Search | arXiv e-print repository

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

Authors: Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

Abstract: The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work propos… ▽ More The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2311.16267 [pdf, other]

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Authors: Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang

Abstract: We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm… ▽ More We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications. △ Less

Submitted 30 January, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2306.14035 [pdf, other]

Thinking Like an Annotator: Generation of Dataset Labeling Instructions

Authors: Nadine Chang, Francesco Ferroni, Michael J. Tarr, Martial Hebert, Deva Ramanan

Abstract: Large-scale datasets are essential to modern day deep learning. Advocates argue that understanding these methods requires dataset transparency (e.g. "dataset curation, motivation, composition, collection process, etc..."). However, almost no one has suggested the release of the detailed definitions and visual category examples provided to annotators - information critical to understanding the stru… ▽ More Large-scale datasets are essential to modern day deep learning. Advocates argue that understanding these methods requires dataset transparency (e.g. "dataset curation, motivation, composition, collection process, etc..."). However, almost no one has suggested the release of the detailed definitions and visual category examples provided to annotators - information critical to understanding the structure of the annotations present in each dataset. These labels are at the heart of public datasets, yet few datasets include the instructions that were used to generate them. We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions. In Labeling Instruction Generation, we take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples. We introduce a framework that requires no model training to solve this task and includes a newly created rapid retrieval system that leverages a large, pre-trained vision and language model. This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality. Our framework generates multiple diverse visual and text representations of dataset categories. The optimized instruction set outperforms our strongest baseline across 5 folds by 7.06 mAP for NuImages and 12.9 mAP for COCO. △ Less

Submitted 24 June, 2023; originally announced June 2023.

arXiv:2209.04741 [pdf, other]

A Thermal Machine Learning Solver For Chip Simulation

Authors: Rishikesh Ranade, Haiyang He, Jay Pathak, Norman Chang, Akhilesh Kumar, Jimin Wen

Abstract: Thermal analysis provides deeper insights into electronic chips behavior under different temperature scenarios and enables faster design exploration. However, obtaining detailed and accurate thermal profile on chip is very time-consuming using FEM or CFD. Therefore, there is an urgent need for speeding up the on-chip thermal solution to address various system scenarios. In this paper, we propose a… ▽ More Thermal analysis provides deeper insights into electronic chips behavior under different temperature scenarios and enables faster design exploration. However, obtaining detailed and accurate thermal profile on chip is very time-consuming using FEM or CFD. Therefore, there is an urgent need for speeding up the on-chip thermal solution to address various system scenarios. In this paper, we propose a thermal machine-learning (ML) solver to speed-up thermal simulations of chips. The thermal ML-Solver is an extension of the recent novel approach, CoAEMLSim (Composable Autoencoder Machine Learning Simulator) with modifications to the solution algorithm to handle constant and distributed HTC. The proposed method is validated against commercial solvers, such as Ansys MAPDL, as well as a latest ML baseline, UNet, under different scenarios to demonstrate its enhanced accuracy, scalability, and generalizability. △ Less

Submitted 10 September, 2022; originally announced September 2022.

arXiv:2110.03780 [pdf, other]

A composable autoencoder-based iterative algorithm for accelerating numerical simulations

Authors: Rishikesh Ranade, Chris Hill, Haiyang He, Amir Maleki, Norman Chang, Jay Pathak

Abstract: Numerical simulations for engineering applications solve partial differential equations (PDE) to model various physical processes. Traditional PDE solvers are very accurate but computationally costly. On the other hand, Machine Learning (ML) methods offer a significant computational speedup but face challenges with accuracy and generalization to different PDE conditions, such as geometry, boundary… ▽ More Numerical simulations for engineering applications solve partial differential equations (PDE) to model various physical processes. Traditional PDE solvers are very accurate but computationally costly. On the other hand, Machine Learning (ML) methods offer a significant computational speedup but face challenges with accuracy and generalization to different PDE conditions, such as geometry, boundary conditions, initial conditions and PDE source terms. In this work, we propose a novel ML-based approach, CoAE-MLSim (Composable AutoEncoder Machine Learning Simulation), which is an unsupervised, lower-dimensional, local method, that is motivated from key ideas used in commercial PDE solvers. This allows our approach to learn better with relatively fewer samples of PDE solutions. The proposed ML-approach is compared against commercial solvers for better benchmarks as well as latest ML-approaches for solving PDEs. It is tested for a variety of complex engineering cases to demonstrate its computational speed, accuracy, scalability, and generalization across different PDE conditions. The results show that our approach captures physics accurately across all metrics of comparison (including measures such as results on section cuts and lines). △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2104.05702 [pdf, other]

Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Authors: Nadine Chang, Zhiding Yu, Yu-Xiong Wang, Anima Anandkumar, Sanja Fidler, Jose M. Alvarez

Abstract: Training on datasets with long-tailed distributions has been challenging for major recognition tasks such as classification and detection. To deal with this challenge, image resampling is typically introduced as a simple but effective approach. However, we observe that long-tailed detection differs from classification since multiple classes may be present in one image. As a result, image resamplin… ▽ More Training on datasets with long-tailed distributions has been challenging for major recognition tasks such as classification and detection. To deal with this challenge, image resampling is typically introduced as a simple but effective approach. However, we observe that long-tailed detection differs from classification since multiple classes may be present in one image. As a result, image resampling alone is not enough to yield a sufficiently balanced distribution at the object level. We address object-level resampling by introducing an object-centric memory replay strategy based on dynamic, episodic memory banks. Our proposed strategy has two benefits: 1) convenient object-level resampling without significant extra computation, and 2) implicit feature-level augmentation from model updates. We show that image-level and object-level resamplings are both important, and thus unify them with a joint resampling strategy (RIO). Our method outperforms state-of-the-art long-tailed detection and segmentation methods on LVIS v0.5 across various backbones. Code is available at https://github.com/NVlabs/RIO. △ Less

Submitted 18 October, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Accepted to ICML 2021

arXiv:2008.07073 [pdf, other]

AlphaNet: Improving Long-Tail Classification By Combining Classifiers

Authors: Nadine Chang, Jayanth Koushik, Aarti Singh, Martial Hebert, Yu-Xiong Wang, Michael J. Tarr

Abstract: Methods in long-tail learning focus on improving performance for data-poor (rare) classes; however, performance for such classes remains much lower than performance for more data-rich (frequent) classes. Analyzing the predictions of long-tail methods for rare classes reveals that a large number of errors are due to misclassification of rare items as visually similar frequent classes. To address th… ▽ More Methods in long-tail learning focus on improving performance for data-poor (rare) classes; however, performance for such classes remains much lower than performance for more data-rich (frequent) classes. Analyzing the predictions of long-tail methods for rare classes reveals that a large number of errors are due to misclassification of rare items as visually similar frequent classes. To address this problem, we introduce AlphaNet, a method that can be applied to existing models, performing post hoc correction on classifiers of rare classes. Starting with a pre-trained model, we find frequent classes that are closest to rare classes in the model's representation space and learn weights to update rare class classifiers with a linear combination of frequent class classifiers. AlphaNet, applied to several models, greatly improves test accuracy for rare classes in multiple long-tailed datasets, with very little change to overall accuracy. Our method also provides a way to control the trade-off between rare class and overall accuracy, making it practical for long-tail classification in the wild. △ Less

Submitted 26 July, 2023; v1 submitted 16 August, 2020; originally announced August 2020.

arXiv:2005.09099 [pdf, other]

(Re)construing Meaning in NLP

Authors: Sean Trott, Tiago Timponi Torrent, Nancy Chang, Nathan Schneider

Abstract: Human speakers have an extensive toolkit of ways to express themselves. In this paper, we engage with an idea largely absent from discussions of meaning in natural language understanding--namely, that the way something is expressed reflects different ways of conceptualizing or construing the information being conveyed. We first define this phenomenon more precisely, drawing on considerable prior w… ▽ More Human speakers have an extensive toolkit of ways to express themselves. In this paper, we engage with an idea largely absent from discussions of meaning in natural language understanding--namely, that the way something is expressed reflects different ways of conceptualizing or construing the information being conveyed. We first define this phenomenon more precisely, drawing on considerable prior work in theoretical cognitive semantics and psycholinguistics. We then survey some dimensions of construed meaning and show how insights from construal could inform theoretical and practical work in NLP. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: ACL 2020 camera-ready

arXiv:2005.05495 [pdf, other]

Train and Deploy an Image Classifier for Disaster Response

Authors: Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren

Abstract: With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy. Our models… ▽ More With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy. Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: 5 pages, 6 figures

arXiv:1902.08034 [pdf, other]

Mitigation of Adversarial Examples in RF Deep Classifiers Utilizing AutoEncoder Pre-training

Authors: Silvija Kokalj-Filipovic, Rob Miller, Nicholas Chang, Chi Leung Lau

Abstract: Adversarial examples in machine learning for images are widely publicized and explored. Illustrations of misclassifications caused by slightly perturbed inputs are abundant and commonly known (e.g., a picture of panda imperceptibly perturbed to fool the classifier into incorrectly labeling it as a gibbon). Similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation… ▽ More Adversarial examples in machine learning for images are widely publicized and explored. Illustrations of misclassifications caused by slightly perturbed inputs are abundant and commonly known (e.g., a picture of panda imperceptibly perturbed to fool the classifier into incorrectly labeling it as a gibbon). Similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation strategies are scarcely addressed in the published work. Yet, RF adversarial examples (AdExs) with minimal waveform perturbations can cause drastic, targeted misclassification results, particularly against spectrum sensing/survey applications (e.g. BPSK is mistaken for 8-PSK). Our research on deep learning AdExs and proposed defense mechanisms are RF-centric, and incorporate physical world, over-the-air (OTA) effects. We herein present defense mechanisms based on pre-training the target classifier using an autoencoder. Our results validate this approach as a viable mitigation method to subvert adversarial attacks against deep learning-based communications and radar sensing systems. △ Less

Submitted 16 February, 2019; originally announced February 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1902.06044

arXiv:1811.00119 [pdf, other]

A task in a suit and a tie: paraphrase generation with semantic augmentation

Authors: Su Wang, Rahul Gupta, Nancy Chang, Jason Baldridge

Abstract: Paraphrasing is rooted in semantics. We show the effectiveness of transformers (Vaswani et al. 2017) for paraphrase generation and further improvements by incorporating PropBank labels via a multi-encoder. Evaluating on MSCOCO and WikiAnswers, we find that transformers are fast and effective, and that semantic augmentation for both transformers and LSTMs leads to sizable 2-3 point gains in BLEU, M… ▽ More Paraphrasing is rooted in semantics. We show the effectiveness of transformers (Vaswani et al. 2017) for paraphrase generation and further improvements by incorporating PropBank labels via a multi-encoder. Evaluating on MSCOCO and WikiAnswers, we find that transformers are fast and effective, and that semantic augmentation for both transformers and LSTMs leads to sizable 2-3 point gains in BLEU, METEOR and TER. More importantly, we find surprisingly large gains on human evaluations compared to previous models. Nevertheless, manual inspection of generated paraphrases reveals ample room for improvement: even our best model produces human-acceptable paraphrases for only 28% of captions from the CHIA dataset (Sharma et al. 2018), and it fails spectacularly on sentences from Wikipedia. Overall, these results point to the potential for incorporating semantics in the task while highlighting the need for stronger evaluation. △ Less

Submitted 14 November, 2018; v1 submitted 31 October, 2018; originally announced November 2018.

Journal ref: Association for the Advancement of Artificial Intelligence (AAAI) 2019

arXiv:1809.01281 [pdf, other]

doi 10.1038/s41597-019-0052-3

BOLD5000: A public fMRI dataset of 5000 images

Authors: Nadine Chang, John A. Pyles, Abhinav Gupta, Michael J. Tarr, Elissa M. Aminoff

Abstract: Vision science, particularly machine vision, has been revolutionized by introducing large-scale image datasets and statistical learning approaches. Yet, human neuroimaging studies of visual perception still rely on small numbers of images (around 100) due to time-constrained experimental procedures. To apply statistical learning approaches that integrate neuroscience, the number of images used in… ▽ More Vision science, particularly machine vision, has been revolutionized by introducing large-scale image datasets and statistical learning approaches. Yet, human neuroimaging studies of visual perception still rely on small numbers of images (around 100) due to time-constrained experimental procedures. To apply statistical learning approaches that integrate neuroscience, the number of images used in neuroimaging must be significantly increased. We present BOLD5000, a human functional MRI (fMRI) study that includes almost 5,000 distinct images depicting real-world scenes. Beyond dramatically increasing image dataset size relative to prior fMRI studies, BOLD5000 also accounts for image diversity, overlapping with standard computer vision datasets by incorporating images from the Scene UNderstanding (SUN), Common Objects in Context (COCO), and ImageNet datasets. The scale and diversity of these image datasets, combined with a slow event-related fMRI design, enable fine-grained exploration into the neural representation of a wide range of visual features, categories, and semantics. Concurrently, BOLD5000 brings us closer to realizing Marr's dream of a singular vision science - the intertwined study of biological and computer vision. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: Currently in submission to Scientific Data

arXiv:1804.01574 [pdf, other]

Prediction-Based Fast Thermoelectric Generator Reconfiguration for Energy Harvesting from Vehicle Radiators

Authors: Hanchen Yang, Feiyang Kang, Caiwen Ding, Ji Li, Jaemin Kim, Donkyu Baek, Shahin Nazarian, Xue Lin, Paul Bogdan, Naehyuck Chang

Abstract: Thermoelectric generation (TEG) has increasingly drawn attention for being environmentally friendly. A few researches have focused on improving TEG efficiency at the system level on vehicle radiators. The most recent reconfiguration algorithm shows improvement in performance but suffers from major drawback on computational time and energy overhead, and non-scalability in terms of array size and pr… ▽ More Thermoelectric generation (TEG) has increasingly drawn attention for being environmentally friendly. A few researches have focused on improving TEG efficiency at the system level on vehicle radiators. The most recent reconfiguration algorithm shows improvement in performance but suffers from major drawback on computational time and energy overhead, and non-scalability in terms of array size and processing frequency. In this paper, we propose a novel TEG array reconfiguration algorithm that determines near-optimal configuration with an acceptable computational time. More precisely, with $O(N)$ time complexity, our prediction-based fast TEG reconfiguration algorithm enables all modules to work at or near their maximum power points (MPP). Additionally, we incorporate prediction methods to further reduce the runtime and switching overhead during the reconfiguration process. Experimental results present $30\%$ performance improvement, almost $100\times$ reduction on switching overhead and $13\times$ enhancement on computational speed compared to the baseline and prior work. The scalability of our algorithm makes it applicable to larger scale systems such as industrial boilers and heat exchangers. △ Less

Submitted 28 March, 2018; originally announced April 2018.

Comments: 4 pages, 7figurs; Accepted at Design Automation and Test in Europe (DATE) 2018

arXiv:1703.04216

Cognitive Inference of Demographic Data by User Ratings

Authors: Jinliang Xu, Shangguang Wang, Fangchun Yang, Rong N. Chang

Abstract: Cognitive inference of user demographics, such as gender and age, plays an important role in creating user profiles for adjusting marketing strategies and generating personalized recommendations because user demographic data is usually not available due to data privacy concerns. At present, users can readily express feedback regarding products or services that they have purchased. During this proc… ▽ More Cognitive inference of user demographics, such as gender and age, plays an important role in creating user profiles for adjusting marketing strategies and generating personalized recommendations because user demographic data is usually not available due to data privacy concerns. At present, users can readily express feedback regarding products or services that they have purchased. During this process, user demographics are concealed, but the data has never yet been successfully utilized to contribute to the cognitive inference of user demographics. In this paper, we investigate the inference power of user ratings data, and propose a simple yet general cognitive inference model, called rating to profile (R2P), to infer user demographics from user provided ratings. In particular, the proposed R2P model can achieve the following: 1. Correctly integrate user ratings into model training. 2.Infer multiple demographic attributes of users simultaneously, capturing the underlying relevance between different demographic attributes. 3. Train its two components, i.e. feature extractor and classifier, in an integrated manner under a supervised learning paradigm, which effectively helps to discover useful hidden patterns from highly sparse ratings data. We introduce how to incorporate user ratings data into the research field of cognitive inference of user demographic data, and detail the model development and optimization process for the proposed R2P. Extensive experiments are conducted on two real-world ratings datasets against various compared state-of-the-art methods, and the results from multiple aspects demonstrate that our proposed R2P model can significantly improve on the cognitive inference performance of user demographic data. △ Less

Submitted 16 March, 2017; v1 submitted 12 March, 2017; originally announced March 2017.

Comments: This paper has been withdrawn by the author due to a crucial sign error in some equations and figures

arXiv:cmp-lg/9405008 [pdf, ps]

A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

Authors: Richard Sproat, Chilin Shih, William Gale, Nancy Chang

Abstract: We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation. We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation. △ Less

Submitted 5 May, 1994; v1 submitted 3 May, 1994; originally announced May 1994.

Comments: To appear in Proceedings of ACL-94

Journal ref: in Proceedings of ACL 94

Showing 1–15 of 15 results for author: Chang, N