-
Intelligent EC Rearview Mirror: Enhancing Driver Safety with Dynamic Glare Mitigation via Cloud Edge Collaboration
Authors:
Junyi Yang,
Zefei Xu,
Huayi Lai,
Hongjian Chen,
Sifan Kong,
Yutong Wu,
Huan Yang
Abstract:
Sudden glare from trailing vehicles significantly increases driving safety risks. Existing anti-glare technologies such as electronic, manually-adjusted, and electrochromic rearview mirrors, are expensive and lack effective adaptability in different lighting conditions. To address these issues, our research introduces an intelligent rearview mirror system utilizing novel all-liquid electrochromic…
▽ More
Sudden glare from trailing vehicles significantly increases driving safety risks. Existing anti-glare technologies such as electronic, manually-adjusted, and electrochromic rearview mirrors, are expensive and lack effective adaptability in different lighting conditions. To address these issues, our research introduces an intelligent rearview mirror system utilizing novel all-liquid electrochromic technology. This system integrates IoT with ensemble and federated learning within a cloud edge collaboration framework, dynamically controlling voltage to effectively eliminate glare and maintain clear visibility. Utilizing an ensemble learning model, it automatically adjusts mirror transmittance based on light intensity, achieving a low RMSE of 0.109 on the test set. Furthermore, the system leverages federated learning for distributed data training across devices, which enhances privacy and updates the cloud model continuously. Distinct from conventional methods, our experiment utilizes the Schmidt-Clausen and Bindels de Boer 9-point scale with TOPSIS for comprehensive evaluation of rearview mirror glare. Designed to be convenient and costeffective, this system demonstrates how IoT and AI can significantly enhance rearview mirror anti-glare performance.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing
Authors:
Zhongwang Zhang,
Pengxiao Lin,
Zhiwei Wang,
Yaoyu Zhang,
Zhi-Qin John Xu
Abstract:
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks using anchor functions. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential solu…
▽ More
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks using anchor functions. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential solutions, which capture the underlying compositional primitives, or symmetric solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. Building upon our understanding of these mechanisms, we can predict the learning behavior of models with different initialization scales when faced with data of varying inferential complexity. Our findings provide valuable insights into the role of initialization scale in shaping the type of solution learned by transformers and their ability to learn and generalize compositional functions.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Masked Graph Transformer for Large-Scale Recommendation
Authors:
Huiyuan Chen,
Zhe Xu,
Chin-Chia Michael Yeh,
Vivian Lai,
Yan Zheng,
Minghua Xu,
Hanghang Tong
Abstract:
Graph Transformers have garnered significant attention for learning graph-structured data, thanks to their superb ability to capture long-range dependencies among nodes. However, the quadratic space and time complexity hinders the scalability of Graph Transformers, particularly for large-scale recommendation. Here we propose an efficient Masked Graph Transformer, named MGFormer, capable of capturi…
▽ More
Graph Transformers have garnered significant attention for learning graph-structured data, thanks to their superb ability to capture long-range dependencies among nodes. However, the quadratic space and time complexity hinders the scalability of Graph Transformers, particularly for large-scale recommendation. Here we propose an efficient Masked Graph Transformer, named MGFormer, capable of capturing all-pair interactions among nodes with a linear complexity. To achieve this, we treat all user/item nodes as independent tokens, enhance them with positional embeddings, and feed them into a kernelized attention module. Additionally, we incorporate learnable relative degree information to appropriately reweigh the attentions. Experimental results show the superior performance of our MGFormer, even with a single attention layer.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Authors:
Tianyi Zhang,
Jonah Yi,
Zhaozhuo Xu,
Anshumali Shrivastava
Abstract:
Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression,…
▽ More
Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Loss Jump During Loss Switch in Solving PDEs with Neural Networks
Authors:
Zhiwei Wang,
Lulu Zhang,
Zhongwang Zhang,
Zhi-Qin John Xu
Abstract:
Using neural networks to solve partial differential equations (PDEs) is gaining popularity as an alternative approach in the scientific computing community. Neural networks can integrate different types of information into the loss function. These include observation data, governing equations, and variational forms, etc. These loss functions can be broadly categorized into two types: observation d…
▽ More
Using neural networks to solve partial differential equations (PDEs) is gaining popularity as an alternative approach in the scientific computing community. Neural networks can integrate different types of information into the loss function. These include observation data, governing equations, and variational forms, etc. These loss functions can be broadly categorized into two types: observation data loss directly constrains and measures the model output, while other loss functions indirectly model the performance of the network, which can be classified as model loss. However, this alternative approach lacks a thorough understanding of its underlying mechanisms, including theoretical foundations and rigorous characterization of various phenomena. This work focuses on investigating how different loss functions impact the training of neural networks for solving PDEs. We discover a stable loss-jump phenomenon: when switching the loss function from the data loss to the model loss, which includes different orders of derivative information, the neural network solution significantly deviates from the exact solution immediately. Further experiments reveal that this phenomenon arises from the different frequency preferences of neural networks under different loss functions. We theoretically analyze the frequency preference of neural networks under model loss. This loss-jump phenomenon provides a valuable perspective for examining the underlying mechanisms of neural networks in solving PDEs.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Easy over Hard: A Simple Baseline for Test Failures Causes Prediction
Authors:
Zhipeng Gao,
Zhipeng Xue,
Xing Hu,
Weiyi Shang,
Xin Xia
Abstract:
The test failure causes analysis is critical since it determines the subsequent way of handling different types of bugs, which is the prerequisite to get the bugs properly analyzed and fixed. After a test case fails, software testers have to inspect the test execution logs line by line to identify its root cause. However, manual root cause determination is often tedious and time-consuming, which c…
▽ More
The test failure causes analysis is critical since it determines the subsequent way of handling different types of bugs, which is the prerequisite to get the bugs properly analyzed and fixed. After a test case fails, software testers have to inspect the test execution logs line by line to identify its root cause. However, manual root cause determination is often tedious and time-consuming, which can cost 30-40% of the time needed to fix a problem. Therefore, there is a need for automatically predicting the test failure causes to lighten the burden of software testers. In this paper, we present a simple but hard-to-beat approach, named NCChecker to automatically identify the failure causes for failed test logs. Our approach can help developers efficiently identify the test failure causes, and flag the most probable log lines of indicating the root causes for investigation. Our approach has three main stages: log abstraction, lookup table construction, and failure causes prediction. We first perform log abstraction to parse the unstructured log messages into structured log events. NCChecker then automatically maintains and updates a lookup table via employing our heuristic rules, which record the matching score between different log events and test failure causes. When it comes to the failure cause prediction stage, for a newly generated failed test log, NCChecker can easily infer its failed reason by checking out the associated log events' scores from the lookup table. We have developed a prototype and evaluated our tool on a real-world industrial dataset with more than 10K test logs. The extensive experiments show the promising performance of our model over a set of benchmarks. Moreover, our approach is highly efficient and memory-saving, and can successfully handle the data imbalance problem.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
A Novel Endorsement Protocol to Secure BFT-Based Consensus in Permissionless Blockchain
Authors:
Ziqiang Xu,
Ahmad Salehi Shahraki,
Naveen Chilamkurti
Abstract:
Permissionless blockchain technology offers numerous potential benefits for decentralised applications, such as security, transparency, and openness. BFT-based consensus mechanisms are widely adopted in the permissioned blockchain to meet the high scalability requirements of the network. Sybil attacks are one of the most potential threats when applying BFT-based consensus mechanisms in permissionl…
▽ More
Permissionless blockchain technology offers numerous potential benefits for decentralised applications, such as security, transparency, and openness. BFT-based consensus mechanisms are widely adopted in the permissioned blockchain to meet the high scalability requirements of the network. Sybil attacks are one of the most potential threats when applying BFT-based consensus mechanisms in permissionless blockchain due to the lack of effective verification mechanisms for participants' identities. This paper presents a novel endorsement-based bootstrapping protocol with a signature algorithm that offers a streamlined, scalable identity endorsement and verification process. This approach effectively safeguards the BFT-based consensus mechanism against Sybil attacks. Using our proposed method, we have conducted thorough security analyses and simulation experiments to assess security, robustness, and scalability advantages in large-scale networks. Our results demonstrate that the scheme can effectively address the identity verification challenges when applying BFT-based consensus in a permissionless blockchain.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy
Authors:
Wei-Ning Chen,
Berivan Isik,
Peter Kairouz,
Albert No,
Sewoong Oh,
Zheng Xu
Abstract:
We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square…
▽ More
We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers.
In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning
Authors:
Chengqian Gao,
William de Vazelhes,
Hualin Zhang,
Bin Gu,
Zhiqiang Xu
Abstract:
Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES, that all input features are task-…
▽ More
Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES, that all input features are task-relevant, poses challenges, especially when confronted with irrelevant features common in real-world problems. This work scrutinizes this limitation, particularly focusing on the Natural Evolution Strategies (NES) variant. We propose NESHT, a novel approach that integrates Hard-Thresholding (HT) with NES to champion sparsity, ensuring only pertinent features are employed. Backed by rigorous analysis and empirical tests, NESHT demonstrates its promise in mitigating the pitfalls of irrelevant features and shines in complex decision-making problems like noisy Mujoco and Atari tasks.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Wildfire Risk Prediction: A Review
Authors:
Zhengsen Xu,
Jonathan Li,
Linlin Xu
Abstract:
Wildfires have significant impacts on global vegetation, wildlife, and humans. They destroy plant communities and wildlife habitats and contribute to increased emissions of carbon dioxide, nitrogen oxides, methane, and other pollutants. The prediction of wildfires relies on various independent variables combined with regression or machine learning methods. In this technical review, we describe the…
▽ More
Wildfires have significant impacts on global vegetation, wildlife, and humans. They destroy plant communities and wildlife habitats and contribute to increased emissions of carbon dioxide, nitrogen oxides, methane, and other pollutants. The prediction of wildfires relies on various independent variables combined with regression or machine learning methods. In this technical review, we describe the options for independent variables, data processing techniques, models, independent variables collinearity and importance estimation methods, and model performance evaluation metrics. First, we divide the independent variables into 4 aspects, including climate and meteorology conditions, socio-economical factors, terrain and hydrological features, and wildfire historical records. Second, preprocessing methods are described for different magnitudes, different spatial-temporal resolutions, and different formats of data. Third, the collinearity and importance evaluation methods of independent variables are also considered. Fourth, we discuss the application of statistical models, traditional machine learning models, and deep learning models in wildfire risk prediction. In this subsection, compared with other reviews, this manuscript particularly discusses the evaluation metrics and recent advancements in deep learning methods. Lastly, addressing the limitations of current research, this paper emphasizes the need for more effective deep learning time series forecasting algorithms, the utilization of three-dimensional data including ground and trunk fuel, extraction of more accurate historical fire point data, and improved model evaluation metrics.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Data Scoping: Effectively Learning the Evolution of Generic Transport PDEs
Authors:
Jiangce Chen,
Wenzhuo Xu,
Zeda Xu,
Noelia Grande Gutiérrez,
Sneha Prabha Narra,
Christopher McComb
Abstract:
Transport phenomena (e.g., fluid flows) are governed by time-dependent partial differential equations (PDEs) describing mass, momentum, and energy conservation, and are ubiquitous in many engineering applications. However, deep learning architectures are fundamentally incompatible with the simulation of these PDEs. This paper clearly articulates and then solves this incompatibility. The local-depe…
▽ More
Transport phenomena (e.g., fluid flows) are governed by time-dependent partial differential equations (PDEs) describing mass, momentum, and energy conservation, and are ubiquitous in many engineering applications. However, deep learning architectures are fundamentally incompatible with the simulation of these PDEs. This paper clearly articulates and then solves this incompatibility. The local-dependency of generic transport PDEs implies that it only involves local information to predict the physical properties at a location in the next time step. However, the deep learning architecture will inevitably increase the scope of information to make such predictions as the number of layers increases, which can cause sluggish convergence and compromise generalizability. This paper aims to solve this problem by proposing a distributed data scoping method with linear time complexity to strictly limit the scope of information to predict the local properties. The numerical experiments over multiple physics show that our data scoping method significantly accelerates training convergence and improves the generalizability of benchmark models on large-scale engineering simulations. Specifically, over the geometries not included in the training data for heat transferring simulation, it can increase the accuracy of Convolutional Neural Networks (CNNs) by 21.7 \% and that of Fourier Neural Operators (FNOs) by 38.5 \% on average.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Efficient and Flexible Method for Reducing Moderate-size Deep Neural Networks with Condensation
Authors:
Tianyi Chen,
Zhi-Qin John Xu
Abstract:
Neural networks have been extensively applied to a variety of tasks, achieving astounding results. Applying neural networks in the scientific field is an important research direction that is gaining increasing attention. In scientific applications, the scale of neural networks is generally moderate-size, mainly to ensure the speed of inference during application. Additionally, comparing neural net…
▽ More
Neural networks have been extensively applied to a variety of tasks, achieving astounding results. Applying neural networks in the scientific field is an important research direction that is gaining increasing attention. In scientific applications, the scale of neural networks is generally moderate-size, mainly to ensure the speed of inference during application. Additionally, comparing neural networks to traditional algorithms in scientific applications is inevitable. These applications often require rapid computations, making the reduction of neural network sizes increasingly important. Existing work has found that the powerful capabilities of neural networks are primarily due to their non-linearity. Theoretical work has discovered that under strong non-linearity, neurons in the same layer tend to behave similarly, a phenomenon known as condensation. Condensation offers an opportunity to reduce the scale of neural networks to a smaller subnetwork with similar performance. In this article, we propose a condensation reduction algorithm to verify the feasibility of this idea in practical problems. Our reduction method can currently be applied to both fully connected networks and convolutional networks, achieving positive results. In complex combustion acceleration tasks, we reduced the size of the neural network to 41.7% of its original scale while maintaining prediction accuracy. In the CIFAR10 image classification task, we reduced the network size to 11.5% of the original scale, still maintaining a satisfactory validation accuracy. Our method can be applied to most trained neural networks, reducing computational pressure and improving inference speed.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic
Authors:
Safa Messaoud,
Billel Mokeddem,
Zhenghai Xue,
Linsey Pang,
Bo An,
Haipeng Chen,
Sanjay Chawla
Abstract:
Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open probl…
▽ More
Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic (S$^2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S$^2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$^2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
Authors:
Kai Zhang,
Sai Bi,
Hao Tan,
Yuanbo Xiangli,
Nanxuan Zhao,
Kalyan Sunkavalli,
Zexiang Xu
Abstract:
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian para…
▽ More
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ .
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
MIPI 2024 Challenge on Nighttime Flare Removal: Methods and Results
Authors:
Yuekun Dai,
Dafeng Zhang,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangchen Zhou,
Ruicheng Feng,
Peiqing Yang,
Zhezhu Jin,
Guanqun Liu,
Chen Change Loy,
Lize Zhang,
Shuai Liu,
Chaoyu Feng,
Luyang Wang,
Shuan Chen,
Guangqi Shao,
Xiaotao Wang,
Lei Lei,
Qirui Yang,
Qihua Cheng,
Zhiqiang Xu,
Yihao Liu,
Huanjing Yue,
Jingyu Yang
, et al. (38 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Transformer-Enhanced Motion Planner: Attention-Guided Sampling for State-Specific Decision Making
Authors:
Lei Zhuang,
Jingdong Zhao,
Yuntao Li,
Zichun Xu,
Liangliang Zhao,
Hong Liu
Abstract:
Sampling-based motion planning (SBMP) algorithms are renowned for their robust global search capabilities. However, the inherent randomness in their sampling mechanisms often result in inconsistent path quality and limited search efficiency. In response to these challenges, this work proposes a novel deep learning-based motion planning framework, named Transformer-Enhanced Motion Planner (TEMP), w…
▽ More
Sampling-based motion planning (SBMP) algorithms are renowned for their robust global search capabilities. However, the inherent randomness in their sampling mechanisms often result in inconsistent path quality and limited search efficiency. In response to these challenges, this work proposes a novel deep learning-based motion planning framework, named Transformer-Enhanced Motion Planner (TEMP), which synergizes an Environmental Information Semantic Encoder (EISE) with a Motion Planning Transformer (MPT). EISE converts environmental data into semantic environmental information (SEI), providing MPT with an enriched environmental comprehension. MPT leverages an attention mechanism to dynamically recalibrate its focus on SEI, task objectives, and historical planning data, refining the sampling node generation. To demonstrate the capabilities of TEMP, we train our model using a dataset comprised of planning results produced by the RRT*. EISE and MPT are collaboratively trained, enabling EISE to autonomously learn and extract patterns from environmental data, thereby forming semantic representations that MPT could more effectively interpret and utilize for motion planning. Subsequently, we conducted a systematic evaluation of TEMP's efficacy across diverse task dimensions, which demonstrates that TEMP achieves exceptional performance metrics and a heightened degree of generalizability compared to state-of-the-art SBMPs.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Exploring the Capability of LLMs in Performing Low-Level Visual Analytic Tasks on SVG Data Visualizations
Authors:
Zhongzheng Xu,
Emily Wall
Abstract:
Data visualizations help extract insights from datasets, but reaching these insights requires decomposing high level goals into low-level analytic tasks that can be complex due to varying degrees of data literacy and visualization experience. Recent advancements in large language models (LLMs) have shown promise for lowering barriers for users to achieve tasks such as writing code and may likewise…
▽ More
Data visualizations help extract insights from datasets, but reaching these insights requires decomposing high level goals into low-level analytic tasks that can be complex due to varying degrees of data literacy and visualization experience. Recent advancements in large language models (LLMs) have shown promise for lowering barriers for users to achieve tasks such as writing code and may likewise facilitate visualization insight. Scalable Vector Graphics (SVG), a text-based image format common in data visualizations, matches well with the text sequence processing of transformer-based LLMs. In this paper, we explore the capability of LLMs to perform 10 low-level visual analytic tasks defined by Amar, Eagan, and Stasko directly on SVG-based visualizations. Using zero-shot prompts, we instruct the models to provide responses or modify the SVG code based on given visualizations. Our findings demonstrate that LLMs can effectively modify existing SVG visualizations for some tasks like Cluster but perform poorly on tasks requiring mathematical operations like Compute Derived Value. We also discovered that LLM performance can vary based on factors such as the number of data points, the presence of value labels, and the chart type. Our findings contribute to gauging the general capabilities of LLMs and highlight the need for further exploration and development to fully harness their potential in supporting visual analytic tasks.
△ Less
Submitted 30 April, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
A Survey on Diffusion Models for Time Series and Spatio-Temporal Data
Authors:
Yiyuan Yang,
Ming Jin,
Haomin Wen,
Chaoli Zhang,
Yuxuan Liang,
Lintao Ma,
Yi Wang,
Chenghao Liu,
Bin Yang,
Zenglin Xu,
Jiang Bian,
Shirui Pan,
Qingsong Wen
Abstract:
The study of time series data is crucial for understanding trends and anomalies over time, enabling predictive insights across various sectors. Spatio-temporal data, on the other hand, is vital for analyzing phenomena in both space and time, providing a dynamic perspective on complex system interactions. Recently, diffusion models have seen widespread application in time series and spatio-temporal…
▽ More
The study of time series data is crucial for understanding trends and anomalies over time, enabling predictive insights across various sectors. Spatio-temporal data, on the other hand, is vital for analyzing phenomena in both space and time, providing a dynamic perspective on complex system interactions. Recently, diffusion models have seen widespread application in time series and spatio-temporal data mining. Not only do they enhance the generative and inferential capabilities for sequential and temporal data, but they also extend to other downstream tasks. In this survey, we comprehensively and thoroughly review the use of diffusion models in time series and spatio-temporal data, categorizing them by model category, task type, data modality, and practical application domain. In detail, we categorize diffusion models into unconditioned and conditioned types and discuss time series data and spatio-temporal data separately. Unconditioned models, which operate unsupervised, are subdivided into probability-based and score-based models, serving predictive and generative tasks such as forecasting, anomaly detection, classification, and imputation. Conditioned models, on the other hand, utilize extra information to enhance performance and are similarly divided for both predictive and generative tasks. Our survey extensively covers their application in various fields, including healthcare, recommendation, climate, energy, audio, and transportation, providing a foundational understanding of how these models analyze and generate data. Through this structured overview, we aim to provide researchers and practitioners with a comprehensive understanding of diffusion models for time series and spatio-temporal data analysis, aiming to direct future innovations and applications by addressing traditional challenges and exploring innovative solutions within the diffusion model framework.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Belt and Brace: When Federated Learning Meets Differential Privacy
Authors:
Xuebin Ren,
Shusen Yang,
Cong Zhao,
Julie McCann,
Zongben Xu
Abstract:
Federated learning (FL) has great potential for large-scale machine learning (ML) without exposing raw data.Differential privacy (DP) is the de facto standard of privacy protection with provable guarantees.Advances in ML suggest that DP would be a perfect fit for FL with comprehensive privacy preservation. Hence, extensive efforts have been devoted to achieving practically usable FL with DP, which…
▽ More
Federated learning (FL) has great potential for large-scale machine learning (ML) without exposing raw data.Differential privacy (DP) is the de facto standard of privacy protection with provable guarantees.Advances in ML suggest that DP would be a perfect fit for FL with comprehensive privacy preservation. Hence, extensive efforts have been devoted to achieving practically usable FL with DP, which however is still challenging.Practitioners often not only are not fully aware of its development and categorization, but also face a hard choice between privacy and utility. Therefore, it calls for a holistic review of current advances and an investigation on the challenges and opportunities for highly usable FL systems with a DP guarantee. In this article, we first introduce the primary concepts of FL and DP, and highlight the benefits of integration. We then review the current developments by categorizing different paradigms and notions. Aiming at usable FL with DP, we present the optimization principles to seek a better tradeoff between model utility and privacy loss. Finally, we discuss future challenges in the emergent areas and relevant research topics.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability
Authors:
Meng Li,
Haoran Jin,
Ruixuan Huang,
Zhihao Xu,
Defu Lian,
Zijia Lin,
Di Zhang,
Xiting Wang
Abstract:
Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic a…
▽ More
Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic and non-deterministic, e.g. case study or human evaluation, hindering the development of the field. To bridge the gap, we approach concept-based explanation evaluation via faithfulness and readability. We first introduce a formal definition of concept generalizable to diverse concept-based explanations. Based on this, we quantify faithfulness via the difference in the output upon perturbation. We then provide an automatic measure for readability, by measuring the coherence of patterns that maximally activate a concept. This measure serves as a cost-effective and reliable substitute for human evaluation. Finally, based on measurement theory, we describe a meta-evaluation method for evaluating the above measures via reliability and validity, which can be generalized to other tasks as well. Extensive experimental analysis has been conducted to validate and inform the selection of concept evaluation measures.
△ Less
Submitted 29 April, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
Additive Spanner Lower Bounds with Optimal Inner Graph Structure
Authors:
Greg Bodwin,
Gary Hoppenworth,
Virginia Vassilevska Williams,
Nicole Wein,
Zixuan Xu
Abstract:
We construct $n$-node graphs on which any $O(n)$-size spanner has additive error at least $+Ω(n^{3/17})$, improving on the previous best lower bound of $Ω(n^{1/7})$ [Bodwin-Hoppenworth FOCS '22]. Our construction completes the first two steps of a particular three-step research program, introduced in prior work and overviewed here, aimed at producing tight bounds for the problem by aligning aspect…
▽ More
We construct $n$-node graphs on which any $O(n)$-size spanner has additive error at least $+Ω(n^{3/17})$, improving on the previous best lower bound of $Ω(n^{1/7})$ [Bodwin-Hoppenworth FOCS '22]. Our construction completes the first two steps of a particular three-step research program, introduced in prior work and overviewed here, aimed at producing tight bounds for the problem by aligning aspects of the upper and lower bound constructions. More specifically, we develop techniques that enable the use of inner graphs in the lower bound framework whose technical properties are provably tight with the corresponding assumptions made in the upper bounds. As an additional application of our techniques, we improve the corresponding lower bound for $O(n)$-size additive emulators to $+Ω(n^{1/14})$.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction
Authors:
Guozheng Li,
Peng Wang,
Wenjun Ke,
Yikai Guo,
Ke Ji,
Ziyu Shang,
Jiajun Liu,
Zijie Xu
Abstract:
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from…
▽ More
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors
Authors:
Guozheng Li,
Peng Wang,
Jiajun Liu,
Yikai Guo,
Ke Ji,
Ziyu Shang,
Zijie Xu
Abstract:
Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and…
▽ More
Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment \textsc{Micre} on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. \textsc{Micre} delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that \textsc{Micre} can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Empirical Analysis of Dialogue Relation Extraction with Large Language Models
Authors:
Guozheng Li,
Zijie Xu,
Ziyu Shang,
Jiajun Liu,
Ke Ji,
Yikai Guo
Abstract:
Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relat…
▽ More
Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning
Authors:
Dapeng Li,
Hang Dong,
Lu Wang,
Bo Qiao,
Si Qin,
Qingwei Lin,
Dongmei Zhang,
Qi Zhang,
Zhiwei Xu,
Bin Zhang,
Guoliang Fan
Abstract:
In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. Howe…
▽ More
In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. However, the understanding of the formation of collaborative mechanisms is still very limited, making designing a human-understandable communication mechanism a valuable problem to address. In this paper, we propose a novel multi-agent reinforcement learning algorithm that embeds large language models into agents, endowing them with the ability to generate human-understandable verbal communication. The entire framework has a message module and an action module. The message module is responsible for generating and sending verbal messages to other agents, effectively enhancing information sharing among agents. To further enhance the message module, we employ a teacher model to generate message labels from the global view and update the student model through Supervised Fine-Tuning (SFT). The action module receives messages from other agents and selects actions based on current local observations and received messages. Experiments conducted on the Overcooked game demonstrate our method significantly enhances the learning efficiency and performance of existing methods, while also providing an interpretable tool for humans to understand the process of multi-agent cooperation.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering
Authors:
Zhentao Xu,
Mark Jerome Cruz,
Matthew Guevara,
Tie Wang,
Manasi Deshpande,
Xiaofeng Wang,
Zheng Li
Abstract:
In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, whi…
▽ More
In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, which limits performance. We introduce a novel customer service question-answering method that amalgamates RAG with a knowledge graph (KG). Our method constructs a KG from historical issues for use in retrieval, retaining the intra-issue structure and inter-issue relations. During the question-answering phase, our method parses consumer queries and retrieves related sub-graphs from the KG to generate answers. This integration of a KG not only improves retrieval accuracy by preserving customer service structure information but also enhances answering quality by mitigating the effects of text segmentation. Empirical assessments on our benchmark datasets, utilizing key retrieval (MRR, Recall@K, NDCG@K) and text generation (BLEU, ROUGE, METEOR) metrics, reveal that our method outperforms the baseline by 77.6% in MRR and by 0.32 in BLEU. Our method has been deployed within LinkedIn's customer service team for approximately six months and has reduced the median per-issue resolution time by 28.6%.
△ Less
Submitted 6 May, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos
Authors:
Zhengze Xu,
Mengting Chen,
Zhao Wang,
Linyu Xing,
Zhonghua Zhai,
Nong Sang,
Jinsong Lan,
Shuai Xiao,
Changxin Gao
Abstract:
Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-u…
▽ More
Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Adversarial Reweighting with $α$-Power Maximization for Domain Adaptation
Authors:
Xiang Gu,
Xi Yu,
Yan Yang,
Jian Sun,
Zongben Xu
Abstract:
The practical Domain Adaptation (DA) tasks, e.g., Partial DA (PDA), open-set DA, universal DA, and test-time adaptation, have gained increasing attention in the machine learning community. In this paper, we propose a novel approach, dubbed Adversarial Reweighting with $α$-Power Maximization (ARPM), for PDA where the source domain contains private classes absent in target domain. In ARPM, we propos…
▽ More
The practical Domain Adaptation (DA) tasks, e.g., Partial DA (PDA), open-set DA, universal DA, and test-time adaptation, have gained increasing attention in the machine learning community. In this paper, we propose a novel approach, dubbed Adversarial Reweighting with $α$-Power Maximization (ARPM), for PDA where the source domain contains private classes absent in target domain. In ARPM, we propose a novel adversarial reweighting model that adversarially learns to reweight source domain data to identify source-private class samples by assigning smaller weights to them, for mitigating potential negative transfer. Based on the adversarial reweighting, we train the transferable recognition model on the reweighted source distribution to be able to classify common class data. To reduce the prediction uncertainty of the recognition model on the target domain for PDA, we present an $α$-power maximization mechanism in ARPM, which enriches the family of losses for reducing the prediction uncertainty for PDA. Extensive experimental results on five PDA benchmarks, i.e., Office-31, Office-Home, VisDA-2017, ImageNet-Caltech, and DomainNet, show that our method is superior to recent PDA methods. Ablation studies also confirm the effectiveness of components in our approach. To theoretically analyze our method, we deduce an upper bound of target domain expected error for PDA, which is approximately minimized in our approach. We further extend ARPM to open-set DA, universal DA, and test time adaptation, and verify the usefulness through experiments.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection
Authors:
Xuanyu Zhang,
Youmin Xu,
Runyi Li,
Jiwen Yu,
Weiqi Li,
Zhipei Xu,
Jian Zhang
Abstract:
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizabili…
▽ More
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Continual Learning of Large Language Models: A Comprehensive Survey
Authors:
Haizhou Shi,
Zihao Xu,
Hengyi Wang,
Weiyi Qin,
Wenyuan Wang,
Yibin Wang,
Hao Wang
Abstract:
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant…
▽ More
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/llm-continual-learning-survey.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
More Asymmetry Yields Faster Matrix Multiplication
Authors:
Josh Alman,
Ran Duan,
Virginia Vassilevska Williams,
Yinzhan Xu,
Zixuan Xu,
Renfei Zhou
Abstract:
We present a new improvement on the laser method for designing fast matrix multiplication algorithms. The new method further develops the recent advances by [Duan, Wu, Zhou FOCS 2023] and [Vassilevska Williams, Xu, Xu, Zhou SODA 2024]. Surprisingly the new improvement is achieved by incorporating more asymmetry in the analysis, circumventing a fundamental tool of prior work that requires two of th…
▽ More
We present a new improvement on the laser method for designing fast matrix multiplication algorithms. The new method further develops the recent advances by [Duan, Wu, Zhou FOCS 2023] and [Vassilevska Williams, Xu, Xu, Zhou SODA 2024]. Surprisingly the new improvement is achieved by incorporating more asymmetry in the analysis, circumventing a fundamental tool of prior work that requires two of the three dimensions to be treated identically. The method yields a new bound on the square matrix multiplication exponent $$ω<2.371339,$$ improved from the previous bound of $ω<2.371552$. We also improve the bounds of the exponents for multiplying rectangular matrices of various shapes.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Reasoners
Authors:
Qihuang Zhong,
Kang Wang,
Ziyang Xu,
Juhua Liu,
Liang Ding,
Bo Du,
Dacheng Tao
Abstract:
Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, including understanding errors, calculation errors and process errors (e.g., missing-step and hallucinations). Subsequently, our in-depth analyses among various error types show that deeply understan…
▽ More
Chain of Thought prompting strategy has enhanced the performance of Large Language Models (LLMs) across various NLP tasks. However, it still has shortcomings when dealing with complex reasoning tasks, including understanding errors, calculation errors and process errors (e.g., missing-step and hallucinations). Subsequently, our in-depth analyses among various error types show that deeply understanding the whole problem is critical in addressing complicated reasoning tasks. Motivated by this, we propose a simple-yet-effective method, namely Deeply Understanding the Problems (DUP), to enhance the LLMs' reasoning abilities. The core of our method is to encourage the LLMs to deeply understand the problems and leverage the key problem-solving information for better reasoning. Extensive experiments on 10 diverse reasoning benchmarks show that our DUP method consistently outperforms the other counterparts by a large margin. More encouragingly, DUP achieves a new SOTA result on the GSM8K benchmark, with an accuracy of 97.1% in a zero-shot setting.
△ Less
Submitted 28 April, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
"A good pun is its own reword": Can Large Language Models Understand Puns?
Authors:
Zhijun Xu,
Siyu Yuan,
Lingjie Chen,
Deqing Yang
Abstract:
Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanatio…
▽ More
Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM's ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
DMesh: A Differentiable Representation for General Meshes
Authors:
Sanghyun Son,
Matheus Gadelha,
Yang Zhou,
Zexiang Xu,
Ming C. Lin,
Yi Zhou
Abstract:
We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and formulate probability of faces to exist on our desired mesh in a differentiable manner based on…
▽ More
We present a differentiable representation, DMesh, for general 3D triangular meshes. DMesh considers both the geometry and connectivity information of a mesh. In our design, we first get a set of convex tetrahedra that compactly tessellates the domain based on Weighted Delaunay Triangulation (WDT), and formulate probability of faces to exist on our desired mesh in a differentiable manner based on the WDT. This enables DMesh to represent meshes of various topology in a differentiable way, and allows us to reconstruct the mesh under various observations, such as point cloud and multi-view images using gradient-based optimization. The source code and full paper is available at: https://sonsang.github.io/dmesh-project.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Foundation Model assisted Weakly Supervised LiDAR Semantic Segmentation
Authors:
Yilong Chen,
Zongyi Xu,
xiaoshui Huang,
Ruicheng Zhang,
Xinqi Jiang,
Xinbo Gao
Abstract:
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semant…
▽ More
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects
Authors:
Dominik Bauer,
Zhenjia Xu,
Shuran Song
Abstract:
Manipulation of elastoplastic objects like dough often involves topological changes such as splitting and merging. The ability to accurately predict these topological changes that a specific action might incur is critical for planning interactions with elastoplastic objects. We present DoughNet, a Transformer-based architecture for handling these challenges, consisting of two components. First, a…
▽ More
Manipulation of elastoplastic objects like dough often involves topological changes such as splitting and merging. The ability to accurately predict these topological changes that a specific action might incur is critical for planning interactions with elastoplastic objects. We present DoughNet, a Transformer-based architecture for handling these challenges, consisting of two components. First, a denoising autoencoder represents deformable objects of varying topology as sets of latent codes. Second, a visual predictive model performs autoregressive set prediction to determine long-horizon geometrical deformation and topological changes purely in latent space. Given a partial initial state and desired manipulation trajectories, it infers all resulting object geometries and topologies at each step. DoughNet thereby allows to plan robotic manipulation; selecting a suited tool, its pose and opening width to recreate robot- or human-made goals. Our experiments in simulated and real environments show that DoughNet is able to significantly outperform related approaches that consider deformation only as geometrical change.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
MeshLRM: Large Reconstruction Model for High-Quality Mesh
Authors:
Xinyue Wei,
Kai Zhang,
Sai Bi,
Hao Tan,
Fujun Luan,
Valentin Deschaintre,
Kalyan Sunkavalli,
Hao Su,
Zexiang Xu
Abstract:
We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a p…
▽ More
We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering. Moreover, we improve the LRM architecture by simplifying several complex designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained with low- and high-resolution images; this new LRM training strategy enables significantly faster convergence and thereby leads to better quality with less compute. Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation. Project page: https://sarahweiii.github.io/meshlrm/
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News
Authors:
Mengna Zhu,
Zijie Xu,
Kaisheng Zeng,
Kaiming Xiao,
Mao Wang,
Wenjun Ke,
Hongbin Huang
Abstract:
Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE,…
▽ More
Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from https://github.com/Mzzzhu/CMNEE.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
Authors:
Zhihao Xu,
Ruixuan Huang,
Xiting Wang,
Fangzhao Wu,
Jing Yao,
Xing Xie
Abstract:
Current open-source large language models (LLMs) are often undergone careful safety alignment before public release. Some attack methods have also been proposed that help check for safety vulnerabilities in LLMs to ensure alignment robustness. However, many of these methods have moderate attack success rates. Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to s…
▽ More
Current open-source large language models (LLMs) are often undergone careful safety alignment before public release. Some attack methods have also been proposed that help check for safety vulnerabilities in LLMs to ensure alignment robustness. However, many of these methods have moderate attack success rates. Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to suspicions that these methods have not accurately identified the safety vulnerabilities of LLMs. In this paper, we introduce a LLM attack method utilizing concept-based model explanation, where we extract safety concept activation vectors (SCAVs) from LLMs' activation space, enabling efficient attacks on well-aligned LLMs like LLaMA-2, achieving near 100% attack success rate as if LLMs are completely unaligned. This suggests that LLMs, even after thorough safety alignment, could still pose potential risks to society upon public release. To evaluate the harmfulness of outputs resulting with various attack methods, we propose a comprehensive evaluation method that reduces the potential inaccuracies of existing evaluations, and further validate that our method causes more harmful content. Additionally, we discover that the SCAVs show some transferability across different open-source LLMs.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
Authors:
Zhengyang Liang,
Meiyu Liang,
Wei Huang,
Yawen Li,
Zhe Xue
Abstract:
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a nove…
▽ More
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Learning Feature Inversion for Multi-class Anomaly Detection under General-purpose COCO-AD Benchmark
Authors:
Jiangning Zhang,
Chengjie Wang,
Xiangtai Li,
Guanzhong Tian,
Zhucun Xue,
Yong Liu,
Guansong Pang,
Dacheng Tao
Abstract:
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a…
▽ More
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at https://github.com/zhangzjn/ader.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Conditional Prototype Rectification Prompt Learning
Authors:
Haoxing Chen,
Yaohui Li,
Zizheng Huang,
Yan Hong,
Zhuoer Xu,
Zhangxuan Gu,
Jun Lan,
Huijia Zhu,
Weiqiang Wang
Abstract:
Pre-trained large-scale vision-language models (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods…
▽ More
Pre-trained large-scale vision-language models (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods tend to overfit the narrow distributions of base classes seen during training and encounter two primary challenges: (i) only utilizing uni-modal information to modeling task-specific knowledge; and (ii) using costly and time-consuming methods to supplement knowledge. To address these issues, we propose a Conditional Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way. Specifically, we alleviate overfitting on base classes from two aspects. First, each input image acquires knowledge from both textual and visual prototypes, and then generates sample-conditional text tokens. Second, we extract utilizable knowledge from unlabeled data to further refine the prototypes. These two strategies mitigate biases stemming from base classes, yielding a more effective classifier. Extensive experiments on 11 benchmark datasets show that our CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks. Our code is avaliable at \url{https://github.com/chenhaoxing/CPR}.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
On the Necessity of Collaboration in Online Model Selection with Decentralized Data
Authors:
Junfan Li,
Zenglin Xu,
Zheshun Wu,
Irwin King
Abstract:
We consider online model selection with decentralized data over $M$ clients, and study a fundamental problem: the necessity of collaboration. Previous work gave a negative answer from the perspective of worst-case regret minimization, while we give a different answer from the perspective of regret-computational cost trade-off. We separately propose a federated algorithm with and without communicat…
▽ More
We consider online model selection with decentralized data over $M$ clients, and study a fundamental problem: the necessity of collaboration. Previous work gave a negative answer from the perspective of worst-case regret minimization, while we give a different answer from the perspective of regret-computational cost trade-off. We separately propose a federated algorithm with and without communication constraint and prove regret bounds that show (i) collaboration is unnecessary if we do not limit the computational cost on each client; (ii) collaboration is necessary if we limit the computational cost on each client to $o(K)$, where $K$ is the number of candidate hypothesis spaces. As a by-product, we improve the regret bounds of algorithms for distributed online multi-kernel learning at a smaller computational and communication cost. Our algorithms rely on three new techniques, i.e., an improved Bernstein's inequality for martingale, a federated algorithmic framework, named FOMD-No-LU, and decoupling model selection and predictions, which might be of independent interest.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Self-Selected Attention Span for Accelerating Large Language Model Inference
Authors:
Tian Jin,
Wanzin Yazar,
Zifei Xu,
Sayeh Sharify,
Xin Wang
Abstract:
Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluat…
▽ More
Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%. Our work presents an end-to-end demonstration showing that training LLMs to self-select their attention spans speeds up autoregressive inference in solving real-world tasks.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Stability in Phase Retrieval: Characterizing Condition Numbers and the Optimal Vector Set
Authors:
Yu Xia,
Zhiqiang Xu,
Zili Xu
Abstract:
In this paper, we primarily focus on analyzing the stability property of phase retrieval by examining the bi-Lipschitz property of the map $Φ_{\boldsymbol{A}}(\boldsymbol{x})=|\boldsymbol{A}\boldsymbol{x}|\in \mathbb{R}_+^m$, where $\boldsymbol{x}\in \mathbb{H}^d$ and $\boldsymbol{A}\in \mathbb{H}^{m\times d}$ is the measurement matrix for $\mathbb{H}\in\{\mathbb{R},\mathbb{C}\}$. We define the co…
▽ More
In this paper, we primarily focus on analyzing the stability property of phase retrieval by examining the bi-Lipschitz property of the map $Φ_{\boldsymbol{A}}(\boldsymbol{x})=|\boldsymbol{A}\boldsymbol{x}|\in \mathbb{R}_+^m$, where $\boldsymbol{x}\in \mathbb{H}^d$ and $\boldsymbol{A}\in \mathbb{H}^{m\times d}$ is the measurement matrix for $\mathbb{H}\in\{\mathbb{R},\mathbb{C}\}$. We define the condition number $β_{\boldsymbol{A}}=\frac{U_{\boldsymbol{A}}}{L_{\boldsymbol{A}}}$, where $L_{\boldsymbol{A}}$ and $U_{\boldsymbol{A}}$ represent the optimal lower and upper Lipschitz constants, respectively. We establish the first universal lower bound on $β_{\boldsymbol{A}}$ by demonstrating that for any ${\boldsymbol{A}}\in\mathbb{H}^{m\times d}$, \begin{equation*} β_{\boldsymbol{A}}\geq β_0^{\mathbb{H}}=\begin{cases} \sqrt{\fracπ{π-2}}\,\,\approx\,\, 1.659 & \text{if $\mathbb{H}=\mathbb{R}$,}\\ \sqrt{\frac{4}{4-π}}\,\,\approx\,\, 2.159 & \text{if $\mathbb{H}=\mathbb{C}$.} \end{cases} \end{equation*} We prove that the condition number of a standard Gaussian matrix in $\mathbb{H}^{m\times d}$ asymptotically matches the lower bound $β_0^{\mathbb{H}}$ for both real and complex cases. This result indicates that the constant lower bound $β_0^{\mathbb{H}}$ is asymptotically tight, holding true for both the real and complex scenarios. As an application of this result, we utilize it to investigate the performance of quadratic models for phase retrieval. Lastly, we establish that for any odd integer $m\geq 3$, the harmonic frame $\boldsymbol{A}\in \mathbb{R}^{m\times 2}$ possesses the minimum condition number among all $\boldsymbol{A}\in \mathbb{R}^{m\times 2}$. We are confident that these findings carry substantial implications for enhancing our understanding of phase retrieval.
△ Less
Submitted 16 April, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Post-hurricane building damage assessment using street-view imagery and structured data: A multi-modal deep learning approach
Authors:
Zhuoqun Xue,
Xiaojian Zhang,
David O. Prevatt,
Jennifer Bridge,
Susu Xu,
Xilei Zhao
Abstract:
Accurately assessing building damage is critical for disaster response and recovery. However, many existing models for detecting building damage have poor prediction accuracy due to their limited capabilities of identifying detailed, comprehensive structural and/or non-structural damage from the street-view image. Additionally, these models mainly rely on the imagery data for damage classification…
▽ More
Accurately assessing building damage is critical for disaster response and recovery. However, many existing models for detecting building damage have poor prediction accuracy due to their limited capabilities of identifying detailed, comprehensive structural and/or non-structural damage from the street-view image. Additionally, these models mainly rely on the imagery data for damage classification, failing to account for other critical information, such as wind speed, building characteristics, evacuation zones, and distance of the building to the hurricane track. To address these limitations, in this study, we propose a novel multi-modal (i.e., imagery and structured data) approach for post-hurricane building damage classification, named the Multi-Modal Swin Transformer (MMST). We empirically train and evaluate the proposed MMST using data collected from the 2022 Hurricane Ian in Florida, USA. Results show that MMST outperforms all selected state-of-the-art benchmark models and can achieve an accuracy of 92.67%, which are 7.71% improvement in accuracy compared to Visual Geometry Group 16 (VGG-16). In addition to the street-view imagery data, building value, building age, and wind speed are the most important predictors for damage level classification. The proposed MMST can be deployed to assist in rapid damage assessment and guide reconnaissance efforts in future hurricanes.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving
Authors:
Diankun Zhang,
Guoan Wang,
Runwen Zhu,
Jianbo Zhao,
Xiwu Chen,
Siyu Zhang,
Jiahao Gong,
Qibin Zhou,
Wenyuan Zhang,
Ningzi Wang,
Feiyang Tan,
Hangning Zhou,
Ziyao Xu,
Haotian Yao,
Chi Zhang,
Xiaojun Liu,
Xiaoguang Di,
Bin Li
Abstract:
End-to-End paradigms use a unified framework to implement multi-tasks in an autonomous driving system. Despite simplicity and clarity, the performance of end-to-end autonomous driving methods on sub-tasks is still far behind the single-task methods. Meanwhile, the widely used dense BEV features in previous end-to-end methods make it costly to extend to more modalities or tasks. In this paper, we p…
▽ More
End-to-End paradigms use a unified framework to implement multi-tasks in an autonomous driving system. Despite simplicity and clarity, the performance of end-to-end autonomous driving methods on sub-tasks is still far behind the single-task methods. Meanwhile, the widely used dense BEV features in previous end-to-end methods make it costly to extend to more modalities or tasks. In this paper, we propose a Sparse query-centric paradigm for end-to-end Autonomous Driving (SparseAD), where the sparse queries completely represent the whole driving scenario across space, time and tasks without any dense BEV representation. Concretely, we design a unified sparse architecture for perception tasks including detection, tracking, and online mapping. Moreover, we revisit motion prediction and planning, and devise a more justifiable motion planner framework. On the challenging nuScenes dataset, SparseAD achieves SOTA full-task performance among end-to-end methods and significantly narrows the performance gap between end-to-end paradigms and single-task methods. Codes will be released soon.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Multimodal Road Network Generation Based on Large Language Model
Authors:
Jiajing Chen,
Weihang Xu,
Haiming Cao,
Zihuan Xu,
Yu Zhang,
Zhao Zhang,
Siyao Zhang
Abstract:
With the increasing popularity of ChatGPT, large language models (LLMs) have demonstrated their capabilities in communication and reasoning, promising for transportation sector intelligentization. However, they still face challenges in domain-specific knowledge. This paper aims to leverage LLMs' reasoning and recognition abilities to replace traditional user interfaces and create an "intelligent o…
▽ More
With the increasing popularity of ChatGPT, large language models (LLMs) have demonstrated their capabilities in communication and reasoning, promising for transportation sector intelligentization. However, they still face challenges in domain-specific knowledge. This paper aims to leverage LLMs' reasoning and recognition abilities to replace traditional user interfaces and create an "intelligent operating system" for transportation simulation software, exploring their potential with transportation modeling and simulation. We introduce Network Generation AI (NGAI), integrating LLMs with road network modeling plugins, validated through experiments for accuracy and robustness. NGAI's effective use has reduced modeling costs, revolutionized transportation simulations, optimized user steps, and proposed a novel approach for LLM integration in the transportation field.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
Authors:
Dazhong Shen,
Guanglu Song,
Zeyue Xue,
Fu-Yun Wang,
Yu Liu
Abstract:
Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifi…
▽ More
Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Self-Supervised Multi-Object Tracking with Path Consistency
Authors:
Zijia Lu,
Bing Shuai,
Yanbei Chen,
Zhenlin Xu,
Davide Modolo
Abstract:
In this paper, we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that, to track a object through frames, we can obtain multiple different association results from a model by varying the frames it can observe, i.e., skipping frames in observation. As the differences in observations do not alter the identi…
▽ More
In this paper, we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that, to track a object through frames, we can obtain multiple different association results from a model by varying the frames it can observe, i.e., skipping frames in observation. As the differences in observations do not alter the identities of objects, the obtained association results should be consistent. Based on this rationale, we generate multiple observation paths, each specifying a different set of frames to be skipped, and formulate the Path Consistency Loss that enforces the association results are consistent across different observation paths. We use the proposed loss to train our object matching model with only self-supervision. By extensive experiments on three tracking datasets (MOT17, PersonPath22, KITTI), we demonstrate that our method outperforms existing unsupervised methods with consistent margins on various evaluation metrics, and even achieves performance close to supervised methods.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.