-
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
Authors:
Shan Chen,
Jack Gallifant,
Mingye Gao,
Pedro Moreira,
Nikolaj Munch,
Ajay Muthukkumar,
Arvind Rajan,
Jaya Kolluri,
Amelia Fiske,
Janna Hastings,
Hugo Aerts,
Brian Anthony,
Leo Anthony Celi,
William G. La Cava,
Danielle S. Bitterman
Abstract:
Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence…
▽ More
Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge
Authors:
Jingguang Tian,
Shuaishuai Ye,
Shunfei Chen,
Yang Xiang,
Zhaohui Yin,
Xinhui Hu,
Xinkang Xu
Abstract:
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on t…
▽ More
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Communication-Efficient Collaborative Perception via Information Filling with Codebook
Authors:
Yue Hu,
Juntong Peng,
Sifei Liu,
Junhao Ge,
Si Liu,
Siheng Chen
Abstract:
Collaborative perception empowers each agent to improve its perceptual ability through the exchange of perceptual messages with other agents. It inherently results in a fundamental trade-off between perception ability and communication cost. To address this bottleneck issue, our core idea is to optimize the collaborative messages from two key aspects: representation and selection. The proposed cod…
▽ More
Collaborative perception empowers each agent to improve its perceptual ability through the exchange of perceptual messages with other agents. It inherently results in a fundamental trade-off between perception ability and communication cost. To address this bottleneck issue, our core idea is to optimize the collaborative messages from two key aspects: representation and selection. The proposed codebook-based message representation enables the transmission of integer codes, rather than high-dimensional feature maps. The proposed information-filling-driven message selection optimizes local messages to collectively fill each agent's information demand, preventing information overflow among multiple agents. By integrating these two designs, we propose CodeFilling, a novel communication-efficient collaborative perception system, which significantly advances the perception-communication trade-off and is inclusive to both homogeneous and heterogeneous collaboration settings. We evaluate CodeFilling in both a real-world dataset, DAIR-V2X, and a new simulation dataset, OPV2VH+. Results show that CodeFilling outperforms previous SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1,333/1,206 times lower communication volume. Our code is available at https://github.com/PhyllisH/CodeFilling.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results
Authors:
Yaqi Wu,
Zhihao Fan,
Xiaofeng Chu,
Jimmy S. Ren,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangcheng Zhou,
Ruicheng Feng,
Yuekun Dai,
Peiqing Yang,
Chen Change Loy,
Senyan Xu,
Zhijing Sun,
Jiaying Zhu,
Yurui Zhu,
Xueyang Fu,
Zheng-Jun Zha,
Jun Cao,
Cheng Li,
Shu Chen,
Liang Ma,
Shiyang Zhou,
Haijin Zeng,
Kai Feng
, et al. (24 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Secure Inference for Vertically Partitioned Data Using Multiparty Homomorphic Encryption
Authors:
Shuangyi Chen,
Yue Ju,
Zhongwen Zhu,
Ashish Khisti
Abstract:
We propose a secure inference protocol for a distributed setting involving a single server node and multiple client nodes. We assume that the observed data vector is partitioned across multiple client nodes while the deep learning model is located at the server node. Each client node is required to encrypt its portion of the data vector and transmit the resulting ciphertext to the server node. The…
▽ More
We propose a secure inference protocol for a distributed setting involving a single server node and multiple client nodes. We assume that the observed data vector is partitioned across multiple client nodes while the deep learning model is located at the server node. Each client node is required to encrypt its portion of the data vector and transmit the resulting ciphertext to the server node. The server node is required to collect the ciphertexts and perform inference in the encrypted domain. We demonstrate an application of multi-party homomorphic encryption (MPHE) to satisfy these requirements. We propose a packing scheme, that enables the server to form the ciphertext of the complete data by aggregating the ciphertext of data subsets encrypted using MPHE. While our proposed protocol builds upon prior horizontal federated training protocol~\cite{sav2020poseidon}, we focus on the inference for vertically partitioned data and avoid the transmission of (encrypted) model weights from the server node to the client nodes.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Robust Collaborative Perception without External Localization and Clock Devices
Authors:
Zixing Lei,
Zhenyang Ni,
Ruize Han,
Shuo Tang,
Chen Feng,
Siheng Chen,
Yanfeng Wang
Abstract:
A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and…
▽ More
A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system,~\emph{FreeAlign}, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate \emph{FreeAlign} on both real-world and simulated datasets. The results show that, the ~\emph{FreeAlign} empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
MR-Transformer: Vision Transformer for Total Knee Replacement Prediction Using Magnetic Resonance Imaging
Authors:
Chaojie Zhang,
Shengjia Chen,
Ozkan Cigdem,
Haresh Rengaraj Rajamohan,
Kyunghyun Cho,
Richard Kijowski,
Cem M. Deniz
Abstract:
A transformer-based deep learning model, MR-Transformer, was developed for total knee replacement (TKR) prediction using magnetic resonance imaging (MRI). The model incorporates the ImageNet pre-training and captures three-dimensional (3D) spatial correlation from the MR images. The performance of the proposed model was compared to existing state-of-the-art deep learning models for knee injury dia…
▽ More
A transformer-based deep learning model, MR-Transformer, was developed for total knee replacement (TKR) prediction using magnetic resonance imaging (MRI). The model incorporates the ImageNet pre-training and captures three-dimensional (3D) spatial correlation from the MR images. The performance of the proposed model was compared to existing state-of-the-art deep learning models for knee injury diagnosis using MRI. Knee MR scans of four different tissue contrasts from the Osteoarthritis Initiative and Multicenter Osteoarthritis Study databases were utilized in the study. Experimental results demonstrated the state-of-the-art performance of the proposed model on TKR prediction using MRI.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Beyond Relevance: Evaluate and Improve Retrievers on Perspective Awareness
Authors:
Xinran Zhao,
Tong Chen,
Sihao Chen,
Hongming Zhang,
Tongshuang Wu
Abstract:
The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expect…
▽ More
The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
How to generate all possible rational Wilf-Zeilberger forms?
Authors:
Shaoshi Chen,
Christoph Koutschan,
Yisen Wang
Abstract:
Wilf-Zeilberger pairs are fundamental in the algorithmic theory of Wilf and Zeilberger for computer-generated proofs of combinatorial identities. Wilf-Zeilberger forms are their high-dimensional generalizations, which can be used for proving and discovering convergence acceleration formulas. This paper presents a structural description of all possible rational such forms, which can be viewed as an…
▽ More
Wilf-Zeilberger pairs are fundamental in the algorithmic theory of Wilf and Zeilberger for computer-generated proofs of combinatorial identities. Wilf-Zeilberger forms are their high-dimensional generalizations, which can be used for proving and discovering convergence acceleration formulas. This paper presents a structural description of all possible rational such forms, which can be viewed as an additive analog of the classical Ore-Sato theorem. Based on this analog, we show a structural decomposition of so-called multivariate hyperarithmetic terms, which extend multivariate hypergeometric terms to the additive setting.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Efficient and Economic Large Language Model Inference with Attention Offloading
Authors:
Shaoyuan Chen,
Yutong Lin,
Mingxing Zhang,
Yongwei Wu
Abstract:
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention…
▽ More
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Boosting Communication Efficiency of Federated Learning's Secure Aggregation
Authors:
Niousha Nazemi,
Omid Tavallaie,
Shuaijun Chen,
Albert Y. Zomaya,
Ralph Holz
Abstract:
Federated Learning (FL) is a decentralized machine learning approach where client devices train models locally and send them to a server that performs aggregation to generate a global model. FL is vulnerable to model inversion attacks, where the server can infer sensitive client data from trained models. Google's Secure Aggregation (SecAgg) protocol addresses this data privacy issue by masking eac…
▽ More
Federated Learning (FL) is a decentralized machine learning approach where client devices train models locally and send them to a server that performs aggregation to generate a global model. FL is vulnerable to model inversion attacks, where the server can infer sensitive client data from trained models. Google's Secure Aggregation (SecAgg) protocol addresses this data privacy issue by masking each client's trained model using shared secrets and individual elements generated locally on the client's device. Although SecAgg effectively preserves privacy, it imposes considerable communication and computation overhead, especially as network size increases. Building upon SecAgg, this poster introduces a Communication-Efficient Secure Aggregation (CESA) protocol that substantially reduces this overhead by using only two shared secrets per client to mask the model. We propose our method for stable networks with low delay variation and limited client dropouts. CESA is independent of the data distribution and network size (for higher than 6 nodes), preventing the honest-but-curious server from accessing unmasked models. Our initial evaluation reveals that CESA significantly reduces the communication cost compared to SecAgg.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Differentiable Particles for General-Purpose Deformable Object Manipulation
Authors:
Siwei Chen,
Yiqing Xu,
Cunjun Yu,
Linfeng Li,
David Hsu
Abstract:
Deformable object manipulation is a long-standing challenge in robotics. While existing approaches often focus narrowly on a specific type of object, we seek a general-purpose algorithm, capable of manipulating many different types of objects: beans, rope, cloth, liquid, . . . . One key difficulty is a suitable representation, rich enough to capture object shape, dynamics for manipulation and yet…
▽ More
Deformable object manipulation is a long-standing challenge in robotics. While existing approaches often focus narrowly on a specific type of object, we seek a general-purpose algorithm, capable of manipulating many different types of objects: beans, rope, cloth, liquid, . . . . One key difficulty is a suitable representation, rich enough to capture object shape, dynamics for manipulation and yet simple enough to be acquired effectively from sensor data. Specifically, we propose Differentiable Particles (DiPac), a new algorithm for deformable object manipulation. DiPac represents a deformable object as a set of particles and uses a differentiable particle dynamics simulator to reason about robot manipulation. To find the best manipulation action, DiPac combines learning, planning, and trajectory optimization through differentiable trajectory tree optimization. Differentiable dynamics provides significant benefits and enable DiPac to (i) estimate the dynamics parameters efficiently, thereby narrowing the sim-to-real gap, and (ii) choose the best action by backpropagating the gradient along sampled trajectories. Both simulation and real-robot experiments show promising results. DiPac handles a variety of object types. By combining planning and learning, DiPac outperforms both pure model-based planning methods and pure data-driven learning methods. In addition, DiPac is robust and adapts to changes in dynamics, thereby enabling the transfer of an expert policy from one object to another with different physical properties, e.g., from a rigid rod to a deformable rope.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Causal Evaluation of Language Models
Authors:
Sirui Chen,
Bo Peng,
Meiqi Chen,
Ruiqi Wang,
Mengying Xu,
Xingyu Zeng,
Rui Zhao,
Shengjie Zhao,
Yu Qiao,
Chaochao Lu
Abstract:
Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive ben…
▽ More
Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at https://opencausalab.github.io/CaLM.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
MIPI 2024 Challenge on Nighttime Flare Removal: Methods and Results
Authors:
Yuekun Dai,
Dafeng Zhang,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangchen Zhou,
Ruicheng Feng,
Peiqing Yang,
Zhezhu Jin,
Guanqun Liu,
Chen Change Loy,
Lize Zhang,
Shuai Liu,
Chaoyu Feng,
Luyang Wang,
Shuan Chen,
Guangqi Shao,
Xiaotao Wang,
Lei Lei,
Qirui Yang,
Qihua Cheng,
Zhiqiang Xu,
Yihao Liu,
Huanjing Yue,
Jingyu Yang
, et al. (38 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Quater-GCN: Enhancing 3D Human Pose Estimation with Orientation and Semi-supervised Training
Authors:
Xingyu Song,
Zhan Li,
Shi Chen,
Kazuyuki Demachi
Abstract:
3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-bein…
▽ More
3D human pose estimation is a vital task in computer vision, involving the prediction of human joint positions from images or videos to reconstruct a skeleton of a human in three-dimensional space. This technology is pivotal in various fields, including animation, security, human-computer interaction, and automotive safety, where it promotes both technological progress and enhanced human well-being. The advent of deep learning significantly advances the performance of 3D pose estimation by incorporating temporal information for predicting the spatial positions of human joints. However, traditional methods often fall short as they primarily focus on the spatial coordinates of joints and overlook the orientation and rotation of the connecting bones, which are crucial for a comprehensive understanding of human pose in 3D space. To address these limitations, we introduce Quater-GCN (Q-GCN), a directed graph convolutional network tailored to enhance pose estimation by orientation. Q-GCN excels by not only capturing the spatial dependencies among node joints through their coordinates but also integrating the dynamic context of bone rotations in 2D space. This approach enables a more sophisticated representation of human poses by also regressing the orientation of each bone in 3D space, moving beyond mere coordinate prediction. Furthermore, we complement our model with a semi-supervised training strategy that leverages unlabeled data, addressing the challenge of limited orientation ground truth data. Through comprehensive evaluations, Q-GCN has demonstrated outstanding performance against current state-of-the-art methods.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Optimal tradeoffs for estimating Pauli observables
Authors:
Sitan Chen,
Weiyuan Gong,
Qi Ye
Abstract:
We revisit the problem of Pauli shadow tomography: given copies of an unknown $n$-qubit quantum state $ρ$, estimate $\text{tr}(Pρ)$ for some set of Pauli operators $P$ to within additive error $ε$. This has been a popular testbed for exploring the advantage of protocols with quantum memory over those without: with enough memory to measure two copies at a time, one can use Bell sampling to estimate…
▽ More
We revisit the problem of Pauli shadow tomography: given copies of an unknown $n$-qubit quantum state $ρ$, estimate $\text{tr}(Pρ)$ for some set of Pauli operators $P$ to within additive error $ε$. This has been a popular testbed for exploring the advantage of protocols with quantum memory over those without: with enough memory to measure two copies at a time, one can use Bell sampling to estimate $|\text{tr}(Pρ)|$ for all $P$ using $O(n/ε^4)$ copies, but with $k\le n$ qubits of memory, $Ω(2^{(n-k)/3})$ copies are needed.
These results leave open several natural questions. How does this picture change in the physically relevant setting where one only needs to estimate a certain subset of Paulis? What is the optimal dependence on $ε$? What is the optimal tradeoff between quantum memory and sample complexity?
We answer all of these questions. For any subset $A$ of Paulis and any family of measurement strategies, we completely characterize the optimal sample complexity, up to $\log |A|$ factors. We show any protocol that makes $\text{poly}(n)$-copy measurements must make $Ω(1/ε^4)$ measurements. For any protocol that makes $\text{poly}(n)$-copy measurements and only has $k < n$ qubits of memory, we show that $\widetildeΘ(\min\{2^n/ε^2, 2^{n-k}/ε^4\})$ copies are necessary and sufficient.
The protocols we propose can also estimate the actual values $\text{tr}(Pρ)$, rather than just their absolute values as in prior work. Additionally, as a byproduct of our techniques, we establish tight bounds for the task of purity testing and show that it exhibits an intriguing phase transition not present in the memory-sample tradeoff for Pauli shadow tomography.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Learning general Gaussian mixtures with efficient score matching
Authors:
Sitan Chen,
Vasilis Kontonis,
Kulin Shah
Abstract:
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\mathrm{poly}(k/\varepsilon)}$ samples from the target mixture, runs in…
▽ More
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\mathrm{poly}(k/\varepsilon)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\varepsilon$-far from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$.
Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
MileBench: Benchmarking MLLMs in Long Context
Authors:
Dingjie Song,
Shunian Chen,
Guiming Hardy Chen,
Fei Yu,
Xiang Wan,
Benyou Wang
Abstract:
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task…
▽ More
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Static Application Security Testing (SAST) Tools for Smart Contracts: How Far Are We?
Authors:
Kaixuan Li,
Yue Xue,
Sen Chen,
Han Liu,
Kairan Sun,
Ming Hu,
Haijun Wang,
Yang Liu,
Yixiang Chen
Abstract:
In recent years, the importance of smart contract security has been heightened by the increasing number of attacks against them. To address this issue, a multitude of static application security testing (SAST) tools have been proposed for detecting vulnerabilities in smart contracts. However, objectively comparing these tools to determine their effectiveness remains challenging. Existing studies o…
▽ More
In recent years, the importance of smart contract security has been heightened by the increasing number of attacks against them. To address this issue, a multitude of static application security testing (SAST) tools have been proposed for detecting vulnerabilities in smart contracts. However, objectively comparing these tools to determine their effectiveness remains challenging. Existing studies often fall short due to the taxonomies and benchmarks only covering a coarse and potentially outdated set of vulnerability types, which leads to evaluations that are not entirely comprehensive and may display bias.
In this paper, we fill this gap by proposing an up-to-date and fine-grained taxonomy that includes 45 unique vulnerability types for smart contracts. Taking it as a baseline, we develop an extensive benchmark that covers 40 distinct types and includes a diverse range of code characteristics, vulnerability patterns, and application scenarios. Based on them, we evaluated 8 SAST tools using this benchmark, which comprises 788 smart contract files and 10,394 vulnerabilities. Our results reveal that the existing SAST tools fail to detect around 50% of vulnerabilities in our benchmark and suffer from high false positives, with precision not surpassing 10%. We also discover that by combining the results of multiple tools, the false negative rate can be reduced effectively, at the expense of flagging 36.77 percentage points more functions. Nevertheless, many vulnerabilities, especially those beyond Access Control and Reentrancy vulnerabilities, remain undetected. We finally highlight the valuable insights from our study, hoping to provide guidance on tool development, enhancement, evaluation, and selection for developers, researchers, and practitioners.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
SafePaint: Anti-forensic Image Inpainting with Domain Adaptation
Authors:
Dunyun Chen,
Xin Liao,
Xiaoshuai Wu,
Shiwei Chen
Abstract:
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-foren…
▽ More
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Dynamic Against Dynamic: An Open-set Self-learning Framework
Authors:
Haifeng Yang,
Chuanxing Geng,
Pong C. Yuen,
Songcan Chen
Abstract:
In open-set recognition, existing methods generally learn statically fixed decision boundaries using known classes to reject unknown classes. Though they have achieved promising results, such decision boundaries are evidently insufficient for universal unknown classes in dynamic and open scenarios as they can potentially appear at any position in the feature space. Moreover, these methods just sim…
▽ More
In open-set recognition, existing methods generally learn statically fixed decision boundaries using known classes to reject unknown classes. Though they have achieved promising results, such decision boundaries are evidently insufficient for universal unknown classes in dynamic and open scenarios as they can potentially appear at any position in the feature space. Moreover, these methods just simply reject unknown class samples during testing without any effective utilization for them. In fact, such samples completely can constitute the true instantiated representation of the unknown classes to further enhance the model's performance. To address these issues, this paper proposes a novel dynamic against dynamic idea, i.e., dynamic method against dynamic changing open-set world, where an open-set self-learning (OSSL) framework is correspondingly developed. OSSL starts with a good closed-set classifier trained by known classes and utilizes available test samples for model adaptation during testing, thus gaining the adaptability to changing data distributions. In particular, a novel self-matching module is designed for OSSL, which can achieve the adaptation in automatically identifying known class samples while rejecting unknown class samples which are further utilized to enhance the discriminability of the model as the instantiated representation of unknown classes. Our method establishes new performance milestones respectively in almost all standard and cross-data benchmarks.
△ Less
Submitted 2 May, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
Part-Guided 3D RL for Sim2Real Articulated Object Manipulation
Authors:
Pengwei Xie,
Rui Chen,
Siang Chen,
Yuzhe Qin,
Fanbo Xiang,
Tianyu Sun,
Jing Xu,
Guijin Wang,
Hao Su
Abstract:
Manipulating unseen articulated objects through visual feedback is a critical but challenging task for real robots. Existing learning-based solutions mainly focus on visual affordance learning or other pre-trained visual models to guide manipulation policies, which face challenges for novel instances in real-world scenarios. In this paper, we propose a novel part-guided 3D RL framework, which can…
▽ More
Manipulating unseen articulated objects through visual feedback is a critical but challenging task for real robots. Existing learning-based solutions mainly focus on visual affordance learning or other pre-trained visual models to guide manipulation policies, which face challenges for novel instances in real-world scenarios. In this paper, we propose a novel part-guided 3D RL framework, which can learn to manipulate articulated objects without demonstrations. We combine the strengths of 2D segmentation and 3D RL to improve the efficiency of RL policy training. To improve the stability of the policy on real robots, we design a Frame-consistent Uncertainty-aware Sampling (FUS) strategy to get a condensed and hierarchical 3D representation. In addition, a single versatile RL policy can be trained on multiple articulated object manipulation tasks simultaneously in simulation and shows great generalizability to novel categories and instances. Experimental results demonstrate the effectiveness of our framework in both simulation and real-world settings. Our code is available at https://github.com/THU-VCLab/Part-Guided-3D-RL-for-Sim2Real-Articulated-Object-Manipulation.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition
Authors:
Zheng Lian,
Haiyang Sun,
Licai Sun,
Zhuofan Wen,
Siyuan Zhang,
Shun Chen,
Hao Gu,
Jinming Zhao,
Ziyang Ma,
Xie Chen,
Jiangyan Yi,
Rui Liu,
Kele Xu,
Bin Liu,
Erik Cambria,
Guoying Zhao,
Björn W. Schuller,
Jianhua Tao
Abstract:
Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate labels), current systems still cannot meet the demands of practical applications. Therefore, w…
▽ More
Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate labels), current systems still cannot meet the demands of practical applications. Therefore, we plan to organize a series of challenges around emotion recognition to further promote the development of this field. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model's ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.
△ Less
Submitted 28 April, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing
Authors:
Peizhuang Cong,
Aomufei Yuan,
Shimao Chen,
Yuxuan Tian,
Bowen Ye,
Tong Yang
Abstract:
MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuatio…
▽ More
MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
Authors:
Xiaohong Liu,
Xiongkuo Min,
Guangtao Zhai,
Chunyi Li,
Tengchuan Kou,
Wei Sun,
Haoning Wu,
Yixuan Gao,
Yuqin Cao,
Zicheng Zhang,
Xiele Wu,
Radu Timofte,
Fei Peng,
Huiyuan Fu,
Anlong Ming,
Chuanming Wang,
Huadong Ma,
Shuai He,
Zifei Dou,
Shu Chen,
Huacong Zhang,
Haiyi Xie,
Chengwei Wang,
Baoying Chen,
Jishen Zeng
, et al. (89 additional authors not shown)
Abstract:
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Conte…
▽ More
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
△ Less
Submitted 7 May, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
SECO: Secure Inference With Model Splitting Across Multi-Server Hierarchy
Authors:
Shuangyi Chen,
Ashish Khisti
Abstract:
In the context of prediction-as-a-service, concerns about the privacy of the data and the model have been brought up and tackled via secure inference protocols. These protocols are built up by using single or multiple cryptographic tools designed under a variety of different security assumptions.
In this paper, we introduce SECO, a secure inference protocol that enables a user holding an input d…
▽ More
In the context of prediction-as-a-service, concerns about the privacy of the data and the model have been brought up and tackled via secure inference protocols. These protocols are built up by using single or multiple cryptographic tools designed under a variety of different security assumptions.
In this paper, we introduce SECO, a secure inference protocol that enables a user holding an input data vector and multiple server nodes deployed with a split neural network model to collaboratively compute the prediction, without compromising either party's data privacy. We extend prior work on secure inference that requires the entire neural network model to be located on a single server node, to a multi-server hierarchy, where the user communicates to a gateway server node, which in turn communicates to remote server nodes. The inference task is split across the server nodes and must be performed over an encrypted copy of the data vector.
We adopt multiparty homomorphic encryption and multiparty garbled circuit schemes, making the system secure against dishonest majority of semi-honest servers as well as protecting the partial model structure from the user. We evaluate SECO on multiple models, achieving the reduction of computation and communication cost for the user, making the protocol applicable to user's devices with limited resources.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Entanglement-Based Artificial Topology: Neighboring Remote Network Nodes
Authors:
Si-Yi Chen,
Jessica Illiano,
Angela Sara Cacciapuoti,
Marcello Caleffi
Abstract:
Entanglement is unanimously recognized as the key communication resource of the Quantum Internet. Yet, the possibility of implementing novel network functionalities by exploiting the marvels of entanglement has been poorly investigated so far, by mainly restricting the attention to bipartite entanglement. Conversely, in this paper, we aim at exploiting multipartite entanglement as inter-network re…
▽ More
Entanglement is unanimously recognized as the key communication resource of the Quantum Internet. Yet, the possibility of implementing novel network functionalities by exploiting the marvels of entanglement has been poorly investigated so far, by mainly restricting the attention to bipartite entanglement. Conversely, in this paper, we aim at exploiting multipartite entanglement as inter-network resource. Specifically, we consider the interconnection of different Quantum Local Area Networks (QLANs), and we show that multipartite entanglement allows to dynamically generate an inter-QLAN artificial topology, by means of local operations only, that overcomes the limitations of the physical QLAN topologies. To this aim, we first design the multipartite entangled state to be distributed within each QLAN. Then, we show how such a state can be engineered to: i) interconnect nodes belonging to different QLANs, and ii) dynamically adapt to different inter-QLAN traffic patterns. Our contribution aims at providing the network engineering community with a hands-on guideline towards the concept of artificial topology and artificial neighborhood.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos
Authors:
Zerui Chen,
Shizhe Chen,
Cordelia Schmid,
Ivan Laptev
Abstract:
In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses. Though prior work has demonstrated that human videos can benefit policy learning, performance improvement has been limited by physically implausible trajectories extracted from videos. Moreover, reliance on privileged object information such as ground-truth…
▽ More
In this work, we aim to learn a unified vision-based policy for a multi-fingered robot hand to manipulate different objects in diverse poses. Though prior work has demonstrated that human videos can benefit policy learning, performance improvement has been limited by physically implausible trajectories extracted from videos. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. A coordinate transformation method is proposed to significantly boost the performance. We evaluate our method on three dexterous manipulation tasks and demonstrate a large improvement over state-of-the-art algorithms.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
CWF: Consolidating Weak Features in High-quality Mesh Simplification
Authors:
Rui Xu,
Longdu Liu,
Ningna Wang,
Shuangmin Chen,
Shiqing Xin,
Xiaohu Guo,
Zichun Zhong,
Taku Komura,
Wenping Wang,
Changhe Tu
Abstract:
In mesh simplification, common requirements like accuracy, triangle quality, and feature alignment are often considered as a trade-off. Existing algorithms concentrate on just one or a few specific aspects of these requirements. For example, the well-known Quadric Error Metrics (QEM) approach prioritizes accuracy and can preserve strong feature lines/points as well but falls short in ensuring high…
▽ More
In mesh simplification, common requirements like accuracy, triangle quality, and feature alignment are often considered as a trade-off. Existing algorithms concentrate on just one or a few specific aspects of these requirements. For example, the well-known Quadric Error Metrics (QEM) approach prioritizes accuracy and can preserve strong feature lines/points as well but falls short in ensuring high triangle quality and may degrade weak features that are not as distinctive as strong ones. In this paper, we propose a smooth functional that simultaneously considers all of these requirements. The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) energy term, with the variables being a set of movable points lying on the surface. The former inherits the spirit of QEM but operates in a continuous setting, while the latter encourages even point distribution, allowing various surface metrics. We further introduce a decaying weight to automatically balance the two terms. We selected 100 CAD models from the ABC dataset, along with 21 organic models, to compare the existing mesh simplification algorithms with ours. Experimental results reveal an important observation: the introduction of a decaying weight effectively reduces the conflict between the two terms and enables the alignment of weak features. This distinctive feature sets our approach apart from most existing mesh simplification methods and demonstrates significant potential in shape understanding.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Minimal Evidence Group Identification for Claim Verification
Authors:
Xiangci Li,
Sihao Chen,
Rajvi Kapadia,
Jessica Ouyang,
Fan Zhang
Abstract:
Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different p…
▽ More
Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different perspectives. In this paper, we formally define and study the problem of identifying such minimal evidence groups (MEGs) for claim verification. We show that MEG identification can be reduced from Set Cover problem, based on entailment inference of whether a given evidence group provides full/partial support to a claim. Our proposed approach achieves 18.4% and 34.8% absolute improvements on the WiCE and SciFact datasets over LLM prompting. Finally, we demonstrate the benefits of MEGs in downstream applications such as claim generation.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
FASTTRACK: Fast and Accurate Fact Tracing for LLMs
Authors:
Si Chen,
Feiyang Kang,
Ning Yu,
Ruoxi Jia
Abstract:
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are me…
▽ More
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100\% improvement in F1 score over the state-of-the-art methods while being X33 faster than \texttt{TracIn}.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning
Authors:
Wenjin Hou,
Shiming Chen,
Shuhuang Chen,
Ziming Hong,
Yan Wang,
Xuetao Feng,
Salman Khan,
Fahad Shahbaz Khan,
Xinge You
Abstract:
Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor gene…
▽ More
Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes). To address this issue, we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge), which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately, we concatenate their output as a dynamic semantic prototype, which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4\%, 5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Think-Program-reCtify: 3D Situated Reasoning with Large Language Models
Authors:
Qingrong He,
Kejun Lin,
Shizhe Chen,
Anwen Hu,
Qin Jin
Abstract:
This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of levera…
▽ More
This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
Authors:
Adrian de Wynter,
Ishaan Watts,
Nektar Ege Altıntoprak,
Tua Wongsangaroonsri,
Minghui Zhang,
Noura Farra,
Lena Baur,
Samantha Claudet,
Pavel Gajdusek,
Can Gören,
Qilong Gu,
Anna Kaminska,
Tomasz Kaminski,
Ruby Kuo,
Akiko Kyuba,
Jongho Lee,
Kartik Mathur,
Petter Merok,
Ivana Milovanović,
Nani Paananen,
Vesa-Matti Paananen,
Anna Pavlenko,
Bruno Pereira Vidal,
Luciano Strika,
Yueh Tsao
, et al. (8 additional authors not shown)
Abstract:
Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end we introduce RTP-LX, a human-transc…
▽ More
Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate seven S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when judging holistically the toxicity of a prompt, and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microagressions, bias). We release of this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
Authors:
Eric Brachmann,
Jamie Wynn,
Shuai Chen,
Tommaso Cavallari,
Áron Monszpart,
Daniyar Turmukhambetov,
Victor Adrian Prisacariu
Abstract:
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of…
▽ More
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis. Project page: https://nianticlabs.github.io/acezero/
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results
Authors:
Xiaoning Liu,
Zongwei Wu,
Ao Li,
Florin-Alexandru Vasluianu,
Yulun Zhang,
Shuhang Gu,
Le Zhang,
Ce Zhu,
Radu Timofte,
Zhi Jin,
Hongjun Wu,
Chenxi Wang,
Haitao Ling,
Yuanhao Cai,
Hao Bian,
Yuxin Zheng,
Jing Lin,
Alan Yuille,
Ben Shao,
Jin Guo,
Tianli Liu,
Mohao Wu,
Yixu Feng,
Shuo Hou,
Haotian Lin
, et al. (87 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlig…
▽ More
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Beyond the Edge: An Advanced Exploration of Reinforcement Learning for Mobile Edge Computing, its Applications, and Future Research Trajectories
Authors:
Ning Yang,
Shuo Chen,
Haijun Zhang,
Randall Berry
Abstract:
Mobile Edge Computing (MEC) broadens the scope of computation and storage beyond the central network, incorporating edge nodes close to end devices. This expansion facilitates the implementation of large-scale "connected things" within edge networks. The advent of applications necessitating real-time, high-quality service presents several challenges, such as low latency, high data rate, reliabilit…
▽ More
Mobile Edge Computing (MEC) broadens the scope of computation and storage beyond the central network, incorporating edge nodes close to end devices. This expansion facilitates the implementation of large-scale "connected things" within edge networks. The advent of applications necessitating real-time, high-quality service presents several challenges, such as low latency, high data rate, reliability, efficiency, and security, all of which demand resolution. The incorporation of reinforcement learning (RL) methodologies within MEC networks promotes a deeper understanding of mobile user behaviors and network dynamics, thereby optimizing resource use in computing and communication processes. This paper offers an exhaustive survey of RL applications in MEC networks, initially presenting an overview of RL from its fundamental principles to the latest advanced frameworks. Furthermore, it outlines various RL strategies employed in offloading, caching, and communication within MEC networks. Finally, it explores open issues linked with software and hardware platforms, representation, RL robustness, safe RL, large-scale scheduling, generalization, security, and privacy. The paper proposes specific RL techniques to mitigate these issues and provides insights into their practical applications.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Beyond MMSE: Rank-1 Subspace Channel Estimator for Massive MIMO Systems
Authors:
Bin Li,
Ziping Wei,
Shaoshi Yang,
Yang Zhang,
Jun Zhang,
Chenglin Zhao,
Sheng Chen
Abstract:
To glean the benefits offered by massive multi-input multi-output (MIMO) systems, channel state information must be accurately acquired. Despite the high accuracy, the computational complexity of classical linear minimum mean squared error (MMSE) estimator becomes prohibitively high in the context of massive MIMO, while the other low-complexity methods degrade the estimation accuracy seriously. In…
▽ More
To glean the benefits offered by massive multi-input multi-output (MIMO) systems, channel state information must be accurately acquired. Despite the high accuracy, the computational complexity of classical linear minimum mean squared error (MMSE) estimator becomes prohibitively high in the context of massive MIMO, while the other low-complexity methods degrade the estimation accuracy seriously. In this paper, we develop a novel rank-1 subspace channel estimator to approximate the maximum likelihood (ML) estimator, which outperforms the linear MMSE estimator, but incurs a surprisingly low computational complexity. Our method first acquires the highly accurate angle-of-arrival (AoA) information via a constructed space-embedding matrix and the rank-1 subspace method. Then, it adopts the post-reception beamforming to acquire the unbiased estimate of channel gains. Furthermore, a fast method is designed to implement our new estimator. Theoretical analysis shows that the extra gain achieved by our method over the linear MMSE estimator grows according to the rule of O($\log_{10}M$), while its computational complexity is linearly scalable to the number of antennas $M$. Numerical simulations also validate the theoretical results. Our new method substantially extends the accuracy-complexity region and constitutes a promising channel estimation solution to the emerging massive MIMO communications.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
SpringGrasp: Synthesizing Compliant, Dexterous Grasps under Shape Uncertainty
Authors:
Sirui Chen,
Jeannette Bohg,
C. Karen Liu
Abstract:
Generating stable and robust grasps on arbitrary objects is critical for dexterous robotic hands, marking a significant step towards advanced dexterous manipulation. Previous studies have mostly focused on improving differentiable grasping metrics with the assumption of precisely known object geometry. However, shape uncertainty is ubiquitous due to noisy and partial shape observations, which intr…
▽ More
Generating stable and robust grasps on arbitrary objects is critical for dexterous robotic hands, marking a significant step towards advanced dexterous manipulation. Previous studies have mostly focused on improving differentiable grasping metrics with the assumption of precisely known object geometry. However, shape uncertainty is ubiquitous due to noisy and partial shape observations, which introduce challenges in grasp planning. We propose, SpringGrasp planner, a planner that considers uncertain observations of the object surface for synthesizing compliant dexterous grasps. A compliant dexterous grasp could minimize the effect of unexpected contact with the object, leading to more stable grasp with shape-uncertain objects. We introduce an analytical and differentiable metric, SpringGrasp metric, that evaluates the dynamic behavior of the entire compliant grasping process. Planning with SpringGrasp planner, our method achieves a grasp success rate of 89% from two viewpoints and 84% from a single viewpoints in experiment with a real robot on 14 common objects. Compared with a force-closure based planner, our method achieves at least 18% higher grasp success rate.
△ Less
Submitted 25 April, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
NeurCADRecon: Neural Representation for Reconstructing CAD Surfaces by Enforcing Zero Gaussian Curvature
Authors:
Qiujie Dong,
Rui Xu,
Pengfei Wang,
Shuangmin Chen,
Shiqing Xin,
Xiaohong Jia,
Wenping Wang,
Changhe Tu
Abstract:
Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each a…
▽ More
Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each approximately developable even around the feature line. Our approach, named NeurCADRecon, is self-supervised, and its loss includes a developability term to encourage the Gaussian curvature toward 0 while ensuring fidelity to the input points. Noticing that the Gaussian curvature is non-zero at tip points, we introduce a double-trough curve to tolerate the existence of these tip points. Furthermore, we develop a dynamic sampling strategy to deal with situations where the given points are incomplete or too sparse. Since our resulting neural SDFs can clearly manifest sharp feature points/lines, one can easily extract the feature-aligned triangle mesh from the SDF and then decompose it into smooth surface patches, greatly reducing the difficulty of recovering the parametric CAD design. A comprehensive comparison with existing state-of-the-art methods shows the significant advantage of our approach in reconstructing faithful CAD shapes.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models
Authors:
Zhenyang Ni,
Rui Ye,
Yuxi Wei,
Zhen Xiang,
Yanfeng Wang,
Siheng Chen
Abstract:
Vision-Large-Language-models(VLMs) have great application prospects in autonomous driving. Despite the ability of VLMs to comprehend and make decisions in complex scenarios, their integration into safety-critical autonomous driving systems poses serious security risks. In this paper, we propose BadVLMDriver, the first backdoor attack against VLMs for autonomous driving that can be launched in prac…
▽ More
Vision-Large-Language-models(VLMs) have great application prospects in autonomous driving. Despite the ability of VLMs to comprehend and make decisions in complex scenarios, their integration into safety-critical autonomous driving systems poses serious security risks. In this paper, we propose BadVLMDriver, the first backdoor attack against VLMs for autonomous driving that can be launched in practice using physical objects. Unlike existing backdoor attacks against VLMs that rely on digital modifications, BadVLMDriver uses common physical items, such as a red balloon, to induce unsafe actions like sudden acceleration, highlighting a significant real-world threat to autonomous vehicle safety. To execute BadVLMDriver, we develop an automated pipeline utilizing natural language instructions to generate backdoor training samples with embedded malicious behaviors. This approach allows for flexible trigger and behavior selection, enhancing the stealth and practicality of the attack in diverse scenarios. We conduct extensive experiments to evaluate BadVLMDriver for two representative VLMs, five different trigger objects, and two types of malicious backdoor behaviors. BadVLMDriver achieves a 92% attack success rate in inducing a sudden acceleration when coming across a pedestrian holding a red balloon. Thus, BadVLMDriver not only demonstrates a critical security risk but also emphasizes the urgent need for developing robust defense mechanisms to protect against such vulnerabilities in autonomous driving technologies.
△ Less
Submitted 22 April, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations
Authors:
Sibei Chen,
Yeye He,
Weiwei Cui,
Ju Fan,
Song Ge,
Haidong Zhang,
Dongmei Zhang,
Surajit Chaudhuri
Abstract:
Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers.
Despite the success of spreadsheets, authoring complex formulas remains challe…
▽ More
Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers.
Despite the success of spreadsheets, authoring complex formulas remains challenging, as non-technical users need to look up and understand non-trivial formula syntax. To address this pain point, we leverage the observation that there is often an abundance of similar-looking spreadsheets in the same organization, which not only have similar data, but also share similar computation logic encoded as formulas. We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell, by learning and adapting formulas that already exist in similar spreadsheets, using contrastive-learning techniques inspired by "similar-face recognition" from compute vision.
Extensive evaluations on over 2K test formulas extracted from real enterprise spreadsheets show the effectiveness of Auto-Formula over alternatives. Our benchmark data is available at https://github.com/microsoft/Auto-Formula to facilitate future research.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models
Authors:
Yuzhu Cai,
Sheng Yin,
Yuxi Wei,
Chenxin Xu,
Weibo Mao,
Felix Juefei-Xu,
Siheng Chen,
Yanfeng Wang
Abstract:
The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designe…
▽ More
The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designed to facilitate the value-aligned usage of text-to-image tools without necessitating internal model revision. Ethical-Lens ensures value alignment in text-to-image models across toxicity and bias dimensions by refining user commands and rectifying model outputs. Systematic evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess alignment capability. Our experiments reveal that Ethical-Lens enhances alignment capabilities to levels comparable with or superior to commercial models like DALLE 3, ensuring user-generated content adheres to ethical standards while maintaining image quality. This study indicates the potential of Ethical-Lens to ensure the sustainable development of open-source text-to-image tools and their beneficial integration into society. Our code is available at https://github.com/yuzhu-cai/Ethical-Lens.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
SIGformer: Sign-aware Graph Transformer for Recommendation
Authors:
Sirui Chen,
Jiawei Chen,
Sheng Zhou,
Bohao Wang,
Shen Han,
Chanfei Su,
Yuqing Yuan,
Can Wang
Abstract:
In recommender systems, most graph-based methods focus on positive user feedback, while overlooking the valuable negative feedback. Integrating both positive and negative feedback to form a signed graph can lead to a more comprehensive understanding of user preferences. However, the existing efforts to incorporate both types of feedback are sparse and face two main limitations: 1) They process pos…
▽ More
In recommender systems, most graph-based methods focus on positive user feedback, while overlooking the valuable negative feedback. Integrating both positive and negative feedback to form a signed graph can lead to a more comprehensive understanding of user preferences. However, the existing efforts to incorporate both types of feedback are sparse and face two main limitations: 1) They process positive and negative feedback separately, which fails to holistically leverage the collaborative information within the signed graph; 2) They rely on MLPs or GNNs for information extraction from negative feedback, which may not be effective.
To overcome these limitations, we introduce SIGformer, a new method that employs the transformer architecture to sign-aware graph-based recommendation. SIGformer incorporates two innovative positional encodings that capture the spectral properties and path patterns of the signed graph, enabling the full exploitation of the entire graph. Our extensive experiments across five real-world datasets demonstrate the superiority of SIGformer over state-of-the-art methods. The code is available at https://github.com/StupidThree/SIGformer.
△ Less
Submitted 6 May, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales
Authors:
Minghe Gao,
Shuang Chen,
Liang Pang,
Yuan Yao,
Jisheng Dang,
Wenqiao Zhang,
Juncheng Li,
Siliang Tang,
Yueting Zhuang,
Tat-Seng Chua
Abstract:
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional rea…
▽ More
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing
Authors:
Sherry X. Chen,
Yaron Vaxman,
Elad Ben Baruch,
David Asulin,
Aviad Moreshet,
Kuo-Chin Lien,
Misha Sra,
Pradeep Sen
Abstract:
Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text…
▽ More
Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text prompts, and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However, these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem, we present TiNO-Edit, an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing, something previously unexplored in the literature. With this simple change, we are able to generate results that both better align with the original images and reflect the desired result. Furthermore, we propose a set of new loss functions that operate in the latent domain of SD, greatly speeding up the optimization when compared to prior approaches, which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Authors:
Hongxin Zhang,
Zeyuan Wang,
Qiushi Lyu,
Zheyuan Zhang,
Sunli Chen,
Tianmin Shu,
Yilun Du,
Chuang Gan
Abstract:
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual ob…
▽ More
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. To evaluate the efficacy of our methods, we create two challenging embodied multi-agent long-horizon cooperation tasks using the ThreeDWorld simulator and conduct experiments with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed framework. More videos can be found at https://vis-www.cs.umass.edu/combo/.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience
Authors:
Haixia Han,
Tingyun Li,
Shisong Chen,
Jie Shi,
Chengyu Du,
Yanghua Xiao,
Jiaqing Liang,
Xin Lin
Abstract:
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveragin…
▽ More
Large Language Models (LLMs) have exhibited remarkable performance across various downstream tasks, but they may generate inaccurate or false information with a confident tone. One of the possible solutions is to empower the LLM confidence expression capability, in which the confidence expressed can be well-aligned with the true probability of the generated answer being correct. However, leveraging the intrinsic ability of LLMs or the signals from the output logits of answers proves challenging in accurately capturing the response uncertainty in LLMs. Therefore, drawing inspiration from cognitive diagnostics, we propose a method of Learning from Past experience (LePe) to enhance the capability for confidence expression. Specifically, we first identify three key problems: (1) How to capture the inherent confidence of the LLM? (2) How to teach the LLM to express confidence? (3) How to evaluate the confidence expression of the LLM? Then we devise three stages in LePe to deal with these problems. Besides, to accurately capture the confidence of an LLM when constructing the training data, we design a complete pipeline including question preparation and answer sampling. We also conduct experiments using the Llama family of LLMs to verify the effectiveness of our proposed method on four datasets.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Robot Positioning Using Torus Packing for Multisets
Authors:
Chung Shue Chen,
Peter Keevash,
Sean Kennedy,
Élie de Panafieu,
Adrian Vetta
Abstract:
We consider the design of a positioning system where a robot determines its position from local observations. This is a well-studied problem of considerable practical importance and mathematical interest. The dominant paradigm derives from the classical theory of de Bruijn sequences, where the robot has access to a window within a larger code and can determine its position if these windows are dis…
▽ More
We consider the design of a positioning system where a robot determines its position from local observations. This is a well-studied problem of considerable practical importance and mathematical interest. The dominant paradigm derives from the classical theory of de Bruijn sequences, where the robot has access to a window within a larger code and can determine its position if these windows are distinct. We propose an alternative model in which the robot has more limited observational powers, which we argue is more realistic in terms of engineering: the robot does not have access to the full pattern of colours (or letters) in the window, but only to the intensity of each colour (or the number of occurrences of each letter). This leads to a mathematically interesting problem with a different flavour to that arising in the classical paradigm, requiring new construction techniques. The parameters of our construction are optimal up to a constant factor, and computing the position requires only a constant number of arithmetic operations.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
ChatShop: Interactive Information Seeking with Language Agents
Authors:
Sanxing Chen,
Sam Wiseman,
Bhuwan Dhingra
Abstract:
The desire and ability to seek new information strategically are fundamental to human learning but often overlooked in current language agent development. Using a web shopping task as an example, we show that it can be reformulated and solved as a retrieval task without a requirement of interactive information seeking. We then redesign the task to introduce a new role of shopper, serving as a real…
▽ More
The desire and ability to seek new information strategically are fundamental to human learning but often overlooked in current language agent development. Using a web shopping task as an example, we show that it can be reformulated and solved as a retrieval task without a requirement of interactive information seeking. We then redesign the task to introduce a new role of shopper, serving as a realistically constrained communication channel. The agents in our proposed ChatShop task explore user preferences in open-ended conversation to make informed decisions. Our experiments demonstrate that the proposed task can effectively evaluate the agent's ability to explore and gradually accumulate information through multi-turn interaction. We also show that LLM-simulated shoppers serve as a good proxy to real human shoppers and discover similar error patterns of agents.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.