Search | arXiv e-print repository

Unifying 3D Vision-Language Understanding via Promptable Queries

Authors: Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

Abstract: A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified m… ▽ More A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 1.8% (AP), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: Project page: https://pq3d.github.io

arXiv:2404.19738 [pdf, other]

DiaryHelper: Exploring the Use of an Automatic Contextual Information Recording Agent for Elicitation Diary Study

Authors: Junze Li, Changyang He, Jiaxiong Hu, Boyang Jia, Alon Halevy, Xiaojuan Ma

Abstract: Elicitation diary studies, a type of qualitative, longitudinal research method, involve participants to self-report aspects of events of interest at their occurrences as memory cues for providing details and insights during post-study interviews. However, due to time constraints and lack of motivation, participants' diary entries may be vague or incomplete, impairing their later recall. To address… ▽ More Elicitation diary studies, a type of qualitative, longitudinal research method, involve participants to self-report aspects of events of interest at their occurrences as memory cues for providing details and insights during post-study interviews. However, due to time constraints and lack of motivation, participants' diary entries may be vague or incomplete, impairing their later recall. To address this challenge, we designed an automatic contextual information recording agent, DiaryHelper, based on the theory of episodic memory. DiaryHelper can predict five dimensions of contextual information and confirm with participants. We evaluated the use of DiaryHelper in both the recording period and the elicitation interview through a within-subject study (N=12) over a period of two weeks. Our results demonstrated that DiaryHelper can assist participants in capturing abundant and accurate contextual information without significant burden, leading to a more detailed recall of recorded events and providing greater insights. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: CHI 2024

arXiv:2404.19021 [pdf]

Enhancing Autonomous Vehicle Design and Testing: A Comprehensive Review of AR and VR Integration

Authors: Emanuella Ejichukwu, Lauren Tong, Gadir Hazime, Bochen Jia

Abstract: This comprehensive literature review explores the potential of Augmented Reality and Virtual Reality technologies to enhance the design and testing of autonomous vehicles. By analyzing existing research, the review aims to identify how AR and VR can be leveraged to improve various aspects of autonomous vehicle development, including: creating more realistic and comprehensive testing environments,… ▽ More This comprehensive literature review explores the potential of Augmented Reality and Virtual Reality technologies to enhance the design and testing of autonomous vehicles. By analyzing existing research, the review aims to identify how AR and VR can be leveraged to improve various aspects of autonomous vehicle development, including: creating more realistic and comprehensive testing environments, facilitating the design of user centered interfaces, and safely evaluating driver behavior in complex scenarios. Ultimately, the review highlights AR and VR utilization as a key driver in the development of adaptable testing environments, fostering more dependable autonomous vehicle technology, and ultimately propelling significant advancements within the field. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18989 [pdf]

Cyberbully and Online Harassment: Issues Associated with Digital Wellbeing

Authors: Manasi Kulkarni, Siddhi Durve, Bochen Jia

Abstract: As digital technology becomes increasingly embedded in daily life, its impact on social interactions has become a critical area of study, particularly concerning cyberbullying. This meta-analysis investigates the dual role of technology in cyberbullying both as a catalyst that can exacerbate the issue and as a potential solution. Cyberbullying, characterized by the use of digital platforms to hara… ▽ More As digital technology becomes increasingly embedded in daily life, its impact on social interactions has become a critical area of study, particularly concerning cyberbullying. This meta-analysis investigates the dual role of technology in cyberbullying both as a catalyst that can exacerbate the issue and as a potential solution. Cyberbullying, characterized by the use of digital platforms to harass, threaten, or humiliate individuals, poses significant challenges to mental and social wellbeing. This research synthesizes empirical findings from diverse studies to evaluate how innovative technological interventions, such as content monitoring algorithms, anonymous reporting systems, and educational initiatives integrated within digital platforms, contribute to reducing the prevalence of cyberbullying. The study focuses on the effectiveness of these interventions in various settings, highlighting the need for adaptive strategies that respond to the dynamic digital landscape. By offering a comprehensive overview of the current state of cyberbullying and the efficacy of technology based solutions, this analysis provides valuable insights for stakeholders, including educators, policymakers, and technology developers, aiming to enhance digital wellbeing and create safer online environments. The findings underscore the importance of leveraging technology not only as a medium of communication but also as a strategic tool to combat the negative impacts of cyberbullying, thus promoting a more inclusive and respectful digital world. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 35 pages, 7 figures

ACM Class: J.4

arXiv:2404.18695 [pdf, other]

Dual-Modal Prompting for Sketch-Based Image Retrieval

Authors: Liying Gao, Bingliang Jiao, Peng Wang, Shizhou Zhang, Hanwang Zhang, Yanning Zhang

Abstract: Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category… ▽ More Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2404.10220 [pdf, other]

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Authors: Peiyuan Zhi, Zhiyuan Zhang, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia, Siyuan Huang

Abstract: Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, a… ▽ More Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, and manipulation, serving as callable execution modules for GPT-4V in task planning. On top of these modules, GPT-4V serves as the brain that can accomplish multimodal reasoning, generate action policy with code, verify the task progress, and provide feedback for replanning. Such design enables COME-robot to (i) actively perceive the environments, (ii) perform situated reasoning, and (iii) recover from failures. Through comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~25%) compared to state-of-the-art baseline methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09465 [pdf, other]

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Authors: Yandan Yang, Baoxiong Jia, Peiyuan Zhi, Siyuan Huang

Abstract: With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we int… ▽ More With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: http://physcene.github.io. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024, 18 pages

arXiv:2404.01576 [pdf, other]

Leveraging Digital Perceptual Technologies for Remote Perception and Analysis of Human Biomechanical Processes: A Contactless Approach for Workload and Joint Force Assessment

Authors: Jesudara Omidokun, Darlington Egeonu, Bochen Jia, Liang Yang

Abstract: This study presents an innovative computer vision framework designed to analyze human movements in industrial settings, aiming to enhance biomechanical analysis by integrating seamlessly with existing software. Through a combination of advanced imaging and modeling techniques, the framework allows for comprehensive scrutiny of human motion, providing valuable insights into kinematic patterns and k… ▽ More This study presents an innovative computer vision framework designed to analyze human movements in industrial settings, aiming to enhance biomechanical analysis by integrating seamlessly with existing software. Through a combination of advanced imaging and modeling techniques, the framework allows for comprehensive scrutiny of human motion, providing valuable insights into kinematic patterns and kinetic data. Utilizing Convolutional Neural Networks (CNNs), Direct Linear Transform (DLT), and Long Short-Term Memory (LSTM) networks, the methodology accurately detects key body points, reconstructs 3D landmarks, and generates detailed 3D body meshes. Extensive evaluations across various movements validate the framework's effectiveness, demonstrating comparable results to traditional marker-based models with minor differences in joint angle estimations and precise estimations of weight and height. Statistical analyses consistently support the framework's reliability, with joint angle estimations showing less than a 5-degree difference for hip flexion, elbow flexion, and knee angle methods. Additionally, weight estimation exhibits an average error of less than 6 % for weight and less than 2 % for height when compared to ground-truth values from 10 subjects. The integration of the Biomech-57 landmark skeleton template further enhances the robustness and reinforces the framework's credibility. This framework shows significant promise for meticulous biomechanical analysis in industrial contexts, eliminating the need for cumbersome markers and extending its utility to diverse research domains, including the study of specific exoskeleton devices' impact on facilitating the prompt return of injured workers to their tasks. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.20025 [pdf, ps, other]

Secure Full-Duplex Communication via Movable Antennas

Authors: Jingze Ding, Zijian Zhou, Chenbo Wang, Wenyao Li, Lifeng Lin, Bingli Jiao

Abstract: This paper investigates physical layer security (PLS) for a movable antenna (MA)-assisted full-duplex (FD) system. In this system, an FD base station (BS) with multiple MAs for transmission and reception provides services for an uplink (UL) user and a downlink (DL) user. Each user operates in half-duplex (HD) mode and is equipped with a single fixed-position antenna (FPA), in the presence of a sin… ▽ More This paper investigates physical layer security (PLS) for a movable antenna (MA)-assisted full-duplex (FD) system. In this system, an FD base station (BS) with multiple MAs for transmission and reception provides services for an uplink (UL) user and a downlink (DL) user. Each user operates in half-duplex (HD) mode and is equipped with a single fixed-position antenna (FPA), in the presence of a single-FPA eavesdropper (Eve). To ensure secure communication, artificial noise (AN) is transmitted to obstruct the interception of Eve. The objective of this paper is to maximize the sum secrecy rate (SSR) of the UL and DL users by jointly optimizing the beamformers of the BS and the positions of MAs. This paper also proposes an alternating optimization (AO) method to address the non-convex problem, which decomposes the optimization problem into three subproblems and solves them iteratively. Simulation results demonstrate a significant performance gain in the SSR achieved by the proposed scheme compared to the benchmark schemes. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: This paper has been submitted for possible publication

arXiv:2403.18036 [pdf, other]

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Authors: Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang

Abstract: Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcit… ▽ More Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: CVPR 2024; 16 pages

arXiv:2403.01164 [pdf, other]

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Authors: Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You

Abstract: In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a nov… ▽ More In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: MLSys 2024

arXiv:2401.17049 [pdf, ps, other]

Movable Antenna-Enabled Co-Frequency Co-Time Full-Duplex Wireless Communication

Authors: Jingze Ding, Zijian Zhou, Wenyao Li, Chenbo Wang, Lifeng Lin, Bingli Jiao

Abstract: Movable antenna (MA) provides an innovative way to arrange antennas that can contribute to improved signal quality and more effective interference management. This method is especially beneficial for co-frequency co-time full-duplex (CCFD) wireless communication, which struggles with self-interference (SI) that usually overpowers the desired incoming signals. By dynamically repositioning transmit/… ▽ More Movable antenna (MA) provides an innovative way to arrange antennas that can contribute to improved signal quality and more effective interference management. This method is especially beneficial for co-frequency co-time full-duplex (CCFD) wireless communication, which struggles with self-interference (SI) that usually overpowers the desired incoming signals. By dynamically repositioning transmit/receive antennas, we can mitigate the SI and enhance the reception of incoming signals. Thus, this paper proposes a novel MA-enabled point-to-point CCFD system and formulates the minimum achievable rate of two CCFD terminals. To maximize the minimum achievable rate and determine the near-optimal positions of the MAs, we introduce a solution based on projected particle swarm optimization (PPSO), which can circumvent common suboptimal positioning issues. Moreover, numerical results reveal that the PPSO method leads to a better performance compared to the conventional alternating position optimization (APO). The results also demonstrate that an MA-enabled CCFD system outperforms the one using fixed-position antennas (FPAs). △ Less

Submitted 7 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: This paper has been submitted to IEEE Wireless Communications Letters

arXiv:2401.10652 [pdf, other]

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference

Authors: Xuanlei Zhao, Shenggan Cheng, Guangyang Lu, Jiarui Fang, Haotian Zhou, Bin Jia, Ziming Liu, Yang You

Abstract: Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequence… ▽ More Large deep learning models have achieved impressive performance across a range of applications. However, their large memory requirements, including parameter memory and activation memory, have become a significant challenge for their practical serving. While existing methods mainly address parameter memory, the importance of activation memory has been overlooked. Especially for long input sequences, activation memory is expected to experience a significant exponential growth as the length of sequences increases. In this approach, we propose AutoChunk, an automatic and adaptive compiler system that efficiently reduces activation memory for long sequence inference by chunk strategies. The proposed system generates chunk plans by optimizing through multiple stages. In each stage, the chunk search pass explores all possible chunk candidates and the chunk selection pass identifies the optimal one. At runtime, AutoChunk employs code generation to automatically apply chunk strategies. The experiments demonstrate that AutoChunk can reduce over 80\% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3.2x to 11.7x, and outperform state-of-the-art methods by a large margin. △ Less

Submitted 2 March, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: ICLR 2024

arXiv:2401.09340 [pdf, other]

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Authors: Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and int… ▽ More 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io. △ Less

Submitted 6 March, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: 21 pages

arXiv:2312.15993 [pdf]

Adaptive Kalman-based hybrid car following strategy using TD3 and CACC

Authors: Yuqi Zheng, Ruidong Yan, Bin Jia, Rui Jiang, Adriana TAPUS, Xiaojing Chen, Shiteng Zheng, Ying Shang

Abstract: In autonomous driving, the hybrid strategy of deep reinforcement learning and cooperative adaptive cruise control (CACC) can fully utilize the advantages of the two algorithms and significantly improve the performance of car following. However, it is challenging for the traditional hybrid strategy based on fixed coefficients to adapt to mixed traffic flow scenarios, which may decrease the performa… ▽ More In autonomous driving, the hybrid strategy of deep reinforcement learning and cooperative adaptive cruise control (CACC) can fully utilize the advantages of the two algorithms and significantly improve the performance of car following. However, it is challenging for the traditional hybrid strategy based on fixed coefficients to adapt to mixed traffic flow scenarios, which may decrease the performance and even lead to accidents. To address the above problems, a hybrid car following strategy based on an adaptive Kalman Filter is proposed by regarding CACC and Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms. Different from traditional hybrid strategy based on fixed coefficients, the Kalman gain H, using as an adaptive coefficient, is derived from multi-timestep predictions and Monte Carlo Tree Search. At the end of study, simulation results with 4157745 timesteps indicate that, compared with the TD3 and HCFS algorithms, the proposed algorithm in this study can substantially enhance the safety of car following in mixed traffic flow without compromising the comfort and efficiency. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: 32pages,13figures

arXiv:2311.12871 [pdf, other]

An Embodied Generalist Agent in 3D World

Authors: Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

Abstract: Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently def… ▽ More Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting. We argue these limitations significantly hinder current models from performing real-world tasks and approaching general intelligence. To this end, we introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Moreover, we meticulously design an LLM-assisted pipeline to produce high-quality 3D VL data. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. Code and data are available on project page. △ Less

Submitted 9 May, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: ICML 2024. The first four authors contribute equally. Project page: https://embodied-generalist.github.io

arXiv:2311.00556 [pdf, other]

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab

Authors: Jieming Cui, Ziren Gong, Baoxiong Jia, Siyuan Huang, Zilong Zheng, Jianzhu Ma, Yixin Zhu

Abstract: The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multi… ▽ More The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in BioLab. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2309.08919 [pdf, other]

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

Authors: Wenyu Zhang, Xin Deng, Baojun Jia, Xingtong Yu, Yifan Chen, jin Ma, Qing Ding, Xinming Zhang

Abstract: Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we pro… ▽ More Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ($\mathcal{L}_{lca}$) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at https://github.com/wenyu1009/RTSRN. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Journal ref: ACM Multimedia 2023

arXiv:2309.08457 [pdf, other]

Sim-to-Real Brush Manipulation using Behavior Cloning and Reinforcement Learning

Authors: Biao Jia, Dinesh Manocha

Abstract: Developing proficient brush manipulation capabilities in real-world scenarios is a complex and challenging endeavor, with wide-ranging applications in fields such as art, robotics, and digital design. In this study, we introduce an approach designed to bridge the gap between simulated environments and real-world brush manipulation. Our framework leverages behavior cloning and reinforcement learnin… ▽ More Developing proficient brush manipulation capabilities in real-world scenarios is a complex and challenging endeavor, with wide-ranging applications in fields such as art, robotics, and digital design. In this study, we introduce an approach designed to bridge the gap between simulated environments and real-world brush manipulation. Our framework leverages behavior cloning and reinforcement learning to train a painting agent, seamlessly integrating it into both virtual and real-world environments. Additionally, we employ a real painting environment featuring a robotic arm and brush, mirroring the MyPaint virtual environment. Our results underscore the agent's effectiveness in acquiring policies for high-dimensional continuous action spaces, facilitating the smooth transfer of brush manipulation techniques from simulation to practical, real-world applications. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2308.16149 [pdf, other]

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Authors: Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock , et al. (7 additional authors not shown)

Abstract: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning… ▽ More We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat △ Less

Submitted 29 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: Arabic-centric, foundation model, large-language model, LLM, generative model, instruction-tuned, Jais, Jais-chat

MSC Class: 68T50 ACM Class: F.2.2; I.2.7

arXiv:2308.14279 [pdf]

Sampling unknown large networks restricted by low sampling rates

Authors: Bo Jiao

Abstract: Graph sampling plays an important role in data mining for large networks. Specifically, larger networks often correspond to lower sampling rates. Under the situation, traditional traversal-based samplings for large networks usually have an excessive preference for densely-connected network core nodes. Aim at this issue, this paper proposes a sampling method for unknown networks at low sampling rat… ▽ More Graph sampling plays an important role in data mining for large networks. Specifically, larger networks often correspond to lower sampling rates. Under the situation, traditional traversal-based samplings for large networks usually have an excessive preference for densely-connected network core nodes. Aim at this issue, this paper proposes a sampling method for unknown networks at low sampling rates, called SLSR, which first adopts a random node sampling to evaluate a degree threshold, utilized to distinguish the core from periphery, and the average degree in unknown networks, and then runs a double-layer sampling strategy on the core and periphery. SLSR is simple that results in a high time efficiency, but experimental evaluation confirms that the proposed method can accurately preserve many critical structures of unknown large networks with low sampling rates and low variances. △ Less

Submitted 4 January, 2024; v1 submitted 27 August, 2023; originally announced August 2023.

Comments: 25 pages,11 figures

arXiv:2308.12729 [pdf, other]

Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection

Authors: Shijie Zhang, Xin Yan, Xuejiao Yang, Binfeng Jia, Shuangyang Wang

Abstract: Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede th… ▽ More Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede the practicality of existing LTV prediction models, since game whales' purchase behaviours always exhibit varied distribution from general users. Consequently, identifying game whales can open up new opportunities to improve the accuracy of LTV prediction models. However, little attention has been paid to applying game whale detection in LTV prediction, and existing works are mainly specialized for the long-term LTV prediction with the assumption that the high-quality user features are available, which is not applicable in the UA stage. In this paper, we propose ExpLTV, a novel multi-task framework to perform LTV prediction and game whale detection in a unified way. In ExpLTV, we first innovatively design a deep neural network-based game whale detector that can not only infer the intrinsic order in accordance with monetary value, but also precisely identify high spenders (i.e., game whales) and low spenders. Then, by treating the game whale detector as a gating network to decide the different mixture patterns of LTV experts assembling, we can thoroughly leverage the shared information and scenario-specific information (i.e., game whales modelling and low spenders modelling). Finally, instead of separately designing a purchase rate estimator for two tasks, we design a shared estimator that can preserve the inner task relationships. The superiority of ExpLTV is further validated via extensive experiments on three industrial datasets. △ Less

Submitted 24 August, 2023; originally announced August 2023.

arXiv:2308.11181

Temporal Interaction -- Bridging Time and Experience in Human-Computer Interaction

Authors: Li He, Baixi Jiao, Yuxi Liu

Abstract: Traditional static user interfaces (UI) have given way to dynamic systems that can intelligently adapt to and respond to users' changing needs. Temporal interaction is an emerging field in human-computer interaction (HCI), which refers to the study and design of UI that are capable of adapting and responding to the user's changing behavioral and emotional states. By comprehending and incorporating… ▽ More Traditional static user interfaces (UI) have given way to dynamic systems that can intelligently adapt to and respond to users' changing needs. Temporal interaction is an emerging field in human-computer interaction (HCI), which refers to the study and design of UI that are capable of adapting and responding to the user's changing behavioral and emotional states. By comprehending and incorporating the temporal component of user interactions, it focuses on developing dynamic and individualized user experiences. This idea places a strong emphasis on the value of adjusting to user behavior and emotions in order to create a more unique and interesting user experience. The potential of temporal interaction to alter user interface design is highlighted by this paper's examination of its capacity to adjust to user behavior and react to emotional states. Designers can create interfaces that respond to the changing demands, emotions, and behaviors of users by utilizing temporal interactions. This produces interfaces that are not only highly functional but also form an emotional connection with the users. △ Less

Submitted 15 November, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: The content is not particularly relevant to the research

arXiv:2308.10441 [pdf, other]

X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events

Authors: Bo Dai, Linge Wang, Baoxiong Jia, Zeyu Zhang, Song-Chun Zhu, Chi Zhang, Yixin Zhu

Abstract: Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the development… ▽ More Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models' comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model's alignment with human commonsense when tested against X-VoE. A remarkable feature is our model's ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings' implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities. △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: 19 pages, 16 figures, selected for an Oral presentation at ICCV 2023. Project link: https://pku.ai/publication/intuitive2023iccv/

arXiv:2305.10036 [pdf, other]

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Authors: Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, Xing Xie

Abstract: Large language models (LLMs) have demonstrated powerful capabilities in both text understanding and generation. Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers. However, previous studies have shown that EaaS is vulnerable to model extraction attacks, which can cause significant losses f… ▽ More Large language models (LLMs) have demonstrated powerful capabilities in both text understanding and generation. Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers. However, previous studies have shown that EaaS is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs, as training these models is extremely expensive. To protect the copyright of LLMs for EaaS, we propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings. Our method selects a group of moderate-frequency words from a general text corpus to form a trigger set, then selects a target embedding as the watermark, and inserts it into the embeddings of texts containing trigger words as the backdoor. The weight of insertion is proportional to the number of trigger words included in the text. This allows the watermark backdoor to be effectively transferred to EaaS-stealer's model for copyright verification while minimizing the adverse impact on the original embeddings' utility. Our extensive experiments on various datasets show that our method can effectively protect the copyright of EaaS models without compromising service quality. △ Less

Submitted 2 June, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023

arXiv:2305.01341 [pdf, other]

Next-Generation Full Duplex Networking System Empowered by Reconfigurable Intelligent Surfaces

Authors: Yingyang Chen, Yuncong Li, Miaowen Wen, Duoying Zhang, Bingli Jiao, Zhiguo Ding, Theodoros A. Tsiftsis, H. Vincent Poor

Abstract: Full duplex (FD) radio has attracted extensive attention due to its co-time and co-frequency transceiving capability. {However, the potential gain brought by FD radios is closely related to the management of self-interference (SI), which imposes high or even stringent requirements on SI cancellation (SIC) techniques. When the FD deployment evolves into next-generation mobile networking, the SI pro… ▽ More Full duplex (FD) radio has attracted extensive attention due to its co-time and co-frequency transceiving capability. {However, the potential gain brought by FD radios is closely related to the management of self-interference (SI), which imposes high or even stringent requirements on SI cancellation (SIC) techniques. When the FD deployment evolves into next-generation mobile networking, the SI problem becomes more complicated, significantly limiting its potential gains.} In this paper, we conceive a multi-cell FD networking scheme by deploying a reconfigurable intelligent surface (RIS) at the cell boundary to configure the radio environment proactively. To achieve the full potential of the system, we aim to maximize the sum rate (SR) of multiple cells by jointly optimizing the transmit precoding (TPC) matrices at FD base stations (BSs) and users and the phase shift matrix at RIS. Since the original problem is non-convex, we reformulate and decouple it into a pair of subproblems by utilizing the relationship between the SR and minimum mean square error (MMSE). The optimal solutions of TPC matrices are obtained in closed form, while both complex circle manifold (CCM) and successive convex approximation (SCA) based algorithms are developed to resolve the phase shift matrix suboptimally. Our simulation results show that introducing an RIS into an FD networking system not only improves the overall SR significantly but also enhances the cell edge performance prominently. More importantly, we validate that the RIS deployment with optimized phase shifts can reduce the requirement for SIC and the number of BS antennas, which further reduces the hardware cost and power consumption, especially with a sufficient number of reflecting elements. As a result, the utilization of an RIS enables the originally cumbersome FD networking system to become efficient and practical. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: 15 pages, 14 figures

arXiv:2304.04487 [pdf, other]

Inference with Reference: Lossless Acceleration of Large Language Models

Authors: Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei

Abstract: We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens t… ▽ More We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens to the decoder and then efficiently checks the tokens' appropriateness as the decoding result in parallel within one decoding step. The improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs with identical generation results as greedy decoding in many practical generation scenarios where significant overlap between in-context reference and outputs exists (e.g., search engines and multi-turn conversations). △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: 9 pages

arXiv:2304.04321 [pdf, other]

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Authors: Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

Abstract: Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's abil… ▽ More Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. Project website: https://arnold-benchmark.github.io. △ Less

Submitted 11 September, 2023; v1 submitted 9 April, 2023; originally announced April 2023.

Comments: The first two authors contributed equally; 20 pages; 17 figures; project availalbe: https://arnold-benchmark.github.io/ ICCV 2023

arXiv:2302.12095 [pdf, other]

On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective

Authors: Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, Binxin Jiao, Yue Zhang, Xing Xie

Abstract: ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing attention over the past few months. While evaluations of various aspects of ChatGPT have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. Robustness is of particular concern in responsible AI, especially for safety-critical applications. In this paper, we conduct… ▽ More ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing attention over the past few months. While evaluations of various aspects of ChatGPT have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. Robustness is of particular concern in responsible AI, especially for safety-critical applications. In this paper, we conduct a thorough evaluation of the robustness of ChatGPT from the adversarial and out-of-distribution (OOD) perspective. To do so, we employ the AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart review and DDXPlus medical diagnosis datasets for OOD evaluation. We select several popular foundation models as baselines. Results show that ChatGPT shows consistent advantages on most adversarial and OOD classification and translation tasks. However, the absolute performance is far from perfection, which suggests that adversarial and OOD robustness remains a significant threat to foundation models. Moreover, ChatGPT shows astounding performance in understanding dialogue-related texts and we find that it tends to provide informal suggestions for medical tasks instead of definitive answers. Finally, we present in-depth discussions of possible research directions. △ Less

Submitted 29 August, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

Comments: Highlighted paper at ICLR 2023 workshop on Trustworthy and Reliable Large-Scale Machine Learning Models; code is at: https://github.com/microsoft/robustlearn; more works: https://llm-eval.github.io/

arXiv:2301.06015 [pdf, other]

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

Authors: Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu

Abstract: We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation,… ▽ More We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding. △ Less

Submitted 14 January, 2023; originally announced January 2023.

Comments: 20 pages

arXiv:2212.10192 [pdf, other]

Adam: Dense Retrieval Distillation with Adaptive Dark Examples

Authors: Chang Liu, Chongyang Tao, Xiubo Geng, Tao Shen, Dongyan Zhao, Can Xu, Binxing Jiao, Daxin Jiang

Abstract: To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods… ▽ More To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: 9 pages, 2 figures

arXiv:2212.03533 [pdf, other]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

Abstract: This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clu… ▽ More This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters. △ Less

Submitted 22 February, 2024; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: 17 pages, v2 fixes the SummEval numbers

arXiv:2212.02398 [pdf, other]

Generalizable Person Re-Identification via Viewpoint Alignment and Fusion

Authors: Bingliang Jiao, Lingqiao Liu, Liying Gao, Guosheng Lin, Ruiqi Wu, Shizhou Zhang, Peng Wang, Yanning Zhang

Abstract: In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture m… ▽ More In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture mapping module to map the pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images, and thus directly using them for ReID will inevitably result in poor performance. To handle this issue, we propose to fuse the original image and canonical view image via a transformer-based module. The key insight of this design is that the cross-attention mechanism in the transformer could be an ideal solution to align the discriminative texture clues from the original image with the canonical view image, which could compensate for the low-quality texture information of the canonical view image. Through extensive experiments, we show that our method can lead to superior performance over the existing approaches in various evaluation settings. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.15402 [pdf, other]

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Authors: Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang

Abstract: Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spec… ▽ More Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.11275 [pdf, other]

doi 10.1109/TMM.2023.3275873

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Authors: Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Abstract: Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech rep… ▽ More Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm. △ Less

Submitted 19 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: 11 pages, Accepted by IEEE Transactions on Multimedia

arXiv:2210.08990 [pdf, other]

Improving Object-centric Learning with Query Optimization

Authors: Baoxiong Jia, Yu Liu, Siyuan Huang

Abstract: The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to t… ▽ More The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io △ Less

Submitted 10 February, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Published as a conference paper at ICLR 2023

arXiv:2210.08809 [pdf, other]

Effective and Efficient Query-aware Snippet Extraction for Web Search

Authors: Jingwei Yi, Fangzhao Wu, Chuhan Wu, Xiaolong Huang, Binxing Jiao, Guangzhong Sun, Xing Xie

Abstract: Query-aware webpage snippet extraction is widely used in search engines to help users better understand the content of the returned webpages before clicking. Although important, it is very rarely studied. In this paper, we propose an effective query-aware webpage snippet extraction method named DeepQSE, aiming to select a few sentences which can best summarize the webpage content in the context of… ▽ More Query-aware webpage snippet extraction is widely used in search engines to help users better understand the content of the returned webpages before clicking. Although important, it is very rarely studied. In this paper, we propose an effective query-aware webpage snippet extraction method named DeepQSE, aiming to select a few sentences which can best summarize the webpage content in the context of input query. DeepQSE first learns query-aware sentence representations for each sentence to capture the fine-grained relevance between query and sentence, and then learns document-aware query-sentence relevance representations for snippet extraction. Since the query and each sentence are jointly modeled in DeepQSE, its online inference may be slow. Thus, we further propose an efficient version of DeepQSE, named Efficient-DeepQSE, which can significantly improve the inference speed of DeepQSE without affecting its performance. The core idea of Efficient-DeepQSE is to decompose the query-aware snippet extraction task into two stages, i.e., a coarse-grained candidate sentence selection stage where sentence representations can be cached, and a fine-grained relevance modeling stage. Experiments on two real-world datasets validate the effectiveness and efficiency of our methods. △ Less

Submitted 27 October, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: Accepted by EMNLP2022

arXiv:2210.03929 [pdf, other]

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Authors: Baoxiong Jia, Ting Lei, Song-Chun Zhu, Siyuan Huang

Abstract: Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partia… ▽ More Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: Published at NeurIPS Track on Datasets and Benchmarks 2022

arXiv:2209.04128 [pdf, other]

Modelling Power Consumptions for Multi-rotor UAVs

Authors: Hao Gong, Baoqi Huang, Bing Jia, Hansu Dai

Abstract: Unmanned aerial vehicles (UAVs) have various advantages, but their practical applications are influenced by their limited energy. Therefore, it is important to manage their power consumption and also important to establish corresponding power consumption models. However, most of existing works either establish theoretical power consumption models for fixed-wing UAVs and single-rotor UAVs, or provi… ▽ More Unmanned aerial vehicles (UAVs) have various advantages, but their practical applications are influenced by their limited energy. Therefore, it is important to manage their power consumption and also important to establish corresponding power consumption models. However, most of existing works either establish theoretical power consumption models for fixed-wing UAVs and single-rotor UAVs, or provide heuristic power consumption models for multi-rotor UAVs without rigorous mathematical derivations. This paper aims to establish theoretical power consumption models for multi-rotor UAVs. To be specific, the closed-form power consumption models for a multi-rotor UAV in three flight statuses, i.e., forward flight, vertical ascent and vertical descent, are derived by leveraging the relationship between single-rotor UAVs and multi-rotor UAVs in terms of power consumptions. On this basis, a generic flight power consumption model for the UAV in a three-dimensional (3-D) scenario is obtained. Extensive experiments are conducted by using DJI M210 and a mobile app made by DJI Mobile SDK in real scenarios, and confirm the correctness and effectiveness of these models; in addition, simulations are performed to further investigate the effect of the rotor numbers on the power consumption for the UAV. The proposed power consumption models not only reveal how the power consumption of multi-rotor UAVs are affected by various factors, but also pave the way for introducing other novel applications. △ Less

Submitted 9 September, 2022; originally announced September 2022.

arXiv:2208.14754 [pdf, other]

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Authors: Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang

Abstract: In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-… ▽ More In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets. △ Less

Submitted 4 June, 2023; v1 submitted 31 August, 2022; originally announced August 2022.

Comments: Appeared at ICLR 2023

arXiv:2208.13661 [pdf, other]

LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval

Authors: Kai Zhang, Chongyang Tao, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, Daxin Jiang

Abstract: Retrieval models based on dense representations in semantic space have become an indispensable branch for first-stage retrieval. These retrievers benefit from surging advances in representation learning towards compressive global sequence-level embeddings. However, they are prone to overlook local salient phrases and entity mentions in texts, which usually play pivot roles in first-stage retrieval… ▽ More Retrieval models based on dense representations in semantic space have become an indispensable branch for first-stage retrieval. These retrievers benefit from surging advances in representation learning towards compressive global sequence-level embeddings. However, they are prone to overlook local salient phrases and entity mentions in texts, which usually play pivot roles in first-stage retrieval. To mitigate this weakness, we propose to make a dense retriever align a well-performing lexicon-aware representation model. The alignment is achieved by weakened knowledge distillations to enlighten the retriever via two aspects -- 1) a lexicon-augmented contrastive objective to challenge the dense encoder and 2) a pair-wise rank-consistent regularization to make dense model's behavior incline to the other. We evaluate our model on three public benchmarks, which shows that with a comparable lexicon-aware retriever as the teacher, our proposed dense one can bring consistent and significant improvements, and even outdo its teacher. In addition, we found our improvement on the dense retriever is complementary to the standard ranker distillation, which can further lift state-of-the-art performance. △ Less

Submitted 2 March, 2023; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: 14 pages, 6 tables, 4 figures. WWW 2023

arXiv:2207.07869 [pdf, other]

CA-SpaceNet: Counterfactual Analysis for 6D Pose Estimation in Space

Authors: Shunli Wang, Shuaibing Wang, Bo Jiao, Dingkang Yang, Liuzhen Su, Peng Zhai, Chixiao Chen, Lihua Zhang

Abstract: Reliable and stable 6D pose estimation of uncooperative space objects plays an essential role in on-orbit servicing and debris removal missions. Considering that the pose estimator is sensitive to background interference, this paper proposes a counterfactual analysis framework named CASpaceNet to complete robust 6D pose estimation of the spaceborne targets under complicated background. Specificall… ▽ More Reliable and stable 6D pose estimation of uncooperative space objects plays an essential role in on-orbit servicing and debris removal missions. Considering that the pose estimator is sensitive to background interference, this paper proposes a counterfactual analysis framework named CASpaceNet to complete robust 6D pose estimation of the spaceborne targets under complicated background. Specifically, conventional methods are adopted to extract the features of the whole image in the factual case. In the counterfactual case, a non-existent image without the target but only the background is imagined. Side effect caused by background interference is reduced by counterfactual analysis, which leads to unbiased prediction in final results. In addition, we also carry out lowbit-width quantization for CA-SpaceNet and deploy part of the framework to a Processing-In-Memory (PIM) accelerator on FPGA. Qualitative and quantitative results demonstrate the effectiveness and efficiency of our proposed method. To our best knowledge, this paper applies causal inference and network quantization to the 6D pose estimation of space-borne targets for the first time. The code is available at https://github.com/Shunli-Wang/CA-SpaceNet. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: 8 pages, 6 figures, IROS-2022 conference paper

ACM Class: I.4.9; I.2.9

arXiv:2207.02578 [pdf, other]

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

Abstract: In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve th… ▽ More In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm . △ Less

Submitted 12 May, 2023; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: Accepted to ACL 2023

arXiv:2206.08063 [pdf, other]

Towards Robust Ranker for Text Retrieval

Authors: Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Binxing Jiao, Daxin Jiang

Abstract: A ranker plays an indispensable role in the de facto 'retrieval & rerank' pipeline, but its training still lags behind -- learning from moderate negatives or/and serving as an auxiliary module for a retriever. In this work, we first identify two major barriers to a robust ranker, i.e., inherent label noises caused by a well-trained retriever and non-ideal negatives sampled for a high-capable ranke… ▽ More A ranker plays an indispensable role in the de facto 'retrieval & rerank' pipeline, but its training still lags behind -- learning from moderate negatives or/and serving as an auxiliary module for a retriever. In this work, we first identify two major barriers to a robust ranker, i.e., inherent label noises caused by a well-trained retriever and non-ideal negatives sampled for a high-capable ranker. Thereby, we propose multiple retrievers as negative generators improve the ranker's robustness, where i) involving extensive out-of-distribution label noises renders the ranker against each noise distribution, and ii) diverse hard negatives from a joint distribution are relatively close to the ranker's negative distribution, leading to more challenging thus effective training. To evaluate our robust ranker (dubbed R$^2$anker), we conduct experiments in various settings on the popular passage retrieval benchmark, including BM25-reranking, full-ranking, retriever distillation, etc. The empirical results verify the new state-of-the-art effectiveness of our model. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: 11 pages of main content, 4 tables, 3 figures

arXiv:2206.05895 [pdf, other]

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

Authors: Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

Abstract: Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling. Fueled by its flexibility in the formulation and strong modeling power of the latent space, recent works built upon it have made interesting attempts aiming at the interpretability of text modeling. However, latent space EBMs also inherit some flaws from EBMs in data spa… ▽ More Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling. Fueled by its flexibility in the formulation and strong modeling power of the latent space, recent works built upon it have made interesting attempts aiming at the interpretability of text modeling. However, latent space EBMs also inherit some flaws from EBMs in data space; the degenerate MCMC sampling quality in practice can lead to poor generation quality and instability in training, especially on data with complex latent structures. Inspired by the recent efforts that leverage diffusion recovery likelihood learning as a cure for the sampling issue, we introduce a novel symbiosis between the diffusion models and latent space EBMs in a variational learning framework, coined as the latent diffusion energy-based model. We develop a geometric clustering-based regularization jointly with the information bottleneck to further improve the quality of the learned latent space. Experiments on several challenging tasks demonstrate the superior performance of our model on interpretable text modeling over strong counterparts. △ Less

Submitted 4 October, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

Comments: ICML 2022

arXiv:2206.00277 [pdf, other]

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Authors: Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei

Abstract: The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream task… ▽ More The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3% benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost. △ Less

Submitted 1 June, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: under review

arXiv:2206.00216 [pdf, other]

THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption

Authors: Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei

Abstract: As more and more pre-trained language models adopt on-cloud deployment, the privacy issues grow quickly, mainly for the exposure of plain-text user data (e.g., search history, medical record, bank account). Privacy-preserving inference of transformer models is on the demand of cloud service users. To protect privacy, it is an attractive choice to compute only with ciphertext in homomorphic encrypt… ▽ More As more and more pre-trained language models adopt on-cloud deployment, the privacy issues grow quickly, mainly for the exposure of plain-text user data (e.g., search history, medical record, bank account). Privacy-preserving inference of transformer models is on the demand of cloud service users. To protect privacy, it is an attractive choice to compute only with ciphertext in homomorphic encryption (HE). However, enabling pre-trained models inference on ciphertext data is difficult due to the complex computations in transformer blocks, which are not supported by current HE tools yet. In this work, we introduce $\textit{THE-X}$, an approximation approach for transformers, which enables privacy-preserving inference of pre-trained models developed by popular frameworks. $\textit{THE-X}$ proposes a workflow to deal with complex computation in transformer networks, including all the non-polynomial functions like GELU, softmax, and LayerNorm. Experiments reveal our proposed $\textit{THE-X}$ can enable transformer inference on encrypted data for different downstream tasks, all with negligible performance drop but enjoying the theory-guaranteed privacy-preserving advantage. △ Less

Submitted 1 June, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

Comments: Findings of ACL 2022

arXiv:2111.12990 [pdf, other]

Learning Algebraic Representation for Systematic Generalization in Abstract Reasoning

Authors: Chi Zhang, Sirui Xie, Baoxiong Jia, Ying Nian Wu, Song-Chun Zhu, Yixin Zhu

Abstract: Is intelligence realized by connectionist or classicist? While connectionist approaches have achieved superhuman performance, there has been growing evidence that such task-specific superiority is particularly fragile in systematic generalization. This observation lies in the central debate between connectionist and classicist, wherein the latter continually advocates an algebraic treatment in cog… ▽ More Is intelligence realized by connectionist or classicist? While connectionist approaches have achieved superhuman performance, there has been growing evidence that such task-specific superiority is particularly fragile in systematic generalization. This observation lies in the central debate between connectionist and classicist, wherein the latter continually advocates an algebraic treatment in cognitive architectures. In this work, we follow the classicist's call and propose a hybrid approach to improve systematic generalization in reasoning. Specifically, we showcase a prototype with algebraic representation for the abstract spatial-temporal reasoning task of Raven's Progressive Matrices (RPM) and present the ALgebra-Aware Neuro-Semi-Symbolic (ALANS) learner. The ALANS learner is motivated by abstract algebra and the representation theory. It consists of a neural visual perception frontend and an algebraic abstract reasoning backend: the frontend summarizes the visual information from object-based representation, while the backend transforms it into an algebraic structure and induces the hidden operator on the fly. The induced operator is later executed to predict the answer's representation, and the choice most similar to the prediction is selected as the solution. Extensive experiments show that by incorporating an algebraic treatment, the ALANS learner outperforms various pure connectionist models in domains requiring systematic generalization. We further show the generative nature of the learned algebraic representation; it can be decoded by isomorphism to generate an answer. △ Less

Submitted 20 July, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

Comments: ECCV 2022 paper. Supplementary: http://wellyzhang.github.io/attach/eccv22zhang_alans_supp.pdf Project: http://wellyzhang.github.io/project/alans.html

arXiv:2108.09193 [pdf, other]

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Authors: Chuhan Wu, Fangzhao Wu, Tao Qi, Binxing Jiao, Daxin Jiang, Yongfeng Huang, Xing Xie

Abstract: Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing sparse self-attention instead of a dense one, which usually attends to tokens at certain positions or randomly selected tokens. However, manually selected… ▽ More Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing sparse self-attention instead of a dense one, which usually attends to tokens at certain positions or randomly selected tokens. However, manually selected or random tokens may be uninformative for context modeling. In this paper, we propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention. In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer, which aims to find potential important interactions between tokens. We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads. Finally, we select token embeddings according to the index matrices to form the input of sparse attention networks. Extensive experiments on six benchmark datasets for different tasks validate the efficiency and effectiveness of Smart Bird in text modeling. △ Less

Submitted 2 September, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

Showing 1–50 of 113 results for author: Jiao, B