Skip to main content

Showing 1–50 of 245 results for author: Gao, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05945  [pdf, other

    cs.CV

    Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

    Authors: Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

    Abstract: Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  2. arXiv:2405.04883  [pdf, other

    cs.CV cs.AI cs.LG

    Molecule-Space: Free Lunch in Unified Multimodal Space via Knowledge Fusion

    Authors: Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao

    Abstract: Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Accepted by ICML 2024. The code and checkpoints are released at https://github.com/MoleculeSpace/MoleculeSpace

  3. arXiv:2404.16212  [pdf, other

    cs.CR cs.CV cs.LG

    An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape

    Authors: Sifat Muhammad Abdullah, Aravind Cheruvu, Shravya Kanchi, Taejoong Chung, Peng Gao, Murtuza Jadliwala, Bimal Viswanath

    Abstract: Deepfake or synthetic images produced using deep generative models pose serious risks to online platforms. This has triggered several research efforts to accurately detect deepfake images, achieving excellent performance on publicly available deepfake datasets. In this work, we study 8 state-of-the-art detectors and argue that they are far from being ready for deployment due to two recent developm… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Accepted to IEEE S&P 2024; 19 pages, 10 figures

  4. arXiv:2404.16006  [pdf, other

    cs.CV

    MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

    Authors: Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 77 pages, 41 figures

  5. arXiv:2404.14759  [pdf, other

    cs.CV

    Unified Unsupervised Salient Object Detection via Knowledge Transfer

    Authors: Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, Jie Qin

    Abstract: Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  6. arXiv:2404.13550  [pdf, other

    cs.CV eess.IV

    Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes

    Authors: Kang You, Kai Liu, Li Yu, Pan Gao, Dandan Ding

    Abstract: Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance an… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  7. arXiv:2404.06936  [pdf, other

    cs.CV cs.MM

    Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression

    Authors: Kang You, Pan Gao, Zhan Ma

    Abstract: The past several years have witnessed the emergence of learned point cloud compression (PCC) techniques. However, current learning-based lossless point cloud attribute compression (PCAC) methods either suffer from high computational complexity or deteriorated compression performance. Moreover, the significant variations in point cloud scale and sparsity encountered in real-world applications make… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  8. arXiv:2404.04050  [pdf, other

    cs.CV

    No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

    Authors: Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Jiaming Liu, Han Xiao, Chaoyou Fu, Hao Dong, Peng Gao

    Abstract: To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes, and then evaluate their generalization performance on 'unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' c… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: CVPR Highlight. Code is available at https://github.com/yangyangyang127/Seg-NN. arXiv admin note: text overlap with arXiv:2308.12961

  9. arXiv:2404.01618  [pdf, other

    cs.RO

    Multi-Robot Collaborative Navigation with Formation Adaptation

    Authors: Zihao Deng, Peng Gao, Williard Joshua Jose, Hao Zhang

    Abstract: Multi-robot collaborative navigation is an essential ability where teamwork and synchronization are keys. In complex and uncertain environments, adaptive formation is vital, as rigid formations prove to be inadequate. The ability of robots to dynamically adjust their formation enables navigation through unpredictable spaces, maintaining cohesion, and effectively responding to environmental challen… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  10. arXiv:2403.20271  [pdf, other

    cs.CV

    Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

    Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

    Abstract: The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand p… ▽ More

    Submitted 31 March, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: 16 pages, 7 figures

  11. arXiv:2403.17770  [pdf, other

    eess.IV cs.CV

    CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation

    Authors: Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang

    Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  12. arXiv:2403.14624  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Authors: Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li

    Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 46 Pages, Work in Progress, Benchmark Project Page: https://mathverse-cuhk.github.io

  13. arXiv:2403.11289  [pdf, other

    cs.RO

    ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

    Authors: Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, Hao Dong

    Abstract: The integration of Multimodal Large Language Models (MLLMs) with robotic systems has significantly enhanced the ability of robots to interpret and act upon natural language instructions. Despite these advancements, conventional MLLMs are typically trained on generic image-text pairs, lacking essential robotics knowledge such as affordances and physical knowledge, which hampers their efficacy in ma… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: Code and dataset will be made publicly available at https://github.com/SiyuanHuang95/ManipVQA

  14. arXiv:2403.07692  [pdf, other

    cs.CV

    Masked AutoDecoder is Effective Multi-Task Vision Generalist

    Authors: Han Qiu, Jiaxing Huang, Peng Gao, Lewei Lu, Xiaoqin Zhang, Shijian Lu

    Abstract: Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However, such autoregressive Transformers may not fit vision tasks well, as vision… ▽ More

    Submitted 14 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  15. arXiv:2402.17098  [pdf, other

    cs.CV

    In Defense and Revival of Bayesian Filtering for Thermal Infrared Object Tracking

    Authors: Peng Gao, Shi-Min Li, Feng Gao, Fei Wang, Ru-Yue Yuan, Hamido Fujita

    Abstract: Deep learning-based methods monopolize the latest research in the field of thermal infrared (TIR) object tracking. However, relying solely on deep learning models to obtain better tracking results requires carefully selecting feature information that is beneficial to representing the target object and designing a reasonable template update strategy, which undoubtedly increases the difficulty of mo… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  16. arXiv:2402.16880  [pdf, other

    cs.LG cs.AI cs.CL

    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

    Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

    Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer… ▽ More

    Submitted 19 April, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  17. arXiv:2402.16570  [pdf, other

    cs.CV cs.LG

    Searching a Lightweight Network Architecture for Thermal Infrared Pedestrian Tracking

    Authors: Peng Gao, Xiao Liu, Yu Wang, Ru-Yue Yuan

    Abstract: Manually-designed network architectures for thermal infrared pedestrian tracking (TIR-PT) require substantial effort from human experts. Neural networks with ResNet backbones are popular for TIR-PT. However, TIR-PT is a tracking task and more challenging than classification and detection. This paper makes an early attempt to search an optimal network architecture for TIR-PT automatically, employin… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  18. arXiv:2402.14309  [pdf, other

    cs.CV

    YOLO-TLA: An Efficient and Lightweight Small Object Detection Model based on YOLOv5

    Authors: Peng Gao, Chun-Lin Ji, Tao Yu, Ru-Yue Yuan

    Abstract: Object detection, a crucial aspect of computer vision, has seen significant advancements in accuracy and robustness. Despite these advancements, practical applications still face notable challenges, primarily the inaccurate detection or missed detection of small objects. In this paper, we propose YOLO-TLA, an advanced object detection model building on YOLOv5. We first introduce an additional dete… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 11 pages, 11 figures, 7 tables

  19. arXiv:2402.14304   

    cs.RO cs.AI cs.CV

    Vision-Language Navigation with Embodied Intelligence: A Survey

    Authors: Peng Gao, Peng Wang, Feng Gao, Fei Wang, Ruyue Yuan

    Abstract: As a long-term vision in the field of artificial intelligence, the core goal of embodied intelligence is to improve the perception, understanding, and interaction capabilities of agents and the environment. Vision-language navigation (VLN), as a critical research path to achieve embodied intelligence, focuses on exploring how agents use natural language to communicate effectively with humans, rece… ▽ More

    Submitted 15 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: The pictures in Figures 2, 4, and 5 are used without authorization, and the literatures in Table 1 have been cited improperly

  20. arXiv:2402.14236  [pdf, other

    cs.LG cs.AI cs.AR

    Automated Design and Optimization of Distributed Filtering Circuits via Reinforcement Learning

    Authors: Peng Gao, Tao Yu, Fei Wang, Ru-Yue Yuan

    Abstract: Designing distributed filtering circuits (DFCs) is complex and time-consuming, with the circuit performance relying heavily on the expertise and experience of electronics engineers. However, manual design methods tend to have exceedingly low-efficiency. This study proposes a novel end-to-end automated method for fabricating circuits to improve the design of DFCs. The proposed method harnesses rein… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: 13 pages, 7 figures, 4 tables

  21. arXiv:2402.05935  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

    Authors: Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao

    Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

  22. arXiv:2402.03327  [pdf, other

    cs.CV cs.AI cs.CL

    Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

    Authors: Dingning Liu, Xiaoshui Huang, Yuenan Hou, Zhihui Wang, Zhenfei Yin, Yongshun Gong, Peng Gao, Wanli Ouyang

    Abstract: In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive p… ▽ More

    Submitted 9 January, 2024; originally announced February 2024.

    Comments: 10 pages, 6 figures

  23. arXiv:2402.01767  [pdf, other

    cs.CL cs.AI cs.LG

    HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA

    Authors: Xinyue Chen, Pengyu Gao, Jiangjiang Song, Xiaoyang Tan

    Abstract: As language model agents leveraging external tools rapidly evolve, significant progress has been made in question-answering(QA) methodologies utilizing supplementary documents and the Retrieval-Augmented Generation (RAG) approach. This advancement has improved the response quality of language models and alleviates the appearance of hallucination. However, these methods exhibit limited retrieval ac… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  24. arXiv:2401.05861  [pdf, other

    cs.CL

    Towards Boosting Many-to-Many Multilingual Machine Translation with Large Language Models

    Authors: Pengzhi Gao, Zhongjun He, Hua Wu, Haifeng Wang

    Abstract: The training paradigm for machine translation has gradually shifted, from learning neural machine translation (NMT) models with extensive parallel corpora to instruction finetuning on multilingual large language models (LLMs) with high-quality translation pairs. In this paper, we focus on boosting many-to-many multilingual translation of LLMs with an emphasis on zero-shot translation directions. W… ▽ More

    Submitted 7 February, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  25. arXiv:2401.02384  [pdf, other

    cs.CV

    ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

    Authors: Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

    Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To a… ▽ More

    Submitted 15 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Updated and corrected experimental results, removal of inappropriate experiments, and a more comprehensive experimental setup

  26. arXiv:2312.15869  [pdf, other

    cs.CL cs.AI

    Medical Report Generation based on Segment-Enhanced Contrastive Representation Learning

    Authors: Ruoqing Zhao, Xi Wang, Hongliang Dai, Pan Gao, Piji Li

    Abstract: Automated radiology report generation has the potential to improve radiology reporting and alleviate the workload of radiologists. However, the medical report generation task poses unique challenges due to the limited availability of medical data and the presence of data bias. To maximize the utility of available data and reduce data bias, we propose MSCL (Medical image Segmentation with Contrasti… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: NLPCC 2023

  27. arXiv:2312.15653  [pdf, other

    cs.IT eess.SP

    Index Modulation for Fluid Antenna-Assisted MIMO Communications: System Design and Performance Analysis

    Authors: Jing Zhu, Gaojie Chen, Pengyu Gao, Pei Xiao, Zihuai Lin, Atta Quddus

    Abstract: In this paper, we propose a transmission mechanism for fluid antennas (FAs) enabled multiple-input multiple-output (MIMO) communication systems based on index modulation (IM), named FA-IM, which incorporates the principle of IM into FAs-assisted MIMO system to improve the spectral efficiency (SE) without increasing the hardware complexity. In FA-IM, the information bits are mapped not only to the… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 12 pages,9 figures, publish to TWC

  28. arXiv:2312.14074  [pdf, other

    cs.CV

    LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

    Authors: Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, Shanghang Zhang

    Abstract: Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR dat… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  29. arXiv:2312.13655  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    Compositional Zero-Shot Learning for Attribute-Based Object Reference in Human-Robot Interaction

    Authors: Peng Gao, Ahmed Jaafar, Brian Reily, Christopher Reardon, Hao Zhang

    Abstract: Language-enabled robots have been widely studied over the past years to enable natural human-robot interaction and teaming in various real-world applications. Language-enabled robots must be able to comprehend referring expressions to identify a particular object from visual perception using a set of referring attributes extracted from natural language. However, visual observations of an object ma… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Equal contribution from the first two authors

  30. arXiv:2312.12436  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

    Authors: Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun

    Abstract: The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the gr… ▽ More

    Submitted 20 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Total 120 pages. See our project at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

  31. arXiv:2312.09738  [pdf, other

    cs.AI

    3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V

    Authors: Dingning Liu, Xiaomeng Dong, Renrui Zhang, Xu Luo, Peng Gao, Xiaoshui Huang, Yongshun Gong, Zhihui Wang

    Abstract: In this work, we present a new visual prompting method called 3DAxiesPrompts (3DAP) to unleash the capabilities of GPT-4V in performing 3D spatial tasks. Our investigation reveals that while GPT-4V exhibits proficiency in discerning the position and interrelations of 2D entities through current visual prompting techniques, its abilities in handling 3D spatial tasks have yet to be explored. In our… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  32. arXiv:2312.06995  [pdf, other

    cs.CV eess.IV

    Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

    Authors: Jinsong Shi, Pan Gao, Jie Qin

    Abstract: Image Quality Assessment (IQA) has long been a research hotspot in the field of image processing, especially No-Reference Image Quality Assessment (NR-IQA). Due to the powerful feature extraction ability, existing Convolution Neural Network (CNN) and Transformers based NR-IQA methods have achieved considerable progress. However, they still exhibit limited capability when facing unknown authentic d… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI24

  33. arXiv:2312.06462  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

    Authors: Qi Yang, Xing Nie, Tong Li, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, Shiming Xiang

    Abstract: Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral… ▽ More

    Submitted 7 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Highlight. 13 pages, 10 figures

  34. arXiv:2312.04547  [pdf, other

    cs.CV cs.AI cs.GR cs.HC

    Digital Life Project: Autonomous 3D Characters with Social Intelligence

    Authors: Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang, Ziwei Liu

    Abstract: In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models perso… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Homepage: https://digital-life-project.com/

  35. arXiv:2312.03700  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    OneLLM: One Framework to Align All Modalities with Language

    Authors: Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

    Abstract: Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Code: https://github.com/csuhan/OneLLM

  36. arXiv:2311.17963  [pdf, other

    cs.CV

    M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

    Authors: Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo

    Abstract: While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various… ▽ More

    Submitted 13 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

  37. arXiv:2311.07575  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Authors: Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao

    Abstract: We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: Work in progress. Code and demos are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

  38. arXiv:2311.05533  [pdf, ps, other

    math.CO cs.DM

    Building Hamiltonian Cycles in the Semi-Random Graph Process in Less Than $2n$ Rounds

    Authors: Alan Frieze, Pu Gao, Calum MacRury, Paweł Prałat, Gregory Sorkin

    Abstract: The semi-random graph process is an adaptive random graph process in which an online algorithm is initially presented an empty graph on $n$ vertices. In each round, a vertex $u$ is presented to the algorithm independently and uniformly at random. The algorithm then adaptively selects a vertex $v$, and adds the edge $uv$ to the graph. For a given graph property, the objective of the algorithm is to… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: 28 pages. arXiv admin note: substantial text overlap with arXiv:2205.02350

  39. arXiv:2310.20491  [pdf, other

    cs.RO

    Collaborative Decision-Making Using Spatiotemporal Graphs in Connected Autonomy

    Authors: Peng Gao, Yu Shen, Ming C. Lin

    Abstract: Collaborative decision-making is an essential capability for multi-robot systems, such as connected vehicles, to collaboratively control autonomous vehicles in accident-prone scenarios. Under limited communication bandwidth, capturing comprehensive situational awareness by integrating connected agents' observation is very challenging. In this paper, we propose a novel collaborative decision-making… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

  40. arXiv:2310.08358  [pdf, other

    cs.LG

    Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

    Authors: Peifeng Gao, Qianqian Xu, Yibo Yang, Peisong Wen, Huiyang Shao, Zhiyong Yang, Bernard Ghanem, Qingming Huang

    Abstract: Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT). It is characterized by the collapse of features and classifier into a symmetrical structure, known as simplex equiangular tight frame (ETF). While there have been extensive studies on optimization characteristics showing the global optimality of neural collapse, little research has been… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 20 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:2304.08914

  41. arXiv:2310.06311  [pdf, other

    cs.CV cs.MM

    Improving Compositional Text-to-image Generation with Large Vision-Language Models

    Authors: Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas

    Abstract: Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-lan… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

  42. arXiv:2310.04180  [pdf, other

    cs.CV

    Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

    Authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang

    Abstract: Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-atte… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: 12 pages

  43. arXiv:2310.01443  [pdf, other

    cs.LG cs.AI cs.ET quant-ph

    Quantum-Based Feature Selection for Multi-classification Problem in Complex Systems with Edge Computing

    Authors: Wenjie Liu, Junxiu Chen, Yuxiang Wang, Peipei Gao, Zhibin Lei, Xu Ma

    Abstract: The complex systems with edge computing require a huge amount of multi-feature data to extract appropriate insights for their decision making, so it is important to find a feasible feature selection method to improve the computational efficiency and save the resource consumption. In this paper, a quantum-based feature selection algorithm for the multi-classification problem, namely, QReliefF, is p… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: 22 pages, 11 figures

    Journal ref: Complexity, 2020. 2020: p. 8216874

  44. arXiv:2309.16583  [pdf, other

    cs.CL

    GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

    Authors: Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

    Abstract: With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we in… ▽ More

    Submitted 1 April, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted by NAACL 2024

  45. arXiv:2309.14366  [pdf, other

    quant-ph cs.AI cs.ET cs.LG

    A Unitary Weights Based One-Iteration Quantum Perceptron Algorithm for Non-Ideal Training Sets

    Authors: Wenjie Liu, Peipei Gao, Yuxiang Wang, Wenbin Yu, Maojun Zhang

    Abstract: In order to solve the problem of non-ideal training sets (i.e., the less-complete or over-complete sets) and implement one-iteration learning, a novel efficient quantum perceptron algorithm based on unitary weights is proposed, where the singular value decomposition of the total weight matrix from the training set is calculated to make the weight matrix to be unitary. The example validation of qua… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: 12 pages, 5 figures

    Journal ref: IEEE Access, 2019. 7: p. 36854-36865

  46. arXiv:2309.13193  [pdf, other

    cs.HC

    SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model

    Authors: Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong

    Abstract: Simulation plays a critical role in the research and development of autonomous driving and intelligent transportation systems. However, the current simulation platforms exhibit limitations in the realism and diversity of agent behaviors, which impede the transfer of simulation outcomes to the real world. In this paper, we propose a generative driver agent simulation framework based on large langua… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 12 pages, 8 figures

    MSC Class: H.5.2

  47. An Efficient and Secure Arbitrary N-Party Quantum Key Agreement Protocol Using Bell States

    Authors: Wen-Jie Liu, Yong Xu, Ching-Nung Yang, Pei-Pei Gao, Wen-Bin Yu

    Abstract: Two quantum key agreement protocols using Bell states and Bell measurement were recently proposed by Shukla et al.(Quantum Inf. Process. 13(11), 2391-2405, 2014). However, Zhu et al. pointed out that there are some security flaws and proposed an improved version (Quantum Inf. Process. 14(11), 4245-4254, 2015). In this study, we will show Zhu et al.'s improvement still exists some security problems… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 13 pages, 5 figures

    Journal ref: International Journal of Theoretical Physics, 2018. 57(1): p. 195-207

  48. arXiv:2309.10309  [pdf, other

    cs.RO

    Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

    Authors: Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, Hao Dong

    Abstract: Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge… ▽ More

    Submitted 20 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: 8 pages, 5 figures

  49. arXiv:2309.08365  [pdf, other

    cs.CV cs.AI

    M$^3$Net: Multilevel, Mixed and Multistage Attention Network for Salient Object Detection

    Authors: Yao Yuan, Pan Gao, XiaoYang Tan

    Abstract: Most existing salient object detection methods mostly use U-Net or feature pyramid structure, which simply aggregates feature maps of different scales, ignoring the uniqueness and interdependence of them and their respective contributions to the final prediction. To overcome these, we propose the M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for Salient Object Detection (S… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  50. arXiv:2309.08214  [pdf, other

    cs.RO

    MTG: Mapless Trajectory Generator with Traversability Coverage for Outdoor Navigation

    Authors: Jing Liang, Peng Gao, Xuesu Xiao, Adarsh Jagan Sathyamoorthy, Mohamed Elnoor, Ming C. Lin, Dinesh Manocha

    Abstract: We present a novel learning-based trajectory generation algorithm for outdoor robot navigation. Our goal is to compute collision-free paths that also satisfy the environment-specific traversability constraints. Our approach is designed for global planning using limited onboard robot perception in mapless environments while ensuring comprehensive coverage of all traversable directions. Our formulat… ▽ More

    Submitted 4 March, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: 9