Skip to main content

Showing 1–50 of 218 results for author: Gao, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04055  [pdf, ps, other

    cs.CR cs.AI cs.SE

    Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG

    Authors: Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu

    Abstract: Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  2. arXiv:2507.02233  [pdf

    cs.DC

    Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems

    Authors: Bruce Fang, Danyi Gao

    Abstract: This paper addresses the challenge of fault root cause identification in cloud computing environments. The difficulty arises from complex system structures, dense service coupling, and limited fault information. To solve this problem, an intelligent identification algorithm based on transfer learning is proposed. The method introduces a shared feature extraction module and a domain adversarial mec… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  3. arXiv:2507.00550  [pdf

    cs.DC

    Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling

    Authors: Bruce Fang, Danyi Gao

    Abstract: This paper addresses the challenges of rapid resource variation and highly uncertain task loads in cloud computing environments. It proposes an optimization method for elastic cloud resource scaling based on a multi-agent system. The method deploys multiple autonomous agents to perceive resource states in parallel and make local decisions. While maintaining the distributed nature of the system, it… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  4. arXiv:2506.12258  [pdf, ps, other

    cs.CV cs.CY

    EgoPrivacy: What Your First-Person Camera Says About You?

    Authors: Yijiang Li, Genpei Zhang, Jiacheng Cheng, Yi Li, Xiaojun Shan, Dashan Gao, Jiancheng Lyu, Yuan Li, Ning Bi, Nuno Vasconcelos

    Abstract: While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale be… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  5. arXiv:2506.08149  [pdf, other

    cs.RO cs.AI

    Ego-centric Learning of Communicative World Models for Autonomous Driving

    Authors: Hang Wang, Dechen Gao, Junshan Zhang

    Abstract: We study multi-agent reinforcement learning (MARL) for tasks in complex high-dimensional environments, such as autonomous driving. MARL is known to suffer from the \textit{partial observability} and \textit{non-stationarity} issues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and sca… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  6. arXiv:2506.01678  [pdf, ps, other

    cond-mat.mtrl-sci cs.AI

    Overcoming Data Scarcity in Scanning Tunnelling Microscopy Image Segmentation

    Authors: Nikola L. Kolev, Max Trouton, Filippo Federici Canova, Geoff Thornton, David Z. Gao, Neil J. Curson, Taylor J. Z. Stock

    Abstract: Scanning tunnelling microscopy (STM) is a powerful technique for imaging surfaces with atomic resolution, providing insight into physical and chemical processes at the level of single atoms and molecules. A regular task of STM image analysis is the identification and labelling of features of interest against a uniform background. Performing this manually is a labour-intensive task, requiring signi… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  7. arXiv:2505.24466  [pdf, ps, other

    cs.CV

    SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

    Authors: Yingjia Xu, Jinlin Wu, Zhen Chen, Daming Gao, Yang Yang, Zhen Lei, Min Cao

    Abstract: Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within… ▽ More

    Submitted 26 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

    Comments: 22 pages, 7 figures. Under review

  8. arXiv:2505.19100  [pdf, other

    cs.CL cs.CV

    ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

    Authors: Yeyuan Wang, Dehong Gao, Rujiao Long, Lei Yi, Linbo Jin, Libin Yang, Xiaoyan Cai

    Abstract: Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segmen… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: Accepted by ACL 2025 findings

  9. arXiv:2505.17826  [pdf, ps, other

    cs.LG cs.CL cs.DC

    Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models

    Authors: Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou

    Abstract: Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high eff… ▽ More

    Submitted 14 July, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: This technical report will be continuously updated as the codebase evolves. GitHub: https://github.com/modelscope/Trinity-RFT

  10. arXiv:2505.10442  [pdf, ps, other

    cs.RO cs.AI

    IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

    Authors: Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, Junshan Zhang

    Abstract: Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability an… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  11. arXiv:2505.05752  [pdf, other

    cs.CV cs.CY cs.LG cs.RO eess.IV

    Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data

    Authors: Amin Ghafourian, Andrew Lee, Dechen Gao, Tyler Beer, Kin Yen, Iman Soltani

    Abstract: Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometr… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 19 pages, 15 figures, 4 tables

  12. arXiv:2504.01168  [pdf, other

    cs.DS

    LimTDD: A Compact Decision Diagram Integrating Tensor and Local Invertible Map Representations

    Authors: Xin Hong, Aochu Dai, Dingchao Gao, Sanjiang Li, Zhengfeng Ji, Mingsheng Ying

    Abstract: Tensor Decision Diagrams (TDDs) provide an efficient structure for representing tensors by combining techniques from both tensor networks and decision diagrams, demonstrating competitive performance in quantum circuit simulation and verification. However, existing decision diagrams, including TDDs, fail to exploit isomorphisms within tensors, limiting their compression efficiency. This paper intro… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  13. arXiv:2503.22759  [pdf, other

    cs.CR cs.AI

    Data Poisoning in Deep Learning: A Survey

    Authors: Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, Ou Wu

    Abstract: Deep learning has become a cornerstone of modern artificial intelligence, enabling transformative applications across a wide range of domains. As the core element of deep learning, the quality and security of training data critically influence model performance and reliability. However, during the training process, deep learning models face the significant threat of data poisoning, where attackers… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  14. arXiv:2503.18556  [pdf, other

    cs.CV cs.CL

    Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

    Authors: Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang

    Abstract: Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by ICME2025

  15. arXiv:2503.13563  [pdf, other

    cs.CL cs.AI cs.IR

    MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG

    Authors: Pingyu Wu, Daiheng Gao, Jing Tang, Huimin Chen, Wenbo Zhou, Weiming Zhang, Nenghai Yu

    Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: NAACL 2025

  16. arXiv:2503.11290  [pdf, ps, other

    cs.CV eess.IV

    EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation

    Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin

    Abstract: Affective Image Manipulation (AIM) aims to alter visual elements within an image to evoke specific emotional responses from viewers. However, existing AIM approaches rely on rigid \emph{one-to-one} mappings between emotions and visual cues, making them ill-suited for the inherently subjective and diverse ways in which humans perceive and express emotion.To address this, we introduce a novel task s… ▽ More

    Submitted 23 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  17. arXiv:2503.06353  [pdf

    cs.CY cs.AI

    The AI Pentad, the CHARME$^{2}$D Model, and an Assessment of Current-State AI Regulation

    Authors: Di Kevin Gao, Sudip Mittal, Jiming Wu, Hongwei Du, Jingdao Chen, Shahram Rahimi

    Abstract: Artificial Intelligence (AI) has made remarkable progress in the past few years with AI-enabled applications beginning to permeate every aspect of our society. Despite the widespread consensus on the need to regulate AI, there remains a lack of a unified approach to framing, developing, and assessing AI regulations. Many of the existing methods take a value-based approach, for example, accountabil… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

  18. arXiv:2503.04146  [pdf, other

    cs.DS cs.ET quant-ph

    Image Computation for Quantum Transition Systems

    Authors: Xin Hong, Dingchao Gao, Sanjiang Li, Shenggang Ying, Mingsheng Ying

    Abstract: With the rapid progress in quantum hardware and software, the need for verification of quantum systems becomes increasingly crucial. While model checking is a dominant and very successful technique for verifying classical systems, its application to quantum systems is still an underdeveloped research area. This paper advances the development of model checking quantum systems by providing efficient… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  19. arXiv:2502.18480  [pdf, other

    cs.IR cs.AI cs.CL

    QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration

    Authors: Shaola Ren, Li Ke, Longtao Huang, Dehong Gao, Hui Xue

    Abstract: Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approac… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  20. arXiv:2502.15281  [pdf, ps, other

    cs.CR cs.SE

    DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

    Authors: Chengyan Ma, Ruidong Han, Jieke Shi, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo

    Abstract: Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or… ▽ More

    Submitted 9 July, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

  21. SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image

    Authors: Qian Jin, Yuqi Jiang, Xudong Lu, Yumeng Liu, Yining Chen, Dawei Gao, Qi Sun, Cheng Zhuo

    Abstract: In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, label… ▽ More

    Submitted 15 February, 2025; originally announced February 2025.

    Comments: Published in ACM/IEEE International Conference on Computer-Aided Design (ICCAD), 2024

  22. arXiv:2502.14215  [pdf, other

    cs.SE cs.AI

    Towards Secure Program Partitioning for Smart Contracts with LLM's In-Context Learning

    Authors: Ye Liu, Yuqing Niu, Chengyan Ma, Ruidong Han, Wei Ma, Yi Li, Debin Gao, David Lo

    Abstract: Smart contracts are highly susceptible to manipulation attacks due to the leakage of sensitive information. Addressing manipulation vulnerabilities is particularly challenging because they stem from inherent data confidentiality issues rather than straightforward implementation bugs. To tackle this by preventing sensitive information leakage, we present PartitionGPT, the first LLM-driven approach… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  23. arXiv:2502.13379  [pdf, other

    cs.CR cs.SE

    AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments

    Authors: Ruidong Han, Zhou Yang, Chengyan Ma, Ye Liu, Yuqing Niu, Siqi Ma, Debin Gao, David Lo

    Abstract: Trusted Execution Environments (TEEs) isolate a special space within a device's memory that is not accessible to the normal world (also known as Untrusted Environment), even when the device is compromised. Thus, developers can utilize TEEs to provide strong security guarantees for their programs, making sensitive operations like encrypted data storage, fingerprint verification, and remote attestat… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Comments: 14 pages

  24. arXiv:2502.11427  [pdf, other

    cs.CL cs.CV

    Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

    Authors: Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen

    Abstract: Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propos… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

    Comments: under review

  25. arXiv:2502.09596  [pdf, other

    cs.AI cs.MA

    KIMAs: A Configurable Knowledge Integrated Multi-Agent System

    Authors: Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding

    Abstract: Knowledge-intensive conversations supported by large language models (LLMs) have become one of the most popular and helpful applications that can assist people in different aspects. Many current knowledge-intensive applications are centered on retrieval-augmented generation (RAG) techniques. While many open-source RAG frameworks facilitate the development of RAG-based applications, they often fall… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  26. arXiv:2502.08047  [pdf, ps, other

    cs.AI cs.MA

    WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

    Authors: Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou

    Abstract: GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to the sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state, often lead to planning errors. This issue is widespread in… ▽ More

    Submitted 9 June, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: Technique Report

  27. arXiv:2501.04963  [pdf, other

    cs.CR

    Shelving it rather than Ditching it: Dynamically Debloating DEX and Native Methods of Android Applications without APK Modification

    Authors: Zicheng Zhang, Jiakun Liu, Ferdian Thung, Haoyu Ma, Rui Li, Yan Naing Tun, Wei Minn, Lwin Khin Shar, Shahar Maoz, Eran Toch, David Lo, Joshua Wong, Debin Gao

    Abstract: Today's Android developers tend to include numerous features to accommodate diverse user requirements, which inevitably leads to bloated apps. Yet more often than not, only a fraction of these features are frequently utilized by users, thus a bloated app costs dearly in potential vulnerabilities, expanded attack surfaces, and additional resource consumption. Especially in the event of severe secur… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  28. arXiv:2412.20413  [pdf, other

    cs.CV

    EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

    Authors: Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang

    Abstract: Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-… ▽ More

    Submitted 2 January, 2025; v1 submitted 29 December, 2024; originally announced December 2024.

    Comments: 24 pages, 18 figures

  29. arXiv:2412.20062  [pdf, other

    cs.CV

    MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

    Authors: Zechao Zhan, Dehong Gao, Jinxia Zhang, Jiale Huang, Yang Hu, Xin Wang

    Abstract: Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foregrou… ▽ More

    Submitted 15 January, 2025; v1 submitted 28 December, 2024; originally announced December 2024.

  30. arXiv:2412.19997  [pdf, other

    cs.CV

    FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

    Authors: Jiale Huang, Dehong Gao, Jinxia Zhang, Zechao Zhan, Yang Hu, Xin Wang

    Abstract: Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we… ▽ More

    Submitted 12 January, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

    Comments: 5 pages, Accepted by ICASSP2025, full paper

  31. arXiv:2412.16869  [pdf, other

    cs.CV

    CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

    Authors: Yeyuan Wang, Dehong Gao, Bin Li, Rujiao Long, Lei Yi, Xiaoyan Cai, Libin Yang, Jinxia Zhang, Shanqing Yu, Qi Xuan

    Abstract: The impressive performance of Large Language Model (LLM) has prompted researchers to develop Multi-modal LLM (MLLM), which has shown great potential for various multi-modal tasks. However, current MLLM often struggles to effectively address fine-grained multi-modal challenges. We argue that this limitation is closely linked to the models' visual grounding capabilities. The restricted spatial aware… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

    Comments: 5 pages, Accepted by ICASSP2025, full paper

  32. arXiv:2412.11596  [pdf, ps, other

    cs.CV cs.GR

    MeshArt: Generating Articulated Meshes with Structure-Guided Transformers

    Authors: Daoyi Gao, Yawar Siddiqui, Lei Li, Angela Dai

    Abstract: Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages. Fi… ▽ More

    Submitted 8 June, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: Project Page: https://daoyig.github.io/Mesh_Art/, Video: https://www.youtube.com/watch?v=0XaHFbmb_FQ

  33. arXiv:2412.10029  [pdf, other

    cs.CV

    Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

    Authors: Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai

    Abstract: Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intri… ▽ More

    Submitted 13 December, 2024; originally announced December 2024.

    Comments: 15pages, Accepted by AAAI2025, full paper

  34. arXiv:2412.09782  [pdf, other

    cs.RO cs.CV cs.MA

    EI-Drive: A Platform for Cooperative Perception with Realistic Communication Models

    Authors: Hanchu Zhou, Edward Xie, Wei Shao, Dechen Gao, Michelle Dong, Junshan Zhang

    Abstract: The growing interest in autonomous driving calls for realistic simulation platforms capable of accurately simulating cooperative perception process in realistic traffic scenarios. Existing studies for cooperative perception often have not accounted for transmission latency and errors in real-world environments. To address this gap, we introduce EI-Drive, an edge-AI based autonomous driving simulat… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  35. arXiv:2412.07405  [pdf, other

    cs.LG cs.AI

    MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

    Authors: Yufei Ma, Zihan Liang, Huangyu Dai, Ben Chen, Dehong Gao, Zhuoran Ran, Wang Zihan, Linbo Jin, Wen Jiang, Guannan Zhang, Xiaoyan Cai, Libin Yang

    Abstract: The growing demand for larger-scale models in the development of \textbf{L}arge \textbf{L}anguage \textbf{M}odels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbf{M}ixture \textbf{o}f \textbf{D}omai… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

  36. arXiv:2411.17465  [pdf, other

    cs.CV cs.AI cs.CL cs.HC

    ShowUI: One Vision-Language-Action Model for GUI Visual Agent

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

    Abstract: Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-langu… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: Technical Report. Github: https://github.com/showlab/ShowUI

  37. arXiv:2411.10323  [pdf, other

    cs.AI cs.CL cs.CV

    The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

    Authors: Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou

    Abstract: The recently released model, Claude 3.5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent. As an early beta, its capability in the real-world complex environment remains unknown. In this case study to explore Claude 3.5 Computer Use, we curate and organize a collection of carefully designed tasks spanning a variet… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

    Comments: 40 pages, 21 figures, preprint

  38. arXiv:2410.14659  [pdf, other

    cs.LG stat.ML

    Harnessing Causality in Reinforcement Learning With Bagged Decision Times

    Authors: Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, Susan A. Murphy

    Abstract: We consider reinforcement learning (RL) for a class of problems with bagged decision times. A bag contains a finite sequence of consecutive decision times. The transition dynamics are non-Markovian and non-stationary within a bag. All actions within a bag jointly impact a single reward, observed at the end of the bag. For example, in mobile health, multiple activity suggestions in a day collective… ▽ More

    Submitted 6 May, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

  39. arXiv:2410.13973  [pdf, ps, other

    cs.RO cs.AI

    MarineFormer: A Spatio-Temporal Attention Model for USV Navigation in Dynamic Marine Environments

    Authors: Ehsan Kazemi, Dechen Gao, Iman Soltani

    Abstract: Autonomous navigation in marine environments can be extremely challenging, especially in the presence of spatially varying flow disturbances and dynamic and static obstacles. In this work, we demonstrate that incorporating local flow field measurements fundamentally alters the nature of the problem, transforming otherwise unsolvable navigation scenarios into tractable ones. However, the mere avail… ▽ More

    Submitted 9 July, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

  40. arXiv:2410.04360  [pdf, ps, other

    cs.MA cs.AI

    GenSim: A General Social Simulation Platform with Large Language Model based Agents

    Authors: Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, Ji-Rong Wen

    Abstract: With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during… ▽ More

    Submitted 3 July, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

    Comments: NAACL 2025 Demo Track

  41. arXiv:2409.17435  [pdf, other

    cs.RO

    Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation

    Authors: Ian Chuang, Andrew Lee, Dechen Gao, M-Mahdi Naddaf-Sh, Iman Soltani

    Abstract: Imitation learning has demonstrated significant potential in performing high-precision manipulation tasks using visual feedback. However, it is common practice in imitation learning for cameras to be fixed in place, resulting in issues like occlusion and limited field of view. Furthermore, cameras are often placed in broad, general locations, without an effective viewpoint specific to the robot's… ▽ More

    Submitted 7 March, 2025; v1 submitted 25 September, 2024; originally announced September 2024.

    Comments: 6 pages, 4 figures

  42. GEVO: Memory-Efficient Monocular Visual Odometry Using Gaussians

    Authors: Dasong Gao, Peter Zhi Xuan Li, Vivienne Sze, Sertac Karaman

    Abstract: Constructing a high-fidelity representation of the 3D scene using a monocular camera can enable a wide range of applications on mobile devices, such as micro-robots, smartphones, and AR/VR headsets. On these devices, memory is often limited in capacity and its access often dominates the consumption of compute energy. Although Gaussian Splatting (GS) allows for high-fidelity reconstruction of 3D sc… ▽ More

    Submitted 29 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: 8 pages

  43. arXiv:2409.03185  [pdf, ps, other

    quant-ph cs.ET

    DasAtom: A Divide-and-Shuttle Atom Approach to Quantum Circuit Transformation

    Authors: Yunqi Huang, Dingchao Gao, Shenggang Ying, Sanjiang Li

    Abstract: Neutral atom (NA) quantum systems are emerging as a leading platform for quantum computation, offering superior or competitive qubit count and gate fidelity compared to superconducting circuits and ion traps. However, the unique features of NA devices, such as long-range interactions, long qubit coherence time, and the ability to physically move qubits, present distinct challenges for quantum circ… ▽ More

    Submitted 20 January, 2025; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: This paper is accepted by IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

  44. arXiv:2408.16251  [pdf, other

    cs.IT eess.SP

    Neural Network-Assisted Hybrid Model Based Message Passing for Parametric Holographic MIMO Near Field Channel Estimation

    Authors: Zhengdao Yuan, Yabo Guo, Dawei Gao, Qinghua Guo, Zhongyong Wang, Chongwen Huang, Ming Jin, Kai-Kit Wong

    Abstract: Holographic multiple-input and multiple-output (HMIMO) is a promising technology with the potential to achieve high energy and spectral efficiencies, enhance system capacity and diversity, etc. In this work, we address the challenge of HMIMO near field (NF) channel estimation, which is complicated by the intricate model introduced by the dyadic Green's function. Despite its complexity, the channel… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  45. arXiv:2408.08913  [pdf, other

    cs.IR

    MLoRA: Multi-Domain Low-Rank Adaptive Network for CTR Prediction

    Authors: Zhiming Yang, Haining Gao, Dehong Gao, Luwei Yang, Libin Yang, Xiaoyan Cai, Wei Ning, Guannan Zhang

    Abstract: Click-through rate (CTR) prediction is one of the fundamental tasks in the industry, especially in e-commerce, social media, and streaming media. It directly impacts website revenues, user satisfaction, and user retention. However, real-world production platforms often encompass various domains to cater for diverse customer needs. Traditional CTR prediction models struggle in multi-domain recommen… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: 11 pages. Accepted by RecSys'2024, full paper

  46. arXiv:2407.21757  [pdf, other

    cs.CV cs.MM

    Learning Video Context as Interleaved Multimodal Sequences

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

    Abstract: Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as i… ▽ More

    Submitted 12 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  47. arXiv:2407.17789  [pdf, other

    cs.MA cs.AI

    Very Large-Scale Multi-Agent Simulation in AgentScope

    Authors: Xuchen Pan, Dawei Gao, Yuexiang Xie, Yushuo Chen, Zhewei Wei, Yaliang Li, Bolin Ding, Ji-Rong Wen, Jingren Zhou

    Abstract: Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we deve… ▽ More

    Submitted 28 October, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

    Comments: We have released code on https://github.com/modelscope/agentscope/tree/main/examples/paper_large_scale_simulation

  48. arXiv:2407.16224  [pdf, other

    cs.CV

    OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

    Authors: Ke Sun, Jian Cao, Qi Wang, Linrui Tian, Xindi Zhang, Lian Zhuo, Bang Zhang, Liefeng Bo, Wenbo Zhou, Weiming Zhang, Daiheng Gao

    Abstract: Virtual Try-On (VTON) has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. While diffusion models, such as Stable Diffusion series, have shown their capability in creating high-quality and photorealistic images, they… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: 10 pages, 13 figures

  49. arXiv:2406.13719  [pdf, other

    cs.CV

    GUI Action Narrator: Where and When Did That Action Take Place?

    Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

    Abstract: The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  50. arXiv:2406.11816  [pdf, other

    cs.CV

    VideoLLM-online: Online Video Large Language Model for Streaming Video

    Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

    Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: CVPR 2024. This arxiv version is upgraded with Llama-3