Skip to main content

Showing 1–50 of 225 results for author: Jiang, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.08307  [pdf, ps, other

    cs.CV

    M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation

    Authors: Kui Jiang, Shiyu Liu, Junjun Jiang, Xin Yang, Hongxun Yang, Xiaopeng Fan

    Abstract: Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head genera… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

  2. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  3. arXiv:2506.24044  [pdf, ps, other

    cs.CV cs.AI cs.RO

    A Survey on Vision-Language-Action Models for Autonomous Driving

    Authors: Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun

    Abstract: The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructio… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  4. arXiv:2506.23590  [pdf, ps, other

    cs.CV

    CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

    Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

    Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' a… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  5. arXiv:2506.14493  [pdf, ps, other

    cs.CL cs.CR

    LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

    Authors: Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang

    Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect th… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  6. arXiv:2506.14020  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Bures-Wasserstein Flow Matching for Graph Generation

    Authors: Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni

    Abstract: Graph generation has emerged as a critical task in fields ranging from molecule design to drug discovery. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between a reference distribution and the data distribution. However, these methods typically model the evolution of indiv… ▽ More

    Submitted 23 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

  7. arXiv:2506.13260  [pdf, ps, other

    cs.CV

    COME: Adding Scene-Centric Forecasting Control to Occupancy World Model

    Authors: Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, Diange Yang

    Abstract: World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In th… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  8. arXiv:2506.11073  [pdf, ps, other

    cs.CL cs.AI cs.CV

    CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

    Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yichong Huang, Baohang Li, Kui Jiang, Yang Xiang, Zhirui Zhang, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensiv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: ACL2025 Main

  9. arXiv:2506.05815  [pdf, ps, other

    cs.CV

    NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces

    Authors: Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Stefano Mattoccia, Zhe Zhang, Yang Yang, Wu Chen, Anlong Ming, Mingshuai Zhao, Mengying Yu, Shida Gao, Xiangfeng Wang, Feng Xue, Jun Shi, Yong Yang, Yong A, Yixiang Jin, Dingzhe Li, Aryan Shukla, Liam Frija-Altarac, Matthew Toews , et al. (14 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The cha… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: NTIRE Workshop Challenge Report, CVPR 2025

  10. arXiv:2506.01511  [pdf, other

    cs.CV

    Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

    Authors: Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang

    Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectivenes… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  11. arXiv:2505.24449  [pdf, other

    cs.CL

    When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways

    Authors: Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang, Zhi Gao, Zilong Zheng, Lei Liu, Bin Li, Qing Li

    Abstract: Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injec… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  12. arXiv:2505.24207  [pdf, other

    cs.CV

    Boosting All-in-One Image Restoration via Self-Improved Privilege Learning

    Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu

    Abstract: Unified image restoration models for diverse and mixed degradations often suffer from unstable optimization dynamics and inter-task conflicts. This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that overcomes these limitations by innovatively extending the utility of privileged information (PI) beyond training into the inference stage. Unlike conventional Privilege Lea… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  13. arXiv:2505.20675  [pdf, ps, other

    cs.CV

    Contrastive Desensitization Learning for Cross Domain Face Forgery Detection

    Authors: Lingyu Qiu, Ke Jiang, Xiaoyang Tan

    Abstract: In this paper, we propose a new cross-domain face forgery detection method that is insensitive to different and possibly unseen forgery methods while ensuring an acceptable low false positive rate. Although existing face forgery detection methods are applicable to multiple domains to some degree, they often come with a high false positive rate, which can greatly disrupt the usability of the system… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  14. arXiv:2505.20653  [pdf, other

    cs.CV cs.AI

    RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment

    Authors: Lingyu Qiu, Ke Jiang, Xiaoyang Tan

    Abstract: Recent advancements in domain generalization for deepfake detection have attracted significant attention, with previous methods often incorporating additional modules to prevent overfitting to domain-specific patterns. However, such regularization can hinder the optimization of the empirical risk minimization (ERM) objective, ultimately degrading model performance. In this paper, we propose a nove… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted to ICME2025

  15. arXiv:2505.19509  [pdf, ps, other

    cs.LG cs.AI

    Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

    Authors: Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du

    Abstract: Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most f… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: The source code is available at https://github.com/MLLMKCBENCH/MLLMKC

  16. arXiv:2505.19459  [pdf, ps, other

    cs.LG cs.AI

    Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation

    Authors: Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao

    Abstract: Joint Energy-based Models (JEMs), a class of hybrid generative-discriminative models, are well known for their ability to achieve both high classification accuracy and generative capability within a single model. However, their robustness still lags significantly behind the classifiers based adversarial training (AT). Conversely, while AT is currently the most effective approach to improving the c… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  17. arXiv:2505.15298  [pdf, ps, other

    cs.RO cs.CL cs.CV

    AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

    Authors: Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang

    Abstract: Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style too… ▽ More

    Submitted 12 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

    Comments: 18 pages, 8 figures

  18. arXiv:2505.12199  [pdf, ps, other

    cs.CV cs.AI

    Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather

    Authors: Kui Jiang, Jing Cao, Zhaocheng Yu, Junjun Jiang, Jingchun Zhou

    Abstract: Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called \textbf{ACDepth}… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  19. arXiv:2505.11594  [pdf, ps, other

    cs.LG cs.AI cs.AR cs.CV cs.PF

    SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

    Authors: Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

    Abstract: The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attent… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  20. arXiv:2505.02835  [pdf, ps, other

    cs.CV cs.CL

    R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

    Authors: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang

    Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In… ▽ More

    Submitted 9 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: Home page: https://github.com/yfzhang114/r1_reward

  21. arXiv:2505.01224  [pdf, other

    cs.CV eess.IV

    RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement

    Authors: Kui Jiang, Yan Luo, Junjun Jiang, Xin Xu, Fei Ma, Fei Yu

    Abstract: Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and glo… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  22. arXiv:2505.00503  [pdf, ps, other

    cs.LG cs.AI cs.RO

    Variational OOD State Correction for Offline Reinforcement Learning

    Authors: Ke Jiang, Wen Jiang, Xiaoyang Tan

    Abstract: The performance of Offline reinforcement learning is significantly impacted by the issue of state distributional shift, and out-of-distribution (OOD) state correction is a popular approach to address this problem. In this paper, we propose a novel method named Density-Aware Safety Perception (DASP) for OOD state correction. Specifically, our method encourages the agent to prioritize actions that l… ▽ More

    Submitted 7 July, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

  23. arXiv:2504.14904  [pdf, other

    cs.SI cs.AI cs.CL cs.MM

    VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform

    Authors: Xingyu Lu, Tianke Zhang, Chang Meng, Xiaobei Wang, Jinpeng Wang, YiFan Zhang, Shisong Tang, Changyi Liu, Haojie Ding, Kaiyu Jiang, Kaiyu Tang, Bin Wen, Hai-Tao Zheng, Fan Yang, Tingting Gao, Di Zhang, Kun Gai

    Abstract: Exponentially growing short video platforms (SVPs) face significant challenges in moderating content detrimental to users' mental health, particularly for minors. The dissemination of such content on SVPs can lead to catastrophic societal consequences. Although substantial efforts have been dedicated to moderating such content, existing methods suffer from critical limitations: (1) Manual review i… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 20 pages, 6 figures

  24. arXiv:2504.14600  [pdf, ps, other

    cs.CV

    NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results

    Authors: Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Jianxing Zhang, Jinlong Wu, Jun Wang, Zheng Xie, Hakjae Jeon, Suejin Han, Hyung-Ju Chun, Hyunhee Park, Zhicun Yin, Junjie Chen, Ming Liu, Xiaoming Li, Chao Zhou, Wangmeng Zuo, Weixia Zhang, Dingquan Li, Kede Ma , et al. (29 additional authors not shown)

    Abstract: This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_RealWorld_Face_Restoration

  25. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  26. arXiv:2504.10738  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates

    Authors: Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian

    Abstract: The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Mul… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Kun Jiang, Mengmeng Yang and Diange Yang are Corresponding Author. The main paper and supplementary material are both included here, total 23 pages (main paper is 10 pages and supplementary material is 13 pages), total 17 figures (6 figures in main paper and 11 figures in supplementary material), this paper is Accepted to CVPR WDFM-AD Workshop 2025, The code will be available at https://Ankit-Zefan.github.io/CleanMap/

    ACM Class: I.2.9; I.2.7; I.2.10; I.5.5; I.5.4; I.2.11

  27. arXiv:2504.10329  [pdf, other

    cs.CV

    InstructEngine: Instruction-driven Text-to-Image Alignment

    Authors: Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Chun Yuan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

    Abstract: Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. H… ▽ More

    Submitted 21 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: 8 pages, 7 figures

  28. arXiv:2504.09973  [pdf, other

    cs.CV

    Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

    Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

    Abstract: All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from p… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Project page: https://github.com/Aitical/CPLIR

  29. arXiv:2504.09844  [pdf, other

    cs.DC cs.AI

    OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training

    Authors: Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu

    Abstract: Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. Under multisource preprocessing, two fundamental challenges exist. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant… ▽ More

    Submitted 18 May, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

  30. arXiv:2504.08222  [pdf, other

    cs.CV cs.AI

    F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

    Authors: Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong

    Abstract: Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of vi… ▽ More

    Submitted 14 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

    Comments: ICLR 2025; Website URL: https://lzyandy.github.io/f3set-website/

  31. arXiv:2504.05649  [pdf, other

    cs.CV

    POD: Predictive Object Detection with Single-Frame FMCW LiDAR Point Cloud

    Authors: Yining Shi, Kun Jiang, Xin Zhao, Kangan Qian, Chuchu Xie, Tuopu Wen, Mengmeng Yang, Diange Yang

    Abstract: LiDAR-based 3D object detection is a fundamental task in the field of autonomous driving. This paper explores the unique advantage of Frequency Modulated Continuous Wave (FMCW) LiDAR in autonomous perception. Given a single frame FMCW point cloud with radial velocity measurements, we expect that our object detector can detect the short-term future locations of objects using only the current frame… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  32. arXiv:2504.05164  [pdf, other

    cs.CV

    Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

    Authors: Xingyu Hu, Junjun Jiang, Chenyang Wang, Kui Jiang, Xianming Liu, Jiayi Ma

    Abstract: Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  33. arXiv:2504.04869  [pdf, other

    cs.CV

    Content-Aware Transformer for All-in-one Image Restoration

    Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu

    Abstract: Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  34. arXiv:2504.03753  [pdf, other

    cs.LG stat.ME

    MMCE: A Framework for Deep Monotonic Modeling of Multiple Causal Effects

    Authors: Juhua Chen, Karson shi, Jialing He, North Chen, Kele Jiang

    Abstract: When we plan to use money as an incentive to change the behavior of a person (such as making riders to deliver more orders or making consumers to buy more items), the common approach of this problem is to adopt a two-stage framework in order to maximize ROI under cost constraints. In the first stage, the individual price response curve is obtained. In the second stage, business goals and resource… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  35. arXiv:2504.01719  [pdf, other

    cs.LG cs.RO

    Beyond Non-Expert Demonstrations: Outcome-Driven Action Constraint for Offline Reinforcement Learning

    Authors: Ke Jiang, Wen Jiang, Yao Li, Xiaoyang Tan

    Abstract: We address the challenge of offline reinforcement learning using realistic data, specifically non-expert data collected through sub-optimal behavior policies. Under such circumstance, the learned policy must be safe enough to manage distribution shift while maintaining sufficient flexibility to deal with non-expert (bad) demonstrations from offline data.To tackle this issue, we introduce a novel m… ▽ More

    Submitted 2 April, 2025; v1 submitted 2 April, 2025; originally announced April 2025.

  36. arXiv:2503.23963  [pdf, other

    cs.CV cs.RO

    A Benchmark for Vision-Centric HD Mapping by V2I Systems

    Authors: Miao Fan, Shanshan Yu, Shengtong Xu, Kun Jiang, Haoyi Xiong, Xiangzeng Liu

    Abstract: Autonomous driving faces safety challenges due to a lack of global perspective and the semantic information of vectorized high-definition (HD) maps. Information from roadside cameras can greatly expand the map perception range through vehicle-to-infrastructure (V2I) communications. However, there is still no dataset from the real world available for the study on map vectorization onboard under the… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: Accepted by IEEE IV'25

  37. arXiv:2503.19739  [pdf, other

    cs.CV

    FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

    Authors: Pihai Sun, Junjun Jiang, Yuanqi Yao, Youyu Chen, Wenbo Zhao, Kui Jiang, Xianming Liu

    Abstract: Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to inef… ▽ More

    Submitted 26 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: 8 pages, 6 figures

  38. arXiv:2503.19443  [pdf, other

    cs.CV

    COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

    Authors: Jiaxin Zhang, Junjun Jiang, Youyu Chen, Kui Jiang, Xianming Liu

    Abstract: Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we intr… ▽ More

    Submitted 26 March, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  39. arXiv:2503.18402  [pdf, other

    cs.CV

    DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

    Authors: Youyu Chen, Junjun Jiang, Kui Jiang, Xiao Tang, Zhihao Li, Xianming Liu, Yinyu Nie

    Abstract: 3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where the rendering resolution and the primitive number, concluded as the optimization complexity, dominate the time cost in primitive optimization. In this paper, we propose DashGaussian, a scheduling scheme over the optimization complexity of 3DGS that strips redundant complexity to accelerate 3DGS optimization. Spec… ▽ More

    Submitted 26 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025. Project page: https://dashgaussian.github.io

  40. arXiv:2503.11117  [pdf, other

    cs.CV

    Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

    Authors: Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, Liang Lin

    Abstract: Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often in… ▽ More

    Submitted 23 May, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  41. arXiv:2503.08760  [pdf, other

    cs.LG cs.AI stat.ML

    Heterogeneous Graph Structure Learning through the Lens of Data-generating Processes

    Authors: Keyue Jiang, Bohan Tang, Xiaowen Dong, Laura Toni

    Abstract: Inferring the graph structure from observed data is a key task in graph machine learning to capture the intrinsic relationship between data entities. While significant advancements have been made in learning the structure of homogeneous graphs, many real-world graphs exhibit heterogeneous patterns where nodes and edges have multiple types. This paper fills this gap by introducing the first approac… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  42. arXiv:2503.08162  [pdf, other

    cs.RO cs.CL

    FASIONAD++ : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback

    Authors: Kangan Qian, Ziang Luo, Sicong Jiang, Zilin Huang, Jinyu Miao, Zhikun Ma, Tianze Zhu, Jiayin Li, Yangfan He, Zheng Fu, Yining Shi, Boyue Wang, Hezhe Lin, Ziyu Chen, Jiangbo Yu, Xinyu Jiao, Mengmeng Yang, Kun Jiang, Diange Yang

    Abstract: Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 8 pages, 4 figures

  43. arXiv:2503.07826  [pdf, other

    cs.CL

    Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

    Authors: Fan Yin, Zifeng Wang, I-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long T. Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, Tomas Pfister

    Abstract: Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large lang… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 12 pages, 3 figures, 4 tables

  44. arXiv:2503.07367  [pdf, other

    cs.CV

    LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction

    Authors: Kangan Qian, Jinyu Miao, Ziang Luo, Zheng Fu, and Jinchen Li, Yining Shi, Yunlong Wang, Kun Jiang, Mengmeng Yang, Diange Yang

    Abstract: Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions be… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 8 pages, 4 figures

  45. arXiv:2503.06446  [pdf, other

    cs.CV

    M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

    Authors: Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li

    Abstract: Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-trai… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  46. arXiv:2503.06313  [pdf

    cs.CV cs.AI cs.CL cs.LG cs.RO

    Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection

    Authors: Chandan Kumar Sah, Ankit Kumar Shaw, Xiaoli Lian, Arsalan Shahid Baig, Tuopu Wen, Kun Jiang, Mengmeng Yang, Diange Yang

    Abstract: Autonomous vehicles (AVs) require reliable traffic sign recognition and robust lane detection capabilities to ensure safe navigation in complex and dynamic environments. This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception. For traffic sign recognition, we systematically evaluate ResN… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: 11 pages, 9 figures

  47. arXiv:2503.04223  [pdf, other

    cs.CV

    Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks

    Authors: Yi Xiao, Qiangqiang Yuan, Kui Jiang, Qiang Zhang, Tingting Zheng, Chia-Wen Lin, Liangpei Zhang

    Abstract: Spiking neural networks (SNNs) are emerging as a promising alternative to traditional artificial neural networks (ANNs), offering biological plausibility and energy efficiency. Despite these merits, SNNs are frequently hampered by limited capacity and insufficient representation power, yet remain underexplored in remote sensing super-resolution (SR) tasks. In this paper, we first observe that spik… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  48. arXiv:2503.00862  [pdf, other

    cs.RO

    Efficient End-to-end Visual Localization for Autonomous Driving with Decoupled BEV Neural Matching

    Authors: Jinyu Miao, Tuopu Wen, Ziang Luo, Kangan Qian, Zheng Fu, Yunlong Wang, Kun Jiang, Mengmeng Yang, Jin Huang, Zhihua Zhong, Diange Yang

    Abstract: Accurate localization plays an important role in high-level autonomous driving systems. Conventional map matching-based localization methods solve the poses by explicitly matching map elements with sensor observations, generally sensitive to perception noise, therefore requiring costly hyper-parameter tuning. In this paper, we propose an end-to-end localization neural network which directly estima… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: 8 pages, 5 figures, 4 tables

  49. arXiv:2502.21134  [pdf, other

    cs.RO cs.AI

    Dynamically Local-Enhancement Planner for Large-Scale Autonomous Driving

    Authors: Nanshan Deng, Weitao Zhou, Bo Zhang, Junze Wen, Kun Jiang, Zhong Cao, Diange Yang

    Abstract: Current autonomous vehicles operate primarily within limited regions, but there is increasing demand for broader applications. However, as models scale, their limited capacity becomes a significant challenge for adapting to novel scenarios. It is increasingly difficult to improve models for new situations using a single monolithic model. To address this issue, we introduce the concept of dynamical… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  50. arXiv:2502.21130  [pdf, other

    cs.CV

    Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning

    Authors: Jiuyang Dong, Junjun Jiang, Kui Jiang, Jiahan Li, Yongbing Zhang

    Abstract: Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL… ▽ More

    Submitted 3 March, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

    Comments: 11 pages, 4 figures, accepted by CVPR2025