Skip to main content

Showing 1–50 of 1,058 results for author: Han, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.00506  [pdf, ps, other

    cs.CV

    SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

    Authors: Yunfei Xie, Yuxuan Cheng, Juncheng Wu, Haoyu Zhang, Yuyin Zhou, Shoudong Han

    Abstract: Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tun… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  2. arXiv:2506.23601  [pdf, ps, other

    cs.CL cs.AI

    Semantic-guided Diverse Decoding for Large Language Model

    Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

    Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  3. arXiv:2506.23329  [pdf, ps, other

    cs.CV

    IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

    Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng

    Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using pr… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Project Page: https://ir3d-bench.github.io/

  4. arXiv:2506.23266  [pdf, ps, other

    cs.LG

    Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

    Authors: Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, Yike Guo

    Abstract: Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compress… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Work in progress, revisions ongoing

  5. arXiv:2506.21132  [pdf, ps, other

    cs.CV

    Learning to See in the Extremely Dark

    Authors: Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu

    Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  6. arXiv:2506.21039  [pdf, ps, other

    cs.LG cs.AI

    Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

    Authors: Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

    Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, they often suffer from subgoal infeasibility and inefficient planning. We introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that enforces singl… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 9 technical page followed by references and appendix

  7. arXiv:2506.20451  [pdf, ps, other

    cs.LG cs.AI

    Automatic Demonstration Selection for LLM-based Tabular Data Classification

    Authors: Shuchu Han, Wolfgang Bruckner

    Abstract: A fundamental question in applying In-Context Learning (ICL) for tabular data classification is how to determine the ideal number of demonstrations in the prompt. This work addresses this challenge by presenting an algorithm to automatically select a reasonable number of required demonstrations. Our method distinguishes itself by integrating not only the tabular data's distribution but also the us… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  8. arXiv:2506.20357  [pdf, ps, other

    cs.AI

    Tabular Feature Discovery With Reasoning Type Exploration

    Authors: Sungwon Han, Sungkyu Park, Seungeon Lee

    Abstract: Feature engineering for tabular data remains a critical yet challenging step in machine learning. Recently, large language models (LLMs) have been used to automatically generate new features by leveraging their vast knowledge. However, existing LLM-based approaches often produce overly simple or repetitive features, partly due to inherent biases in the transformations the LLM chooses and the lack… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  9. arXiv:2506.19852  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

    Authors: Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han

    Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal d… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/mit-han-lab/radial-attention

  10. arXiv:2506.19417  [pdf, ps, other

    cs.LG cs.MA

    Center of Gravity-Guided Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

    Authors: Yisak Park, Sunwoo Lee, Seungyul Han

    Abstract: Cooperative multi-agent reinforcement learning (MARL) under sparse rewards presents a fundamental challenge due to limited exploration and insufficient coordinated attention among agents. In this work, we propose the Focusing Influence Mechanism (FIM), a novel framework that enhances cooperation by directing agent influence toward task-critical elements, referred to as Center of Gravity (CoG) stat… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: 9 technical page followed by references and appendix

  11. arXiv:2506.18897  [pdf, ps, other

    cs.RO cs.AI

    MinD: Unified Visual Imagination and Control via Hierarchical World Models

    Authors: Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo

    Abstract: Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate i… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  12. arXiv:2506.17877  [pdf, ps, other

    cs.NI cs.OS

    Supporting Deterministic Traffic on Standard NICs

    Authors: Chuanyu Xue, Tianyu Zhang, Andrew Loveless, Song Han

    Abstract: Networked mission-critical applications (e.g., avionic control and industrial automation systems) require deterministic packet transmissions to support a range of sensing and control tasks with stringent timing constraints. While specialized network infrastructure (e.g., time-sensitive networking (TSN) switches) provides deterministic data transport across the network, achieving strict end-to-end… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 20 pages

  13. arXiv:2506.17104  [pdf, ps, other

    cs.AI cs.CL cs.LO

    Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

    Authors: Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo

    Abstract: Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated b… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  14. arXiv:2506.16754  [pdf, ps, other

    cs.LG cs.AI cs.SI

    Metapath-based Hyperbolic Contrastive Learning for Heterogeneous Graph Embedding

    Authors: Jongmin Park, Seunghoon Han, Won-Yong Shin, Sungsu Lim

    Abstract: The hyperbolic space, characterized by a constant negative curvature and exponentially expanding space, aligns well with the structural properties of heterogeneous graphs. However, although heterogeneous graphs inherently possess diverse power-law structures, most hyperbolic heterogeneous graph embedding models rely on a single hyperbolic space. This approach may fail to effectively capture the di… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 14 pages, 9 figures

  15. arXiv:2506.16741  [pdf, ps, other

    eess.AS cs.AI

    RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

    Authors: Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

    Abstract: We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, Ra… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Accepted on Interspeech 2025

  16. arXiv:2506.16500  [pdf, ps, other

    cs.LG

    SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

    Authors: Samir Khaki, Xiuyu Li, Junxian Guo, Ligeng Zhu, Chenfeng Xu, Konstantinos N. Plataniotis, Amir Yazdanbakhsh, Kurt Keutzer, Song Han, Zhijian Liu

    Abstract: Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsi… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: ICML 2025. The first three authors contributed equally to this work. Project page: https://z-lab.ai/projects/sparselora

  17. arXiv:2506.16073  [pdf, ps, other

    cs.CV

    TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

    Authors: Byung Hoon Lee, Wooseok Shin, Sung Won Han

    Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 15 pages, 6 figures

    ACM Class: I.4.8; I.5.4; I.2.10

  18. arXiv:2506.13342  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

    Authors: Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu

    Abstract: Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  19. arXiv:2506.13015  [pdf, ps, other

    cs.LG cs.AI

    Geometric Embedding Alignment via Curvature Matching in Transfer Learning

    Authors: Sung Moon Ko, Jaewan Lee, Sumin Lee, Soorin Yim, Kyunghoon Bae, Sehui Han

    Abstract: Geometrical interpretations of deep learning models offer insightful perspectives into their underlying mathematical structures. In this work, we introduce a novel approach that leverages differential geometry, particularly concepts from Riemannian geometry, to integrate multiple models into a unified transfer learning framework. By aligning the Ricci curvature of latent space of individual models… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: 13+19 pages, 7 figures, 8 tables, 1 pseudo code

  20. arXiv:2506.12040  [pdf, other

    cs.LG cs.AI cs.CV

    BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

    Authors: Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo

    Abstract: Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask… ▽ More

    Submitted 23 May, 2025; originally announced June 2025.

  21. arXiv:2506.10242  [pdf, ps, other

    cs.CV

    DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

    Authors: Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

    Abstract: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propo… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 Workshop on Autonomous Driving

  22. arXiv:2506.10145  [pdf, ps, other

    cs.CV

    RoCA: Robust Cross-Domain End-to-End Autonomous Driving

    Authors: Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Litian Liu, Shweta Mahajan, Apratim Bhattacharyya, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

    Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohib… ▽ More

    Submitted 17 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  23. arXiv:2506.09417  [pdf, ps, other

    cs.CV

    ODG: Occupancy Prediction Using Dual Gaussians

    Authors: Yunxiao Shi, Yinhao Zhu, Shizhong Han, Jisoo Jeong, Amin Ansari, Hong Cai, Fatih Porikli

    Abstract: Occupancy prediction infers fine-grained 3D geometry and semantics from camera images of the surrounding environment, making it a critical perception task for autonomous driving. Existing methods either adopt dense grids as scene representation, which is difficult to scale to high resolution, or learn the entire scene using a single set of sparse queries, which is insufficient to handle the variou… ▽ More

    Submitted 12 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  24. arXiv:2506.09375  [pdf, ps, other

    cs.CL cs.SD eess.AS

    CoLMbo: Speaker Language Model for Descriptive Profiling

    Authors: Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj

    Abstract: Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  25. arXiv:2506.08964  [pdf, other

    cs.CV

    ORIDa: Object-centric Real-world Image Composition Dataset

    Authors: Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyoung Kim, Seon Joo Kim

    Abstract: Object compositing, the task of placing and harmonizing objects in images of diverse visual scenes, has become an important task in computer vision with the rise of generative models. However, existing datasets lack the diversity and scale required to comprehensively explore real-world scenarios. We introduce ORIDa (Object-centric Real-world Image Composition Dataset), a large-scale, real-captured… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: Accepted at CVPR 2025

  26. arXiv:2506.08125  [pdf, ps, other

    cs.LG cs.CL

    Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning

    Authors: Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

    Abstract: Large language models have demonstrated impressive reasoning capabilities, yet they often suffer from inefficiencies due to unnecessarily verbose or redundant outputs. While many works have explored reinforcement learning (RL) to enhance reasoning abilities, most primarily focus on improving accuracy, with limited attention to reasoning efficiency. Some existing approaches introduce direct length-… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  27. arXiv:2506.07443  [pdf, ps, other

    cs.AI

    LegalReasoner: Step-wised Verification-Correction for Legal Judgment Reasoning

    Authors: Weijie Shi, Han Zhu, Jiaming Ji, Mengze Li, Jipeng Zhang, Ruiyuan Zhang, Jia Zhu, Jiajie Xu, Sirui Han, Yike Guo

    Abstract: Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability throu… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  28. arXiv:2506.07002  [pdf, ps, other

    cs.CV

    BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction

    Authors: Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, Fatih Porikli

    Abstract: 3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird's Eye View (BEV) or sparse points as scene representation with much reduced cost, but s… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: Two-page abstract version available at CVPR 2025 Embodied AI Workshop

  29. arXiv:2506.06803  [pdf, ps, other

    cs.CY

    Spatial Disparities in Fire Shelter Accessibility: Capacity Challenges in the Palisades and Eaton Fires

    Authors: Su Yeon Han, Yubin Lee, Jooyoung Yoo, Jeon-Young Kang, Jinwoo Park, Soe W. Myint, Eunsang Cho, Xin Gu, Joon-Seok Kim

    Abstract: The increasing frequency and severity of wildfire in California, exacerbated by prolonged drought and environmental changes, pose significant challenges to urban community resilience and equitable emergency response. The study investigates issues of accessibility to shelters during the Palisades and Eaton Fires which started in January 2025 in Southern California that led to over 180,000 displacem… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 35 pages, 11 figures

  30. arXiv:2506.06677  [pdf, ps, other

    cs.RO cs.CV

    RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

    Authors: Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu

    Abstract: Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limite… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 23 pages, 18 figures

  31. arXiv:2506.06636  [pdf, other

    cs.CL

    SafeLawBench: Towards Safe Alignment of Large Language Models

    Authors: Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Juntao Dai, Yaodong Yang, Sirui Han, Yike Guo

    Abstract: With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchma… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL2025 Findings

  32. arXiv:2506.06400  [pdf, ps, other

    eess.IV cs.CV

    ResPF: Residual Poisson Flow for Efficient and Physically Consistent Sparse-View CT Reconstruction

    Authors: Changsheng Fang, Yongtong Liu, Bahareh Morovati, Shuo Han, Yu Shi, Li Zhou, Shuyi Fan, Hengyong Yu

    Abstract: Sparse-view computed tomography (CT) is a practical solution to reduce radiation dose, but the resulting ill-posed inverse problem poses significant challenges for accurate image reconstruction. Although deep learning and diffusion-based methods have shown promising results, they often lack physical interpretability or suffer from high computational costs due to iterative sampling starting from ra… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  33. arXiv:2506.05587  [pdf, ps, other

    cs.AI cs.CL cs.DB cs.LG

    MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

    Authors: Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish

    Abstract: Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenario… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  34. arXiv:2506.05207  [pdf, ps, other

    cs.CV

    Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

    Authors: Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen

    Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to larg… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: project page: https://follow-your-motion.github.io/

  35. arXiv:2506.04499  [pdf, ps, other

    cs.CV

    FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

    Authors: Shizhong Han, Hsin-Pai Cheng, Hong Cai, Jihad Masri, Soyeb Nagori, Fatih Porikli

    Abstract: Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  36. arXiv:2506.03765  [pdf, ps, other

    cs.CR

    Prediction Inconsistency Helps Achieve Generalizable Detection of Adversarial Examples

    Authors: Sicong Han, Chenhao Lin, Zhengyu Zhao, Xiyuan Wang, Xinlei He, Qian Li, Cong Wang, Qian Wang, Chao Shen

    Abstract: Adversarial detection protects models from adversarial attacks by refusing suspicious test samples. However, current detection methods often suffer from weak generalization: their effectiveness tends to degrade significantly when applied to adversarially trained models rather than naturally trained ones, and they generally struggle to achieve consistent effectiveness across both white-box and blac… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  37. arXiv:2506.02615  [pdf, ps, other

    cs.CV cs.AI

    Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models

    Authors: Safaa Abdullahi Moallim Mohamud, Minjin Baek, Dong Seog Han

    Abstract: In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage,… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  38. arXiv:2506.02197  [pdf, ps, other

    eess.IV cs.CV

    NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution

    Authors: Marcos V. Conde, Radu Timofte, Zihao Lu, Xiangyu Kong, Xiaoxia Xing, Fan Wang, Suejin Han, MinKyu Park, Tianyu Zhang, Xin Luo, Yeda Chen, Dong Liu, Li Pang, Yuhang Yang, Hongzhong Wang, Xiangyong Cao, Ruixuan Jiang, Senyan Xu, Siyuan Jiang, Xueyang Fu, Zheng-Jun Zha, Tianyu Hao, Yuhong He, Ruoqi Li, Yueqi Yang , et al. (14 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and… ▽ More

    Submitted 4 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

  39. arXiv:2506.01908  [pdf, ps, other

    cs.CV

    Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency

    Authors: Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, Si Liu

    Abstract: Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training s… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  40. arXiv:2506.01460  [pdf, ps, other

    cs.SD eess.AS

    Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement

    Authors: Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee

    Abstract: Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and requir… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  41. arXiv:2506.01388  [pdf, ps, other

    cs.CV cs.AI

    VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding

    Authors: Yihao Ding, Soyeon Caren Han, Yan Li, Josiah Poon

    Abstract: Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational applications. However, form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability. Address… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted at IJCAI 2025 Demonstrations Track

  42. arXiv:2506.01300  [pdf, ps, other

    cs.CV

    ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

    Authors: Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

    Abstract: Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: 31 pages, 18 figures

  43. arXiv:2506.01096  [pdf, ps, other

    cs.AI

    SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

    Authors: Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

    Abstract: Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent co… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  44. arXiv:2506.00982  [pdf, other

    cs.RO cs.MA

    Robust and Safe Multi-Agent Reinforcement Learning Framework with Communication for Autonomous Vehicles

    Authors: Keshawn Smith, Zhili Zhang, H M Sabbir Ahmad, Ehsan Sabouni, Maniak Mondal, Song Han, Wenchao Li, Fei Miao

    Abstract: Deep multi-agent reinforcement learning (MARL) has been demonstrated effectively in simulations for many multi-robot problems. For autonomous vehicles, the development of vehicle-to-vehicle (V2V) communication technologies provide opportunities to further enhance safety of the system. However, zero-shot transfer of simulator-trained MARL policies to hardware dynamic systems remains challenging, an… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 19 pages, 9 Figures

  45. arXiv:2506.00799  [pdf, ps, other

    cs.LG

    Uni-LoRA: One Vector is All You Need

    Authors: Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

    Abstract: Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strate… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  46. arXiv:2506.00751  [pdf, ps, other

    cs.AI cs.LG

    Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?

    Authors: Zhuojun Gu, Quan Wang, Shuchu Han

    Abstract: Recent advances in Large Language Models (LLMs) highlight the need to align their behaviors with human values. A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios). Such deviations raise fundamental concerns for the inter… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  47. arXiv:2505.24714  [pdf, ps, other

    cs.CL

    FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

    Authors: Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo

    Abstract: Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asse… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: ACL 2025 Main Conference

  48. arXiv:2505.23950  [pdf, ps, other

    cs.AI

    InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

    Authors: Boyuan Chen, Donghai Hong, Jiaming Ji, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, Yaodong Yang

    Abstract: As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support mul… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  49. arXiv:2505.23667  [pdf, ps, other

    cs.AI

    Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models

    Authors: Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang

    Abstract: Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful… ▽ More

    Submitted 31 May, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

  50. arXiv:2505.22618  [pdf, ps, other

    cs.CL

    Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

    Authors: Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie

    Abstract: Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introdu… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.