Skip to main content

Showing 1–50 of 13,459 results for author: Wang, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02841  [pdf, ps, other

    cs.AI cs.CL cs.LG

    StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

    Authors: Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan

    Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  2. arXiv:2507.02792  [pdf, ps, other

    cs.CV

    RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

    Authors: Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang

    Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer fr… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  3. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  4. arXiv:2507.02714  [pdf, ps, other

    cs.CV cs.AI

    FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

    Authors: Yuxuan Wang, Tianwei Cao, Huayu Zhang, Zhongjiang He, Kongming Liang, Zhanyu Ma

    Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach design… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: ICCV 2025

  5. arXiv:2507.02626  [pdf, ps, other

    cs.MM

    VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

    Authors: Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

    Abstract: Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content unders… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  6. arXiv:2507.02565  [pdf, ps, other

    cs.CV

    Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

    Authors: Buzhen Huang, Chen Li, Chongyang Xu, Dongyue Lu, Jinnan Chen, Yangang Wang, Gim Hee Lee

    Abstract: Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacle… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  7. arXiv:2507.02562  [pdf, ps, other

    eess.AS cs.SD

    Multi-Utterance Speech Separation and Association Trained on Short Segments

    Authors: Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this ch… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 5 pages, accepted by WASPAA 2025

  8. arXiv:2507.02376  [pdf, ps, other

    cs.SE cs.AI cs.DC

    VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

    Authors: Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang

    Abstract: Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants' data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  9. arXiv:2507.02200  [pdf, ps, other

    cs.CV cs.AI cs.CL

    ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

    Authors: Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang

    Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of in… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: A Strong Baseline for Reasoning based Event Stream Scene Text Recognition

  10. arXiv:2507.02187  [pdf, ps, other

    cs.HC

    VergeIO: Depth-Aware Eye Interaction on Glasses

    Authors: Xiyuxing Zhang, Duc Vu, Chengyi Shen, Yuntao Wang, Yuanchun Shi, Justin Chan

    Abstract: There is growing industry interest in creating unobtrusive designs for electrooculography (EOG) sensing of eye gestures on glasses (e.g. JINS MEME and Apple eyewear). We present VergeIO, the first EOG-based glasses that enables depth-aware eye interaction using vergence with an optimized electrode layout and novel smart glass prototype. It can distinguish between four and six depth-based eye gestu… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  11. arXiv:2507.02145  [pdf, ps, other

    cs.CL cs.AI

    Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

    Authors: Keyan Jin, Yapeng Wang, Leonel Santos, Tao Fang, Xu Yang, Sio Kei Im, Hugo Gonçalo Oliveira

    Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplor… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  12. arXiv:2507.02107  [pdf, ps, other

    cs.SE cs.PL

    Structural Code Search using Natural Language Queries

    Authors: Ben Limpanukorn, Yanjun Wang, Zach Patterson, Pranav Garg, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras, Miryung Kim

    Abstract: Searching code is a common task that developers perform to understand APIs, learn common code patterns, and navigate code. Currently, developers most commonly search using keywords and regular expressions that are easy to use and widely available. Beyond keywords and regular expressions, structural code search tools allow developers to search for code based on its syntactic structure. This has num… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  13. arXiv:2507.02076  [pdf, ps, other

    cs.AI cs.LG

    Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

    Authors: Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, Peng Cheng, Yunzhou Wang, Pengyi Liao, Hanrui Huang, Bin Wang, Jianye Hao, Mark Coates

    Abstract: Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks. However, current models remain inefficient at reasoning: they apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones. This survey presents a comprehensive review of efficient test-time compute (TTC… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  14. arXiv:2507.01951  [pdf, ps, other

    cs.LG cs.CL

    Test-Time Scaling with Reflective Generative Model

    Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

    Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra pr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  15. arXiv:2507.01925  [pdf, ps, other

    cs.RO

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

    Abstract: The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 70 pages, 5 figures

  16. arXiv:2507.01908  [pdf, ps, other

    cs.CV

    Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

    Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

    Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  17. arXiv:2507.01792  [pdf, ps, other

    cs.CV

    FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization

    Authors: Peng Zheng, Ye Wang, Rui Ma, Zuxuan Wu

    Abstract: Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimizati… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  18. arXiv:2507.01653  [pdf, ps, other

    cs.CV

    RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

    Authors: Yuran Wang, Yingping Liang, Yutao Hu, Ying Fu

    Abstract: Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose \textbf{RobuSTereo}, a novel framework that enhances the zer… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: accepted by ICCV25

  19. arXiv:2507.01630  [pdf, ps, other

    cs.CV cs.AI

    Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

    Authors: Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu

    Abstract: The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \te… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  20. arXiv:2507.01381  [pdf, ps, other

    cs.LG cs.AI

    Distributional Soft Actor-Critic with Diffusion Policy

    Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li

    Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional rei… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted IEEE ITSC 2025

  21. arXiv:2507.01335  [pdf, ps, other

    cs.CL cs.AI

    LEDOM: An Open and Fundamental Reverse Language Model

    Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

    Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Work in progress

  22. arXiv:2507.01334  [pdf, ps, other

    cs.CL

    Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

    Authors: Nifu Dan, Yujun Cai, Yiwei Wang

    Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challen… ▽ More

    Submitted 3 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  23. arXiv:2507.01281  [pdf, ps, other

    cs.CL cs.AI

    Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

    Authors: Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu

    Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  24. arXiv:2507.01041  [pdf, ps, other

    cs.LG cs.AI

    Fast AI Model Splitting over Edge Networks

    Authors: Zuguang Li, Wen Wu, Shaohua Wu, Songge Zhang, Ye Wang, Xuemin, Shen

    Abstract: Split learning (SL) has emerged as a computationally efficient approach for artificial intelligence (AI) model training, which can alleviate device-side computational workloads. However, complex AI model architectures pose high computational complexity to obtain the optimal model splitting. In this paper, we represent an arbitrary AI model as a directed acyclic graph (DAG), and then reformulate th… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced July 2025.

    Comments: 13 pages, 14 figures

  25. arXiv:2507.01016  [pdf, ps, other

    cs.RO cs.CV

    VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

    Authors: Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He

    Abstract: In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  26. arXiv:2507.01006  [pdf, ps, other

    cs.CV cs.AI cs.LG

    GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang , et al. (54 additional authors not shown)

    Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the fi… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  27. arXiv:2507.00992  [pdf, ps, other

    cs.CV

    UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

    Authors: Yuanrui Wang, Cong Han, Yafei Li, Zhipeng Jin, Xiawei Li, SiNan Du, Wen Tao, Yi Yang, Shuanglong Li, Chun Yuan, Liu Lin

    Abstract: Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  28. arXiv:2507.00949  [pdf, ps, other

    cs.DC cs.AR

    How Fast Can Graph Computations Go on Fine-grained Parallel Architectures

    Authors: Yuqing Wang, Charles Colley, Brian Wheatman, Jiya Su, David F. Gleich, Andrew A. Chien

    Abstract: Large-scale graph problems are of critical and growing importance and historically parallel architectures have provided little support. In the spirit of co-design, we explore the question, How fast can graph computing go on a fine-grained architecture? We explore the possibilities of an architecture optimized for fine-grained parallelism, natural programming, and the irregularity and skew found in… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 13 pages, 11 figures, 6 tables

  29. arXiv:2507.00891  [pdf, ps, other

    cs.CL cs.AI

    MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

    Authors: Yuheng Wang, Xianhe Tang, Pufeng Huang

    Abstract: Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automati… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  30. arXiv:2507.00884  [pdf

    physics.chem-ph cs.AI cs.LG physics.bio-ph

    A Scalable and Quantum-Accurate Foundation Model for Biomolecular Force Field via Linearly Tensorized Quadrangle Attention

    Authors: Qun Su, Kai Zhu, Qiaolin Gou, Jintu Zhang, Renling Hu, Yurong Li, Yongze Wang, Hui Zhang, Ziyi You, Linlong Jiang, Yu Kang, Jike Wang, Chang-Yu Hsieh, Tingjun Hou

    Abstract: Accurate atomistic biomolecular simulations are vital for disease mechanism understanding, drug discovery, and biomaterial design, but existing simulation methods exhibit significant limitations. Classical force fields are efficient but lack accuracy for transition states and fine conformational details critical in many chemical and biological processes. Quantum Mechanics (QM) methods are highly a… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  31. arXiv:2507.00880  [pdf, ps, other

    cs.LG cs.AI

    NN-Former: Rethinking Graph Structure in Neural Architecture Representation

    Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

    Abstract: The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent compli… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to CVPR 2025. Code is avaiable at https://github.com/XuRuihan/NNFormer

  32. arXiv:2507.00790  [pdf, ps, other

    cs.CV cs.AI

    LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

    Authors: Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

    Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recur… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  33. arXiv:2507.00721  [pdf, ps, other

    cs.CV

    UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

    Authors: Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu

    Abstract: Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on m… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: ICCV2025

  34. arXiv:2507.00699  [pdf, ps, other

    cs.SE

    A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

    Authors: Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng

    Abstract: Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluat… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  35. arXiv:2507.00608  [pdf, ps, other

    cs.CV

    De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection

    Authors: Zehua Fu, Chenguang Liu, Yuyu Chen, Jiaqi Zhou, Qingjie Liu, Yunhong Wang

    Abstract: Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance.… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE Transactions on Intelligent Transportation Systems. 15 pages, 10 figures

  36. arXiv:2507.00501  [pdf, ps, other

    cs.CV

    Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing

    Authors: Yongzhen Wang, Liangliang Chen, Bingwen Hu, Heng Liu, Xiao-Ping Zhang, Mingqiang Wei

    Abstract: Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 12 pages, 11 figures, 6 tables

  37. arXiv:2507.00435  [pdf, ps, other

    cs.RO cs.AI cs.CV

    RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

    Authors: Yi Ru Wang, Carter Ung, Grant Tannert, Jiafei Duan, Josephine Li, Amy Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa

    Abstract: We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suit… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Project page: https://robo-eval.github.io

  38. arXiv:2507.00427  [pdf, ps, other

    cs.DB

    Zero-Knowledge Verifiable Graph Query Evaluation via Expansion-Centric Operator Decomposition

    Authors: Hao Wu, Changzheng Wei, Yanhao Wang, Li Lin, Yilong Leng, Shiyu He, Minghao Zhao, Hanghang Wu, Ying Yan, Aoying Zhou

    Abstract: This paper investigates the feasibility of achieving zero-knowledge verifiability for graph databases, enabling database owners to cryptographically prove the query execution correctness without disclosing the underlying data. Although similar capabilities have been explored for relational databases, their implementation for graph databases presents unique challenges. This is mainly attributed to… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  39. arXiv:2507.00286  [pdf, ps, other

    cs.HC cs.AI cs.ET

    Visual Privacy Management with Generative AI for Blind and Low-Vision People

    Authors: Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang

    Abstract: Blind and low vision (BLV) individuals use Generative AI (GenAI) tools to interpret and manage visual content in their daily lives. While such tools can enhance the accessibility of visual content and so enable greater user independence, they also introduce complex challenges around visual privacy. In this paper, we investigate the current practices and future design preferences of blind and low v… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  40. arXiv:2507.00185  [pdf

    eess.IV cs.AI cs.CV

    Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

    Authors: Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting

    Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundatio… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 42 pages, 3 composite figures, 4 tables

  41. arXiv:2507.00008  [pdf, other

    cs.AI cs.CV cs.HC

    DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

    Authors: Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

    Abstract: Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolit… ▽ More

    Submitted 11 June, 2025; originally announced July 2025.

    Comments: 8 pages, 6 figures

  42. arXiv:2506.24120  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

    Authors: Yuqing Wang, Shangding Gu

    Abstract: Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improv… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  43. arXiv:2506.24081  [pdf, ps, other

    quant-ph cs.AI cs.LG

    SQUASH: A SWAP-Based Quantum Attack to Sabotage Hybrid Quantum Neural Networks

    Authors: Rahul Kumar, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen

    Abstract: We propose a circuit-level attack, SQUASH, a SWAP-Based Quantum Attack to sabotage Hybrid Quantum Neural Networks (HQNNs) for classification tasks. SQUASH is executed by inserting SWAP gate(s) into the variational quantum circuit of the victim HQNN. Unlike conventional noise-based or adversarial input attacks, SQUASH directly manipulates the circuit structure, leading to qubit misalignment and dis… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Keywords: Quantum Machine Learning, Hybrid Quantum Neural Networks, SWAP Test, Fidelity, Circuit-level Attack

  44. arXiv:2506.24063  [pdf, ps, other

    cs.CV

    Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

    Authors: Deng Li, Aming Wu, Yang Li, Yaowei Wang, Yahong Han

    Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers.… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  45. arXiv:2506.24044  [pdf, ps, other

    cs.CV cs.AI cs.RO

    A Survey on Vision-Language-Action Models for Autonomous Driving

    Authors: Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun

    Abstract: The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructio… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  46. arXiv:2506.24026  [pdf, ps, other

    cs.AI

    Constructing Non-Markovian Decision Process via History Aggregator

    Authors: Yongyi Wang, Wenxin Li

    Abstract: In the domain of algorithmic decision-making, non-Markovian dynamics manifest as a significant impediment, especially for paradigms such as Reinforcement Learning (RL), thereby exerting far-reaching consequences on the advancement and effectiveness of the associated systems. Nevertheless, the existing benchmarks are deficient in comprehensively assessing the capacity of decision algorithms to hand… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  47. arXiv:2506.23854  [pdf, ps, other

    cs.CV cs.GR

    HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

    Authors: Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Xianpeng Lang

    Abstract: Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal co… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Published in International Conference on Computer Vision (ICCV) 2025

  48. arXiv:2506.23825  [pdf, ps, other

    cs.CV

    Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

    Authors: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin

    Abstract: Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  49. arXiv:2506.23726  [pdf, ps, other

    cs.LG cs.AI

    System-Embedded Diffusion Bridge Models

    Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek

    Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic pr… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Preprint

  50. arXiv:2506.23690  [pdf, ps, other

    cs.CV

    SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

    Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen

    Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce vis… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Project page: https://lucaria-academy.github.io/SynMotion/