Skip to main content

Showing 1–50 of 349 results for author: Mao, Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.00898  [pdf, ps, other

    cs.CV cs.CL

    ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

    Authors: Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie

    Abstract: Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world app… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025. Project page: https://zifuwan.github.io/ONLY/

  2. arXiv:2506.23783  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

    Authors: Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang

    Abstract: Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effe… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Journal extension of Mamba-FETrack which was published on Pattern Recognition and Computer Vision (PRCV) 2024

  3. arXiv:2506.18797  [pdf, ps, other

    cs.LG

    A Multi-view Divergence-Convergence Feature Augmentation Framework for Drug-related Microbes Prediction

    Authors: Xin An, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma

    Abstract: In the study of drug function and precision medicine, identifying new drug-microbe associations is crucial. However, current methods isolate association and similarity analysis of drug and microbe, lacking effective inter-view optimization and coordinated multi-view feature fusion. In our study, a multi-view Divergence-Convergence Feature Augmentation framework for Drug-related Microbes Prediction… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 10 pages, 8 figures (including subfigures), 1 table. Xin An and Ruijie Li contributed equally to this work and should be considered co-first authors

  4. arXiv:2506.16578  [pdf, ps, other

    cs.CV

    SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage

    Authors: Tongan Cai, Haomiao Ni, Wenchao Ma, Yuan Xue, Qian Ma, Rachel Leicht, Kelvin Wong, John Volpi, Stephen T. C. Wong, James Z. Wang, Sharon X. Huang

    Abstract: Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across inst… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: IPMI 2025

  5. arXiv:2506.13977  [pdf, ps, other

    cs.SE cs.CL

    CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

    Authors: Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao

    Abstract: The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  6. arXiv:2506.11375  [pdf, ps, other

    cs.AI cs.CL

    Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables

    Authors: Yitong Zhou, Mingyue Cheng, Qingyang Mao, Yucong Luo, Qi Liu, Yupeng Li, Xiaohan Zhang, Deguang Liu, Xin Li, Enhong Chen

    Abstract: Chemical tables encode complex experimental knowledge through symbolic expressions, structured variables, and embedded molecular graphics. Existing benchmarks largely overlook this multimodal and domain-specific complexity, limiting the ability of multimodal large language models to support scientific understanding in chemistry. In this work, we introduce ChemTable, a large-scale benchmark of real… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  7. arXiv:2506.08710  [pdf, ps, other

    cs.CV

    SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

    Authors: Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, Theo Gevers, Martin R. Oswald, Danda Pani Paudel

    Abstract: 3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) general… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 15 pages, codes, data and benchmark will be released

  8. arXiv:2506.03691  [pdf, ps, other

    cs.SE

    A Two-Staged LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Validation

    Authors: Weiyuan Xu, Juntao Luo, Tao Huang, Kaixin Sui, Jie Geng, Qijun Ma, Isami Akasaka, Xiaoxue Shi, Jing Tang, Peng Cai

    Abstract: Continuous Integration and Continuous Deployment (CI/CD) pipelines are pivotal to modern software engineering, yet diagnosing and resolving their failures remains a complex and labor-intensive challenge. In this paper, we present LogSage, the first end-to-end LLM-powered framework that performs root cause analysis and solution generation from failed CI/CD pipeline logs. During the root cause analy… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 12 pages, 5 figures

  9. arXiv:2506.03598  [pdf

    cs.CL cs.AI

    Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments

    Authors: Zetong Tang, Qian Ma, Di Wu

    Abstract: Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method de… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 4 pages,2 figures,EITCE 2025

    MSC Class: 68T50

  10. arXiv:2506.03373  [pdf, ps, other

    cs.CV cs.AI

    A Foundation Model for Spatial Proteomics

    Authors: Muhammad Shaban, Yuzhou Chang, Huaying Qiu, Yao Yu Yeo, Andrew H. Song, Guillaume Jaume, Yuchen Wang, Luca L. Weishaupt, Tong Ding, Anurag Vaidya, Abdallah Lamane, Daniel Shao, Mohammed Zidane, Yunhao Bai, Paige McCallum, Shuli Luo, Wenrui Wu, Yang Wang, Precious Cramer, Chi Ngai Chan, Pierre Stephan, Johanna Schaffenrath, Jia Le Lee, Hendrik A. Michel, Caiwei Tian , et al. (35 additional authors not shown)

    Abstract: Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-superv… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  11. arXiv:2506.01405  [pdf, ps, other

    cs.LG

    SOC-DGL: Social Interaction Behavior Inspired Dual Graph Learning Framework for Drug-Target Interaction Identification

    Authors: Xiang Zhao, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma

    Abstract: The identification of drug-target interactions (DTI) is crucial for drug discovery and repositioning, as it reveals potential uses of existing drugs, aiding in the acceleration of the drug development process and reducing associated costs. Despite the similarity information in DTI is important, most models are limited to mining direct similarity information within homogeneous graphs, overlooking t… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: 14 pages, 17 figures (including subfigures), 4 tables. Xiang Zhao and Ruijie Li contributed equally to this work and should be considered co-first authors. The source code and datasets are available at https://github.com/Zhaoxiang0422/SOC-DGL

  12. arXiv:2506.01297  [pdf, ps, other

    cs.AI

    MobCLIP: Learning General-purpose Geospatial Representation at Scale

    Authors: Ya Wen, Jixuan Cai, Qiyao Ma, Linyan Li, Xinhua Chen, Chris Webster, Yulun Zhou

    Abstract: Representation learning of geospatial locations remains a core challenge in achieving general geospatial intelligence. Current embedding methods often lack versatility, limiting their utility across diverse tasks in both human and natural domains. We present MobCLIP, the first nationwide general-purpose location encoder, integrating an unprecedented diversity of data modalities through effective a… ▽ More

    Submitted 3 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

  13. arXiv:2505.24567  [pdf, ps, other

    cs.CV

    Unleashing the Power of Intermediate Domains for Mixed Domain Semi-Supervised Medical Image Segmentation

    Authors: Qinghe Ma, Jian Zhang, Lei Qi, Qian Yu, Yinghuan Shi, Yang Gao

    Abstract: Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However, the coexistence of limited annotation and domain shift is quite common, which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised med… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted by IEEE TMI 2025. arXiv admin note: text overlap with arXiv:2404.08951

  14. arXiv:2505.23833  [pdf, ps, other

    cs.CL

    Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

    Authors: Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

    Abstract: In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract pat… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  15. arXiv:2505.17431  [pdf, ps, other

    cs.LG

    HyperIMTS: Hypergraph Neural Network for Irregular Multivariate Time Series Forecasting

    Authors: Boyuan Li, Yicheng Luo, Zhen Liu, Junhao Zheng, Jianming Lv, Qianli Ma

    Abstract: Irregular multivariate time series (IMTS) are characterized by irregular time intervals within variables and unaligned observations across variables, posing challenges in learning temporal and variable dependencies. Many existing IMTS models either require padded samples to learn separately from temporal and variable dimensions, or represent original samples via bipartite graphs or sets. However,… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted in ICML 2025

  16. arXiv:2505.16533  [pdf, ps, other

    cs.CV

    Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction

    Authors: Jiacong Chen, Qingyu Mao, Youneng Bao, Xiandong Meng, Fanyang Meng, Ronggang Wang, Yongsheng Liang

    Abstract: 3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we p… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: 17 pages, 9 figures

  17. arXiv:2505.15629  [pdf, ps, other

    cs.MM

    Relationship Analysis of Image-Text Pair in SNS Posts

    Authors: Takuto Nabeoka, Yijun Duan, Qiang Ma

    Abstract: Social networking services (SNS) contain vast amounts of image-text posts, necessitating effective analysis of their relationships for improved information retrieval. This study addresses the classification of image-text pairs in SNS, overcoming prior limitations in distinguishing relationships beyond similarity. We propose a graph-based method to classify image-text pairs into similar and complem… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 15 pages, 5 figures

  18. arXiv:2505.13360  [pdf, ps, other

    cs.CL cs.SE

    What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

    Authors: Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Kästner, Tongshuang Wu

    Abstract: Building LLM-powered software requires developers to communicate their requirements through natural language, but developer prompts are frequently underspecified, failing to fully capture many user-important requirements. In this paper, we present an in-depth analysis of prompt underspecification, showing that while LLMs can often (41.1%) guess unspecified requirements by default, such behavior is… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  19. arXiv:2505.11942  [pdf, ps, other

    cs.AI

    LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners

    Authors: Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, Qianli Ma

    Abstract: Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to sys… ▽ More

    Submitted 29 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

    Comments: Project Page: https://caixd-220529.github.io/LifelongAgentBench/

  20. arXiv:2505.06892  [pdf, ps, other

    cs.LG

    Learning Soft Sparse Shapes for Efficient Time-Series Classification

    Authors: Zhen Liu, Yicheng Luo, Boyuan Li, Emadeldeen Eldele, Min Wu, Qianli Ma

    Abstract: Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying c… ▽ More

    Submitted 3 June, 2025; v1 submitted 11 May, 2025; originally announced May 2025.

    Comments: Accepted in ICML 2025

  21. arXiv:2504.21323  [pdf, other

    cs.CR cs.AI cs.LG

    How to Backdoor the Knowledge Distillation

    Authors: Chen Wu, Qian Ma, Prasenjit Mitra, Sencun Zhu

    Abstract: Knowledge distillation has become a cornerstone in modern machine learning systems, celebrated for its ability to transfer knowledge from a large, complex teacher model to a more efficient student model. Traditionally, this process is regarded as secure, assuming the teacher model is clean. This belief stems from conventional backdoor attacks relying on poisoned training data with backdoor trigger… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  22. arXiv:2504.20851  [pdf

    cs.CY cs.AI

    Fostering Self-Directed Growth with Generative AI: Toward a New Learning Analytics Framework

    Authors: Qianrun Mao

    Abstract: In an era increasingly shaped by decentralized knowledge ecosystems and pervasive AI technologies, fostering sustainable learner agency has become a critical educational imperative. This study introduces a novel conceptual framework integrating Generative Artificial Intelligence and Learning Analytics to cultivate Self-Directed Growth, a dynamic competency that enables learners to iteratively driv… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  23. arXiv:2504.19867  [pdf, other

    cs.CL cs.DC cs.LG

    semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

    Authors: Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang

    Abstract: Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticat… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 18 pages, 16 figures

  24. arXiv:2504.19519  [pdf, other

    cs.DC cs.CL cs.LG

    FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

    Authors: Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, Guohao Dai, Yu Wang

    Abstract: Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 17 pages, 11 figures, 4 tables

  25. arXiv:2504.19101  [pdf, other

    cs.CL

    Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

    Authors: Qianren Mao, Qili Zhang, Hanwen Hao, Zhentao Han, Runhua Xu, Weifeng Jiang, Qi Hu, Zhijun Chen, Tyler Zhou, Bo Li, Yangqiu Song, Jin Dong, Jianxin Li, Philip S. Yu

    Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critica… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  26. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  27. arXiv:2504.12323  [pdf, other

    cs.CL cs.AI

    The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation

    Authors: Zheng Zhang, Ning Li, Qi Liu, Rui Li, Weibo Gao, Qingyang Mao, Zhenya Huang, Baosheng Yu, Dacheng Tao

    Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems… ▽ More

    Submitted 19 April, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

    Comments: 12 pages

  28. arXiv:2504.09248  [pdf, ps, other

    eess.SY cs.CR

    Asymptotic stabilization under homomorphic encryption: A re-encryption free method

    Authors: Shuai Feng, Qian Ma, Junsoo Kim, Shengyuan Xu

    Abstract: In this paper, we propose methods to encrypted a pre-given dynamic controller with homomorphic encryption, without re-encrypting the control inputs. We first present a preliminary result showing that the coefficients in a pre-given dynamic controller can be scaled up into integers by the zooming-in factor in dynamic quantization, without utilizing re-encryption. However, a sufficiently small zoomi… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  29. arXiv:2504.07971  [pdf, other

    cs.HC cs.AI

    SPHERE: An Evaluation Card for Human-AI Systems

    Authors: Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, Tongshuang Wu

    Abstract: In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How i… ▽ More

    Submitted 24 March, 2025; originally announced April 2025.

  30. arXiv:2504.05594  [pdf, other

    cs.CV

    Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

    Authors: Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang

    Abstract: Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: under review

  31. arXiv:2504.05076  [pdf, other

    cs.CV

    Content-Distortion High-Order Interaction for Blind Image Quality Assessment

    Authors: Shuai Liu, Qingyu Mao, Chao Li, Jiacong Chen, Fanyang Meng, Yonghong Tian, Yongsheng Liang

    Abstract: The content and distortion are widely recognized as the two primary factors affecting the visual quality of an image. While existing No-Reference Image Quality Assessment (NR-IQA) methods have modeled these factors, they fail to capture the complex interactions between content and distortions. This shortfall impairs their ability to accurately perceive quality. To confront this, we analyze the key… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: 19 pages (main text: 14 pages + appendix: 5 pages), 9 figures, 23 tables. In submission

  32. arXiv:2504.01440  [pdf, other

    cs.LG

    Solving Time-Fractional Partial Integro-Differential Equations Using Tensor Neural Network

    Authors: Zhongshuo Lin, Qingkui Ma, Hehu Xie, Xiaobo Yin

    Abstract: In this paper, we propose a novel machine learning method based on adaptive tensor neural network subspace to solve linear time-fractional diffusion-wave equations and nonlinear time-fractional partial integro-differential equations. In this framework, the tensor neural network and Gauss-Jacobi quadrature are effectively combined to construct a universal numerical scheme for the temporal Caputo de… ▽ More

    Submitted 6 April, 2025; v1 submitted 2 April, 2025; originally announced April 2025.

  33. arXiv:2504.01204  [pdf, other

    cs.GR cs.CV

    Articulated Kinematics Distillation from Video Diffusion Models

    Authors: Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chenfanfu Jiang, Ming-Yu Liu, Donglai Xiang

    Abstract: We present Articulated Kinematics Distillation (AKD), a framework for generating high-fidelity character animations by merging the strengths of skeleton-based animation and modern generative models. AKD uses a skeleton-based representation for rigged 3D assets, drastically reducing the Degrees of Freedom (DoFs) by focusing on joint-level control, which allows for efficient, consistent motion synth… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

  34. arXiv:2503.23771  [pdf, other

    cs.CV

    XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

    Authors: Fengxiang Wang, Hongzhen Wang, Mingshuo Chen, Di Wang, Yulin Wang, Zonghao Guo, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, Maosong Sun

    Abstract: The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existi… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: It has been accepted by CVPR2025

  35. arXiv:2503.22020  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin

    Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Project website: https://cot-vla.github.io/

    Journal ref: CVPR 2025

  36. arXiv:2503.18355  [pdf, other

    cs.IR

    Food Recommendation With Balancing Comfort and Curiosity

    Authors: Yuto Sakai, Qiang Ma

    Abstract: Food is a key pleasure of traveling, but travelers face a trade-off between exploring curious new local food and choosing comfortable, familiar options. This creates demand for personalized recommendation systems that balance these competing factors. To the best of our knowledge, conventional recommendation methods cannot provide recommendations that offer both curiosity and comfort for food unkno… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  37. arXiv:2503.18052  [pdf, ps, other

    cs.CV

    SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

    Authors: Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel

    Abstract: Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Mean… ▽ More

    Submitted 3 June, 2025; v1 submitted 23 March, 2025; originally announced March 2025.

    Comments: Our code, model, and dataset will be released at https://unique1i.github.io/SceneSplat_webpage/

  38. arXiv:2503.16997  [pdf, other

    cs.CV

    Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation

    Authors: Qinghe Ma, Jian Zhang, Zekun Li, Lei Qi, Qian Yu, Yinghuan Shi

    Abstract: Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains. In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025

  39. arXiv:2503.14492  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

    Authors: NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo , et al. (16 additional authors not shown)

    Abstract: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly contro… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

  40. arXiv:2503.13327  [pdf, ps, other

    cs.CV

    Edit Transfer: Learning Image Editing via Vision In-Context Relations

    Authors: Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou

    Abstract: We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style… ▽ More

    Submitted 1 July, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  41. arXiv:2503.11290  [pdf, ps, other

    cs.CV eess.IV

    EmoAgent: A Multi-Agent Framework for Diverse Affective Image Manipulation

    Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin

    Abstract: Affective Image Manipulation (AIM) aims to alter visual elements within an image to evoke specific emotional responses from viewers. However, existing AIM approaches rely on rigid \emph{one-to-one} mappings between emotions and visual cues, making them ill-suited for the inherently subjective and diverse ways in which humans perceive and express emotion.To address this, we introduce a novel task s… ▽ More

    Submitted 23 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  42. arXiv:2503.09735  [pdf, other

    cs.CR cs.CV cs.LG

    Enhancing Adversarial Example Detection Through Model Explanation

    Authors: Qian Ma, Ziping Ye

    Abstract: Adversarial examples are a major problem for machine learning models, leading to a continuous search for effective defenses. One promising direction is to leverage model explanations to better understand and defend against these attacks. We looked at AmI, a method proposed by a NeurIPS 2018 spotlight paper that uses model explanations to detect adversarial examples. Our study shows that while AmI… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 5 pages, 1 figure

  43. arXiv:2503.09642  [pdf, other

    cs.GR cs.AI

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Authors: Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang , et al. (7 additional authors not shown)

    Abstract: Video generation models have achieved remarkable progress in the past year. The quality of AI video continues to improve, but at the cost of larger model size, increased data quantity, and greater demand for training compute. In this report, we present Open-Sora 2.0, a commercial-level video generation model trained for only $200k. With this model, we demonstrate that the cost of training a top-pe… ▽ More

    Submitted 23 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

  44. arXiv:2503.08160  [pdf, other

    cs.LG

    Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

    Authors: Taojie Kuang, Qianli Ma, Athanasios V. Vasilakos, Yu Wang, Qiang, Cheng, Zhixiang Ren

    Abstract: In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook com… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  45. arXiv:2503.03792  [pdf, other

    cs.LG cs.AI

    Rebalanced Multimodal Learning with Data-aware Unimodal Sampling

    Authors: Qingyuan Jiang, Zhouyang Chi, Xiao Ma, Qirong Mao, Yang Yang, Jinhui Tang

    Abstract: To address the modality learning degeneration caused by modality imbalance, existing multimodal learning~(MML) approaches primarily attempt to balance the optimization process of each modality from the perspective of model learning. However, almost all existing methods ignore the modality imbalance caused by unimodal data sampling, i.e., equal unimodal data sampling often results in discrepancies… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  46. arXiv:2503.02662  [pdf, other

    cs.CV

    10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

    Authors: Biqiao Xin, Qianchen Mao, Bingshu Wang, Jiangbin Zheng, Yong Zhao, C. L. Philip Chen

    Abstract: The widespread deployment of Infrared Small-Target Detection (IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binarized neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision… ▽ More

    Submitted 10 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

  47. arXiv:2502.20847  [pdf, other

    cs.LG

    Gradient Imbalance in Direct Preference Optimization

    Authors: Qinwei Ma, Jingzhe Shi, Can Jin, Jenq-Neng Hwang, Serge Belongie, Lei Li

    Abstract: Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance a… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: 15 pages, 2 figures

  48. arXiv:2502.18036  [pdf, other

    cs.CL

    Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

    Authors: Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu

    Abstract: LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemb… ▽ More

    Submitted 15 May, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

    Comments: 9 pages, 2 figures, codebase: https://github.com/junchenzhi/Awesome-LLM-Ensemble

  49. arXiv:2502.16770  [pdf, other

    cs.CL cs.AI

    LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint

    Authors: Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao

    Abstract: Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: \textbf{neuron misidentification}… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

  50. arXiv:2502.16399  [pdf, other

    cs.IR cs.AI cs.CL

    Ensemble ToT of LLMs and Its Application to Automatic Grading System for Supporting Self-Learning

    Authors: Yuki Ito, Qiang Ma

    Abstract: Providing students with detailed and timely grading feedback is essential for self-learning. While existing LLM-based grading systems are promising, most of them rely on one single model, which limits their performance. To address this, we propose Ensemble Tree-of-Thought (ToT), a framework that enhances LLM outputs by integrating multiple models. Using this framework, we develop a grading system.… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: 33 pages, 25 figures

    ACM Class: I.2.7; K.3.1; K.3.2