Search | arXiv e-print repository

ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

Authors: You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao

Abstract: Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological… ▽ More Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet. △ Less

Submitted 6 July, 2025; originally announced July 2025.

arXiv:2507.03892 [pdf]

Is AI mingling or bullying me? Exploring User Interactions with a Chatbot in China

Authors: Nuo Chen, Pu Yan, Jia Li, Qixuan Zhao

Abstract: Since its viral emergence in early 2024, Comment Robert-a Weibo-launched social chatbot-has gained widespread attention on the Chinese Internet for its unsolicited and unpredictable comments on user posts. Unlike conventional chatbots that respond only to user prompts, Robert autonomously intervenes in public discourse, representing a novel form of AI-driven social media engagement. This study exa… ▽ More Since its viral emergence in early 2024, Comment Robert-a Weibo-launched social chatbot-has gained widespread attention on the Chinese Internet for its unsolicited and unpredictable comments on user posts. Unlike conventional chatbots that respond only to user prompts, Robert autonomously intervenes in public discourse, representing a novel form of AI-driven social media engagement. This study examines how such autonomous, algorithmic communication reshapes human-AI interaction in everyday online contexts. Using computational linguistics techniques, including topic classification and sentiment analysis, we analyze over 3,900 user-submitted interactions from the "Robert Victims Alliance", a grassroots community documenting their exchanges with the chatbot. Topic modeling reveals six key themes: interpersonal relationships, self-identity, academic and career concerns, subcultures, sensitive topics, and social events. Complementing this, mixed-methods emotional analysis uncovers a complex affective spectrum: Robert's casual remarks can evoke warmth and humor but may also conceal covert hostility beneath neutral or polite language. These ambivalent interactions reveal an emerging emotional divide between humans and socially proactive AI, suggesting that while Robert simulates social presence, it often falls short of users' emotional needs. Our study contributes to human-AI interaction research by offering new insights into the affective dynamics and socio-technical implications of unsolicited AI bots' participation in digital public spheres. △ Less

Submitted 5 July, 2025; originally announced July 2025.

arXiv:2507.00917 [pdf, ps, other]

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

Authors: Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai

Abstract: The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interacti… ▽ More The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey

arXiv:2506.20045 [pdf, ps, other]

Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception

Authors: Eric C. Joyce, Qianwen Zhao, Nathaniel Burgdorfer, Long Wang, Philippos Mordohai

Abstract: Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the… ▽ More Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal. △ Less

Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

Comments: Accepted to IROS 2025

arXiv:2506.19937 [pdf, ps, other]

The Most Important Features in Generalized Additive Models Might Be Groups of Features

Authors: Tomas M. Bosschieter, Luis Franca, Jessica Wolk, Yiyuan Wu, Bella Mehta, Joseph Dehoney, Orsolya Kiss, Fiona C. Baker, Qingyu Zhao, Rich Caruana, Kilian M. Pohl

Abstract: While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This c… ▽ More While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This can be especially problematic for datasets that contain natural groupings of features, including multimodal datasets. This paper introduces a novel approach to determine the importance of a group of features for Generalized Additive Models (GAMs) that is efficient, requires no model retraining, allows defining groups posthoc, permits overlapping groups, and remains meaningful in high-dimensional settings. Moreover, this definition offers a parallel with explained variation in statistics. We showcase properties of our method on three synthetic experiments that illustrate the behavior of group importance across various data regimes. We then demonstrate the importance of groups of features in identifying depressive symptoms from a multimodal neuroscience dataset, and study the importance of social determinants of health after total hip arthroplasty. These two case studies reveal that analyzing group importance offers a more accurate, holistic view of the medical issues compared to a single-feature analysis. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19270 [pdf, ps, other]

Continuous-variable Quantum Diffusion Model for State Generation and Restoration

Authors: Haitao Huang, Chuangtao Chen, Qinglin Zhao

Abstract: The generation and preservation of complex quantum states against environmental noise are paramount challenges in advancing continuous-variable (CV) quantum information processing. This paper introduces a novel framework based on continuous-variable quantum diffusion principles, synergizing them with CV quantum neural networks (CVQNNs) to address these dual challenges. For the task of state genera… ▽ More The generation and preservation of complex quantum states against environmental noise are paramount challenges in advancing continuous-variable (CV) quantum information processing. This paper introduces a novel framework based on continuous-variable quantum diffusion principles, synergizing them with CV quantum neural networks (CVQNNs) to address these dual challenges. For the task of state generation, our Continuous-Variable Quantum Diffusion Generative model (CVQD-G) employs a physically driven forward diffusion process using a thermal loss channel, which is then inverted by a learnable, parameter-efficient backward denoising process based on a CVQNN with time-embedding. This framework's capability is further extended for state recovery by the Continuous-Variable Quantum Diffusion Restoration model (CVQD-R), a specialized variant designed to restore quantum states, particularly coherent states with unknown parameters, from thermal degradation. Extensive numerical simulations validate these dual capabilities, demonstrating the high-fidelity generation of diverse Gaussian (coherent, squeezed) and non-Gaussian (Fock, cat) states, typically with fidelities exceeding 99%, and confirming the model's ability to robustly restore corrupted states. Furthermore, a comprehensive complexity analysis reveals favorable training and inference costs, highlighting the framework's efficiency, scalability, and its potential as a robust tool for quantum state engineering and noise mitigation in realistic CV quantum systems. △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 15+3 pages, 14 figures, 7 tables

MSC Class: 81P68

arXiv:2506.18898 [pdf, ps, other]

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expan… ▽ More This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Project page: https://tar.csuhan.com

arXiv:2506.17108 [pdf, ps, other]

Searching for a Hidden Markov Anomaly over Multiple Processes

Authors: Levli Citron, Kobi Cohen, Qing Zhao

Abstract: We address the problem of detecting an anomalous process among a large number of processes. At each time t, normal processes are in state zero (normal state), while the abnormal process may be in either state zero (normal state) or state one (abnormal state), with the states being hidden. The transition between states for the abnormal process is governed by a Markov chain over time. At each time s… ▽ More We address the problem of detecting an anomalous process among a large number of processes. At each time t, normal processes are in state zero (normal state), while the abnormal process may be in either state zero (normal state) or state one (abnormal state), with the states being hidden. The transition between states for the abnormal process is governed by a Markov chain over time. At each time step, observations can be drawn from a selected subset of processes. Each probed process generates an observation depending on its hidden state, either a typical distribution under state zero or an abnormal distribution under state one. The objective is to design a sequential search strategy that minimizes the expected detection time, subject to an error probability constraint. In contrast to prior works that assume i.i.d. observations, we address a new setting where anomalies evolve according to a hidden Markov model. To this end, we propose a novel algorithm, dubbed Anomaly Detection under Hidden Markov model (ADHM), which dynamically adapts the probing strategy based on accumulated statistical evidence and predictive belief updates over hidden states. ADHM effectively leverages temporal correlations to focus sensing resources on the most informative processes. The algorithm is supported by an asymptotic theoretical foundation, grounded in an oracle analysis that characterizes the fundamental limits of detection under the assumption of a known distribution of the hidden states. In addition, the algorithm demonstrates strong empirical performance, consistently outperforming existing methods in extensive simulations. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 13 pages, 9 figures

arXiv:2506.15715 [pdf, ps, other]

NeuronSeek: On Stability and Expressivity of Task-driven Neurons

Authors: Hanyu Pei, Jing-Xiao Liao, Qibin Zhao, Ting Gao, Shijun Zhang, Xiaoge Zhang, Feng-Lei Fan

Abstract: Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons… ▽ More Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons. Along this direction, this work replaces symbolic regression with tensor decomposition (TD) to discover optimal neuronal formulations, offering enhanced stability and faster convergence. Furthermore, we establish theoretical guarantees that modifying the aggregation functions with common activation functions can empower a network with a fixed number of parameters to approximate any continuous function with an arbitrarily small error, providing a rigorous mathematical foundation for the NeuronSeek framework. Extensive empirical evaluations demonstrate that our NeuronSeek-TD framework not only achieves superior stability, but also is competitive relative to the state-of-the-art models across diverse benchmarks. The code is available at https://github.com/HanyuPei22/NeuronSeek. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: 14 pages, 10 figures

arXiv:2506.12321 [pdf, ps, other]

Extending Memorization Dynamics in Pythia Models from Instance-Level Insights

Authors: Jie Zhang, Qinghua Zhao, Lei Li, Chi-ho Lin

Abstract: Large language models have demonstrated a remarkable ability for verbatim memorization. While numerous works have explored factors influencing model memorization, the dynamic evolution memorization patterns remains underexplored. This paper presents a detailed analysis of memorization in the Pythia model family across varying scales and training steps under prefix perturbations. Using granular met… ▽ More Large language models have demonstrated a remarkable ability for verbatim memorization. While numerous works have explored factors influencing model memorization, the dynamic evolution memorization patterns remains underexplored. This paper presents a detailed analysis of memorization in the Pythia model family across varying scales and training steps under prefix perturbations. Using granular metrics, we examine how model architecture, data characteristics, and perturbations influence these patterns. Our findings reveal that: (1) as model scale increases, memorization expands incrementally while efficiency decreases rapidly; (2) as model scale increases, the rate of new memorization acquisition decreases while old memorization forgetting increases; (3) data characteristics (token frequency, repetition count, and uncertainty) differentially affect memorized versus non-memorized samples; and (4) prefix perturbations reduce memorization and increase generation uncertainty proportionally to perturbation strength, with low-redundancy samples showing higher vulnerability and larger models offering no additional robustness. These findings advance our understanding of memorization mechanisms, with direct implications for training optimization, privacy safeguards, and architectural improvements. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: 5 figures

arXiv:2506.10941 [pdf, ps, other]

VINCIE: Unlocking In-context Image Editing from Video

Authors: Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable appro… ▽ More In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Comments: Project page: https://vincie2025.github.io/

arXiv:2506.10406 [pdf, ps, other]

PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Authors: Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generativ… ▽ More Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency. △ Less

Submitted 12 June, 2025; originally announced June 2025.

arXiv:2506.07809 [pdf, ps, other]

Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution

Authors: Weilei Wen, Tianyi Zhang, Qianqian Zhao, Zhaohui Zheng, Chunle Guo, Xiuli Shao, Chongyi Li

Abstract: Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address t… ▽ More Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: (1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, (2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and (3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. We will release the code upon formal publication. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06787 [pdf, ps, other]

FuncGNN: Learning Functional Semantics of Logic Circuits with Graph Neural Networks

Authors: Qiyun Zhao

Abstract: As integrated circuit scale grows and design complexity rises, effective circuit representation helps support logic synthesis, formal verification, and other automated processes in electronic design automation. And-Inverter Graphs (AIGs), as a compact and canonical structure, are widely adopted for representing Boolean logic in these workflows. However, the increasing complexity and integration de… ▽ More As integrated circuit scale grows and design complexity rises, effective circuit representation helps support logic synthesis, formal verification, and other automated processes in electronic design automation. And-Inverter Graphs (AIGs), as a compact and canonical structure, are widely adopted for representing Boolean logic in these workflows. However, the increasing complexity and integration density of modern circuits introduce structural heterogeneity and global logic information loss in AIGs, posing significant challenges to accurate circuit modeling. To address these issues, we propose FuncGNN, which integrates hybrid feature aggregation to extract multi-granularity topological patterns, thereby mitigating structural heterogeneity and enhancing logic circuit representations. FuncGNN further introduces gate-aware normalization that adapts to circuit-specific gate distributions, improving robustness to structural heterogeneity. Finally, FuncGNN employs multi-layer integration to merge intermediate features across layers, effectively synthesizing local and global semantic information for comprehensive logic representations. Experimental results on two logic-level analysis tasks (i.e., signal probability prediction and truth-table distance prediction) demonstrate that FuncGNN outperforms existing state-of-the-art methods, achieving improvements of 2.06% and 18.71%, respectively, while reducing training time by approximately 50.6% and GPU memory usage by about 32.8%. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.06710 [pdf, ps, other]

A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

Authors: Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li

Abstract: Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous i… ▽ More Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: https://github.com/nqian1/Survey-on-ODISR-and-ODVSR. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.04185 [pdf, ps, other]

R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

Authors: Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu

Abstract: Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework fo… ▽ More Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at https://github.com/QingFei1/R-Search. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 16 pages, 3 figures

arXiv:2506.01713 [pdf, ps, other]

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Authors: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan

Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge l… ▽ More Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality. △ Less

Submitted 20 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

Comments: Technical report

arXiv:2505.24378 [pdf, other]

Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

Authors: Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao

Abstract: Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively e… ▽ More Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model's parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: ICML 2025

arXiv:2505.23861 [pdf, ps, other]

BiBLDR: Bidirectional Behavior Learning for Drug Repositioning

Authors: Renye Zhang, Mengyun Yang, Qichang Zhao, Jianxin Wang

Abstract: Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel dr… ▽ More Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel drugs because of the lack of association information with the diseases. Unlike traditional graph-based approaches, we propose a bidirectional behavior learning strategy for drug repositioning, known as BiBLDR. This innovative framework redefines drug repositioning as a behavior sequential learning task to capture drug-disease interaction patterns. First, we construct bidirectional behavioral sequences based on drug and disease sides. The consideration of bidirectional information ensures a more meticulous and rigorous characterization of the behavioral sequences. Subsequently, we propose a two-stage strategy for drug repositioning. In the first stage, we construct prototype spaces to characterize the representational attributes of drugs and diseases. In the second stage, these refined prototypes and bidirectional behavior sequence data are leveraged to predict potential drug-disease associations. Based on this learning approach, the model can more robustly and precisely capture the interactive relationships between drug and disease features from bidirectional behavioral sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on benchmark datasets. Meanwhile, BiBLDR demonstrates significantly superior performance compared to previous methods in cold-start scenarios. Our code is published in https://github.com/Renyeeah/BiBLDR. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.23537 [pdf, ps, other]

Domain-Aware Tensor Network Structure Search

Authors: Giorgos Iacovides, Wuyang Zhou, Chao Li, Qibin Zhao, Danilo Mandic

Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, e… ▽ More Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2505.22543 [pdf, ps, other]

Scaling-up Perceptual Video Quality Assessment

Authors: Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Yingji Liang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Haoning Wu, Bin Wang, Haoran Zhang, Guanyu Zhu, Qiyong Zhao, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min

Abstract: The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an effic… ▽ More The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a \textbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.20297 [pdf, other]

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

Authors: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng

Abstract: An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated… ▽ More An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: Our code is available at https://github.com/Qinyu-Allen-Zhao/DiSA

arXiv:2505.17524 [pdf, other]

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Authors: Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

Abstract: De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attri… ▽ More De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called \underline{\textbf{L}}atent \underline{\textbf{I}}mputation before \underline{\textbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at \href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: Accepted by ICML 2025

arXiv:2505.13390 [pdf, other]

doi 10.1145/3721238.3730720

MGPBD: A Multigrid Accelerated Global XPBD Solver

Authors: Chunlei Li, Peng Yu, Tiantian Liu, Siyuan Yu, Yuting Xiao, Shuai Li, Aimin Hao, Yang Gao, Qinping Zhao

Abstract: We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles wit… ▽ More We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles with low-frequency errors, leading to instability and stalling issues, especially in high-resolution, high-stiffness simulations. Our multigrid approach addresses these issues efficiently by leveraging AMG. To reduce the computational overhead of traditional AMG, where prolongator construction can consume up to two-thirds of the runtime, we propose a lazy setup strategy that reuses prolongators across iterations based on matrix structure and physical significance. Furthermore, we introduce a simplified method for constructing near-kernel components by applying a few sweeps of iterative methods to the homogeneous equation, achieving convergence rates comparable to adaptive smoothed aggregation (adaptive-SA) at a lower computational cost. Experimental results demonstrate that our method significantly improves convergence rates and numerical stability, enabling efficient and stable high-resolution simulations of deformable objects. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: SIGGRAPH 2025

ACM Class: I.3.6

arXiv:2505.12748 [pdf, ps, other]

TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Authors: Hangyu Li, Qin Zhao, Haoran Xu, Xinyu Jiang, Qingwei Ben, Feiyu Jia, Haoyu Zhao, Liang Xu, Jia Zeng, Hanqing Wang, Bo Dai, Junting Dong, Jiangmiao Pang

Abstract: Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fa… ▽ More Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation. △ Less

Submitted 19 May, 2025; originally announced May 2025.

Comments: 13 pages

arXiv:2505.12045 [pdf, ps, other]

FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

Authors: Shuai Yuan, Guowen Xu, Hongwei Li, Rui Zhang, Xinyuan Qian, Wenbo Jiang, Hangcheng Cao, Qingchuan Zhao

Abstract: Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers… ▽ More Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers are invisible under normal conditions and activated stealthily by ultraviolet light, providing superior stealthiness, flexibility, and untraceability. Inspired by real-world graffiti, we derive realistic trigger shapes and enhance their robustness via an interpolation-based fluorescence simulation algorithm. Furthermore, we develop an automated backdoor sample generation method to support three attack objectives. Extensive evaluations in the physical world demonstrate FIGhost's effectiveness against state-of-the-art detectors and VLMs, maintaining robustness under environmental variations and effectively evading existing defenses. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2505.11154 [pdf, other]

MPMA: Preference Manipulation Attack Against Model Context Protocol

Authors: Zihan Wang, Hongwei Li, Rui Zhang, Yu Liu, Wenbo Jiang, Wenshu Fan, Qingchuan Zhao, Guowen Xu

Abstract: Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper… ▽ More Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack ($\mathtt{DPMA}$) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ($\mathtt{GAPMA}$). $\mathtt{GAPMA}$ employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that $\mathtt{GAPMA}$ balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem. △ Less

Submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.07347 [pdf, other]

AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography

Authors: Jiewen Yang, Taoran Huang, Shangwei Ding, Xiaowei Xu, Qinhua Zhao, Yong Jiang, Jiarong Guo, Bin Pu, Jiexuan Zheng, Caojin Zhang, Hongwen Fei, Xiaomeng Li

Abstract: Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH… ▽ More Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH, a multi-view, multi-modal vision-language model to accurately assess pulmonary hypertension progression using non-invasive echocardiography. We constructed a large dataset comprising paired standardized echocardiogram videos, spectral images and RHC data, covering 1,237 patient cases from 12 medical centers. For the first time, MePH precisely models the correlation between non-invasive multi-view, multi-modal echocardiography and the pressure and resistance obtained via RHC. We show that MePH significantly outperforms echocardiographers' assessments using echocardiography, reducing the mean absolute error in estimating mean pulmonary arterial pressure (mPAP) and pulmonary vascular resistance (PVR) by 49.73% and 43.81%, respectively. In eight independent external hospitals, MePH achieved a mean absolute error of 3.147 for PVR assessment. Furthermore, MePH achieved an area under the curve of 0.921, surpassing echocardiographers (area under the curve of 0.842) in accurately predicting the severity of pulmonary hypertension, whether mild or severe. A prospective study demonstrated that MePH can predict treatment efficacy for patients. Our work provides pulmonary hypertension patients with a non-invasive and timely method for monitoring disease progression, improving the accuracy and efficiency of pulmonary hypertension management while enabling earlier interventions and more personalized treatment decisions. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.05473 [pdf, ps, other]

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Authors: Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pi… ▽ More Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: CVPR 2025. Project website: https://qitaozhao.github.io/DiffusionSfM

arXiv:2505.05151 [pdf, ps, other]

Overcoming Dimensional Factorization Limits in Discrete Diffusion Models through Quantum Joint Distribution Learning

Authors: Chuangtao Chen, Qinglin Zhao, MengChu Zhou, Dusit Niyato, Zhimin He, Haozhen Situ

Abstract: Discrete diffusion models represent a significant advance in generative modeling, demonstrating remarkable success in synthesizing complex, high-quality discrete data. However, to avoid exponential computational costs, they typically rely on calculating per-dimension transition probabilities when learning high-dimensional distributions. In this study, we rigorously prove that this approach leads t… ▽ More Discrete diffusion models represent a significant advance in generative modeling, demonstrating remarkable success in synthesizing complex, high-quality discrete data. However, to avoid exponential computational costs, they typically rely on calculating per-dimension transition probabilities when learning high-dimensional distributions. In this study, we rigorously prove that this approach leads to a worst-case linear scaling of Kullback-Leibler (KL) divergence with data dimension. To address this, we propose a Quantum Discrete Denoising Diffusion Probabilistic Model (QD3PM), which enables joint probability learning through diffusion and denoising in exponentially large Hilbert spaces, offering a theoretical pathway to faithfully capture the true joint distribution. By deriving posterior states through quantum Bayes' theorem, similar to the crucial role of posterior probabilities in classical diffusion models, and by learning the joint probability, we establish a solid theoretical foundation for quantum-enhanced diffusion models. For denoising, we design a quantum circuit that utilizes temporal information for parameter sharing and incorporates learnable classical-data-controlled rotations for encoding. Exploiting joint distribution learning, our approach enables single-step sampling from pure noise, eliminating iterative requirements of existing models. Simulations demonstrate the proposed model's superior accuracy in modeling complex distributions compared to factorization methods. Hence, this paper establishes a new theoretical paradigm in generative models by leveraging the quantum advantage in joint distribution learning. △ Less

Submitted 29 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

Comments: Comments are welcome

arXiv:2505.05041 [pdf, other]

ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization

Authors: Chenxi Zhao, Jianqiang Li, Qing Zhao, Jing Bai, Susana Boluda, Benoit Delatour, Lev Stimmer, Daniel Racoceanu, Gabriel Jimenez, Guanghui Fu

Abstract: Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated ima… ▽ More Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.04869 [pdf, other]

From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation

Authors: Jie Cao, Chloe Qianhui Zhao, Xian Chen, Shuman Wang, Christian Schunn, Kenneth R. Koedinger, Jionghao Lin

Abstract: Producing large volumes of high-quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies-particularly Large Language Models (LLMs)-show great potential. However, current LLM-based research still shows room for improvement in terms of feedback quality. Our study proposed a multi-agent approach performing "generation, evaluation, and regene… ▽ More Producing large volumes of high-quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies-particularly Large Language Models (LLMs)-show great potential. However, current LLM-based research still shows room for improvement in terms of feedback quality. Our study proposed a multi-agent approach performing "generation, evaluation, and regeneration" (G-E-RG) to further enhance feedback quality. In the first-generation phase, six methods were adopted, combining three feedback theoretical frameworks and two prompt methods: zero-shot and retrieval-augmented generation with chain-of-thought (RAG_CoT). The results indicated that, compared to first-round feedback, G-E-RG significantly improved final feedback across six methods for most dimensions. Specifically:(1) Evaluation accuracy for six methods increased by 3.36% to 12.98% (p<0.001); (2) The proportion of feedback containing four effective components rose from an average of 27.72% to an average of 98.49% among six methods, sub-dimensions of providing critiques, highlighting strengths, encouraging agency, and cultivating dialogue also showed great enhancement (p<0.001); (3) There was a significant improvement in most of the feature values (p<0.001), although some sub-dimensions (e.g., strengthening the teacher-student relationship) still require further enhancement; (4) The simplicity of feedback was effectively enhanced (p<0.001) for three methods. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 14 pages, to be published at the 26th International Conference on Artificial Intelligence in Education (AIED '25)

arXiv:2505.04584 [pdf, other]

SlideItRight: Using AI to Find Relevant Slides and Provide Feedback for Open-Ended Questions

Authors: Chloe Qianhui Zhao, Jie Cao, Eason Chen, Kenneth R. Koedinger, Jionghao Lin

Abstract: Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' lear… ▽ More Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' learning. Hence, it is important to explore the potential of Artificial Intelligence (AI) in feedback generation from and to different modalities. Our study leverages Large Language Models (LLMs) for textual feedback with the supplementary guidance from other modality - relevant lecture slide retrieved from the slides hub. Through an online crowdsourcing study (N=91), this study investigates learning gains and student perceptions using a 2x2 design (i.e., human feedback vs. AI feedback and with vs. without relevant slide), evaluating the clarity, engagement, perceived effectiveness, and reliability) of AI-facilitated multimodal feedback. We observed significant pre-to-post learning gains across all conditions. However, the differences in these gains were not statistically significant between conditions. The post-survey revealed that students found the slide feedback helpful in their learning process, though they reported difficulty in understanding it. Regarding the AI-generated open-ended feedback, students considered it personalized and relevant to their responses, but they expressed lower trust in the AI feedback compared to human-generated feedback. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 14 pages, to be published at the 26th International Conference on Artificial Intelligence in Education (AIED '25)

arXiv:2505.03501 [pdf, other]

BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

Authors: Zihan Wang, Hongwei Li, Rui Zhang, Wenbo Jiang, Kangjie Chen, Tianwei Zhang, Qingchuan Zhao, Guowen Xu

Abstract: In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious enti… ▽ More In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness △ Less

Submitted 6 May, 2025; originally announced May 2025.

arXiv:2505.02094 [pdf, other]

SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

Authors: Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, Qifeng Chen

Abstract: We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is t… ▽ More We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2504.21771 [pdf, ps, other]

Anatomical Similarity as a New Metric to Evaluate Brain Generative Models

Authors: Bahram Jafrasteh, Wei Peng, Cheng Wan, Yimin Luo, Ehsan Adeli, Qingyu Zhao

Abstract: Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brai… ▽ More Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brain MRIs. WASABI leverages \textit{SynthSeg}, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying anatomical discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing anatomical fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at https://github.com/BahramJafrasteh/wasabi-mri. △ Less

Submitted 8 July, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

arXiv:2504.17332 [pdf, other]

Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection

Authors: Zihan Wang, Lu Yuan, Zhengxuan Zhang, Qing Zhao

Abstract: In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integra… ▽ More In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integrates cognitive and emotional empathy to analyze misinformation from both the creator and reader perspectives. By examining creators' cognitive strategies and emotional appeals, as well as simulating readers' cognitive judgments and emotional responses using Large Language Models (LLMs), DAE offers a more comprehensive and human-centric approach to misinformation detection. Moreover, we further introduce an empathy-aware filtering mechanism to enhance response authenticity and diversity. Experimental results on benchmark datasets demonstrate that DAE outperforms existing methods, providing a novel paradigm for multimodal misinformation detection. △ Less

Submitted 24 April, 2025; originally announced April 2025.

arXiv:2504.15667 [pdf, other]

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

Authors: Jingchen Zou, Jianqiang Li, Gabriel Jimenez, Qing Zhao, Daniel Racoceanu, Matias Cosarinsky, Enzo Ferrante, Guanghui Fu

Abstract: The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose… ▽ More The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2504.13882 [pdf, other]

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

Authors: Megan Gu, Chloe Qianhui Zhao, Claire Liu, Nikhil Patel, Jahnvi Shah, Jionghao Lin, Kenneth R. Koedinger

Abstract: Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tu… ▽ More Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tutoring strategy as either being employed as desired or undesired. Our study utilizes GPT-3.5 with few-shot prompting to assess the use of these strategies and analyze tutoring dialogues. The results show that for the five tutoring strategies, True Negative Rates (TNR) range from 0.655 to 0.738, and Recall ranges from 0.327 to 0.432, indicating that the model is effective at excluding incorrect classifications but struggles to consistently identify the correct strategy. The strategy \textit{helping students manage inequity} showed the highest performance with a TNR of 0.738 and Recall of 0.432. The study highlights the potential of LLMs in tutoring strategy analysis and outlines directions for future improvements, including incorporating more advanced models for more nuanced feedback. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: Manuscript accepted to the Workshop on "From Data to Discovery: LLMs for Qualitative Analysis in Education" at LAK25

arXiv:2504.08685 [pdf, other]

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi , et al. (30 additional authors not shown)

Abstract: This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary… ▽ More This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/ △ Less

Submitted 4 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

Comments: Technical report (some typos fixed)

arXiv:2504.06982 [pdf, other]

SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

Authors: Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong

Abstract: 3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-vie… ▽ More 3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: project page:https://yyvhang.github.io/SIGMAN_3D/

arXiv:2504.05720 [pdf, other]

QEMesh: Employing A Quadric Error Metrics-Based Representation for Mesh Generation

Authors: Jiaqi Li, Ruowei Wang, Yu Liu, Qijun Zhao

Abstract: Mesh generation plays a crucial role in 3D content creation, as mesh is widely used in various industrial applications. Recent works have achieved impressive results but still face several issues, such as unrealistic patterns or pits on surfaces, thin parts missing, and incomplete structures. Most of these problems stem from the choice of shape representation or the capabilities of the generative… ▽ More Mesh generation plays a crucial role in 3D content creation, as mesh is widely used in various industrial applications. Recent works have achieved impressive results but still face several issues, such as unrealistic patterns or pits on surfaces, thin parts missing, and incomplete structures. Most of these problems stem from the choice of shape representation or the capabilities of the generative network. To alleviate these, we extend PoNQ, a Quadric Error Metrics (QEM)-based representation, and propose a novel model, QEMesh, for high-quality mesh generation. PoNQ divides the shape surface into tiny patches, each represented by a point with its normal and QEM matrix, which preserves fine local geometry information. In our QEMesh, we regard these elements as generable parameters and design a unique latent diffusion model containing a novel multi-decoder VAE for PoNQ parameters generation. Given the latent code generated by the diffusion model, three parameter decoders produce several PoNQ parameters within each voxel cell, and an occupancy decoder predicts which voxel cells containing parameters to form the final shape. Extensive evaluations demonstrate that our method generates results with watertight surfaces and is comparable to state-of-the-art methods in several main metrics. △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: Accepted by International Conference on Multimedia and Expo

arXiv:2504.02730 [pdf, other]

HQViT: Hybrid Quantum Vision Transformer for Image Classification

Authors: Hui Zhang, Qinglin Zhao, Mengchu Zhou, Li Feng

Abstract: Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensiona… ▽ More Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensional input data, e.g., images, particularly expensive. To address such limitations, we propose a Hybrid Quantum Vision Transformer (HQViT), that leverages the principles of quantum computing to accelerate model training while enhancing model performance. HQViT introduces whole-image processing with amplitude encoding to better preserve global image information without additional positional encoding. By leveraging quantum computation on the most critical steps and selectively handling other components in a classical way, we lower the cost of quantum resources for HQViT. The qubit requirement is minimized to $O(log_2N)$ and the number of parameterized quantum gates is only $O(log_2d)$, making it well-suited for Noisy Intermediate-Scale Quantum devices. By offloading the computationally intensive attention coefficient matrix calculation to the quantum framework, HQViT reduces the classical computational load by $O(T^2d)$. Extensive experiments across various computer vision datasets demonstrate that HQViT outperforms existing models, achieving a maximum improvement of up to $10.9\%$ (on the MNIST 10-classification task) over the state of the art. This work highlights the great potential to combine quantum and classical computing to cope with complex image classification tasks. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: 13 pages, 8 figures

arXiv:2504.01038 [pdf, other]

An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at https://github.com/liu37972/Multirate-Location-on-OCT-X-Learning.git. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: 26 pages, 4 figures, 6 tables

arXiv:2504.00375 [pdf, other]

CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection

Authors: Xin Zhang, Keren Fu, Qijun Zhao

Abstract: The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios f… ▽ More The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: 10 pages, 5 figures,

arXiv:2503.23748 [pdf, other]

THEMIS: Towards Practical Intellectual Property Protection for Post-Deployment On-Device Deep Learning Models

Authors: Yujin Huang, Zhi Zhang, Qingchuan Zhao, Xingliang Yuan, Chunyang Chen

Abstract: On-device deep learning (DL) has rapidly gained adoption in mobile apps, offering the benefits of offline model inference and user privacy preservation over cloud-based approaches. However, it inevitably stores models on user devices, introducing new vulnerabilities, particularly model-stealing attacks and intellectual property infringement. While system-level protections like Trusted Execution En… ▽ More On-device deep learning (DL) has rapidly gained adoption in mobile apps, offering the benefits of offline model inference and user privacy preservation over cloud-based approaches. However, it inevitably stores models on user devices, introducing new vulnerabilities, particularly model-stealing attacks and intellectual property infringement. While system-level protections like Trusted Execution Environments (TEEs) provide a robust solution, practical challenges remain in achieving scalable on-device DL model protection, including complexities in supporting third-party models and limited adoption in current mobile solutions. Advancements in TEE-enabled hardware, such as NVIDIA's GPU-based TEEs, may address these obstacles in the future. Currently, watermarking serves as a common defense against model theft but also faces challenges here as many mobile app developers lack corresponding machine learning expertise and the inherent read-only and inference-only nature of on-device DL models prevents third parties like app stores from implementing existing watermarking techniques in post-deployment models. To protect the intellectual property of on-device DL models, in this paper, we propose THEMIS, an automatic tool that lifts the read-only restriction of on-device DL models by reconstructing their writable counterparts and leverages the untrainable nature of on-device DL models to solve watermark parameters and protect the model owner's intellectual property. Extensive experimental results across various datasets and model structures show the superiority of THEMIS in terms of different metrics. Further, an empirical investigation of 403 real-world DL mobile apps from Google Play is performed with a success rate of 81.14%, showing the practicality of THEMIS. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025

arXiv:2503.23327 [pdf]

AI Delivers Creative Output but Struggles with Thinking Processes

Authors: Man Zhang, Ying Li, Yang Peng, Yijia Sun, Wenxin Guo, Huiqing Hu, Shi Chen, Qingbai Zhao

Abstract: A key objective in artificial intelligence (AI) development is to create systems that match or surpass human creativity. Although current AI models perform well across diverse creative tasks, it remains unclear whether these achievements reflect genuine creative thinking. This study examined whether AI models (GPT-3.5-turbo, GPT-4, and GPT-4o) engage in creative thinking by comparing their perform… ▽ More A key objective in artificial intelligence (AI) development is to create systems that match or surpass human creativity. Although current AI models perform well across diverse creative tasks, it remains unclear whether these achievements reflect genuine creative thinking. This study examined whether AI models (GPT-3.5-turbo, GPT-4, and GPT-4o) engage in creative thinking by comparing their performance with humans across various creative tasks and core cognitive processes. Results showed that AI models outperformed humans in divergent thinking, convergent thinking, and insight problem-solving, but underperformed in creative writing. Compared to humans, AI generated lower forward flow values in both free and chain association tasks and showed lower accuracy in the representational change task. In creative evaluation, AI exhibited no significant correlation between the weights of novelty and appropriateness when predicting creative ratings, suggesting the absence of a human-like trade-off strategy. AI also had higher decision error scores in creative selection, suggesting difficulty identifying the most creative ideas. These findings suggest that while AI can mimic human creativity, its strong performance in creative tasks is likely driven by non-creative mechanisms rather than genuine creative thinking. △ Less

Submitted 30 March, 2025; originally announced March 2025.

arXiv:2503.22020 [pdf, other]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin

Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial… ▽ More Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/ △ Less

Submitted 27 March, 2025; originally announced March 2025.

Comments: Project website: https://cot-vla.github.io/

Journal ref: CVPR 2025

arXiv:2503.20822 [pdf, other]

Synthetic Video Enhances Physical Fidelity in Video Synthesis

Authors: Qi Zhao, Xingyu Ni, Ziyu Wang, Feng Cheng, Ziyan Yang, Lu Jiang, Bohan Wang

Abstract: We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrate… ▽ More We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: https://kevinz8866.github.io/simulation/ △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.19427 [pdf, other]

ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation

Authors: Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Qi Zhao, Changyu Zeng, Wenpei Bai, Guangliang Cheng

Abstract: Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Pa… ▽ More Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba's scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at https://github.com/BaoBao0926/ASP-VMUNet/tree/main. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Showing 1–50 of 702 results for author: Zhao, Q