Search | arXiv e-print repository

arXiv:2507.02092 [pdf, ps, other]

Energy-Based Transformers are Scalable Learners and Thinkers

Authors: Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

Abstract: Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pret… ▽ More Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question "Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?" Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) -- a new class of Energy-Based Models (EBMs) -- to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01663 [pdf, ps, other]

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

Authors: Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, Jianping Wu

Abstract: Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled w… ▽ More Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled with LLM training or inference engines, making it difficult to support custom-designed engines. To address these challenges, we propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training. Specifically, we introduce a distributed data storage and transfer module that provides a unified data management and fine-grained scheduling capability in a fully streamed manner. This architecture inherently facilitates automated pipeline overlapping among RL tasks and dynamic load balancing. Moreover, we propose a producer-consumer-based asynchronous workflow engineered to minimize computational idleness by strategically deferring parameter update process within staleness thresholds. Finally, the core capability of AsynFlow is architecturally decoupled from underlying training and inference engines and encapsulated by service-oriented user interfaces, offering a modular and customizable user experience. Extensive experiments demonstrate an average of 1.59 throughput improvement compared with state-of-the-art baseline. The presented architecture in this work provides actionable insights for next-generation RL training system designs. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.23918 [pdf, ps, other]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Authors: Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung

Abstract: Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vi… ▽ More Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI. △ Less

Submitted 3 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

Comments: Preprint in progress. We maintain a real-time GitHub repository tracking progress at: https://github.com/zhaochen0110/Awesome_Think_With_Images

arXiv:2506.23445 [pdf]

Topotactic phase transformation in correlated vanadium dioxide through oxygen vacancy ordering

Authors: Xuanchi Zhou, Xiaohui Yao, Xiaomei Qiao, Jiahui Ji, Guowei Zhou, Huihui Ji, Xiaohong Xu

Abstract: Controlling the insulator-metal transition (IMT) in correlated oxide system through oxygen vacancy ordering opens up a new paradigm for exploring exotic structural transformation and physical functionality. Oxygen vacancy serves as a powerful tuning knob for adjusting the IMT property in VO2, though driving topochemical reduction to V2O3 remains challenging due to structural incompatibility and co… ▽ More Controlling the insulator-metal transition (IMT) in correlated oxide system through oxygen vacancy ordering opens up a new paradigm for exploring exotic structural transformation and physical functionality. Oxygen vacancy serves as a powerful tuning knob for adjusting the IMT property in VO2, though driving topochemical reduction to V2O3 remains challenging due to structural incompatibility and competing phase instability. Here we unveil consecutive oxygen-vacancy-driven VO2-VO2-x-V2O3 topotactic phase transformation route with enticing facet-dependent anisotropy, engendering tunable IMT properties over an extended temperature range. Remarkably, topochemically reduced V2O3 inherits the crystallographic characteristics from parent VO2, enabling emergent lattice framework and IMT behavior inaccessible via direct epitaxial growth. Analogous electron doping arising from hydrogenation and oxygen vacancy contributes cooperatively to drive the Mott phase transition in VO2 through band-filling control. Our work not only unveils sequential topotactic phase transformations in VO2 through oxygen vacancy ordering but also provides fundamentally new insights for defect-mediated Mott transitions. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.20949 [pdf, ps, other]

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Authors: Chenkai Sun, Denghui Zhang, ChengXiang Zhai, Heng Ji

Abstract: Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling… ▽ More Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.07459 [pdf, ps, other]

ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Authors: Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu

Abstract: Protein generative models have shown remarkable promise in protein design but still face limitations in success rate, due to the scarcity of high-quality protein datasets for supervised pretraining. We present ProteinZero, a novel framework that enables scalable, automated, and continuous self-improvement of the inverse folding model through online reinforcement learning. To achieve computationall… ▽ More Protein generative models have shown remarkable promise in protein design but still face limitations in success rate, due to the scarcity of high-quality protein datasets for supervised pretraining. We present ProteinZero, a novel framework that enables scalable, automated, and continuous self-improvement of the inverse folding model through online reinforcement learning. To achieve computationally tractable online feedback, we introduce efficient proxy reward models based on ESM-fold and a novel rapid ddG predictor that significantly accelerates evaluation speed. ProteinZero employs a general RL framework balancing multi-reward maximization, KL-divergence from a reference model, and a novel protein-embedding level diversity regularization that prevents mode collapse while promoting higher sequence diversity. Through extensive experiments, we demonstrate that ProteinZero substantially outperforms existing methods across every key metric in protein design, achieving significant improvements in structural accuracy, designability, thermodynamic stability, and sequence diversity. Most impressively, ProteinZero reduces design failure rates by approximately 36% - 48% compared to widely-used methods like ProteinMPNN, ESM-IF and InstructPLM, consistently achieving success rates exceeding 90% across diverse and complex protein folds. Notably, the entire RL run on CATH-4.3 can be done with a single 8 X GPU node in under 3 days, including reward computation. Our work establishes a new paradigm for protein design where models evolve continuously from their own generated outputs, opening new possibilities for exploring the vast protein design space. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07413 [pdf, ps, other]

Variational Supervised Contrastive Learning

Authors: Ziwen Wang, Jiajun Fan, Thao Nguyen, Heng Ji, Ge Liu

Abstract: Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide… ▽ More Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. △ Less

Submitted 26 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06972 [pdf, ps, other]

Atomic Reasoning for Scientific Table Claim Verification

Authors: Yuji Zhang, Qingyun Wang, Cheng Qian, Jiateng Liu, Chenkai Sun, Denghui Zhang, Tarek Abdelzaher, Chengxiang Zhai, Preslav Nakov, Heng Ji

Abstract: Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large lang… ▽ More Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model's ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o's chain-of-thought method, achieving state-of-the-art results with far less training data. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.05869 [pdf, ps, other]

Loss Functions for Predictor-based Neural Architecture Search

Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun

Abstract: Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of archite… ▽ More Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of architectures, recent approaches have explored various ranking-based loss functions, such as pairwise and listwise ranking losses, to focus on the ranking of architecture performance. Despite their success in NAS, the effectiveness and characteristics of these loss functions have not been thoroughly investigated. In this paper, we conduct the first comprehensive study on loss functions in performance predictors, categorizing them into three main types: regression, ranking, and weighted loss functions. Specifically, we assess eight loss functions using a range of NAS-relevant metrics on 13 tasks across five search spaces. Our results reveal that specific categories of loss functions can be effectively combined to enhance predictor-based NAS. Furthermore, our findings could provide practical guidance for selecting appropriate loss functions for various tasks. We hope this work provides meaningful insights to guide the development of loss functions for predictor-based methods in the NAS community. △ Less

Submitted 6 June, 2025; originally announced June 2025.

arXiv:2506.05297 [pdf, ps, other]

DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling

Authors: Hangyu Ji

Abstract: Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder's 1D sequence flattening compromises spatial structures, while conventional decoders fail to lev… ▽ More Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder's 1D sequence flattening compromises spatial structures, while conventional decoders fail to leverage Mamba's state propagation. We present DM-SegNet, a Dual-Mamba architecture integrating directional state transitions with anatomy-aware hierarchical decoding. The core innovations include a quadri-directional spatial Mamba module employing four-directional 3D scanning to maintain anatomical spatial coherence, a gated spatial convolution layer that enhances spatially sensitive feature representation prior to state modeling, and a Mamba-driven decoding framework enabling bidirectional state synchronization across scales. Extensive evaluation on two clinically significant benchmarks demonstrates the efficacy of DM-SegNet: achieving state-of-the-art Dice Similarity Coefficient (DSC) of 85.44% on the Synapse dataset for abdominal organ segmentation and 90.22% on the BraTS2023 dataset for brain tumor segmentation. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.05021 [pdf]

Mechanistic Insights into Water-Splitting, Proton Migration, and Hydrogen Evolution Reaction in g-C3N4/TiO2-B and Li-F co-doped Heterostructures

Authors: Shuhan Tang, Qi Jiang, Shuang Qiu, Hanyang Ji, Xiaojie Liu

Abstract: Solar water splitting has received a lot of attention due to its high efficiency and clean energy production potential. Herein, based on the band alignment principle, the g-C3N4/TiO2-B(001) heterostructure is strategically designed, then a Li-F co-doping approach is developed and implemented, leading to significant enhancement in the photocatalytic hydrogen evolution efficiency of the heterostruct… ▽ More Solar water splitting has received a lot of attention due to its high efficiency and clean energy production potential. Herein, based on the band alignment principle, the g-C3N4/TiO2-B(001) heterostructure is strategically designed, then a Li-F co-doping approach is developed and implemented, leading to significant enhancement in the photocatalytic hydrogen evolution efficiency of the heterostructure systems. The decomposition of water molecule on the surface of heterostructures, the migration and diffusion of proton across the interface, and the hydrogen evolution performance are systematically studied and comprehensively analyzed. The results demonstrate that the heterojunction surface exhibits a relatively low energy barrier for water decomposition, facilitating both hydrogen evolution reaction (HER) and oxygen evolution reaction (OER). Proton transfer preferentially occurs from the TiO2-B(001) surface to the g-C3N4 surface through the interface. The presence of polar covalent bonds establishes a substantial energy barrier for proton migration from TiO2-B(001) surface to the interface, representing a rate-determining factor in the hydrogen evolution process. The formation of hydrogen bonds significantly reduces the migration energy barrier for protons crossing the interface to the g-C3N4 surface. Hydrogen adsorption free energy analysis show that that the heterojunction surface exhibits optimal proton adsorption and desorption characteristics. The synergistic combination of low water decomposition energy barrier, reduced proton migration energy barriers and exceptional HER performance endows both g-C3N4/TiO2-B(001) heterostructure and Li-F co-doped g-C3N4/TiO2-B(001) heterojunction with remarkbale potential as efficient HER photocatalyst. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2506.04001 [pdf, ps, other]

CARL: Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor

Authors: Han Ji, Yuqi Feng, Jiahao Fan, Yanan Sun

Abstract: Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training sampl… ▽ More Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67% top-1 accuracy on CIFAR-10 using DARTS. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.02167 [pdf, other]

Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos

Authors: Aditi Tiwari, Farzaneh Masoud, Dac Trong Nguyen, Jill Kraft, Heng Ji, Klara Nahrstedt

Abstract: Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 2… ▽ More Modern AI systems struggle most in environments where reliability is critical - scenes with smoke, poor visibility, and structural deformation. Each year, tens of thousands of firefighters are injured on duty, often due to breakdowns in situational perception. We introduce Fire360, a benchmark for evaluating perception and reasoning in safety-critical firefighting scenarios. The dataset includes 228 360-degree videos from professional training sessions under diverse conditions (e.g., low light, thermal distortion), annotated with action segments, object locations, and degradation metadata. Fire360 supports five tasks: Visual Question Answering, Temporal Action Captioning, Object Localization, Safety-Critical Reasoning, and Transformed Object Retrieval (TOR). TOR tests whether models can match pristine exemplars to fire-damaged counterparts in unpaired scenes, evaluating transformation-invariant recognition. While human experts achieve 83.5% on TOR, models like GPT-4o lag significantly, exposing failures in reasoning under degradation. By releasing Fire360 and its evaluation suite, we aim to advance models that not only see, but also remember, reason, and act under uncertainty. The dataset is available at: https://uofi.box.com/v/fire360dataset. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: 20 pages, 9 figures, 6 tables

arXiv:2506.00886 [pdf, ps, other]

Toward a Theory of Agents as Tool-Use Decision-Makers

Authors: Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Kam-Fai Wong

Abstract: As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, wha… ▽ More As Large Language Models (LLMs) evolve into increasingly autonomous agents, fundamental questions about their epistemic foundations remain unresolved: What defines an agent? How should it make decisions? And what objectives should guide its behavior? In this position paper, we argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently. We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction. Building on this framework, we advocate for aligning an agent's tool use decision-making boundary with its knowledge boundary, thereby minimizing unnecessary tool use and maximizing epistemic efficiency. This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior. △ Less

Submitted 1 June, 2025; originally announced June 2025.

arXiv:2506.00671 [pdf, ps, other]

DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA

Authors: Yuelyu Ji, Hang Zhang, Shiven Verma, Hui Ji, Chun Li, Yushui Han, Yanshan Wang

Abstract: We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals inf… ▽ More We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.22379 [pdf, other]

Dynamics of thin film flows on a vertical fibre with vapor absorption

Authors: Souradip Chattopadhyay, Zihao Yu, Y. Sungtaek Ju, Hangjie Ji

Abstract: Water vapor capture through free surface flows plays a crucial role in various industrial applications, such as liquid desiccant air conditioning systems, water harvesting, and dewatering. This paper studies the dynamics of a silicone liquid sorbent (also known as water-absorbing silicone oil) flowing down a vertical cylindrical fibre while absorbing water vapor. We propose a one-sided thin-film-t… ▽ More Water vapor capture through free surface flows plays a crucial role in various industrial applications, such as liquid desiccant air conditioning systems, water harvesting, and dewatering. This paper studies the dynamics of a silicone liquid sorbent (also known as water-absorbing silicone oil) flowing down a vertical cylindrical fibre while absorbing water vapor. We propose a one-sided thin-film-type model for these dynamics, where the governing equations form a coupled system of nonlinear fourth-order partial differential equations for the liquid film thickness and oil concentration. The model incorporates gravity, surface tension, Marangoni effects induced by concentration gradients, and non-mass-conserving effects due to absorption flux. Interfacial instabilities, driven by the competition between mass-conserving and non-mass-conserving effects, are investigated via stability analysis. We numerically show that water absorption can lead to the formation of irregular wavy patterns and trigger droplet coalescence downstream. Systematic simulations further identify parameter ranges for the Marangoni number and absorption parameter that lead to the onset of droplet coalescence dynamics and regime transitions. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: 32 pages, 13 figures

arXiv:2505.21397 [pdf, ps, other]

DecisionFlow: Advancing Large Language Model as Principled Decision Maker

Authors: Xiusi Chen, Shanyong Wang, Cheng Qian, Hongru Wang, Peixuan Han, Heng Ji

Abstract: In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision mod… ▽ More In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. We release the data and code at https://github.com/xiusic/DecisionFlow. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: 24 pages, 13 figures

arXiv:2505.20759 [pdf, ps, other]

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Authors: Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and ou… ▽ More Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs. △ Less

Submitted 15 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: 18 pages

arXiv:2505.20067 [pdf, ps, other]

Community Moderation and the New Epistemology of Fact Checking on Social Media

Authors: Isabelle Augenstein, Michiel Bakker, Tanmoy Chakraborty, David Corney, Emilio Ferrara, Iryna Gurevych, Scott Hale, Eduard Hovy, Heng Ji, Irene Larraz, Filippo Menczer, Preslav Nakov, Paolo Papotti, Dhruv Sahnan, Greta Warren, Giovanni Zagni

Abstract: Social media platforms have traditionally relied on internal moderation teams and partnerships with independent fact-checking organizations to identify and flag misleading content. Recently, however, platforms including X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking -- Community Notes. If effecti… ▽ More Social media platforms have traditionally relied on internal moderation teams and partnerships with independent fact-checking organizations to identify and flag misleading content. Recently, however, platforms including X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking -- Community Notes. If effectively scaled and governed, such crowd-checking initiatives have the potential to combat misinformation with increased scale and speed as successfully as community-driven efforts once did with spam. Nevertheless, general content moderation, especially for misinformation, is inherently more complex. Public perceptions of truth are often shaped by personal biases, political leanings, and cultural contexts, complicating consensus on what constitutes misleading content. This suggests that community efforts, while valuable, cannot replace the indispensable role of professional fact-checkers. Here we systemically examine the current approaches to misinformation detection across major platforms, explore the emerging role of community-driven moderation, and critically evaluate both the promises and challenges of crowd-checking at scale. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: 1 Figure, 2 tables

arXiv:2505.16832 [pdf, other]

From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization

Authors: Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao

Abstract: While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual… ▽ More While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent. △ Less

Submitted 27 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: 16 pages; 7 figures

arXiv:2505.15181 [pdf]

Manipulating the hydrogen-induced insulator-metal transition through artificial microstructure engineering

Authors: Xuanchi Zhou, Xiaohui Yao, Wentian Lu, Jinjian Guo, Jiahui Ji, Lili Lang, Guowei Zhou, Chunwei Yao, Xiaomei Qiao, Huihui Ji, Zhe Yuan, Xiaohong Xu

Abstract: Hydrogen-associated filling-controlled Mottronics within electron-correlated system provides a groundbreaking paradigm to explore exotic physical functionality and phenomena. Dynamically controlling hydrogen-induced phase transitions through external fields offers a promising route for designing protonic devices in multidisciplinary fields, but faces high-speed bottlenecks owing to slow bulk diffu… ▽ More Hydrogen-associated filling-controlled Mottronics within electron-correlated system provides a groundbreaking paradigm to explore exotic physical functionality and phenomena. Dynamically controlling hydrogen-induced phase transitions through external fields offers a promising route for designing protonic devices in multidisciplinary fields, but faces high-speed bottlenecks owing to slow bulk diffusion of hydrogens. Here, we present a promising pathway to kinetically expedite hydrogen-related Mott transition in correlated VO2 system by taking advantage of artificial microstructure design. Typically, inclined domain boundary configuration and cR-faceted preferential orientation simultaneously realized in VO2/Al2O3 (102) heterostructure significantly lower the diffusion barrier via creating an unobstructed conduit for hydrogen diffusion. As a result, the achievable switching speed through hydrogenation outperforms that of counterpart grown on widely-reported c-plane Al2O3 substrate by 2-3 times, with resistive switching concurrently improved by an order of magnitude. Of particular interest, an anomalous uphill hydrogen diffusion observed for VO2 with a highway for hydrogen diffusion fundamentally deviates from basic Fick's law, unveiling a deterministic role of hydrogen spatial distribution in tailoring electronic state evolution. The present work not only provides a versatile strategy for manipulating ionic evolution, endowing with great potential in designing high-speed protonic devices, but also deepens the understanding of hydrogen-induced Mott transitions in electron-correlated system. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.15068 [pdf, other]

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

Authors: Cheng Qian, Hongyi Du, Hongru Wang, Xiusi Chen, Yuji Zhang, Avirup Sil, Chengxiang Zhai, Kathleen McKeown, Heng Ji

Abstract: Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open… ▽ More Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 36 Pages, 26 Figures, 5 Tables

arXiv:2505.12565 [pdf, ps, other]

mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model

Authors: Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymkuć, Bartosz A. Grzybowski, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, Heng Ji

Abstract: Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to le… ▽ More Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2505.11961 [pdf, ps, other]

An Immersed Finite Element Method for Anisotropic Elliptic Interface Problems with Nonhomogeneous Jump Conditions

Authors: Haifeng Ji, Zhilin Li

Abstract: A new finite element method (FEM) using meshes that do not necessarily align with the interface is developed for two- and three-dimensional anisotropic elliptic interface problems with nonhomogeneous jump conditions. The degrees of freedom of the proposed method are the same as those of traditional nonconforming FEMs, while the function space is modified to account for the jump conditions of the s… ▽ More A new finite element method (FEM) using meshes that do not necessarily align with the interface is developed for two- and three-dimensional anisotropic elliptic interface problems with nonhomogeneous jump conditions. The degrees of freedom of the proposed method are the same as those of traditional nonconforming FEMs, while the function space is modified to account for the jump conditions of the solution. The modified function space on an interface element is shown to exist uniquely, independent of the element's shape and the manner in which the interface intersects it. Optimal error estimates for the method, along with the usual bound on the condition number of the stiffness matrix, are proven, with the error constant independent of the interface's location relative to the mesh. To solve the resulting linear system, a preconditioner is proposed in which a Gauss-Seidel smoother with the interface correction is employed to ensure robustness against large jumps in the diffusion matrix. Numerical experiments are provided to demonstrate the optimal convergence of the proposed method and the efficiency of the preconditioner. △ Less

Submitted 17 May, 2025; originally announced May 2025.

MSC Class: 65N15; 65N30; 35R05

arXiv:2505.08971 [pdf, ps, other]

Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

Authors: Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji

Abstract: In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a sim… ▽ More In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: The code will be available at https://github.com/Yangyi-Chen/PRIOR

arXiv:2505.08162 [pdf, ps, other]

GDNTT: an Area-Efficient Parallel NTT Accelerator Using Glitch-Driven Near-Memory Computing and Reconfigurable 10T SRAM

Authors: Hengyu Ding, Houran Ji, Jia Li, Jinhang Chen, Chin-Wing Sham, Yao Wang

Abstract: With the rapid advancement of quantum computing technology, post-quantum cryptography (PQC) has emerged as a pivotal direction for next-generation encryption standards. Among these, lattice-based cryptographic schemes rely heavily on the fast Number Theoretic Transform (NTT) over polynomial rings, whose performance directly determines encryption/decryption throughput and energy efficiency. However… ▽ More With the rapid advancement of quantum computing technology, post-quantum cryptography (PQC) has emerged as a pivotal direction for next-generation encryption standards. Among these, lattice-based cryptographic schemes rely heavily on the fast Number Theoretic Transform (NTT) over polynomial rings, whose performance directly determines encryption/decryption throughput and energy efficiency. However, existing software-based NTT implementations struggle to meet the real-time performance and low-power requirements of IoT and edge devices. To address this challenge, this paper proposes an area-efficient highly parallel NTT accelerator with glitch-driven near-memory computing (GDNTT). The design integrates a 10T SRAM for data storage, enabling flexible row/column data access and streamlining circuit mapping strategies. Furthermore, a glitch generator is incorporated into the near-memory computing unit, significantly reducing the latency of butterfly operations. Evaluation results show that the proposed NTT accelerator achieves a 1.5~28* improvement in throughput-per-area compared to the state-of-the-art. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07849 [pdf, ps, other]

SweRank: Software Issue Localization with Code Ranking

Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty

Abstract: Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step rea… ▽ More Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community. △ Less

Submitted 7 May, 2025; originally announced May 2025.

arXiv:2505.07775 [pdf, ps, other]

Must Read: A Systematic Survey of Computational Persuasion

Authors: Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür

Abstract: Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but… ▽ More Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence. Moreover, AI systems are not only persuaders, but also susceptible to persuasion, making them vulnerable to adversarial attacks and bias reinforcement. Despite rapid advancements in AI-generated persuasive content, our understanding of what makes persuasion effective remains limited due to its inherently subjective and context-dependent nature. In this survey, we provide a comprehensive overview of computational persuasion, structured around three key perspectives: (1) AI as a Persuader, which explores AI-generated persuasive content and its applications; (2) AI as a Persuadee, which examines AI's susceptibility to influence and manipulation; and (3) AI as a Persuasion Judge, which analyzes AI's role in evaluating persuasive strategies, detecting manipulation, and ensuring ethical persuasion. We introduce a taxonomy for computational persuasion research and discuss key challenges, including evaluating persuasiveness, mitigating manipulative persuasion, and developing responsible AI-driven persuasive systems. Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion while addressing the risks posed by increasingly capable language models. △ Less

Submitted 12 May, 2025; originally announced May 2025.

arXiv:2505.07062 [pdf, ps, other]

Seed1.5-VL Technical Report

Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428) △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.06566 [pdf]

Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

Authors: Zequn Xie, Haoming Ji, Lingwei Meng

Abstract: Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these iss… ▽ More Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.02784 [pdf, other]

Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

Authors: Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl , et al. (45 additional authors not shown)

Abstract: Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics wer… ▽ More Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools. △ Less

Submitted 8 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

arXiv:2505.02387 [pdf, ps, other]

RM-R1: Reward Modeling as Reasoning

Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji

Abstract: Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize… ▽ More Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1. △ Less

Submitted 17 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

Comments: 25 pages, 8 figures

arXiv:2505.02332 [pdf, other]

Record Magnetic Field Generation by Laser-Driven Capacitor-Coil Targets

Authors: Lan Gao, Yang Zhang, Hantao Ji, Brandon K. Russell, Geoffrey Pomraning, Jesse Griff-McMahon, Sallee Klein, Carolyn Kuranz, Mingsheng Wei

Abstract: Magnetic fields generated by capacitor-coil targets driven by intense short-pulse lasers have been characterized using ultrafast proton radiography. A 1-kJ, 15-ps laser at a center wavelength of 1053 nm irradiated the back plate of the capacitor with an intensity of $\sim$8.3 $\times$ 10$^{18}$ W$/$cm$^{2}$, creating ultra large currents in the connecting coils. High-quality proton data obtained i… ▽ More Magnetic fields generated by capacitor-coil targets driven by intense short-pulse lasers have been characterized using ultrafast proton radiography. A 1-kJ, 15-ps laser at a center wavelength of 1053 nm irradiated the back plate of the capacitor with an intensity of $\sim$8.3 $\times$ 10$^{18}$ W$/$cm$^{2}$, creating ultra large currents in the connecting coils. High-quality proton data obtained in the axial probing geometry show definitive signatures of magnetic field generation allowing precision measurement of the field distribution and strength. The data show a peak coil current of 150 $\pm$ 20 kA producing 250 $\pm$ 30 Tesla magnetic fields at the coil center. This sets a new record for magnetic field generation by the short-pulse-powered capacitor-coil targets. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2505.02326 [pdf, other]

Determining Magnetic and Electric Field Generations in Laser-Driven Coil Targets

Authors: Yang Zhang, Lan Gao, Hantao Ji, Brandon K. Russell, Geoffrey Pomraning, Jesse Griff-McMahon, Sallee Klein, Carolyn Kuranz, Mingsheng Wei

Abstract: Laser-driven capacitor coils are widely used to generate intense magnetic fields for various applications in high-energy-density physics research. Accurate measurement of the magnetic fields is essential but challenging, due to the overlapping contributions from magnetic and electric fields in proton radiography, which is the primary tool diagnosing the field generation around the coils. In this s… ▽ More Laser-driven capacitor coils are widely used to generate intense magnetic fields for various applications in high-energy-density physics research. Accurate measurement of the magnetic fields is essential but challenging, due to the overlapping contributions from magnetic and electric fields in proton radiography, which is the primary tool diagnosing the field generation around the coils. In this study, we systematically analyze proton radiographs obtained from laser-driven capacitor-coil targets along two orthogonal axes under various electromagnetic field conditions, including magnetic field only, electric field only, and combined electromagnetic fields. By analyzing key features in the radiographs, we distinguish and characterize the respective contributions from magnetic and electric fields. Using detailed simulations validated by experimental benchmarks, methods to isolate and quantify the magnetic field and electric field are given. The methods are successfully applied to determine the electric current and charge distribution in a double coil configuration. Our findings provide insights into improving the diagnostic capability of proton radiography, potentially leading to more accurate measurements of electromagnetic fields and enhancing the utility of laser-driven capacitor coils in high-energy-density experiments. △ Less

Submitted 4 May, 2025; originally announced May 2025.

arXiv:2504.20314 [pdf, other]

Perturbation-efficient Zeroth-order Optimization for Hardware-friendly On-device Training

Authors: Qitao Tan, Sung-En Chang, Rui Xia, Huidong Ji, Chence Yang, Ci Zhang, Jun Liu, Zheng Zhan, Zhou Zou, Yanzhi Wang, Jin Lu, Geng Yuan

Abstract: Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms,… ▽ More Zeroth-order (ZO) optimization is an emerging deep neural network (DNN) training paradigm that offers computational simplicity and memory savings. However, this seemingly promising approach faces a significant and long-ignored challenge. ZO requires generating a substantial number of Gaussian random numbers, which poses significant difficulties and even makes it infeasible for hardware platforms, such as FPGAs and ASICs. In this paper, we identify this critical issue, which arises from the mismatch between algorithm and hardware designers. To address this issue, we proposed PeZO, a perturbation-efficient ZO framework. Specifically, we design random number reuse strategies to significantly reduce the demand for random number generation and introduce a hardware-friendly adaptive scaling method to replace the costly Gaussian distribution with a uniform distribution. Our experiments show that PeZO reduces the required LUTs and FFs for random number generation by 48.6\% and 12.7\%, and saves at maximum 86\% power consumption, all without compromising training performance, making ZO optimization feasible for on-device training. To the best of our knowledge, we are the first to explore the potential of on-device ZO optimization, providing valuable insights for future research. △ Less

Submitted 28 April, 2025; originally announced April 2025.

arXiv:2504.18838 [pdf, other]

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Authors: Yixin Cao, Shibo Hong, Xinze Li, Jiahao Ying, Yubo Ma, Haiyuan Liang, Yantao Liu, Zijun Yao, Xiaozhi Wang, Dan Huang, Wenxuan Zhang, Lifu Huang, Muhao Chen, Lei Hou, Qianru Sun, Xingjun Ma, Zuxuan Wu, Min-Yen Kan, David Lo, Qi Zhang, Heng Ji, Jing Jiang, Juanzi Li, Aixin Sun, Xuanjing Huang , et al. (2 additional authors not shown)

Abstract: Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around… ▽ More Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and "LLM-as-a-judge" scoring. Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators. △ Less

Submitted 26 April, 2025; originally announced April 2025.

arXiv:2504.17040 [pdf, other]

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Authors: Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu

Abstract: We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fi… ▽ More We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/. △ Less

Submitted 10 May, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

arXiv:2504.16939 [pdf, other]

A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions

Authors: Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

Abstract: Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversa… ▽ More Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: https://github.com/emrecanacikgoz/awesome-conversational-agents. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2504.14870 [pdf, ps, other]

Acting Less is Reasoning More! Teaching Model to Act Efficiently

Authors: Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji

Abstract: Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness w… ▽ More Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use. This often leads to excessive tool calling, incurring high computational costs and hindering the development of internal reasoning capabilities - a phenomenon known as \textit{cognitive offloading}. To this end, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers answer correctness and corresponding tool use behavior of model to reach that answer. To validate the effectiveness, we introduce the metric of \textit{tool productivity}, defined as the ratio between the number of correct answers and the total number of tool calls across all test cases. This metric reflects how efficiently tool usage contributes to successful task completion, with higher values indicating smarter and more autonomous reasoning. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 68.3\% and improves tool productivity by up to 215.4\%, while maintaining comparable answer accuracy. △ Less

Submitted 31 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

arXiv:2504.14574 [pdf]

Utilizing Optic Fiber Interferometry in Forced Vibration Experimentation for Educational Purposes

Authors: Mingyuan Wang, Manli Zhou, Hengda Ji, Tao Lan

Abstract: This study introduces an experimental teaching method that employs optic fiber interferometry (OFI) to investigate forced vibration phenomena. It is designed for undergraduate physics majors with foundational mechanics and optics training and optics-focused graduate students. This approach aims to deepen students' understanding of forced vibration theory and interferometric measurement principles… ▽ More This study introduces an experimental teaching method that employs optic fiber interferometry (OFI) to investigate forced vibration phenomena. It is designed for undergraduate physics majors with foundational mechanics and optics training and optics-focused graduate students. This approach aims to deepen students' understanding of forced vibration theory and interferometric measurement principles while fostering skills in experimental design, data analysis, and problem solving. Leveraging OFI's high-precision displacement measurement capabilities, the experiment enabled accurate tracking of frequency and displacement variations. By scanning the driving force frequency, students obtained amplitude frequency curves to determine the system's natural frequency, which closely aligned with theoretical predictions. This method may bridge theoretical concepts and practical applications, offering insights into teaching vibration theory and precision measurement techniques and equipping students with integrated knowledge for real-world challenges. △ Less

Submitted 20 April, 2025; originally announced April 2025.

arXiv:2504.13958 [pdf, other]

ToolRL: Reward is All Tool Learning Needs

Authors: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji

Abstract: Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique ch… ▽ More Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: 19 Pages, 12 Figures, 12 Tables

arXiv:2504.13460 [pdf, other]

Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

Authors: Hongwei Ji, Wulian Yun, Mengshi Qi, Huadong Ma

Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the l… ▽ More Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark. △ Less

Submitted 6 May, 2025; v1 submitted 18 April, 2025; originally announced April 2025.

arXiv:2504.12643 [pdf, ps, other]

RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

Authors: Hang Ji, Tao Ni, Xufeng Huang, Zhan Shi, Tao Luo, Xin Zhan, Junbo Chen

Abstract: This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck whe… ▽ More This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck when evaluated on the NuScenes dataset. To overcome this limitation, we propose a customized positional embedding strategy tailored to enhance temporal modeling capabilities. Experimental evaluations conducted on the NuScenes test set demonstrate that our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection. △ Less

Submitted 6 June, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

arXiv:2504.10707 [pdf]

Distinct hydrologic response patterns and trends worldwide revealed by physics-embedded learning

Authors: Haoyu Ji, Yalan Song, Tadd Bindas, Chaopeng Shen, Yuan Yang, Ming Pan, Jiangtao Liu, Farshid Rahmani, Ather Abbas, Hylke Beck, Kathryn Lawson, Yoshihide Wada

Abstract: To track rapid changes within our water sector, Global Water Models (GWMs) need to realistically represent hydrologic systems' response patterns - such as baseflow fraction - but are hindered by their limited ability to learn from data. Here we introduce a high-resolution physics-embedded big-data-trained model as a breakthrough in reliably capturing characteristic hydrologic response patterns ('s… ▽ More To track rapid changes within our water sector, Global Water Models (GWMs) need to realistically represent hydrologic systems' response patterns - such as baseflow fraction - but are hindered by their limited ability to learn from data. Here we introduce a high-resolution physics-embedded big-data-trained model as a breakthrough in reliably capturing characteristic hydrologic response patterns ('signatures') and their shifts. By realistically representing the long-term water balance, the model revealed widespread shifts - up to ~20% over 20 years - in fundamental green-blue-water partitioning and baseflow ratios worldwide. Shifts in these response patterns, previously considered static, contributed to increasing flood risks in northern mid-latitudes, heightening water supply stresses in southern subtropical regions, and declining freshwater inputs to many European estuaries, all with ecological implications. With more accurate simulations at monthly and daily scales than current operational systems, this next-generation model resolves large, nonlinear seasonal runoff responses to rainfall ('elasticity') and streamflow flashiness in semi-arid and arid regions. These metrics highlight regions with management challenges due to large water supply variability and high climate sensitivity, but also provide tools to forecast seasonal water availability. This capability newly enables global-scale models to deliver reliable and locally relevant insights for water management. △ Less

Submitted 22 April, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.07316 [pdf, other]

Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization

Authors: Shujin Wu, Cheng Qian, Yi R. Fung, Paul Pu Liang, Heng Ji

Abstract: The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students… ▽ More The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students from employing their knowledge during training and reaching their full potential. In this work, we introduce Alice (pro{A}ctive {l}earning w{i}th tea{c}her's D{e}monstrations), a framework that leverages complementary knowledge between teacher and student to enhance the learning process. We probe the knowledge base of the teacher model by eliciting their uncertainty, and then use these insights together with teachers' responses as demonstrations to guide student models in self-generating improved responses for supervision. In addition, for situations with significant capability gaps between teacher and student models, we introduce cascade Alice, which employs a hierarchical training approach where weak teachers initially supervise intermediate models, who then guide stronger models in sequence. Experimental results demonstrate that our method significantly enhances the W2SG performance, yielding substantial improvements in three key tasks compared to the original W2SG: knowledge-based reasoning (+4.0%), mathematical reasoning (+22.62%), and logical reasoning (+12.11%). This highlights the effectiveness of our new W2SG paradigm that enables more robust knowledge transfer and supervision outcome. △ Less

Submitted 11 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

arXiv:2504.06659 [pdf, other]

Bridging the Gap Between Preference Alignment and Machine Unlearning

Authors: Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, Chaochao Chen

Abstract: Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlea… ▽ More Despite advances in Preference Alignment (PA) for Large Language Models (LLMs), mainstream methods like Reinforcement Learning with Human Feedback (RLHF) face notable challenges. These approaches require high-quality datasets of positive preference examples, which are costly to obtain and computationally intensive due to training instability, limiting their use in low-resource scenarios. LLM unlearning technique presents a promising alternative, by directly removing the influence of negative examples. However, current research has primarily focused on empirical validation, lacking systematic quantitative analysis. To bridge this gap, we propose a framework to explore the relationship between PA and LLM unlearning. Specifically, we introduce a bi-level optimization-based method to quantify the impact of unlearning specific negative examples on PA performance. Our analysis reveals that not all negative examples contribute equally to alignment improvement when unlearned, and the effect varies significantly across examples. Building on this insight, we pose a crucial question: how can we optimally select and weight negative examples for unlearning to maximize PA performance? To answer this, we propose a framework called Unlearning to Align (U2A), which leverages bi-level optimization to efficiently select and unlearn examples for optimal PA performance. We validate the proposed method through extensive experiments, with results confirming its effectiveness. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: 17 pages

arXiv:2504.04238 [pdf, other]

Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

Authors: Yuheng Wu, Wentao Guo, Zirui Liu, Heng Ji, Zhaozhuo Xu, Denghui Zhang

Abstract: This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextu… ▽ More This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2503.24377 [pdf, other]

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Authors: Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning beh… ▽ More Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: In Progress; Paper list Repo: https://github.com/DevoAllen/Awesome-Reasoning-Economy-Papers

arXiv:2503.20666 [pdf, other]

TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

Authors: Huimin Xu, Seungjun Yi, Terence Lim, Jiawei Xu, Andrew Well, Carlos Mery, Aidong Zhang, Yuji Zhang, Heng Ji, Keshav Pingali, Yan Leng, Ying Ding

Abstract: Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-A… ▽ More Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. TA provides valuable insights in healthcare but is resource-intensive. Large Language Models (LLMs) have been introduced to perform TA, yet their applications in healthcare remain unexplored. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We leverage the scalability and coherence of multi-agent systems through structured conversations between agents and coordinate the expertise of cardiac experts in TA. Using interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness. TAMA demonstrates strong potential for automated TA in clinical settings by leveraging multi-agent LLM systems with human-in-the-loop integration by enhancing quality while significantly reducing manual workload. △ Less

Submitted 26 March, 2025; originally announced March 2025.

Comments: Submitted to the American Medical Informatics Association (AMIA) 2025 Annual Symposium, 10 pages

arXiv:2503.15126 [pdf, other]

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Authors: Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Yang, Zhiyong Wang, Honghai Liu

Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods ove… ▽ More Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results. △ Less

Submitted 19 March, 2025; originally announced March 2025.

Showing 1–50 of 684 results for author: Ji, H