-
UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
Authors:
Yanzhe Chen,
Huasong Zhong,
Yan Li,
Zhenheng Yang
Abstract:
Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable trai…
▽ More
Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode$^2$, a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning. Moreover, the alignment of our visual tokens with textual semantics enables seamless integration with pretrained diffusion decoders, supporting high-quality visual synthesis with minimal adaptation. UniCode^2 delivers strong performance across diverse benchmarks, demonstrating the viability of scaling visual token spaces without sacrificing stability, semantics, or modularity.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning
Authors:
Yuchang Zhu,
Huazhen Zhong,
Qunshu Lin,
Haotong Wei,
Xiaolong Sun,
Zixuan Yu,
Minghao Liu,
Zibin Zheng,
Liang Chen
Abstract:
With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performanc…
▽ More
With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.
△ Less
Submitted 24 June, 2025; v1 submitted 23 June, 2025;
originally announced June 2025.
-
The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability
Authors:
Jiachen Hu,
Rui Ai,
Han Zhong,
Xiaoyu Chen,
Liwei Wang,
Zhaoran Wang,
Zhuoran Yang
Abstract:
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising fr…
▽ More
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $ε$-optimal policy with a tight sample complexity of $O(1/ε^2)$.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Authors:
Baolin Zheng,
Guanlin Chen,
Hongqiong Zhong,
Qingyang Teng,
Yingshui Tan,
Zhendong Liu,
Weixun Wang,
Jiaheng Liu,
Jian Yang,
Huiyun Jing,
Jincheng Wei,
Wenbo Su,
Xiaoyong Zhu,
Bo Zheng,
Kaifu Zhang
Abstract:
Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation res…
▽ More
Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks -- even the union of the vast majority of them -- are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive, high-quality complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories, covering both English and Chinese across both vulnerability and oversensitivity dimensions.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
Authors:
Muzhi Zhu,
Hao Zhong,
Canyu Zhao,
Zongze Du,
Zheng Huang,
Mingyu Liu,
Hao Chen,
Cheng Zou,
Jingdong Chen,
Ming Yang,
Chunhua Shen
Abstract:
Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems…
▽ More
Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
GGBond: Growing Graph-Based AI-Agent Society for Socially-Aware Recommender Simulation
Authors:
Hailin Zhong,
Hanlin Wang,
Yujun Ye,
Meiyi Zhang,
Shengxin Zhu
Abstract:
Current personalized recommender systems predominantly rely on static offline data for algorithm design and evaluation, significantly limiting their ability to capture long-term user preference evolution and social influence dynamics in real-world scenarios. To address this fundamental challenge, we propose a high-fidelity social simulation platform integrating human-like cognitive agents and dyna…
▽ More
Current personalized recommender systems predominantly rely on static offline data for algorithm design and evaluation, significantly limiting their ability to capture long-term user preference evolution and social influence dynamics in real-world scenarios. To address this fundamental challenge, we propose a high-fidelity social simulation platform integrating human-like cognitive agents and dynamic social interactions to realistically simulate user behavior evolution under recommendation interventions. Specifically, the system comprises a population of Sim-User Agents, each equipped with a five-layer cognitive architecture that encapsulates key psychological mechanisms, including episodic memory, affective state transitions, adaptive preference learning, and dynamic trust-risk assessments. In particular, we innovatively introduce the Intimacy--Curiosity--Reciprocity--Risk (ICR2) motivational engine grounded in psychological and sociological theories, enabling more realistic user decision-making processes. Furthermore, we construct a multilayer heterogeneous social graph (GGBond Graph) supporting dynamic relational evolution, effectively modeling users' evolving social ties and trust dynamics based on interest similarity, personality alignment, and structural homophily. During system operation, agents autonomously respond to recommendations generated by typical recommender algorithms (e.g., Matrix Factorization, MultVAE, LightGCN), deciding whether to consume, rate, and share content while dynamically updating their internal states and social connections, thereby forming a stable, multi-round feedback loop. This innovative design transcends the limitations of traditional static datasets, providing a controlled, observable environment for evaluating long-term recommender effects.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
Authors:
Hao Zhong,
Muzhi Zhu,
Zongze Du,
Zheng Huang,
Canyu Zhao,
Mingyu Liu,
Wen Wang,
Hao Chen,
Chunhua Shen
Abstract:
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost,…
▽ More
Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits.
Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Authors:
Haitian Zhong,
Yuhuan Liu,
Ziyang Xu,
Guofan Liu,
Qiang Liu,
Shu Wu,
Zhe Zhao,
Liang Wang,
Tieniu Tan
Abstract:
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editin…
▽ More
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Statistical Inference under Performativity
Authors:
Xiang Li,
Yunai Li,
Huiying Zhong,
Lihua Lei,
Zhun Deng
Abstract:
Performativity of predictions refers to the phenomena that prediction-informed decisions may influence the target they aim to predict, which is widely observed in policy-making in social sciences and economics. In this paper, we initiate the study of statistical inference under performativity. Our contribution is two-fold. First, we build a central limit theorem for estimation and inference under…
▽ More
Performativity of predictions refers to the phenomena that prediction-informed decisions may influence the target they aim to predict, which is widely observed in policy-making in social sciences and economics. In this paper, we initiate the study of statistical inference under performativity. Our contribution is two-fold. First, we build a central limit theorem for estimation and inference under performativity, which enables inferential purposes in policy-making such as constructing confidence intervals or testing hypotheses. Second, we further leverage the derived central limit theorem to investigate prediction-powered inference (PPI) under performativity, which is based on a small labeled dataset and a much larger dataset of machine-learning predictions. This enables us to obtain more precise estimation and improved confidence regions for the model parameter (i.e., policy) of interest in performative prediction. We demonstrate the power of our framework by numerical experiments. To the best of our knowledge, this paper is the first one to establish statistical inference under performativity, which brings up new challenges and inference settings that we believe will add significant values to policy-making, statistics, and machine learning.
△ Less
Submitted 18 June, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
NEAT: QCP: A Practical Separation Logic-based C Program Verification Tool
Authors:
Xiwei Wu,
Yueyang Feng,
Xiaoyang Lu,
Tianchuan Lin,
Kan Liu,
Zhiyi Wang,
Shushu Wu,
Lihan Xie,
Chengxi Yang,
Hongyi Zhong,
Naijun Zhan,
Zhenjiang Hu,
Qinxiang Cao
Abstract:
As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, there still remain %these tools still continue to encounter substantial difficulties when applying these tools to complex, real-world scenarios. To address these d…
▽ More
As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, there still remain %these tools still continue to encounter substantial difficulties when applying these tools to complex, real-world scenarios. To address these difficulties, this paper introduces a novel verification tool, called \textbf{Qualified C Programming Verifier (QCP)}. QCP incorporates a refined front-end %syntax of assertion language to enhance user interaction. The proposed assertion language aims to %syntax is designed to lower the entry barrier for verification tools, improve proof efficiency by improving automation, and facilitate a deeper understanding of both the program and its verification results.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
SafeMove-RL: A Certifiable Reinforcement Learning Framework for Dynamic Motion Constraints in Trajectory Planning
Authors:
Tengfei Liu,
Haoyang Zhong,
Jiazheng Hu,
Tan Zhang
Abstract:
This study presents a dynamic safety margin-based reinforcement learning framework for local motion planning in dynamic and uncertain environments. The proposed planner integrates real-time trajectory optimization with adaptive gap analysis, enabling effective feasibility assessment under partial observability constraints. To address safety-critical computations in unknown scenarios, an enhanced o…
▽ More
This study presents a dynamic safety margin-based reinforcement learning framework for local motion planning in dynamic and uncertain environments. The proposed planner integrates real-time trajectory optimization with adaptive gap analysis, enabling effective feasibility assessment under partial observability constraints. To address safety-critical computations in unknown scenarios, an enhanced online learning mechanism is introduced, which dynamically corrects spatial trajectories by forming dynamic safety margins while maintaining control invariance. Extensive evaluations, including ablation studies and comparisons with state-of-the-art algorithms, demonstrate superior success rates and computational efficiency. The framework's effectiveness is further validated on both simulated and physical robotic platforms.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Not All Documents Are What You Need for Extracting Instruction Tuning Data
Authors:
Chi Zhang,
Huaping Zhong,
Hongtao Li,
Chengliang Chai,
Jiawei Hong,
Yuhao Deng,
Jiacheng Wang,
Tian Tan,
Yizhou Yan,
Jiantao Qiu,
Ye Yuan,
Guoren Wang,
Conghui He,
Lei Cao
Abstract:
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address t…
▽ More
Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
A Physics-informed End-to-End Occupancy Framework for Motion Planning of Autonomous Vehicles
Authors:
Shuqi Shen,
Junjie Yang,
Hongliang Lu,
Hui Zhong,
Qiming Zhang,
Xinhu Zheng
Abstract:
Accurate and interpretable motion planning is essential for autonomous vehicles (AVs) navigating complex and uncertain environments. While recent end-to-end occupancy prediction methods have improved environmental understanding, they typically lack explicit physical constraints, limiting safety and generalization. In this paper, we propose a unified end-to-end framework that integrates verifiable…
▽ More
Accurate and interpretable motion planning is essential for autonomous vehicles (AVs) navigating complex and uncertain environments. While recent end-to-end occupancy prediction methods have improved environmental understanding, they typically lack explicit physical constraints, limiting safety and generalization. In this paper, we propose a unified end-to-end framework that integrates verifiable physical rules into the occupancy learning process. Specifically, we embed artificial potential fields (APF) as physics-informed guidance during network training to ensure that predicted occupancy maps are both data-efficient and physically plausible. Our architecture combines convolutional and recurrent neural networks to capture spatial and temporal dependencies while preserving model flexibility. Experimental results demonstrate that our method improves task completion rate, safety margins, and planning efficiency across diverse driving scenarios, confirming its potential for reliable deployment in real-world AV systems.
△ Less
Submitted 6 June, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Hierarchical Multi-Label Generation with Probabilistic Level-Constraint
Authors:
Linqing Chen,
Weilei Wang,
Wentao Wu,
Hanmeng Zhong
Abstract:
Hierarchical Extreme Multi-Label Classification poses greater difficulties compared to traditional multi-label classification because of the intricate hierarchical connections of labels within a domain-specific taxonomy and the substantial number of labels. Some of the prior research endeavors centered on classifying text through several ancillary stages such as the cluster algorithm and multiphas…
▽ More
Hierarchical Extreme Multi-Label Classification poses greater difficulties compared to traditional multi-label classification because of the intricate hierarchical connections of labels within a domain-specific taxonomy and the substantial number of labels. Some of the prior research endeavors centered on classifying text through several ancillary stages such as the cluster algorithm and multiphase classification. Others made attempts to leverage the assistance of generative methods yet were unable to properly control the output of the generative model. We redefine the task from hierarchical multi-Label classification to Hierarchical Multi-Label Generation (HMG) and employ a generative framework with Probabilistic Level Constraints (PLC) to generate hierarchical labels within a specific taxonomy that have complex hierarchical relationships. The approach we proposed in this paper enables the framework to generate all relevant labels across levels for each document without relying on preliminary operations like clustering. Meanwhile, it can control the model output precisely in terms of count, length, and level aspects. Experiments demonstrate that our approach not only achieves a new SOTA performance in the HMG task, but also has a much better performance in constrained the output of model than previous research work.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
HoneyBee: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning
Authors:
Hongbin Zhong,
Matthew Lentz,
Nina Narodytska,
Adriana Szekeres,
Kexin Rong
Abstract:
As vector databases gain traction in enterprise applications, robust access control has become critical to safeguard sensitive data. Access control in these systems is often implemented through hybrid vector queries, which combine nearest neighbor search on vector data with relational predicates based on user permissions. However, existing approaches face significant trade-offs: creating dedicated…
▽ More
As vector databases gain traction in enterprise applications, robust access control has become critical to safeguard sensitive data. Access control in these systems is often implemented through hybrid vector queries, which combine nearest neighbor search on vector data with relational predicates based on user permissions. However, existing approaches face significant trade-offs: creating dedicated indexes for each user minimizes query latency but introduces excessive storage redundancy, while building a single index and applying access control after vector search reduces storage overhead but suffers from poor recall and increased query latency. This paper introduces HoneyBee, a dynamic partitioning framework that bridges the gap between these approaches by leveraging the structure of Role-Based Access Control (RBAC) policies. RBAC, widely adopted in enterprise settings, groups users into roles and assigns permissions to those roles, creating a natural "thin waist" in the permission structure that is ideal for partitioning decisions. Specifically, HoneyBee produces overlapping partitions where vectors can be strategically replicated across different partitions to reduce query latency while controlling storage overhead. By introducing analytical models for the performance and recall of the vector search, HoneyBee formulates the partitioning strategy as a constrained optimization problem to dynamically balance storage, query efficiency, and recall. Evaluations on RBAC workloads demonstrate that HoneyBee reduces storage redundancy compared to role partitioning and achieves up to 6x faster query speeds than row-level security (RLS) with only 1.4x storage increase, offering a practical middle ground for secure and efficient vector search.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Bridge the Domains: Large Language Models Enhanced Cross-domain Sequential Recommendation
Authors:
Qidong Liu,
Xiangyu Zhao,
Yejing Wang,
Zijian Zhang,
Howard Zhong,
Chong Chen,
Xiang Li,
Wei Huang,
Feng Tian
Abstract:
Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user's historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item…
▽ More
Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user's historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item relationships, compromising the practicability. The latter refers to the difficulties in learning the complex transition patterns from the mixed behavior sequences. With powerful representation and reasoning abilities, Large Language Models (LLMs) are promising to address these two problems by bridging the items and capturing the user's preferences from a semantic view. Therefore, we propose an LLMs Enhanced Cross-domain Sequential Recommendation model (LLM4CDSR). To obtain the semantic item relationships, we first propose an LLM-based unified representation module to represent items. Then, a trainable adapter with contrastive regularization is designed to adapt the CDSR task. Besides, a hierarchical LLMs profiling module is designed to summarize user cross-domain preferences. Finally, these two modules are integrated into the proposed tri-thread framework to derive recommendations. We have conducted extensive experiments on three public cross-domain datasets, validating the effectiveness of LLM4CDSR. We have released the code online.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
High-Efficiency Split Computing for Cooperative Edge Systems: A Novel Compressed Sensing Bottleneck
Authors:
Hailin Zhong,
Donglong Chen
Abstract:
The advent of big data and AI has precipitated a demand for computational frameworks that ensure real-time performance, accuracy, and privacy. While edge computing mitigates latency and privacy concerns, its scalability is constrained by the resources of edge devices, thus prompting the adoption of split computing (SC) addresses these limitations. However, SC faces challenges in (1) efficient data…
▽ More
The advent of big data and AI has precipitated a demand for computational frameworks that ensure real-time performance, accuracy, and privacy. While edge computing mitigates latency and privacy concerns, its scalability is constrained by the resources of edge devices, thus prompting the adoption of split computing (SC) addresses these limitations. However, SC faces challenges in (1) efficient data transmission under bandwidth constraints and (2) balancing accuracy with real-time performance. To tackle these challenges, we propose a novel split computing architecture inspired by compressed sensing (CS) theory. At its core is the High-Efficiency Compressed Sensing Bottleneck (HECS-B), which incorporates an efficient compressed sensing autoencoder into the shallow layer of a deep neural network (DNN) to create a bottleneck layer using the knowledge distillation method. This bottleneck splits the DNN into a distributed model while efficiently compressing intermediate feature data, preserving critical information for seamless reconstruction in the cloud.
Through rigorous theoretical analysis and extensive experimental validation in both simulated and real-world settings, we demonstrate the effectiveness of the proposed approach. Compared to state-of-the-art methods, our architecture reduces bandwidth utilization by 50%, maintains high accuracy, and achieves a 60% speed-up in computational efficiency. The results highlight significant improvements in bandwidth efficiency, processing speed, and model accuracy, underscoring the potential of HECS-B to bridge the gap between resource-constrained edge devices and computationally intensive cloud services.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation
Authors:
Hanmeng Zhong,
Linqing Chen,
Weilei Wang,
Wentao Wu
Abstract:
Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchm…
▽ More
Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs' Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Streamlining Biomedical Research with Specialized LLMs
Authors:
Linqing Chen,
Weilei Wang,
Yubin Xia,
Wentao Wu,
Peng Xu,
Zilong Bai,
Jie Fang,
Chaobo Xu,
Ran Hu,
Licong Xu,
Haoran Hua,
Jing Sun,
Hanmeng Zhong,
Jin Liu,
Tian Qiu,
Haowen Liu,
Meng Hu,
Xiuwen Li,
Fei Gao,
Yong Gu,
Tao Shi,
Chaochao Wang,
Jianping Lu,
Cheng Sun,
Yixin Wang
, et al. (8 additional authors not shown)
Abstract:
In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, imag…
▽ More
In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system's capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\&D process. Furthermore, the system proposed in this paper is available at https://synapse-chat.patsnap.com.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
SCFANet: Style Distribution Constraint Feature Alignment Network For Pathological Staining Translation
Authors:
Zetong Chen,
Yuzhuo Chen,
Hai Zhong,
Xu Qiao
Abstract:
Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (H&E) stained images into IHC stained images ha…
▽ More
Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (H&E) stained images into IHC stained images has emerged as an efficient solution. Nevertheless, the conversion from H&E to IHC images presents significant challenges, primarily due to alignment discrepancies between image pairs and the inherent diversity in IHC staining style patterns. To overcome these challenges, we propose the Style Distribution Constraint Feature Alignment Network (SCFANet), which incorporates two innovative modules: the Style Distribution Constrainer (SDC) and Feature Alignment Learning (FAL). The SDC ensures consistency between the generated and target images' style distributions while integrating cycle consistency loss to maintain structural consistency. To mitigate the complexity of direct image-to-image translation, the FAL module decomposes the end-to-end translation task into two subtasks: image reconstruction and feature alignment. Furthermore, we ensure pathological consistency between generated and target images by maintaining pathological pattern consistency and Optical Density (OD) uniformity. Extensive experiments conducted on the Breast Cancer Immunohistochemical (BCI) dataset demonstrate that our SCFANet model outperforms existing methods, achieving precise transformation of H&E-stained images into their IHC-stained counterparts. The proposed approach not only addresses the technical challenges in H&E to IHC image translation but also provides a robust framework for accurate and efficient stain conversion in pathological analysis.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video
Authors:
Qiang Hu,
Zihan Zheng,
Houqiang Zhong,
Sihua Fu,
Li Song,
XiaoyunZhang,
Guangtao Zhai,
Yanfeng Wang
Abstract:
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD)…
▽ More
3D Gaussian Splatting (3DGS) has substantial potential for enabling photorealistic Free-Viewpoint Video (FVV) experiences. However, the vast number of Gaussians and their associated attributes poses significant challenges for storage and transmission. Existing methods typically handle dynamic 3DGS representation and compression separately, neglecting motion information and the rate-distortion (RD) trade-off during training, leading to performance degradation and increased model redundancy. To address this gap, we propose 4DGC, a novel rate-aware 4D Gaussian compression framework that significantly reduces storage size while maintaining superior RD performance for FVV. Specifically, 4DGC introduces a motion-aware dynamic Gaussian representation that utilizes a compact motion grid combined with sparse compensated Gaussians to exploit inter-frame similarities. This representation effectively handles large motions, preserving quality and reducing temporal redundancy. Furthermore, we present an end-to-end compression scheme that employs differentiable quantization and a tiny implicit entropy model to compress the motion grid and compensated Gaussians efficiently. The entire framework is jointly optimized using a rate-distortion trade-off. Extensive experiments demonstrate that 4DGC supports variable bitrates and consistently outperforms existing methods in RD performance across multiple datasets.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Serial Low-rank Adaptation of Vision Transformer
Authors:
Houqiang Zhong,
Shaocheng Shen,
Ke Cai,
Zhenglong Wu,
Jiangchao Yao,
Yuan Cheng,
Xuefei Li,
Xiaoyun Zhang,
Li Song,
Qiang Hu
Abstract:
Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-r…
▽ More
Fine-tuning large pre-trained vision foundation models in a parameter-efficient manner is critical for downstream vision tasks, considering the practical constraints of computational and storage costs. Low-rank adaptation (LoRA) is a well-established technique in this domain, achieving impressive efficiency by reducing the parameter space to a low-rank form. However, developing more advanced low-rank adaptation methods to reduce parameters and memory requirements remains a significant challenge in resource-constrained application scenarios. In this study, we consider on top of the commonly used vision transformer and propose Serial LoRA, a novel LoRA variant that introduces a shared low-rank matrix serially composite with the attention mechanism. Such a design extracts the underlying commonality of parameters in adaptation, significantly reducing redundancy. Notably, Serial LoRA uses only 1/4 parameters of LoRA but achieves comparable performance in most cases. We conduct extensive experiments on a range of vision foundation models with the transformer structure, and the results confirm consistent superiority of our method.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning
Authors:
Mengyao Lyu,
Yan Li,
Huasong Zhong,
Wenhao Yang,
Hui Chen,
Jungong Han,
Guiguang Ding,
Zhenheng Yang
Abstract:
The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassi…
▽ More
The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection.
To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Global AI Governance: Where the Challenge is the Solution- An Interdisciplinary, Multilateral, and Vertically Coordinated Approach
Authors:
Huixin Zhong,
Thao Do,
Ynagliu Jie,
Rostam J. Neuwirth,
Hong Shen
Abstract:
Current global AI governance frameworks struggle with fragmented disciplinary collaboration, ineffective multilateral coordination, and disconnects between policy design and grassroots implementation. This study, guided by Integration and Implementation Science (IIS) initiated a structured interdisciplinary dialogue at the UN Science Summit, convening legal, NGO, and HCI experts to tackle those ch…
▽ More
Current global AI governance frameworks struggle with fragmented disciplinary collaboration, ineffective multilateral coordination, and disconnects between policy design and grassroots implementation. This study, guided by Integration and Implementation Science (IIS) initiated a structured interdisciplinary dialogue at the UN Science Summit, convening legal, NGO, and HCI experts to tackle those challenges. Drawing on the common ground of the experts: dynamism, experimentation, inclusivity, and paradoxical governance, this study, through thematic analysis and interdisciplinary comparison analysis, identifies four core principles of global AI governance. Furthermore, we translate these abstract principles into concrete action plans leveraging the distinct yet complementary perspectives of each discipline. These principles and action plans are then integrated into a five-phase, time-sequential framework including foundation building, experimental verification, collaborative optimization, global adaptation, and continuous evolution phases. This multilevel framework offers a novel and concrete pathway toward establishing interdisciplinary, multilateral, and vertically coordinated AI governance, transforming global AI governance challenges into opportunities for political actions.
△ Less
Submitted 12 February, 2025;
originally announced March 2025.
-
MIDAS: Modeling Ground-Truth Distributions with Dark Knowledge for Domain Generalized Stereo Matching
Authors:
Peng Xu,
Zhiyu Xiang,
Jingyun Fu,
Tianyu Pu,
Hanzhi Zhong,
Eryun Liu
Abstract:
Despite the significant advances in domain generalized stereo matching, existing methods still exhibit domain-specific preferences when transferring from synthetic to real domains, hindering their practical applications in complex and diverse scenarios. The probability distributions predicted by the stereo network naturally encode rich similarity and uncertainty information. Inspired by this obser…
▽ More
Despite the significant advances in domain generalized stereo matching, existing methods still exhibit domain-specific preferences when transferring from synthetic to real domains, hindering their practical applications in complex and diverse scenarios. The probability distributions predicted by the stereo network naturally encode rich similarity and uncertainty information. Inspired by this observation, we propose to extract these two types of dark knowledge from the pre-trained network to model intuitive multi-modal ground-truth distributions for both edge and non-edge regions. To mitigate the inherent domain preferences of a single network, we adopt network ensemble and further distinguish between objective and biased knowledge in the Laplace parameter space. Finally, the objective knowledge and the original disparity labels are jointly modeled as a mixture of Laplacians to provide fine-grained supervision for the stereo network training. Extensive experiments demonstrate that: 1) Our method is generic and effectively improves the generalization of existing networks. 2) PCWNet with our method achieves the state-of-the-art generalization performance on both KITTI 2015 and 2012 datasets. 3) Our method outperforms existing methods in comprehensive ranking across four popular real-world datasets.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Efficient and Universal Neural-Network Decoder for Stabilizer-Based Quantum Error Correction
Authors:
Gengyuan Hu,
Wanli Ouyang,
Chao-Yang Lu,
Chen Lin,
Han-Sen Zhong
Abstract:
Scaling quantum computing to practical applications necessitates reliable quantum error correction. Although numerous correction codes have been proposed, the overall correction efficiency critically limited by the decode algorithms. We introduce GraphQEC, a code-agnostic decoder leveraging machine-learning on the graph structure of stabilizer codes with linear time complexity. GraphQEC demonstrat…
▽ More
Scaling quantum computing to practical applications necessitates reliable quantum error correction. Although numerous correction codes have been proposed, the overall correction efficiency critically limited by the decode algorithms. We introduce GraphQEC, a code-agnostic decoder leveraging machine-learning on the graph structure of stabilizer codes with linear time complexity. GraphQEC demonstrates unprecedented accuracy and efficiency across all tested code families, including surface codes, color codes, and quantum low-density parity-check (QLDPC) codes. For instance, on a distance-12 QLDPC code, GraphQEC achieves a logical error rate of $9.55 \times 10^{-5}$, an 18-fold improvement over the previous best specialized decoder's $1.74 \times 10^{-3}$ under $p=0.005$ physical error rates, while maintaining $157μ$s/cycle decoding speed. Our approach represents the first universal solution for real-time quantum error correction across arbitrary stabilizer codes.
△ Less
Submitted 3 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
A Dynamic UAVs Cooperative Suppressive Jamming Method with Joint Task Assignment and Bandwidth Allocation
Authors:
Ruiqing Han,
Tianxian Zhang,
Han Zhong,
Yuanhang Wang
Abstract:
The low detectability and low cost of unmanned aerial vehicles (UAVs) allow them to swarm near the radar network for effective jamming. The key to jamming is the reasonable task assignment and resource allocation of UAVs. However, the existing allocation model is somewhat ideal, weakly adaptive to the dynamic environment, and rarely considers frequency matching, which cannot suppress the frequency…
▽ More
The low detectability and low cost of unmanned aerial vehicles (UAVs) allow them to swarm near the radar network for effective jamming. The key to jamming is the reasonable task assignment and resource allocation of UAVs. However, the existing allocation model is somewhat ideal, weakly adaptive to the dynamic environment, and rarely considers frequency matching, which cannot suppress the frequency agile radar (FAR) network effectively. To solve these problems, a dynamic UAVs cooperative suppressive jamming method with joint task assignment and bandwidth allocation is proposed. To represent the matching relationship between UAVs and FARs, a system model of task assignment and bandwidth allocation is established, the problem is formulated as a dynamic mixed integer programming (D-MIP) problem. Then, a suppressive jamming evaluation indicator is proposed, and the utility function is designed based on the Quality of Service (QoS) framework to quantify the jamming effect of UAVs. To solve the combinational optimization problem, a two-step dynamic hybrid algorithm based on Kriging model is proposed, which can obtain the task assignment and bandwidth allocation schemes of UAVs by consuming fewer computational resources in dynamic environment. Simulation results show that the proposed method is effective in terms of jamming performance, computational resource saving and dynamic environment adaptability.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
A general language model for peptide identification
Authors:
Jixiu Zhai,
Tianchi Lu,
Haitian Zhong,
Ziyang Xu,
Yuhuan Liu,
Shengrui Xu,
Jingwan Wang,
Dan Huang
Abstract:
Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein langu…
▽ More
Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses-including dimensionality reduction and comparison studies-PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub:https://github.com/fondress/PDeepPP and Hugging Face:https://huggingface.co/fondress/PDeppPP.
△ Less
Submitted 30 June, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
An approach for API synthesis using large language models
Authors:
Hua Zhong,
Shan Jiang,
Sarfraz Khurshid
Abstract:
APIs play a pivotal role in modern software development by enabling seamless communication and integration between various systems, applications, and services. Component-based API synthesis is a form of program synthesis that constructs an API by assembling predefined components from a library. Existing API synthesis techniques typically implement dedicated search strategies over bounded spaces of…
▽ More
APIs play a pivotal role in modern software development by enabling seamless communication and integration between various systems, applications, and services. Component-based API synthesis is a form of program synthesis that constructs an API by assembling predefined components from a library. Existing API synthesis techniques typically implement dedicated search strategies over bounded spaces of possible implementations, which can be very large and time consuming to explore. In this paper, we present a novel approach of using large language models (LLMs) in API synthesis. LLMs offer a foundational technology to capture developer insights and provide an ideal framework for enabling more effective API synthesis. We perform an experimental evaluation of our approach using 135 real-world programming tasks, and compare it with FrAngel, a state-of-the-art API synthesis tool. The experimental results show that our approach completes 133 of the tasks, and overall outperforms FrAngel. We believe LLMs provide a very useful foundation for tackling the problem of API synthesis, in particular, and program synthesis, in general.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Less is More: Improving LLM Alignment via Preference Data Selection
Authors:
Xun Deng,
Han Zhong,
Rui Ai,
Fuli Feng,
Zheng Wang,
Xiangnan He
Abstract:
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel…
▽ More
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.
△ Less
Submitted 14 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Qwen2.5-VL Technical Report
Authors:
Shuai Bai,
Keqin Chen,
Xuejing Liu,
Jialin Wang,
Wenbin Ge,
Sibo Song,
Kai Dang,
Peng Wang,
Shijie Wang,
Jun Tang,
Humen Zhong,
Yuanzhi Zhu,
Mingkun Yang,
Zhaohai Li,
Jianqiang Wan,
Pengfei Wang,
Wei Ding,
Zheren Fu,
Yiheng Xu,
Jiabo Ye,
Xi Zhang,
Tianbao Xie,
Zesen Cheng,
Hang Zhang,
Zhibo Yang
, et al. (2 additional authors not shown)
Abstract:
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehensio…
▽ More
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning
Authors:
Xun Wang,
Zhuoran Li,
Hai Zhong,
Longbo Huang
Abstract:
As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, in this paper, we propose a task-efficient multi-task offline MARL algo…
▽ More
As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, in this paper, we propose a task-efficient multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing offline skill-discovery methods, SD-CQL discovers skills by reconstructing the next observation. It then evaluates fixed and variable actions separately and employs behavior-regularized conservative Q-learning to execute the optimal action for each skill. This approach eliminates the need for local-global alignment and enables strong multi-task generalization from limited small-scale source tasks. Substantial experiments on StarCraftII demonstrates the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on $\textbf{10}$ out of $14$ task sets, with up to $\textbf{65%}$ improvement on individual task sets, and is within $4\%$ of the best baseline on the remaining four.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Deployment-friendly Lane-changing Intention Prediction Powered by Brain-inspired Spiking Neural Networks
Authors:
Shuqi Shen,
Junjie Yang,
Hui Zhong,
Hongliang Lu,
Xinhu Zheng,
Hai Yang
Abstract:
Accurate and real-time prediction of surrounding vehicles' lane-changing intentions is a critical challenge in deploying safe and efficient autonomous driving systems in open-world scenarios. Existing high-performing methods remain hard to deploy due to their high computational cost, long training times, and excessive memory requirements. Here, we propose an efficient lane-changing intention predi…
▽ More
Accurate and real-time prediction of surrounding vehicles' lane-changing intentions is a critical challenge in deploying safe and efficient autonomous driving systems in open-world scenarios. Existing high-performing methods remain hard to deploy due to their high computational cost, long training times, and excessive memory requirements. Here, we propose an efficient lane-changing intention prediction approach based on brain-inspired Spiking Neural Networks (SNN). By leveraging the event-driven nature of SNN, the proposed approach enables us to encode the vehicle's states in a more efficient manner. Comparison experiments conducted on HighD and NGSIM datasets demonstrate that our method significantly improves training efficiency and reduces deployment costs while maintaining comparable prediction accuracy. Particularly, compared to the baseline, our approach reduces training time by 75% and memory usage by 99.9%. These results validate the efficiency and reliability of our method in lane-changing predictions, highlighting its potential for safe and efficient autonomous driving systems while offering significant advantages in deployment, including reduced training time, lower memory usage, and faster inference.
△ Less
Submitted 8 May, 2025; v1 submitted 9 February, 2025;
originally announced February 2025.
-
Learning an Optimal Assortment Policy under Observational Data
Authors:
Yuxuan Han,
Han Zhong,
Miao Lu,
Jose Blanchet,
Zhengyuan Zhou
Abstract:
We study the fundamental problem of offline assortment optimization under the Multinomial Logit (MNL) model, where sellers must determine the optimal subset of the products to offer based solely on historical customer choice data. While most existing approaches to learning-based assortment optimization focus on the online learning of the optimal assortment through repeated interactions with custom…
▽ More
We study the fundamental problem of offline assortment optimization under the Multinomial Logit (MNL) model, where sellers must determine the optimal subset of the products to offer based solely on historical customer choice data. While most existing approaches to learning-based assortment optimization focus on the online learning of the optimal assortment through repeated interactions with customers, such exploration can be costly or even impractical in many real-world settings. In this paper, we consider the offline learning paradigm and investigate the minimal data requirements for efficient offline assortment optimization. To this end, we introduce Pessimistic Rank-Breaking (PRB), an algorithm that combines rank-breaking with pessimistic estimation. We prove that PRB is nearly minimax optimal by establishing the tight suboptimality upper bound and a nearly matching lower bound. This further shows that "optimal item coverage" - where each item in the optimal assortment appears sufficiently often in the historical data - is both sufficient and necessary for efficient offline learning. This significantly relaxes the previous requirement of observing the complete optimal assortment in the data. Our results provide fundamental insights into the data requirements for offline assortment optimization under the MNL model.
△ Less
Submitted 15 June, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Authors:
Han Zhong,
Yutong Yin,
Shenao Zhang,
Xiaojun Xu,
Yuanxin Liu,
Yifei Zuo,
Zhihan Liu,
Boyi Liu,
Sirui Zheng,
Hongyi Guo,
Liwei Wang,
Mingyi Hong,
Zhaoran Wang
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Within this framework, we introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps. First, it generates high-quality rationales by approximating the optimal thinking process through reinforcement learning, using a novel reward shaping mechanism. Second, it enhances the base LLM by maximizing the joint probability of rationale generation with respect to the model's parameters. Theoretically, we demonstrate BRiTE's convergence at a rate of $1/T$ with $T$ representing the number of iterations. Empirical evaluations on math and coding benchmarks demonstrate that our approach consistently improves performance across different base models without requiring human-annotated thinking processes. In addition, BRiTE demonstrates superior performance compared to existing algorithms that bootstrap thinking processes use alternative methods such as rejection sampling, and can even match or exceed the results achieved through supervised fine-tuning with human-annotated data.
△ Less
Submitted 6 June, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
VARFVV: View-Adaptive Real-Time Interactive Free-View Video Streaming with Edge Computing
Authors:
Qiang Hu,
Qihan He,
Houqiang Zhong,
Guo Lu,
Xiaoyun Zhang,
Guangtao Zhai,
Yanfeng Wang
Abstract:
Free-view video (FVV) allows users to explore immersive video content from multiple views. However, delivering FVV poses significant challenges due to the uncertainty in view switching, combined with the substantial bandwidth and computational resources required to transmit and decode multiple video streams, which may result in frequent playback interruptions. Existing approaches, either client-ba…
▽ More
Free-view video (FVV) allows users to explore immersive video content from multiple views. However, delivering FVV poses significant challenges due to the uncertainty in view switching, combined with the substantial bandwidth and computational resources required to transmit and decode multiple video streams, which may result in frequent playback interruptions. Existing approaches, either client-based or cloud-based, struggle to meet high Quality of Experience (QoE) requirements under limited bandwidth and computational resources. To address these issues, we propose VARFVV, a bandwidth- and computationally-efficient system that enables real-time interactive FVV streaming with high QoE and low switching delay. Specifically, VARFVV introduces a low-complexity FVV generation scheme that reassembles multiview video frames at the edge server based on user-selected view tracks, eliminating the need for transcoding and significantly reducing computational overhead. This design makes it well-suited for large-scale, mobile-based UHD FVV experiences. Furthermore, we present a popularity-adaptive bit allocation method, leveraging a graph neural network, that predicts view popularity and dynamically adjusts bit allocation to maximize QoE within bandwidth constraints. We also construct an FVV dataset comprising 330 videos from 10 scenes, including basketball, opera, etc. Extensive experiments show that VARFVV surpasses existing methods in video quality, switching latency, computational efficiency, and bandwidth usage, supporting over 500 users on a single edge server with a switching delay of 71.5ms. Our code and dataset are available at https://github.com/qianghu-huber/VARFVV.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Offline Critic-Guided Diffusion Policy for Multi-User Delay-Constrained Scheduling
Authors:
Zhuoran Li,
Ruishuo Chen,
Hai Zhong,
Longbo Huang
Abstract:
Effective multi-user delay-constrained scheduling is crucial in various real-world applications, such as instant messaging, live streaming, and data center management. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. Current learning-based…
▽ More
Effective multi-user delay-constrained scheduling is crucial in various real-world applications, such as instant messaging, live streaming, and data center management. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. Current learning-based methods typically require interactions with actual systems during the training stage, which can be difficult or impractical, as it is capable of significantly degrading system performance and incurring substantial service costs. To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underline{S}cheduling By \underline{O}ffline Learning with \underline{C}ritic Guidance and \underline{D}iffusion Generation (SOCD), to learn efficient scheduling policies purely from pre-collected \emph{offline data}. SOCD innovatively employs a diffusion-based policy network, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD effectively trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning
Authors:
Hanwen Zhong,
Jiaxin Chen,
Yutong Zhang,
Di Huang,
Yunhong Wang
Abstract:
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiv…
▽ More
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiveness of reparameterization of LoRA, leading to sub-optimal performance and low inference speed. In this work, we propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner during training, and reparameterizing the learned structure for efficient inference. Specifically, we firstly develop the MoEfied LoRA structure, which decomposes the pre-trained Transformer into a low-rank MoE structure and employ LoRA to fine-tune the parameters. Subsequently, we take into account the intrinsic asynchronous nature of multi-task learning and devise a learning Quality Retaining (QR) optimization mechanism, by leveraging the historical high-quality class logits to prevent a well-trained task from performance degradation. Finally, we design a router fading strategy to integrate the learned parameters into the original Transformer, archiving efficient inference. Extensive experiments on public benchmarks demonstrate the superiority of our method, compared to the state-of-the-art multi-task learning approaches.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection
Authors:
Ruoyu Xu,
Zhiyu Xiang,
Chenwei Zhang,
Hanzhi Zhong,
Xijun Zhao,
Ruina Dang,
Peng Xu,
Tianyu Pu,
Eryun Liu
Abstract:
3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the e…
▽ More
3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the existing methods is still much lower than expected. In this paper, we propose a novel Semi-supervised Cross-modality Knowledge Distillation (SCKD) method for 4D radar-based 3D object detection. It characterizes the capability of learning the feature from a Lidar-radar-fused teacher network with semi-supervised distillation. We first propose an adaptive fusion module in the teacher network to boost its performance. Then, two feature distillation modules are designed to facilitate the cross-modality knowledge transfer. Finally, a semi-supervised output distillation is proposed to increase the effectiveness and flexibility of the distillation framework. With the same network structure, our radar-only student trained by SCKD boosts the mAP by 10.38% over the baseline and outperforms the state-of-the-art works on the VoD dataset. The experiment on ZJUODset also shows 5.12% mAP improvements on the moderate difficulty level over the baseline when extra unlabeled data are available. Code is available at https://github.com/Ruoyu-Xu/SCKD.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Differential Privacy Preserving Distributed Quantum Computing
Authors:
Hui Zhong,
Keyi Ju,
Jiachen Shen,
Xinyue Zhang,
Xiaoqi Qin,
Tomoaki Ohtsuki,
Miao Pan,
Zhu Han
Abstract:
Existing quantum computers can only operate with hundreds of qubits in the Noisy Intermediate-Scale Quantum (NISQ) state, while quantum distributed computing (QDC) is regarded as a reliable way to address this limitation, allowing quantum computers to achieve their full computational potential. However, similar to classical distributed computing, QDC also faces the problem of privacy leakage. Exis…
▽ More
Existing quantum computers can only operate with hundreds of qubits in the Noisy Intermediate-Scale Quantum (NISQ) state, while quantum distributed computing (QDC) is regarded as a reliable way to address this limitation, allowing quantum computers to achieve their full computational potential. However, similar to classical distributed computing, QDC also faces the problem of privacy leakage. Existing research has introduced quantum differential privacy (QDP) for privacy protection in central quantum computing, but there is no dedicated privacy protection mechanisms for QDC. To fill this research gap, our paper introduces a novel concept called quantum Rényi differential privacy (QRDP), which incorporates the advantages of classical Rényi DP and is applicable in the QDC domain. Based on the new quantum Rényi divergence, QRDP provides delicate and flexible privacy protection by introducing parameter $α$. In particular, the QRDP composition is well suited for QDC, since it allows for more precise control of the total privacy budget in scenarios requiring multiple quantum operations. We analyze a variety of noise mechanisms that can implement QRDP, and derive the lowest privacy budget provided by these mechanisms. Finally, we investigate the impact of different quantum parameters on QRDP. Through our simulations, we also find that adding noise will make the data less usable, but increase the level of privacy protection.
△ Less
Submitted 6 January, 2025; v1 submitted 16 December, 2024;
originally announced December 2024.
-
VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression
Authors:
Qiang Hu,
Houqiang Zhong,
Zihan Zheng,
Xiaoyun Zhang,
Zhengxue Cheng,
Li Song,
Guangtao Zhai,
Yanfeng Wang
Abstract:
Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression indepen…
▽ More
Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.
△ Less
Submitted 15 December, 2024;
originally announced December 2024.
-
CoopetitiveV: Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality Verilog Generation
Authors:
Zhendong Mi,
Renming Zheng,
Haowen Zhong,
Yue Sun,
Seth Kneeland,
Sayan Moitra,
Ken Kutzer,
Zhaozhuo Xu Shaoyi Huang
Abstract:
Recent advances in agentic LLMs have demonstrated great capabilities in Verilog code generation. However, existing approaches either use LLM-assisted single-agent prompting or cooperation-only multi-agent learning, which will lead to: (i) Degeneration issue for single-agent learning: characterized by diminished error detection and correction capabilities; (ii) Error propagation in cooperation-only…
▽ More
Recent advances in agentic LLMs have demonstrated great capabilities in Verilog code generation. However, existing approaches either use LLM-assisted single-agent prompting or cooperation-only multi-agent learning, which will lead to: (i) Degeneration issue for single-agent learning: characterized by diminished error detection and correction capabilities; (ii) Error propagation in cooperation-only multi-agent learning: erroneous information from the former agent will be propagated to the latter through prompts, which can make the latter agents generate buggy code. In this paper, we propose an LLM-based coopetitive multi-agent prompting framework, in which the agents cannot collaborate with each other to form the generation pipeline, but also create a healthy competitive mechanism to improve the generating quality. Our experimental results show that the coopetitive multi-agent framework can effectively mitigate the degeneration risk and reduce the error propagation while improving code error correction capabilities, resulting in higher quality Verilog code generation. The effectiveness of our approach is validated through extensive experiments. On VerilogEval Machine and Human dataset, CoopetitiveV+GPT-4 achieves 99.2% and 99.1% pass@10 scores, respectively. While on RTLLM, CoopetitiveV+GPT-4 obtains 100% syntax and 99.9% functionality pass@5 scores.
△ Less
Submitted 5 June, 2025; v1 submitted 14 December, 2024;
originally announced December 2024.
-
Deep Learning Models for Colloidal Nanocrystal Synthesis
Authors:
Kai Gu,
Yingping Liang,
Jiaming Su,
Peihan Sun,
Jia Peng,
Naihua Miao,
Zhimei Sun,
Ying Fu,
Haizheng Zhong,
Jun Zhang
Abstract:
Colloidal synthesis of nanocrystals usually includes complex chemical reactions and multi-step crystallization processes. Despite the great success in the past 30 years, it remains challenging to clarify the correlations between synthetic parameters of chemical reaction and physical properties of nanocrystals. Here, we developed a deep learning-based nanocrystal synthesis model that correlates syn…
▽ More
Colloidal synthesis of nanocrystals usually includes complex chemical reactions and multi-step crystallization processes. Despite the great success in the past 30 years, it remains challenging to clarify the correlations between synthetic parameters of chemical reaction and physical properties of nanocrystals. Here, we developed a deep learning-based nanocrystal synthesis model that correlates synthetic parameters with the final size and shape of target nanocrystals, using a dataset of 3500 recipes covering 348 distinct nanocrystal compositions. The size and shape labels were obtained from transmission electron microscope images using a segmentation model trained with a semi-supervised algorithm on a dataset comprising 1.2 million nanocrystals. By applying the reaction intermediate-based data augmentation method and elaborated descriptors, the synthesis model was able to predict nanocrystal's size with a mean absolute error of 1.39 nm, while reaching an 89% average accuracy for shape classification. The synthesis model shows knowledge transfer capabilities across different nanocrystals with inputs of new recipes. With that, the influence of chemicals on the final size of nanocrystals was further evaluated, revealing the importance order of nanocrystal composition, precursor or ligand, and solvent. Overall, the deep learning-based nanocrystal synthesis model offers a powerful tool to expedite the development of high-quality nanocrystals.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Authors:
Zhibo Yang,
Jun Tang,
Zhaohai Li,
Pengfei Wang,
Jianqiang Wan,
Humen Zhong,
Xuejing Liu,
Mingkun Yang,
Peng Wang,
Shuai Bai,
LianWen Jin,
Junyang Lin
Abstract:
Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are o…
▽ More
Large Multimodal Models (LMMs) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time. We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, facilitating continued progress in this crucial area.
△ Less
Submitted 10 December, 2024; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference
Authors:
Depeng Chen,
Hao Chen,
Hulin Jin,
Jie Cui,
Hong Zhong
Abstract:
Membership inference attacks (MIAs) are critical tools for assessing privacy risks and ensuring compliance with regulations like the General Data Protection Regulation (GDPR). However, their potential for auditing unauthorized use of data remains under explored. To bridge this gap, we propose a novel clean-label backdoor-based approach for MIAs, designed specifically for robust and stealthy data a…
▽ More
Membership inference attacks (MIAs) are critical tools for assessing privacy risks and ensuring compliance with regulations like the General Data Protection Regulation (GDPR). However, their potential for auditing unauthorized use of data remains under explored. To bridge this gap, we propose a novel clean-label backdoor-based approach for MIAs, designed specifically for robust and stealthy data auditing. Unlike conventional methods that rely on detectable poisoned samples with altered labels, our approach retains natural labels, enhancing stealthiness even at low poisoning rates. Our approach employs an optimal trigger generated by a shadow model that mimics the target model's behavior. This design minimizes the feature-space distance between triggered samples and the source class while preserving the original data labels. The result is a powerful and undetectable auditing mechanism that overcomes limitations of existing approaches, such as label inconsistencies and visual artifacts in poisoned samples. The proposed method enables robust data auditing through black-box access, achieving high attack success rates across diverse datasets and model architectures. Additionally, it addresses challenges related to trigger stealthiness and poisoning durability, establishing itself as a practical and effective solution for data auditing. Comprehensive experiments validate the efficacy and generalizability of our approach, outperforming several baseline methods in both stealth and attack success metrics.
△ Less
Submitted 24 November, 2024;
originally announced November 2024.
-
CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning
Authors:
Depeng Chen,
Xiao Liu,
Jie Cui,
Hong Zhong
Abstract:
Since machine learning model is often trained on a limited data set, the model is trained multiple times on the same data sample, which causes the model to memorize most of the training set data. Membership Inference Attacks (MIAs) exploit this feature to determine whether a data sample is used for training a machine learning model. However, in realistic scenarios, it is difficult for the adversar…
▽ More
Since machine learning model is often trained on a limited data set, the model is trained multiple times on the same data sample, which causes the model to memorize most of the training set data. Membership Inference Attacks (MIAs) exploit this feature to determine whether a data sample is used for training a machine learning model. However, in realistic scenarios, it is difficult for the adversary to obtain enough qualified samples that mark accurate identity information, especially since most samples are non-members in real world applications. To address this limitation, in this paper, we propose a new attack method called CLMIA, which uses unsupervised contrastive learning to train an attack model without using extra membership status information. Meanwhile, in CLMIA, we require only a small amount of data with known membership status to fine-tune the attack model. Experimental results demonstrate that CLMIA performs better than existing attack methods for different datasets and model structures, especially with data with less marked identity information. In addition, we experimentally find that the attack performs differently for different proportions of labeled identity information for member and non-member data. More analysis proves that our attack method performs better with less labeled identity information, which applies to more realistic scenarios.
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Gender Bias of LLM in Economics: An Existentialism Perspective
Authors:
Hui Zhong,
Songsheng Chen,
Mian Liang
Abstract:
Large Language Models (LLMs), such as GPT-4 and BERT, have rapidly gained traction in natural language processing (NLP) and are now integral to financial decision-making. However, their deployment introduces critical challenges, particularly in perpetuating gender biases that can distort decision-making outcomes in high-stakes economic environments. This paper investigates gender bias in LLMs thro…
▽ More
Large Language Models (LLMs), such as GPT-4 and BERT, have rapidly gained traction in natural language processing (NLP) and are now integral to financial decision-making. However, their deployment introduces critical challenges, particularly in perpetuating gender biases that can distort decision-making outcomes in high-stakes economic environments. This paper investigates gender bias in LLMs through both mathematical proofs and empirical experiments using the Word Embedding Association Test (WEAT), demonstrating that LLMs inherently reinforce gender stereotypes even without explicit gender markers. By comparing the decision-making processes of humans and LLMs, we reveal fundamental differences: while humans can override biases through ethical reasoning and individualized understanding, LLMs maintain bias as a rational outcome of their mathematical optimization on biased data. Our analysis proves that bias in LLMs is not an unintended flaw but a systematic result of their rational processing, which tends to preserve and amplify existing societal biases encoded in training data. Drawing on existentialist theory, we argue that LLM-generated bias reflects entrenched societal structures and highlights the limitations of purely technical debiasing methods. This research underscores the need for new theoretical frameworks and interdisciplinary methodologies that address the ethical implications of integrating LLMs into economic and financial decision-making. We advocate for a reconceptualization of how LLMs influence economic decisions, emphasizing the importance of incorporating human-like ethical considerations into AI governance to ensure fairness and equity in AI-driven financial systems.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration
Authors:
Hai Zhong,
Xun Wang,
Zhuoran Li,
Longbo Huang
Abstract:
Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MAR…
▽ More
Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
CrystalX: Ultra-Precision Crystal Structure Resolution and Error Correction Using Deep Learning
Authors:
Kaipeng Zheng,
Weiran Huang,
Wanli Ouyang,
Han-Sen Zhong,
Yuqiang Li
Abstract:
Atomic structure analysis of crystalline materials is a paramount endeavor in both chemical and material sciences. This sophisticated technique necessitates not only a solid foundation in crystallography but also a profound comprehension of the intricacies of the accompanying software, posing a significant challenge in meeting the rigorous daily demands. For the first time, we confront this challe…
▽ More
Atomic structure analysis of crystalline materials is a paramount endeavor in both chemical and material sciences. This sophisticated technique necessitates not only a solid foundation in crystallography but also a profound comprehension of the intricacies of the accompanying software, posing a significant challenge in meeting the rigorous daily demands. For the first time, we confront this challenge head-on by harnessing the power of deep learning for ultra-precise structural analysis at the full-atom level. To validate the performance of the model, named CrystalX, we employed a vast dataset comprising over 50,000 X-ray diffraction measurements derived from authentic experiments, demonstrating performance that is commensurate with human experts and adept at deciphering intricate geometric patterns. Remarkably, CrystalX revealed that even peer-reviewed publications can harbor errors that are stealthy to human scrutiny, yet CrystalX adeptly rectifies them. This deep learning model revolutionizes the time frame for crystal structure analysis, slashing it down to seconds. It has already been successfully applied in the structure analysis of newly discovered compounds in the latest research without human intervention. Overall, CrystalX marks the beginning of a new era in automating routine structural analysis within self-driving laboratories.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Reinforcement Learning Based Bidding Framework with High-dimensional Bids in Power Markets
Authors:
Jinyu Liu,
Hongye Guo,
Yun Li,
Qinghu Tang,
Fuquan Huang,
Tunan Chen,
Haiwang Zhong,
Qixin Chen
Abstract:
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-…
▽ More
Over the past decade, bidding in power markets has attracted widespread attention. Reinforcement Learning (RL) has been widely used for power market bidding as a powerful AI tool to make decisions under real-world uncertainties. However, current RL methods mostly employ low dimensional bids, which significantly diverge from the N price-power pairs commonly used in the current power markets. The N-pair bidding format is denoted as High Dimensional Bids (HDBs), which has not been fully integrated into the existing RL-based bidding methods. The loss of flexibility in current RL bidding methods could greatly limit the bidding profits and make it difficult to tackle the rising uncertainties brought by renewable energy generations. In this paper, we intend to propose a framework to fully utilize HDBs for RL-based bidding methods. First, we employ a special type of neural network called Neural Network Supply Functions (NNSFs) to generate HDBs in the form of N price-power pairs. Second, we embed the NNSF into a Markov Decision Process (MDP) to make it compatible with most existing RL methods. Finally, experiments on Energy Storage Systems (ESSs) in the PJM Real-Time (RT) power market show that the proposed bidding method with HDBs can significantly improve bidding flexibility, thereby improving the profit of the state-of-the-art RL bidding methods.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.