Search | arXiv e-print repository

Concolic Testing of Quantum Programs

Authors: Shangzhou Xia, Jianjun Zhao, Fuyuan Zhang, Xiaoyu Guo

Abstract: This paper presents the first concolic testing framework specifically designed for quantum programs. The framework defines quantum conditional statements that quantify quantum states and presents a symbolization method for quantum variables. Utilizing this framework, we generate path constraints for each concrete execution path of a quantum program. These constraints guide the exploration of new p… ▽ More This paper presents the first concolic testing framework specifically designed for quantum programs. The framework defines quantum conditional statements that quantify quantum states and presents a symbolization method for quantum variables. Utilizing this framework, we generate path constraints for each concrete execution path of a quantum program. These constraints guide the exploration of new paths, with a quantum constraint solver determining the outcomes to generate novel input samples and enhance branch coverage. We implemented this framework in Python and integrated it with Qiskit for practical evaluation. Experimental results demonstrate that our concolic testing framework significantly improves branch coverage and the quality of quantum input samples, demonstrating its effectiveness and efficiency in quantum software testing. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.03969 [pdf, other]

Speak the Same Language: Global LiDAR Registration on BIM Using Pose Hough Transform

Authors: Zhijian Qiao, Haoming Huang, Chuhao Liu, Shaojie Shen, Fumin Zhang, Huan Yin

Abstract: The construction and robotic sensing data originate from disparate sources and are associated with distinct frames of reference. The primary objective of this study is to align LiDAR point clouds with building information modeling (BIM) using a global point cloud registration approach, aimed at establishing a shared understanding between the two modalities, i.e., ``speak the same language''. To ac… ▽ More The construction and robotic sensing data originate from disparate sources and are associated with distinct frames of reference. The primary objective of this study is to align LiDAR point clouds with building information modeling (BIM) using a global point cloud registration approach, aimed at establishing a shared understanding between the two modalities, i.e., ``speak the same language''. To achieve this, we design a cross-modality registration method, spanning from front end the back end. At the front end, we extract descriptors by identifying walls and capturing the intersected corners. Subsequently, for the back-end pose estimation, we employ the Hough transform for pose estimation and estimate multiple pose candidates. The final pose is verified by wall-pixel correlation. To evaluate the effectiveness of our method, we conducted real-world multi-session experiments in a large-scale university building, involving two different types of LiDAR sensors. We also report our findings and plan to make our collected dataset open-sourced. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 12 pages, 10 figures

arXiv:2405.03565 [pdf, other]

Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing

Authors: Han Liu, Siyang Zhao, Xiaotong Zhang, Feng Zhang, Wei Wang, Fenglong Ma, Hongyang Chen, Hong Yu, Xianchao Zhang

Abstract: Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes t… ▽ More Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Accepted to AAAI 2024

arXiv:2405.01329 [pdf, other]

Decentralization of Ethereum's Builder Market

Authors: Sen Yang, Kartik Nayak, Fan Zhang

Abstract: Blockchains protect an ecosystem worth more than $500bn with their strong security properties derived from the principle of decentralization. Is today's blockchain really decentralized? In this paper, we empirically studied one of the least decentralized parts of Ethereum -- the most used blockchain system in practice -- and shed light on the decentralization issue from a new perspective. To avo… ▽ More Blockchains protect an ecosystem worth more than $500bn with their strong security properties derived from the principle of decentralization. Is today's blockchain really decentralized? In this paper, we empirically studied one of the least decentralized parts of Ethereum -- the most used blockchain system in practice -- and shed light on the decentralization issue from a new perspective. To avoid centralization caused by Maximal Extractable Value (MEV), Ethereum adopts a novel mechanism that produces blocks through a builder market. After two years in operation, however, the builder market has evolved to a highly centralized one with three builders producing more than 90% of blocks. Why does the builder market centralize, given that it is permissionless and anyone can join? Moreover, what are the security implications of a centralized builder market to MEV-Boost auctions? Through a rigorous empirical study of the builder market's core mechanism, MEV-Boost auctions, we answered these two questions using a large-scale auction dataset we curated since 2022. Unlike previous works that focus on who wins the auctions, we focus on why they win, to shed light on the {openness, competitiveness, and efficiency} of MEV-Boost auctions. Our findings also help identify directions for improving the decentralization of builder markets. △ Less

Submitted 2 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

arXiv:2405.00269 [pdf, other]

Adaptive Integral Sliding Mode Control for Attitude Tracking of Underwater Robots With Large Range Pitch Variations in Confined Space

Authors: Xiaorui Wang, Zeyu Sha, Feitian Zhang

Abstract: Underwater robots play a crucial role in exploring aquatic environments. The ability to flexibly adjust their attitudes is essential for underwater robots to effectively accomplish tasks in confined space. However, the highly coupled six degrees of freedom dynamics resulting from attitude changes and the complex turbulence within limited spatial areas present significant challenges. To address the… ▽ More Underwater robots play a crucial role in exploring aquatic environments. The ability to flexibly adjust their attitudes is essential for underwater robots to effectively accomplish tasks in confined space. However, the highly coupled six degrees of freedom dynamics resulting from attitude changes and the complex turbulence within limited spatial areas present significant challenges. To address the problem of attitude control of underwater robots, this letter investigates large-range pitch angle tracking during station holding as well as simultaneous roll and yaw angle control to enable versatile attitude adjustments. Based on dynamic modeling, this letter proposes an adaptive integral sliding mode controller (AISMC) that integrates an integral module into traditional sliding mode control (SMC) and adaptively adjusts the switching gain for improved tracking accuracy, reduced chattering, and enhanced robustness. The stability of the closed-loop control system is established through Lyapunov analysis. Extensive experiments and comparison studies are conducted using a commercial remotely operated vehicle (ROV), the results of which demonstrate that AISMC achieves satisfactory performance in attitude tracking control in confined space with unknown disturbances, significantly outperforming both PID and SMC. △ Less

Submitted 30 April, 2024; originally announced May 2024.

arXiv:2404.19357 [pdf, other]

doi 10.1145/3626772.3661369

Interest Clock: Time Perception in Real-Time Streaming Recommendation System

Authors: Yongchun Zhu, Jingwu Chen, Ling Chen, Yitan Li, Feng Zhang, Zuotao Liu

Abstract: User preferences follow a dynamic pattern over a day, e.g., at 8 am, a user might prefer to read news, while at 8 pm, they might prefer to watch movies. Time modeling aims to enable recommendation systems to perceive time changes to capture users' dynamic preferences over time, which is an important and challenging problem in recommendation systems. Especially, streaming recommendation systems in… ▽ More User preferences follow a dynamic pattern over a day, e.g., at 8 am, a user might prefer to read news, while at 8 pm, they might prefer to watch movies. Time modeling aims to enable recommendation systems to perceive time changes to capture users' dynamic preferences over time, which is an important and challenging problem in recommendation systems. Especially, streaming recommendation systems in the industry, with only available samples of the current moment, present greater challenges for time modeling. There is still a lack of effective time modeling methods for streaming recommendation systems. In this paper, we propose an effective and universal method Interest Clock to perceive time information in recommendation systems. Interest Clock first encodes users' time-aware preferences into a clock (hour-level personalized features) and then uses Gaussian distribution to smooth and aggregate them into the final interest clock embedding according to the current time for the final prediction. By arming base models with Interest Clock, we conduct online A/B tests, obtaining +0.509% and +0.758% improvements on user active days and app duration respectively. Besides, the extended offline experiments show improvements as well. Interest Clock has been deployed on Douyin Music App. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: Accepted by SIGIR 2024

arXiv:2404.18580 [pdf, other]

Data-Driven Dynamics Modeling of Miniature Robotic Blimps Using Neural ODEs With Parameter Auto-Tuning

Authors: Yongjian Zhu, Hao Cheng, Feitian Zhang

Abstract: Miniature robotic blimps, as one type of lighter-than-air aerial vehicles, have attracted increasing attention in the science and engineering community for their enhanced safety, extended endurance, and quieter operation compared to quadrotors. Accurately modeling the dynamics of these robotic blimps poses a significant challenge due to the complex aerodynamics stemming from their large lifting bo… ▽ More Miniature robotic blimps, as one type of lighter-than-air aerial vehicles, have attracted increasing attention in the science and engineering community for their enhanced safety, extended endurance, and quieter operation compared to quadrotors. Accurately modeling the dynamics of these robotic blimps poses a significant challenge due to the complex aerodynamics stemming from their large lifting bodies. Traditional first-principle models have difficulty obtaining accurate aerodynamic parameters and often overlook high-order nonlinearities, thus coming to its limit in modeling the motion dynamics of miniature robotic blimps. To tackle this challenge, this letter proposes the Auto-tuning Blimp-oriented Neural Ordinary Differential Equation method (ABNODE), a data-driven approach that integrates first-principle and neural network modeling. Spiraling motion experiments of robotic blimps are conducted, comparing the ABNODE with first-principle and other data-driven benchmark models, the results of which demonstrate the effectiveness of the proposed method. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 8 pages, 8 figures

arXiv:2404.18416 [pdf, other]

Capabilities of Gemini Models in Medicine

Authors: Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G. T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean-baptiste Alayrac, Neil Houlsby , et al. (42 additional authors not shown)

Abstract: Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-G… ▽ More Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain. △ Less

Submitted 1 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.17862 [pdf, other]

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Authors: Tao Meng, Fuchen Zhang, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li

Abstract: Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are… ▽ More Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets. △ Less

Submitted 2 May, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

Comments: 10 pages, 4 figures

arXiv:2404.17858 [pdf, other]

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Authors: Yuntao Shou, Tao Meng, Fuchen Zhang, Nan Yin, Keqin Li

Abstract: Multi-modal Emotion Recognition in Conversation (MERC) has received considerable attention in various fields, e.g., human-computer interaction and recommendation systems. Most existing works perform feature disentanglement and fusion to extract emotional contextual information from multi-modal features and emotion classification. After revisiting the characteristic of MERC, we argue that long-rang… ▽ More Multi-modal Emotion Recognition in Conversation (MERC) has received considerable attention in various fields, e.g., human-computer interaction and recommendation systems. Most existing works perform feature disentanglement and fusion to extract emotional contextual information from multi-modal features and emotion classification. After revisiting the characteristic of MERC, we argue that long-range contextual semantic information should be extracted in the feature disentanglement stage and the inter-modal semantic information consistency should be maximized in the feature fusion stage. Inspired by recent State Space Models (SSMs), Mamba can efficiently model long-distance dependencies. Therefore, in this work, we fully consider the above insights to further improve the performance of MERC. Specifically, on the one hand, in the feature disentanglement stage, we propose a Broad Mamba, which does not rely on a self-attention mechanism for sequence modeling, but uses state space models to compress emotional representation, and utilizes broad learning systems to explore the potential data distribution in broad space. Different from previous SSMs, we design a bidirectional SSM convolution to extract global context information. On the other hand, we design a multi-modal fusion strategy based on probability guidance to maximize the consistency of information between modalities. Experimental results show that the proposed method can overcome the computational and memory limitations of Transformer when modeling long-distance contexts, and has great potential to become a next-generation general architecture in MERC. △ Less

Submitted 2 May, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

Comments: 10 pages, 6 figures

arXiv:2404.16220 [pdf, other]

When does a bent concatenation not belong to the completed Maiorana-McFarland class?

Authors: Sadmir Kudin, Enes Pasalic, Alexandr Polujan, Fengrong Zhang

Abstract: Every Boolean bent function $f$ can be written either as a concatenation $f=f_1||f_2$ of two complementary semi-bent functions $f_1,f_2$; or as a concatenation $f=f_1||f_2||f_3||f_4$ of four Boolean functions $f_1,f_2,f_3,f_4$, all of which are simultaneously bent, semi-bent, or 5-valued spectra-functions. In this context, it is essential to ask: When does a bent concatenation $f$ (not) belong to… ▽ More Every Boolean bent function $f$ can be written either as a concatenation $f=f_1||f_2$ of two complementary semi-bent functions $f_1,f_2$; or as a concatenation $f=f_1||f_2||f_3||f_4$ of four Boolean functions $f_1,f_2,f_3,f_4$, all of which are simultaneously bent, semi-bent, or 5-valued spectra-functions. In this context, it is essential to ask: When does a bent concatenation $f$ (not) belong to the completed Maiorana-McFarland class $\mathcal{M}^\#$? In this article, we answer this question completely by providing a full characterization of the structure of $\mathcal{M}$-subspaces for the concatenation of the form $f=f_1||f_2$ and $f=f_1||f_2||f_3||f_4$, which allows us to specify the necessary and sufficient conditions so that $f$ is outside $\mathcal{M}^\#$. Based on these conditions, we propose several explicit design methods of specifying bent functions outside $\mathcal{M}^\#$ in the special case when $f=g||h||g||(h+1)$, where $g$ and $h$ are bent functions. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: This is the authors' version of the camera-ready version to be presented at the 2024 IEEE International Symposium on Information Theory (ISIT 2024)

arXiv:2404.15588 [pdf, other]

Minimal Evidence Group Identification for Claim Verification

Authors: Xiangci Li, Sihao Chen, Rajvi Kapadia, Jessica Ouyang, Fan Zhang

Abstract: Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different p… ▽ More Claim verification in real-world settings (e.g. against a large collection of candidate evidences retrieved from the web) typically requires identifying and aggregating a complete set of evidence pieces that collectively provide full support to the claim. The problem becomes particularly challenging when there exists distinct sets of evidence that could be used to verify the claim from different perspectives. In this paper, we formally define and study the problem of identifying such minimal evidence groups (MEGs) for claim verification. We show that MEG identification can be reduced from Set Cover problem, based on entailment inference of whether a given evidence group provides full/partial support to a claim. Our proposed approach achieves 18.4% and 34.8% absolute improvements on the WiCE and SciFact datasets over LLM prompting. Finally, we demonstrate the benefits of MEGs in downstream applications such as claim generation. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.15041 [pdf, other]

LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition

Authors: Fan Zhang, Zhi-Qi Cheng, Jian Zhao, Xiaojiang Peng, Xuelong Li

Abstract: Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the… ▽ More Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the coin by proposing a unified framework termed hierarchicaL dEcoupling And Fusing (LEAF) to coordinate expression-relevant representations and pseudo-labels for semi-supervised FER. LEAF introduces a hierarchical expression-aware aggregation strategy that operates at three levels: semantic, instance, and category. (1) At the semantic and instance levels, LEAF decouples representations into expression-agnostic and expression-relevant components, and adaptively fuses them using learnable gating weights. (2) At the category level, LEAF assigns ambiguous pseudo-labels by decoupling predictions into positive and negative parts, and employs a consistency loss to ensure agreement between two augmented views of the same image. Extensive experiments on benchmark datasets demonstrate that by unveiling and harmonizing both sides of the coin, LEAF outperforms state-of-the-art semi-supervised FER methods, effectively leveraging both labeled and unlabeled data. Moreover, the proposed expression-aware aggregation strategy can be seamlessly integrated into existing semi-supervised frameworks, leading to significant performance gains. Our code is available at https://anonymous.4open.science/r/LEAF-BC57/. △ Less

Submitted 26 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.11968 [pdf, other]

P-NAL: an Effective and Interpretable Entity Alignment Method

Authors: Chuanhao Xu, Jingwei Cheng, Fu Zhang

Abstract: Entity alignment (EA) aims to find equivalent entities between two Knowledge Graphs. Existing embedding-based EA methods usually encode entities as embeddings, triples as embeddings' constraint and learn to align the embeddings. The structural and side information are usually utilized via embedding propagation, aggregation or interaction. However, the details of the underlying logical inference st… ▽ More Entity alignment (EA) aims to find equivalent entities between two Knowledge Graphs. Existing embedding-based EA methods usually encode entities as embeddings, triples as embeddings' constraint and learn to align the embeddings. The structural and side information are usually utilized via embedding propagation, aggregation or interaction. However, the details of the underlying logical inference steps among the alignment process are usually omitted, resulting in inadequate inference process. In this paper, we introduce P-NAL, an entity alignment method that captures two types of logical inference paths with Non-Axiomatic Logic (NAL). Type 1 is the bridge-like inference path between to-be-aligned entity pairs, consisting of two relation/attribute triples and a similarity sentence between the other two entities. Type 2 links the entity pair by their embeddings. P-NAL iteratively aligns entities and relations by integrating the conclusions of the inference paths. Moreover, our method is logically interpretable and extensible due to the expressiveness of NAL. Our proposed method is suitable for various EA settings. Experimental results show that our method outperforms state-of-the-art methods in terms of Hits@1, achieving 0.98+ on all three datasets of DBP15K with both supervised and unsupervised settings. To our knowledge, we present the first in-depth analysis of entity alignment's basic principles from a unified logical perspective. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 13 pages, 2 figures

ACM Class: I.2.4

arXiv:2404.11095 [pdf, other]

Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues

Authors: Jiao Ou, Jiayu Wu, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai

Abstract: Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to… ▽ More Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to implicitly model complex dialogue flows and pose high-quality instructions. In this paper, we take inspiration from the cognitive abilities inherent in human learning and propose the explicit modeling of complex dialogue flows through instructional strategy reuse. Specifically, we first induce high-level strategies from various real instruction dialogues. These strategies are applied to new dialogue scenarios deductively, where the instructional strategies facilitate high-quality instructions. Experimental results show that our method can generate diverse, in-depth, and insightful instructions for a given dialogue history. The constructed multi-turn instructional dialogues can outperform competitive baselines on the downstream chat model. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 27 pages, 3 figures, 12 tables

arXiv:2404.09571 [pdf, other]

MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution

Authors: Yuxuan Jiang, Chen Feng, Fan Zhang, David Bull

Abstract: Knowledge distillation (KD) has emerged as a promising technique in deep learning, typically employed to enhance a compact student network through learning from their high-performance but more complex teacher variant. When applied in the context of image super-resolution, most KD approaches are modified versions of methods developed for other computer vision tasks, which are based on training stra… ▽ More Knowledge distillation (KD) has emerged as a promising technique in deep learning, typically employed to enhance a compact student network through learning from their high-performance but more complex teacher variant. When applied in the context of image super-resolution, most KD approaches are modified versions of methods developed for other computer vision tasks, which are based on training strategies with a single teacher and simple loss functions. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) framework specifically for image super-resolution. It exploits the advantages of multiple teachers by combining and enhancing the outputs of these teacher models, which then guides the learning process of the compact student network. To achieve more effective learning performance, we have also developed a new wavelet-based loss function for MTKD, which can better optimize the training process by observing differences in both the spatial and frequency domains. We fully evaluate the effectiveness of the proposed method by comparing it to five commonly used KD methods for image super-resolution based on three popular network architectures. The results show that the proposed MTKD method achieves evident improvements in super-resolution performance, up to 0.46dB (based on PSNR), over state-of-the-art KD approaches across different network structures. The source code of MTKD will be made available here for public evaluation. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.08353 [pdf, other]

TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability

Authors: Shiwei Lian, Feitian Zhang

Abstract: The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attenti… ▽ More The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models. △ Less

Submitted 12 April, 2024; originally announced April 2024.

arXiv:2404.08322 [pdf, other]

doi 10.1145/3589334.3645580.

BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task Promoting

Authors: Yuqing Cheng, Bo Chen, Fanjin Zhang, Jie Tang

Abstract: From-scratch name disambiguation is an essential task for establishing a reliable foundation for academic platforms. It involves partitioning documents authored by identically named individuals into groups representing distinct real-life experts. Canonically, the process is divided into two decoupled tasks: locally estimating the pairwise similarities between documents followed by globally groupin… ▽ More From-scratch name disambiguation is an essential task for establishing a reliable foundation for academic platforms. It involves partitioning documents authored by identically named individuals into groups representing distinct real-life experts. Canonically, the process is divided into two decoupled tasks: locally estimating the pairwise similarities between documents followed by globally grouping these documents into appropriate clusters. However, such a decoupled approach often inhibits optimal information exchange between these intertwined tasks. Therefore, we present BOND, which bootstraps the local and global informative signals to promote each other in an end-to-end regime. Specifically, BOND harnesses local pairwise similarities to drive global clustering, subsequently generating pseudo-clustering labels. These global signals further refine local pairwise characterizations. The experimental results establish BOND's superiority, outperforming other advanced baselines by a substantial margin. Moreover, an enhanced version, BOND+, incorporating ensemble and post-match techniques, rivals the top methods in the WhoIsWho competition. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: TheWebConf 2024 (WWW '24)

ACM Class: H.3.7; H.3.3

Journal ref: Proceedings of TheWebConf 2024 (WWW '24), May 13--17, 2024, Singapore

arXiv:2404.06495 [pdf, other]

Mechanism Design for ZK-Rollup Prover Markets

Authors: Wenhao Wang, Lulu Zhou, Aviv Yaish, Fan Zhang, Ben Fisch, Benjamin Livshits

Abstract: In ZK-Rollups, provers spend significant computational resources to generate validity proofs. Their costs should be compensated properly, so a sustainable prover market can form over time. Existing transaction fee mechanisms (TFMs) such as EIP-1559, however, do not work in this setting, as EIP-1559 only generates negligible revenue because of burning, while provers often create or purchase special… ▽ More In ZK-Rollups, provers spend significant computational resources to generate validity proofs. Their costs should be compensated properly, so a sustainable prover market can form over time. Existing transaction fee mechanisms (TFMs) such as EIP-1559, however, do not work in this setting, as EIP-1559 only generates negligible revenue because of burning, while provers often create or purchase specialized hardware in hopes of creating long-term revenue from proving, somewhat reminiscent of proof-of-work miners in the case of chains like Bitcoin. In this paper, we explore the design of transaction fee mechanisms for prover markets. The desiderata for such mechanisms include efficiency (social welfare is maximized), incentive compatibility (it is rational to bid honestly), collusion resistance (no profitable collusion among provers exists), and off-chain agreement proofness (no profitable collusion between users and provers exists). To demonstrate the difficulties of our new setting, we put forward several simple strawman mechanisms, and show they suffer from notable deficiencies. △ Less

Submitted 26 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.05220 [pdf, other]

StylizedGS: Controllable Stylization for 3D Gaussian Splatting

Authors: Dingxi Zhang, Zhuoxun Chen, Yu-Jie Yuan, Fang-Lue Zhang, Zhenliang He, Shiguang Shan, Lin Gao

Abstract: With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user e… ▽ More With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user experience and the implicit nature limits its ability to transfer the geometric pattern styles. Additionally, the ability for artists to exert flexible control over stylized scenes is considered highly desirable, fostering an environment conducive to creative exploration. In this paper, we introduce StylizedGS, a 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting (3DGS) representation. The 3DGS brings the benefits of high efficiency. We propose a GS filter to eliminate floaters in the reconstruction which affects the stylization effects before stylization. Then the nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale and regions during the stylization to possess customized capabilities. Our method can attain high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference FPS. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.01579 [pdf]

Diffusion Deepfake

Authors: Chaitali Bhattacharyya, Hanxiao Wang, Feng Zhang, Sungho Kim, Xiatian Zhu

Abstract: Recent progress in generative AI, primarily through diffusion models, presents significant challenges for real-world deepfake detection. The increased realism in image details, diverse content, and widespread accessibility to the general public complicates the identification of these sophisticated deepfakes. Acknowledging the urgency to address the vulnerability of current deepfake detectors to th… ▽ More Recent progress in generative AI, primarily through diffusion models, presents significant challenges for real-world deepfake detection. The increased realism in image details, diverse content, and widespread accessibility to the general public complicates the identification of these sophisticated deepfakes. Acknowledging the urgency to address the vulnerability of current deepfake detectors to this evolving threat, our paper introduces two extensive deepfake datasets generated by state-of-the-art diffusion models as other datasets are less diverse and low in quality. Our extensive experiments also showed that our dataset is more challenging compared to the other face deepfake datasets. Our strategic dataset creation not only challenge the deepfake detectors but also sets a new benchmark for more evaluation. Our comprehensive evaluation reveals the struggle of existing detection methods, often optimized for specific image domains and manipulations, to effectively adapt to the intricate nature of diffusion deepfakes, limiting their practical utility. To address this critical issue, we investigate the impact of enhancing training data diversity on representative detection methods. This involves expanding the diversity of both manipulation techniques and image domains. Our findings underscore that increasing training data diversity results in improved generalizability. Moreover, we propose a novel momentum difficulty boosting strategy to tackle the additional challenge posed by training data heterogeneity. This strategy dynamically assigns appropriate sample weights based on learning difficulty, enhancing the model's adaptability to both easy and challenging samples. Extensive experiments on both existing and newly proposed benchmarks demonstrate that our model optimization approach surpasses prior alternatives significantly. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 28 pages including Supplementary material

arXiv:2403.19001 [pdf, other]

Cross-domain Fiber Cluster Shape Analysis for Language Performance Cognitive Score Prediction

Authors: Yui Lo, Yuqian Chen, Dongnan Liu, Wan Liu, Leo Zekelman, Fan Zhang, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Weidong Cai, Lauren J. O'Donnell

Abstract: Shape plays an important role in computer graphics, offering informative features to convey an object's morphology and functionality. Shape analysis in brain imaging can help interpret structural and functionality correlations of the human brain. In this work, we investigate the shape of the brain's 3D white matter connections and its potential predictive relationship to human cognitive function.… ▽ More Shape plays an important role in computer graphics, offering informative features to convey an object's morphology and functionality. Shape analysis in brain imaging can help interpret structural and functionality correlations of the human brain. In this work, we investigate the shape of the brain's 3D white matter connections and its potential predictive relationship to human cognitive function. We reconstruct brain connections as sequences of 3D points using diffusion magnetic resonance imaging (dMRI) tractography. To describe each connection, we extract 12 shape descriptors in addition to traditional dMRI connectivity and tissue microstructure features. We introduce a novel framework, Shape--fused Fiber Cluster Transformer (SFFormer), that leverages a multi-head cross-attention feature fusion module to predict subject-specific language performance based on dMRI tractography. We assess the performance of the method on a large dataset including 1065 healthy young adults. The results demonstrate that both the transformer-based SFFormer model and its inter/intra feature fusion with shape, microstructure, and connectivity are informative, and together, they improve the prediction of subject-specific language performance scores. Overall, our results indicate that the shape of the brain's connections is predictive of human language function. △ Less

Submitted 29 March, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: 2 figures, 11 pages

arXiv:2403.16405 [pdf, other]

Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models

Authors: Kaikang Zhao, Xi Chen, Wei Huang, Liuxin Ding, Xianglong Kong, Fan Zhang

Abstract: The integration of an ensemble of deep learning models has been extensively explored to enhance defense against adversarial attacks. The diversity among sub-models increases the attack cost required to deceive the majority of the ensemble, thereby improving the adversarial robustness. While existing approaches mainly center on increasing diversity in feature representations or dispersion of first-… ▽ More The integration of an ensemble of deep learning models has been extensively explored to enhance defense against adversarial attacks. The diversity among sub-models increases the attack cost required to deceive the majority of the ensemble, thereby improving the adversarial robustness. While existing approaches mainly center on increasing diversity in feature representations or dispersion of first-order gradients with respect to input, the limited correlation between these diversity metrics and adversarial robustness constrains the performance of ensemble adversarial defense. In this work, we aim to enhance ensemble diversity by reducing attack transferability. We identify second-order gradients, which depict the loss curvature, as a key factor in adversarial robustness. Computing the Hessian matrix involved in second-order gradients is computationally expensive. To address this, we approximate the Hessian-vector product using differential approximation. Given that low curvature provides better robustness, our ensemble model was designed to consider the influence of curvature among different sub-models. We introduce a novel regularizer to train multiple more-diverse low-curvature network models. Extensive experiments across various datasets demonstrate that our ensemble model exhibits superior robustness against a range of attacks, underscoring the effectiveness of our approach. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Accepted to The 2024 International Joint Conference on Neural Networks (IJCNN)

arXiv:2403.15369 [pdf, other]

OceanPlan: Hierarchical Planning and Replanning for Natural Language AUV Piloting in Large-scale Unexplored Ocean Environments

Authors: Ruochu Yang, Fumin Zhang, Mengxue Hou

Abstract: We develop a hierarchical LLM-task-motion planning and replanning framework to efficiently ground an abstracted human command into tangible Autonomous Underwater Vehicle (AUV) control through enhanced representations of the world. We also incorporate a holistic replanner to provide real-world feedback with all planners for robust AUV operation. While there has been extensive research in bridging t… ▽ More We develop a hierarchical LLM-task-motion planning and replanning framework to efficiently ground an abstracted human command into tangible Autonomous Underwater Vehicle (AUV) control through enhanced representations of the world. We also incorporate a holistic replanner to provide real-world feedback with all planners for robust AUV operation. While there has been extensive research in bridging the gap between LLMs and robotic missions, they are unable to guarantee success of AUV applications in the vast and unknown ocean environment. To tackle specific challenges in marine robotics, we design a hierarchical planner to compose executable motion plans, which achieves planning efficiency and solution quality by decomposing long-horizon missions into sub-tasks. At the same time, real-time data stream is obtained by a replanner to address environmental uncertainties during plan execution. Experiments validate that our proposed framework delivers successful AUV performance of long-duration missions through natural language piloting. △ Less