-
Direct Training High-Performance Deep Spiking Neural Networks: A Review of Theories and Methods
Authors:
Chenlin Zhou,
Han Zhang,
Liutao Yu,
Yumin Ye,
Zhaokun Zhou,
Liwei Huang,
Zhengyu Ma,
Xiaopeng Fan,
Huihui Zhou,
Yonghong Tian
Abstract:
Spiking neural networks (SNNs) offer a promising energy-efficient alternative to artificial neural networks (ANNs), in virtue of their high biological plausibility, rich spatial-temporal dynamics, and event-driven computation. The direct training algorithms based on the surrogate gradient method provide sufficient flexibility to design novel SNN architectures and explore the spatial-temporal dynam…
▽ More
Spiking neural networks (SNNs) offer a promising energy-efficient alternative to artificial neural networks (ANNs), in virtue of their high biological plausibility, rich spatial-temporal dynamics, and event-driven computation. The direct training algorithms based on the surrogate gradient method provide sufficient flexibility to design novel SNN architectures and explore the spatial-temporal dynamics of SNNs. According to previous studies, the performance of models is highly dependent on their sizes. Recently, direct training deep SNNs have achieved great progress on both neuromorphic datasets and large-scale static datasets. Notably, transformer-based SNNs show comparable performance with their ANN counterparts. In this paper, we provide a new perspective to summarize the theories and methods for training deep SNNs with high performance in a systematic and comprehensive way, including theory fundamentals, spiking neuron models, advanced SNN models and residual architectures, software frameworks and neuromorphic hardware, applications, and future trends. The reviewed papers are collected at https://github.com/zhouchenlin2096/Awesome-Spiking-Neural-Networks
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
SSI4IoT: Unlocking the Potential of IoT Tailored Self-Sovereign Identity
Authors:
Thusitha Dayaratne,
Xinxin Fan,
Yuhong Liu,
Carsten Rudolph
Abstract:
The emerging Self-Sovereign Identity (SSI) techniques, such as Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), move control of digital identity from conventional identity providers to individuals and lay down the foundation for people, organizations, and things establishing rich digital relationship. The existing applications of SSI mainly focus on creating person-to-person and…
▽ More
The emerging Self-Sovereign Identity (SSI) techniques, such as Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), move control of digital identity from conventional identity providers to individuals and lay down the foundation for people, organizations, and things establishing rich digital relationship. The existing applications of SSI mainly focus on creating person-to-person and person-to-service relationships, whereas person-to-device and device-to-device interactions have been largely overlooked. In this paper, we close this gap by identifying a number of key challenges of applying SSI to the Internet of Things (IoT) and providing a comprehensive taxonomy and usage of VCs in the IoT context with respect to their validity period, trust and interoperability level, and scope of usage. The life-cycle management of VCs as well as various optimization techniques for realizing SSI in IoT environments are also addressed in great detail. This work is a noteworthy step towards massive adoption of SSI for securing existing and future IoT applications in practice.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Optimized Distribution of Entanglement Graph States in Quantum Networks
Authors:
Xiaojie Fan,
Caitao Zhan,
Himanshu Gupta,
C. R. Ramakrishnan
Abstract:
Building large-scale quantum computers, essential to demonstrating quantum advantage, is a key challenge. Quantum Networks (QNs) can help address this challenge by enabling the construction of large, robust, and more capable quantum computing platforms by connecting smaller quantum computers. Moreover, unlike classical systems, QNs can enable fully secured long-distance communication. Thus, quantu…
▽ More
Building large-scale quantum computers, essential to demonstrating quantum advantage, is a key challenge. Quantum Networks (QNs) can help address this challenge by enabling the construction of large, robust, and more capable quantum computing platforms by connecting smaller quantum computers. Moreover, unlike classical systems, QNs can enable fully secured long-distance communication. Thus, quantum networks lie at the heart of the success of future quantum information technologies. In quantum networks, multipartite entangled states distributed over the network help implement and support many quantum network applications for communications, sensing, and computing. Our work focuses on developing optimal techniques to generate and distribute multipartite entanglement states efficiently. Prior works on generating general multipartite entanglement states have focused on the objective of minimizing the number of maximally entangled pairs (EPs) while ignoring the heterogeneity of the network nodes and links as well as the stochastic nature of underlying processes. In this work, we develop a hypergraph based linear programming framework that delivers optimal (under certain assumptions) generation schemes for general multipartite entanglement represented by graph states, under the network resources, decoherence, and fidelity constraints, while considering the stochasticity of the underlying processes. We illustrate our technique by developing generation schemes for the special cases of path and tree graph states, and discuss optimized generation schemes for more general classes of graph states. Using extensive simulations over a quantum network simulator (NetSquid), we demonstrate the effectiveness of our developed techniques and show that they outperform prior known schemes by up to orders of magnitude.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
Multimodal Information Interaction for Medical Image Segmentation
Authors:
Xinxin Fan,
Lin Liu,
Haoran Zhang
Abstract:
The use of multimodal data in assisted diagnosis and segmentation has emerged as a prominent area of interest in current research. However, one of the primary challenges is how to effectively fuse multimodal features. Most of the current approaches focus on the integration of multimodal features while ignoring the correlation and consistency between different modal features, leading to the inclusi…
▽ More
The use of multimodal data in assisted diagnosis and segmentation has emerged as a prominent area of interest in current research. However, one of the primary challenges is how to effectively fuse multimodal features. Most of the current approaches focus on the integration of multimodal features while ignoring the correlation and consistency between different modal features, leading to the inclusion of potentially irrelevant information. To address this issue, we introduce an innovative Multimodal Information Cross Transformer (MicFormer), which employs a dual-stream architecture to simultaneously extract features from each modality. Leveraging the Cross Transformer, it queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features. Additionally, we incorporate a deformable Transformer architecture to expand the search space. We conducted experiments on the MM-WHS dataset, and in the CT-MRI multimodal image segmentation task, we successfully improved the whole-heart segmentation DICE score to 85.57 and MIoU to 75.51. Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively. This demonstrates the efficacy of MicFormer in integrating relevant information between different modalities in multimodal tasks. These findings hold significant implications for multimodal image tasks, and we believe that MicFormer possesses extensive potential for broader applications across various domains. Access to our method is available at https://github.com/fxxJuses/MICFormer
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
True random number generation using metastable 1T' molybdenum ditelluride
Authors:
Yang Liu,
Pengyu Liu,
Yingyi Wen,
Zihan Liang,
Songwei Liu,
Lekai Song,
Jingfang Pei,
Xiaoyue Fan,
Teng Ma,
Gang Wang,
Shuo Gao,
Kong-Pang Pun,
Xiaolong Chen,
Guohua Hu
Abstract:
True random numbers play a critical role in secure cryptography. The generation relies on a stable and readily extractable entropy source. Here, from solution-processed structurally metastable 1T' MoTe2, we prove stable output of featureless, stochastic, and yet stable conductance noise at a broad temperature (down to 15 K) with minimal power consumption (down to 0.05 micro-W). Our characterizatio…
▽ More
True random numbers play a critical role in secure cryptography. The generation relies on a stable and readily extractable entropy source. Here, from solution-processed structurally metastable 1T' MoTe2, we prove stable output of featureless, stochastic, and yet stable conductance noise at a broad temperature (down to 15 K) with minimal power consumption (down to 0.05 micro-W). Our characterizations and statistical analysis of the characteristics of the conductance noise suggest that the noise arises from the volatility of the stochastic polarization of the underlying ferroelectric dipoles in the 1T' MoTe2. Further, as proved in our experiments and indicated by our Monte Carlo simulation, the ferroelectric dipole polarization is a reliable entropy source with the stochastic polarization persistent and stable over time. Exploiting the conductance noise, we achieve the generation of true random numbers and demonstrate their use in common cryptographic applications, for example, password generation and data encryption. Besides, particularly, we show a privacy safeguarding approach to sensitive data that can be critical for the cryptography of neural networks. We believe our work will bring insights into the understanding of the metastable 1T' MoTe2 and, more importantly, underpin its great potential in secure cryptography.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
FedSI: Federated Subnetwork Inference for Efficient Uncertainty Quantification
Authors:
Hui Chen,
Hengyu Liu,
Zhangkai Wu,
Xuhui Fan,
Longbing Cao
Abstract:
While deep neural networks (DNNs) based personalized federated learning (PFL) is demanding for addressing data heterogeneity and shows promising performance, existing methods for federated learning (FL) suffer from efficient systematic uncertainty quantification. The Bayesian DNNs-based PFL is usually questioned of either over-simplified model structures or high computational and memory costs. In…
▽ More
While deep neural networks (DNNs) based personalized federated learning (PFL) is demanding for addressing data heterogeneity and shows promising performance, existing methods for federated learning (FL) suffer from efficient systematic uncertainty quantification. The Bayesian DNNs-based PFL is usually questioned of either over-simplified model structures or high computational and memory costs. In this paper, we introduce FedSI, a novel Bayesian DNNs-based subnetwork inference PFL framework. FedSI is simple and scalable by leveraging Bayesian methods to incorporate systematic uncertainties effectively. It implements a client-specific subnetwork inference mechanism, selects network parameters with large variance to be inferred through posterior distributions, and fixes the rest as deterministic ones. FedSI achieves fast and scalable inference while preserving the systematic uncertainties to the fullest extent. Extensive experiments on three different benchmark datasets demonstrate that FedSI outperforms existing Bayesian and non-Bayesian FL baselines in heterogeneous FL scenarios.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
On Modeling Multi-Criteria Decision Making with Uncertain Information using Probabilistic Rules
Authors:
Shengxin Hong,
Xiuyi Fan
Abstract:
Decision-making processes often involve dealing with uncertainty, which is traditionally addressed through probabilistic models. However, in practical scenarios, assessing probabilities reliably can be challenging, compounded by diverse perceptions of probabilistic information among decision makers. To address this variability and accommodate diverse preferences regarding uncertainty, we introduce…
▽ More
Decision-making processes often involve dealing with uncertainty, which is traditionally addressed through probabilistic models. However, in practical scenarios, assessing probabilities reliably can be challenging, compounded by diverse perceptions of probabilistic information among decision makers. To address this variability and accommodate diverse preferences regarding uncertainty, we introduce the Probabilistic Abstract Decision Framework (PADF). PADF offers a structured approach for reasoning across different decision criteria, encompassing the optimistic, pessimistic, and Laplace perspectives, each tailored to distinct perceptions of uncertainty. We illustrate how PADF facilitates the computation of optimal decisions aligned with these criteria by leveraging probabilistic rules. Furthermore, we present strategies for optimizing the computational efficiency of these rules, leveraging appropriate independence assumptions to navigate the extensive search space inherent in PADF. Through these contributions, our framework provides a robust and adaptable tool for effectively navigating the complexities of decision-making under uncertainty.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
FedPFT: Federated Proxy Fine-Tuning of Foundation Models
Authors:
Zhaopeng Peng,
Xiaoliang Fan,
Yufan Chen,
Zheng Wang,
Shirui Pan,
Chenglu Wen,
Ruisheng Zhang,
Cheng Wang
Abstract:
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuni…
▽ More
Adapting Foundation Models (FMs) for downstream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine-tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumulations of gradients. In this paper, we propose Federated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise compression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations-layer-level and neuron-level-before and during FL fine-tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theoretical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vision) demonstrate the superiority of FedPFT.
△ Less
Submitted 28 April, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Kilometer-Level Coupled Modeling Using 40 Million Cores: An Eight-Year Journey of Model Development
Authors:
Xiaohui Duan,
Yuxuan Li,
Zhao Liu,
Bin Yang,
Juepeng Zheng,
Haohuan Fu,
Shaoqing Zhang,
Shiming Xu,
Yang Gao,
Wei Xue,
Di Wei,
Xiaojing Lv,
Lifeng Yan,
Haopeng Huang,
Haitian Lu,
Lingfeng Wan,
Haoran Lin,
Qixin Chang,
Chenlin Li,
Quanjie He,
Zeyu Song,
Xuantong Wang,
Yangyang Yu,
Xilong Fan,
Zhaopeng Qu
, et al. (16 additional authors not shown)
Abstract:
With current and future leading systems adopting heterogeneous architectures, adapting existing models for heterogeneous supercomputers is of urgent need for improving model resolution and reducing modeling uncertainty. This paper presents our three-week effort on porting a complex earth system model, CESM 2.2, to a 40-million-core Sunway supercomputer. Taking a non-intrusive approach that tries t…
▽ More
With current and future leading systems adopting heterogeneous architectures, adapting existing models for heterogeneous supercomputers is of urgent need for improving model resolution and reducing modeling uncertainty. This paper presents our three-week effort on porting a complex earth system model, CESM 2.2, to a 40-million-core Sunway supercomputer. Taking a non-intrusive approach that tries to minimizes manual code modifications, our project tries to achieve both improvement of performance and consistency of the model code. By using a hierarchical grid system and an OpenMP-based offloading toolkit, our porting and parallelization effort covers over 80% of the code, and achieves a simulation speed of 340 SDPD (simulated days per day) for 5-km atmosphere, 265 SDPD for 3-km ocean, and 222 SDPD for a coupled model, thus making multi-year or even multi-decadal experiments at such high resolution possible.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
Authors:
Wenrui Li,
Xiaopeng Hong,
Xiaopeng Fan
Abstract:
Temporal video grounding (TVG) is a critical task in video content understanding. Despite significant advancements, existing methods often limit in capturing the fine-grained relationships between multimodal inputs and the high computational costs with processing long video sequences. To address these limitations, we introduce a novel SpikeMba: multi-modal spiking saliency mamba for temporal video…
▽ More
Temporal video grounding (TVG) is a critical task in video content understanding. Despite significant advancements, existing methods often limit in capturing the fine-grained relationships between multimodal inputs and the high computational costs with processing long video sequences. To address these limitations, we introduce a novel SpikeMba: multi-modal spiking saliency mamba for temporal video grounding. In our work, we integrate the Spiking Neural Networks (SNNs) and state space models (SSMs) to capture the fine-grained relationships of multimodal features effectively. Specifically, we introduce the relevant slots to enhance the model's memory capabilities, enabling a deeper contextual understanding of video sequences. The contextual moment reasoner leverages these slots to maintain a balance between contextual information preservation and semantic relevance exploration. Simultaneously, the spiking saliency detector capitalizes on the unique properties of SNNs to accurately locate salient proposals. Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods across mainstream benchmarks.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening
Authors:
Hei Yi Mak,
Flint Xiaofeng Fan,
Luca A. Lanzendörfer,
Cheston Tan,
Wei Tsang Ooi,
Roger Wattenhofer
Abstract:
In this study, we delve into Federated Reinforcement Learning (FedRL) in the context of value-based agents operating across diverse Markov Decision Processes (MDPs). Existing FedRL methods typically aggregate agents' learning by averaging the value functions across them to improve their performance. However, this aggregation strategy is suboptimal in heterogeneous environments where agents converg…
▽ More
In this study, we delve into Federated Reinforcement Learning (FedRL) in the context of value-based agents operating across diverse Markov Decision Processes (MDPs). Existing FedRL methods typically aggregate agents' learning by averaging the value functions across them to improve their performance. However, this aggregation strategy is suboptimal in heterogeneous environments where agents converge to diverse optimal value functions. To address this problem, we introduce the Convergence-AwarE SAmpling with scReening (CAESAR) aggregation scheme designed to enhance the learning of individual agents across varied MDPs. CAESAR is an aggregation strategy used by the server that combines convergence-aware sampling with a screening mechanism. By exploiting the fact that agents learning in identical MDPs are converging to the same optimal value function, CAESAR enables the selective assimilation of knowledge from more proficient counterparts, thereby significantly enhancing the overall learning efficiency. We empirically validate our hypothesis and demonstrate the effectiveness of CAESAR in enhancing the learning efficiency of agents, using both a custom-built GridWorld environment and the classical FrozenLake-v1 task, each presenting varying levels of environmental heterogeneity.
△ Less
Submitted 16 April, 2024; v1 submitted 29 March, 2024;
originally announced March 2024.
-
Analysis on reservoir activation with the nonlinearity harnessed from solution-processed MoS2 devices
Authors:
Songwei Liu,
Yang Liu,
Yingyi Wen,
Jingfang Pei,
Pengyu Liu,
Lekai Song,
Xiaoyue Fan,
Wenchen Yang,
Danmei Pan,
Teng Ma,
Yue Lin,
Gang Wang,
Guohua Hu
Abstract:
Reservoir computing is a recurrent neural network that has been applied across various domains in machine learning. The implementation of reservoir computing, however, often demands heavy computations for activating the reservoir. Configuring physical reservoir networks and harnessing the nonlinearity from the underlying devices for activation is an emergent solution to address the computational c…
▽ More
Reservoir computing is a recurrent neural network that has been applied across various domains in machine learning. The implementation of reservoir computing, however, often demands heavy computations for activating the reservoir. Configuring physical reservoir networks and harnessing the nonlinearity from the underlying devices for activation is an emergent solution to address the computational challenge. Herein, we analyze the feasibility of employing the nonlinearity from solution-processed molybdenum disulfide (MoS2) devices for reservoir activation. The devices, fabricated using liquid-phase exfoliated MoS2, exhibit a high-order nonlinearity achieved by Stark modulation of the MoS2 material. We demonstrate that this nonlinearity can be fitted and employed as the activation function to facilitate reservoir computing implementation. Notably, owing to the high-order nonlinearity, the network exhibits long-term synchronization and robust generalization abilities for approximating complex dynamical systems. Given the remarkable reservoir activation capability, coupled with the scalability of the device fabrication, our findings open the possibility for the physical realization of lightweight, efficient reservoir computing for, for instance, signal classification, motion tracking, and pattern recognition of complex time series as well as secure cryptography. As an example, we show the network can be appointed to generate chaotic random numbers for secure data encryption.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
QKFormer: Hierarchical Spiking Transformer using Q-K Attention
Authors:
Chenlin Zhou,
Han Zhang,
Zhaokun Zhou,
Liutao Yu,
Liwei Huang,
Xiaopeng Fan,
Li Yuan,
Zhengyu Ma,
Huihui Zhou,
Yonghong Tian
Abstract:
Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mecha…
▽ More
Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
Authors:
Xiang Fan,
Anand Bhattad,
Ranjay Krishna
Abstract:
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through…
▽ More
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
△ Less
Submitted 22 March, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot Learning
Authors:
Yanyi Zhang,
Qi Jia,
Xin Fan,
Yu Liu,
Ran He
Abstract:
Attribute and object (A-O) disentanglement is a fundamental and critical problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize novel A-O compositions based on foregone knowledge. Existing methods based on disentangled representation learning lose sight of the contextual dependency between the A-O primitive pairs. Inspired by this, we propose a novel A-O disentangled framew…
▽ More
Attribute and object (A-O) disentanglement is a fundamental and critical problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize novel A-O compositions based on foregone knowledge. Existing methods based on disentangled representation learning lose sight of the contextual dependency between the A-O primitive pairs. Inspired by this, we propose a novel A-O disentangled framework for CZSL, namely Class-specified Cascaded Network (CSCNet). The key insight is to firstly classify one primitive and then specifies the predicted class as a priori for guiding another primitive recognition in a cascaded fashion. To this end, CSCNet constructs Attribute-to-Object and Object-to-Attribute cascaded branches, in addition to a composition branch modeling the two primitives as a whole. Notably, we devise a parametric classifier (ParamCls) to improve the matching between visual and semantic embeddings. By improving the A-O disentanglement, our framework achieves superior results than previous competitive methods.
△ Less
Submitted 13 March, 2024; v1 submitted 9 March, 2024;
originally announced March 2024.
-
MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process
Authors:
Xinyao Fan,
Yueying Wu,
Chang Xu,
Yuhao Huang,
Weiqing Liu,
Jiang Bian
Abstract:
Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature.…
▽ More
Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel Multi-Granularity Time Series Diffusion (MG-TSD) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our MG-TSD model outperforms existing time series prediction methods.
△ Less
Submitted 15 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
mmPlace: Robust Place Recognition with Intermediate Frequency Signal of Low-cost Single-chip Millimeter Wave Radar
Authors:
Chengzhen Meng,
Yifan Duan,
Chenming He,
Dequan Wang,
Xiaoran Fan,
Yanyong Zhang
Abstract:
Place recognition is crucial for tasks like loop-closure detection and re-localization. Single-chip millimeter wave radar (single-chip radar in short) emerges as a low-cost sensor option for place recognition, with the advantage of insensitivity to degraded visual environments. However, it encounters two challenges. Firstly, sparse point cloud from single-chip radar leads to poor performance when…
▽ More
Place recognition is crucial for tasks like loop-closure detection and re-localization. Single-chip millimeter wave radar (single-chip radar in short) emerges as a low-cost sensor option for place recognition, with the advantage of insensitivity to degraded visual environments. However, it encounters two challenges. Firstly, sparse point cloud from single-chip radar leads to poor performance when using current place recognition methods, which assume much denser data. Secondly, its performance significantly declines in scenarios involving rotational and lateral variations, due to limited overlap in its field of view (FOV). We propose mmPlace, a robust place recognition system to address these challenges. Specifically, mmPlace transforms intermediate frequency (IF) signal into range azimuth heatmap and employs a spatial encoder to extract features. Additionally, to improve the performance in scenarios involving rotational and lateral variations, mmPlace employs a rotating platform and concatenates heatmaps in a rotation cycle, effectively expanding the system's FOV. We evaluate mmPlace's performance on the milliSonic dataset, which is collected on the University of Science and Technology of China (USTC) campus, the city roads surrounding the campus, and an underground parking garage. The results demonstrate that mmPlace outperforms point cloud-based methods and achieves 87.37% recall@1 in scenarios involving rotational and lateral variations.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
DV-Hop localization based on Distance Estimation using Multinode and Hop Loss in WSNs
Authors:
Penghong Wang,
Xingtao Wang,
Wenrui Li,
Xiaopeng Fan,
Debin Zhao
Abstract:
Location awareness is a critical issue in wireless sensor network applications. For more accurate location estimation, the two issues should be considered extensively: 1) how to sufficiently utilize the connection information between multiple nodes and 2) how to select a suitable solution from multiple solutions obtained by the Euclidean distance loss. In this paper, a DV-Hop localization based on…
▽ More
Location awareness is a critical issue in wireless sensor network applications. For more accurate location estimation, the two issues should be considered extensively: 1) how to sufficiently utilize the connection information between multiple nodes and 2) how to select a suitable solution from multiple solutions obtained by the Euclidean distance loss. In this paper, a DV-Hop localization based on the distance estimation using multinode (DEMN) and the hop loss in WSNs is proposed to address the two issues. In DEMN, when multiple anchor nodes can detect an unknown node, the distance expectation between the unknown node and an anchor node is calculated using the cross-domain information and is considered as the expected distance between them, which narrows the search space. When minimizing the traditional Euclidean distance loss, multiple solutions may exist. To select a suitable solution, the hop loss is proposed, which minimizes the difference between the real and its predicted hops. Finally, the Euclidean distance loss calculated by the DEMN and the hop loss are embedded into the multi-objective optimization algorithm. The experimental results show that the proposed method gains 86.11\% location accuracy in the randomly distributed network, which is 6.05% better than the DEM-DV-Hop, while DEMN and the hop loss can contribute 2.46% and 3.41%, respectively.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Deep Network for Image Compressed Sensing Coding Using Local Structural Sampling
Authors:
Wenxue Cui,
Xingtao Wang,
Xiaopeng Fan,
Shaohui Liu,
Xinwei Gao,
Debin Zhao
Abstract:
Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods gen…
▽ More
Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally maintain a much higher computational complexity. In this paper, we propose a new CNN based image CS coding framework using local structural sampling (dubbed CSCNet) that includes three functional modules: local structural sampling, measurement coding and Laplacian pyramid reconstruction. In the proposed framework, instead of GRM, a new local structural sampling matrix is first developed, which is able to enhance the correlation between the measurements through a local perceptual sampling strategy. Besides, the designed local structural sampling matrix can be jointly optimized with the other functional modules during training process. After sampling, the measurements with high correlations are produced, which are then coded into final bitstreams by the third-party image codec. At last, a Laplacian pyramid reconstruction network is proposed to efficiently recover the target image from the measurement domain to the image domain. Extensive experimental results demonstrate that the proposed scheme outperforms the existing state-of-the-art CS coding methods, while maintaining fast computational speed.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
MOSAIC: A Modular System for Assistive and Interactive Cooking
Authors:
Huaxiaoyue Wang,
Kushal Kedia,
Juntao Ren,
Rahma Abdullah,
Atiksh Bhardwaj,
Angela Chao,
Kelly Y Chen,
Nathaniel Chin,
Prithwish Dan,
Xinyi Fan,
Gonzalo Gonzalez-Pumariega,
Aditya Kompella,
Maximus Adrian Pace,
Yash Sharma,
Xiangwan Sun,
Neha Sunkara,
Sanjiban Choudhury
Abstract:
We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. At its core, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for g…
▽ More
We present MOSAIC, a modular architecture for home robots to perform complex collaborative tasks, such as cooking with everyday users. MOSAIC tightly collaborates with humans, interacts with users using natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. At its core, MOSAIC employs modularity: it leverages multiple large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules designed for task-specific control. We extensively evaluate MOSAIC on 60 end-to-end trials where two robots collaborate with a human user to cook a combination of 6 recipes. We also extensively test individual modules with 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations of the task planner. We show that MOSAIC is able to efficiently collaborate with humans by running the overall system end-to-end with a real human user, completing 68.3% (41/60) collaborative cooking trials of 6 different recipes with a subtask completion rate of 91.6%. Finally, we discuss the limitations of the current system and exciting open challenges in this domain. The project's website is at https://portal-cornell.github.io/MOSAIC/
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection
Authors:
Xun Huang,
Hai Wu,
Xin Li,
Xiaoliang Fan,
Chenglu Wen,
Cheng Wang
Abstract:
LiDAR-based 3D object detection models have traditionally struggled under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain…
▽ More
LiDAR-based 3D object detection models have traditionally struggled under rainy conditions due to the degraded and noisy scanning signals. Previous research has attempted to address this by simulating the noise from rain to improve the robustness of detection models. However, significant disparities exist between simulated and actual rain-impacted data points. In this work, we propose a novel rain simulation method, termed DRET, that unifies Dynamics and Rainy Environment Theory to provide a cost-effective means of expanding the available realistic rain data for 3D detection training. Furthermore, we present a Sunny-to-Rainy Knowledge Distillation (SRKD) approach to enhance 3D detection under rainy conditions. Extensive experiments on the WaymoOpenDataset large-scale dataset show that, when combined with the state-of-the-art DSVT model and other classical 3D detectors, our proposed framework demonstrates significant detection accuracy improvements, without losing efficiency. Remarkably, our framework also improves detection capabilities under sunny conditions, therefore offering a robust solution for 3D detection regardless of whether the weather is rainy or sunny
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
QUCE: The Minimisation and Quantification of Path-Based Uncertainty for Generative Counterfactual Explanations
Authors:
Jamie Duell,
Monika Seisenberger,
Hsuan Fu,
Xiuyi Fan
Abstract:
Deep Neural Networks (DNNs) stand out as one of the most prominent approaches within the Machine Learning (ML) domain. The efficacy of DNNs has surged alongside recent increases in computational capacity, allowing these approaches to scale to significant complexities for addressing predictive challenges in big data. However, as the complexity of DNN models rises, interpretability diminishes. In re…
▽ More
Deep Neural Networks (DNNs) stand out as one of the most prominent approaches within the Machine Learning (ML) domain. The efficacy of DNNs has surged alongside recent increases in computational capacity, allowing these approaches to scale to significant complexities for addressing predictive challenges in big data. However, as the complexity of DNN models rises, interpretability diminishes. In response to this challenge, explainable models such as Adversarial Gradient Integration (AGI) leverage path-based gradients provided by DNNs to elucidate their decisions. Yet the performance of path-based explainers can be compromised when gradients exhibit irregularities during out-of-distribution path traversal. In this context, we introduce Quantified Uncertainty Counterfactual Explanations (QUCE), a method designed to mitigate out-of-distribution traversal by minimizing path uncertainty. QUCE not only quantifies uncertainty when presenting explanations but also generates more certain counterfactual examples. We showcase the performance of the QUCE method by comparing it with competing methods for both path-based explanations and generative counterfactual examples.
△ Less
Submitted 29 April, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks
Authors:
Zhiying Jiang,
Xingyuan Li,
Jinyuan Liu,
Xin Fan,
Risheng Liu
Abstract:
Image stitching seamlessly integrates images captured from varying perspectives into a single wide field-of-view image. Such integration not only broadens the captured scene but also augments holistic perception in computer vision applications. Given a pair of captured images, subtle perturbations and distortions which go unnoticed by the human visual system tend to attack the correspondence match…
▽ More
Image stitching seamlessly integrates images captured from varying perspectives into a single wide field-of-view image. Such integration not only broadens the captured scene but also augments holistic perception in computer vision applications. Given a pair of captured images, subtle perturbations and distortions which go unnoticed by the human visual system tend to attack the correspondence matching, impairing the performance of image stitching algorithms. In light of this challenge, this paper presents the first attempt to improve the robustness of image stitching against adversarial attacks. Specifically, we introduce a stitching-oriented attack~(SoA), tailored to amplify the alignment loss within overlapping regions, thereby targeting the feature matching procedure. To establish an attack resistant model, we delve into the robustness of stitching architecture and develop an adaptive adversarial training~(AAT) to balance attack resistance with stitching precision. In this way, we relieve the gap between the routine adversarial training and benign models, ensuring resilience without quality compromise. Comprehensive evaluation across real-world and synthetic datasets validate the deterioration of SoA on stitching performance. Furthermore, AAT emerges as a more robust solution against adversarial perturbations, delivering superior stitching results. Code is available at:https://github.com/Jzy2017/TRIS.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Bringing Generative AI to Adaptive Learning in Education
Authors:
Hang Li,
Tianlong Xu,
Chaoli Zhang,
Eason Chen,
Jing Liang,
Xing Fan,
Haoyang Li,
Jiliang Tang,
Qingsong Wen
Abstract:
The recent surge in generative AI technologies, such as large language models and diffusion models, have boosted the development of AI applications in various domains, including science, finance, and education. Concurrently, adaptive learning, a concept that has gained substantial interest in the educational sphere, has proven its efficacy in enhancing students' learning efficiency. In this positi…
▽ More
The recent surge in generative AI technologies, such as large language models and diffusion models, have boosted the development of AI applications in various domains, including science, finance, and education. Concurrently, adaptive learning, a concept that has gained substantial interest in the educational sphere, has proven its efficacy in enhancing students' learning efficiency. In this position paper, we aim to shed light on the intersectional studies of these two methods, which combine generative AI with adaptive learning concepts. By presenting discussions about the benefits, challenges, and potentials in this field, we argue that this union will contribute significantly to the development of the next stage learning format in education.
△ Less
Submitted 22 February, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
ChatCell: Facilitating Single-Cell Analysis with Natural Language
Authors:
Yin Fang,
Kangwei Liu,
Ningyu Zhang,
Xinle Deng,
Penghui Yang,
Zhuo Chen,
Xiangru Tang,
Mark Gerstein,
Xiaohui Fan,
Huajun Chen
Abstract:
As Large Language Models (LLMs) rapidly evolve, their influence in science is becoming increasingly prominent. The emerging capabilities of LLMs in task generalization and free-form dialogue can significantly advance fields like chemistry and biology. However, the field of single-cell biology, which forms the foundational building blocks of living organisms, still faces several challenges. High kn…
▽ More
As Large Language Models (LLMs) rapidly evolve, their influence in science is becoming increasingly prominent. The emerging capabilities of LLMs in task generalization and free-form dialogue can significantly advance fields like chemistry and biology. However, the field of single-cell biology, which forms the foundational building blocks of living organisms, still faces several challenges. High knowledge barriers and limited scalability in current methods restrict the full exploitation of LLMs in mastering single-cell data, impeding direct accessibility and rapid iteration. To this end, we introduce ChatCell, which signifies a paradigm shift by facilitating single-cell analysis with natural language. Leveraging vocabulary adaptation and unified sequence generation, ChatCell has acquired profound expertise in single-cell biology and the capability to accommodate a diverse range of analysis tasks. Extensive experiments further demonstrate ChatCell's robust performance and potential to deepen single-cell insights, paving the way for more accessible and intuitive exploration in this pivotal field. Our project homepage is available at https://zjunlp.github.io/project/ChatCell.
△ Less
Submitted 19 February, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
Authors:
Zhiheng Xi,
Wenxiang Chen,
Boyang Hong,
Senjie Jin,
Rui Zheng,
Wei He,
Yiwen Ding,
Shichun Liu,
Xin Guo,
Junzhe Wang,
Honglin Guo,
Wei Shen,
Xiaoran Fan,
Yuhao Zhou,
Shihan Dou,
Xiao Wang,
Xinbo Zhang,
Peng Sun,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for o…
▽ More
In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R$^3$ overcomes these limitations by learning from correct demonstrations. Specifically, R$^3$ progressively slides the start state of reasoning from a demonstration's end to its beginning, facilitating easier model exploration at all stages. Thus, R$^3$ establishes a step-wise curriculum, allowing outcome supervision to offer step-level signals and precisely pinpoint errors. Using Llama2-7B, our method surpasses RL baseline on eight reasoning tasks by $4.1$ points on average. Notebaly, in program-based reasoning on GSM8K, it exceeds the baseline by $4.2$ points across three backbone models, and without any extra data, Codellama-7B + R$^3$ performs comparable to larger models or closed-source models.
△ Less
Submitted 17 March, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
Authors:
Shihan Dou,
Yan Liu,
Haoxiang Jia,
Limao Xiong,
Enyu Zhou,
Wei Shen,
Junjie Shan,
Caishuang Huang,
Xiao Wang,
Xiaoran Fan,
Zhiheng Xi,
Yuhao Zhou,
Tao Ji,
Rui Zheng,
Qi Zhang,
Xuanjing Huang,
Tao Gui
Abstract:
The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit te…
▽ More
The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online.
△ Less
Submitted 5 February, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
MouSi: Poly-Visual-Expert Vision-Language Models
Authors:
Xiaoran Fan,
Tao Ji,
Changhao Jiang,
Shuo Li,
Senjie Jin,
Sirui Song,
Junke Wang,
Boyang Hong,
Lu Chen,
Guodong Zheng,
Ming Zhang,
Caishuang Huang,
Rui Zheng,
Zhiheng Xi,
Yuhao Zhou,
Shihan Dou,
Junjie Ye,
Hang Yan,
Tao Gui,
Qi Zhang,
Xipeng Qiu,
Xuanjing Huang,
Zuxuan Wu,
Yu-Gang Jiang
Abstract:
Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability…
▽ More
Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Scientific Large Language Models: A Survey on Biological & Chemical Domains
Authors:
Qiang Zhang,
Keyang Ding,
Tianwen Lyv,
Xinda Wang,
Qingyu Yin,
Yiwen Zhang,
Jing Yu,
Yuhao Wang,
Xiaotong Li,
Zhuoyi Xiang,
Xiang Zhuang,
Zeyuan Wang,
Ming Qin,
Mengyao Zhang,
Jinlu Zhang,
Jiyu Cui,
Renjun Xu,
Hongyang Chen,
Xiaohui Fan,
Huabin Xing,
Huajun Chen
Abstract:
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent o…
▽ More
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Integrated Sensing, Communication, and Computing: An Information-oriented Resource Transaction Mechanism
Authors:
Ning Chen,
Zhipeng Cheng,
Xuwei Fan,
Zhang Liu,
Bangzhen Huang,
Jie Yang,
Yifeng Zhao,
Lianfen Huang
Abstract:
Information acquisition from target perception represents the key enabling technology of the Internet of Automatic Vehicles (IoAV), which is essential for the decision-making and control operation of connected automatic vehicles (CAVs). Exploring target information involves multiple operations on data, e.g., wireless sensing (for data acquisition), communication (for data transmission), and comput…
▽ More
Information acquisition from target perception represents the key enabling technology of the Internet of Automatic Vehicles (IoAV), which is essential for the decision-making and control operation of connected automatic vehicles (CAVs). Exploring target information involves multiple operations on data, e.g., wireless sensing (for data acquisition), communication (for data transmission), and computing (for data analysis), which all rely on the consumption of time-space-frequency-computing (TSFC) multi-domain resources. Due to the coupled resource sharing of sensing, communication, and computing procedures, the resource management of information-oriented IoAV is commonly formulated as a non-convex NP-hard problem. In this article, further combining the integrated sensing and communication (ISAC) and computing, we introduce the integrated sensing, communication, and computing (ISCC), wherein the TSFC resources are decoupled from the specific processes and shared universally among sensing, communication, and computing processes. Furthermore, the information-oriented resource trading platform (IRTP) is established, which transforms the problem of ISCC resource management into a resource-information substitution model. Finally, we embed the employment topology structure in IoAV into neural network architecture, taking advantage of the graph neural network (GNN) and multi-worker reinforcement learning, and propose the dynamic resource management strategy based on the asynchronous advantage GNN (A2GNN) algorithm, which can achieve the convergence both of information gain maximization and resource consumption minimization, realizing efficient information-oriented resource management.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
Authors:
Junjie Ye,
Yilong Wu,
Songyang Gao,
Caishuang Huang,
Sixian Li,
Guanyu Li,
Xiaoran Fan,
Qi Zhang,
Tao Gui,
Xuanjing Huang
Abstract:
Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level b…
▽ More
Tool learning has generated widespread interest as a vital means of interaction between Large Language Models (LLMs) and the physical world. Current research predominantly emphasizes LLMs' capacity to utilize tools in well-structured environments while overlooking their stability when confronted with the inevitable noise of the real world. To bridge this gap, we introduce RoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool learning. Specifically, we establish five external environments, each featuring varying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union), providing an in-depth analysis of the model's resilience across three critical phases: tool selection, parameter identification, and content filling. Experiments involving six widely-used models underscore the urgent necessity for enhancing the robustness of LLMs in tool learning. For instance, the performance of GPT-4 even drops significantly from 80.00 to 58.10 when there is no substantial change in manual accuracy. More surprisingly, the noise correction capability inherent in the GPT family paradoxically impedes its adaptability in the face of mild noise. In light of these findings, we propose RoTTuning, a strategy that enriches the diversity of training environments to bolster the robustness of LLMs in tool learning. The code and data are available at https://github.com/Junjie-Ye/RoTBench.
△ Less
Submitted 19 January, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
The Devil is in the Details: Boosting Guided Depth Super-Resolution via Rethinking Cross-Modal Alignment and Aggregation
Authors:
Xinni Jiang,
Zengsheng Kuang,
Chunle Guo,
Ruixun Zhang,
Lei Cai,
Xiao Fan,
Chongyi Li
Abstract:
Guided depth super-resolution (GDSR) involves restoring missing depth details using the high-resolution RGB image of the same scene. Previous approaches have struggled with the heterogeneity and complementarity of the multi-modal inputs, and neglected the issues of modal misalignment, geometrical misalignment, and feature selection. In this study, we rethink some essential components in GDSR netwo…
▽ More
Guided depth super-resolution (GDSR) involves restoring missing depth details using the high-resolution RGB image of the same scene. Previous approaches have struggled with the heterogeneity and complementarity of the multi-modal inputs, and neglected the issues of modal misalignment, geometrical misalignment, and feature selection. In this study, we rethink some essential components in GDSR networks and propose a simple yet effective Dynamic Dual Alignment and Aggregation network (D2A2). D2A2 mainly consists of 1) a dynamic dual alignment module that adapts to alleviate the modal misalignment via a learnable domain alignment block and geometrically align cross-modal features by learning the offset; and 2) a mask-to-pixel feature aggregate module that uses the gated mechanism and pixel attention to filter out irrelevant texture noise from RGB features and combine the useful features with depth features. By combining the strengths of RGB and depth features while minimizing disturbance introduced by the RGB image, our method with simple reuse and redesign of basic components achieves state-of-the-art performance on multiple benchmark datasets. The code is available at https://github.com/JiangXinni/D2A2.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Authors:
Binghai Wang,
Rui Zheng,
Lu Chen,
Yan Liu,
Shihan Dou,
Caishuang Huang,
Wei Shen,
Senjie Jin,
Enyu Zhou,
Chenyu Shi,
Songyang Gao,
Nuo Xu,
Yuhao Zhou,
Xiaoran Fan,
Zhiheng Xi,
Jun Zhao,
Xiao Wang,
Tao Ji,
Hang Yan,
Lixing Shen,
Zhan Chen,
Tao Gui,
Qi Zhang,
Xipeng Qiu,
Xuanjing Huang
, et al. (2 additional authors not shown)
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they f…
▽ More
Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training.
In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization.
△ Less
Submitted 12 January, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Probability-based Distance Estimation Model for 3D DV-Hop Localization in WSNs
Authors:
Penghong Wang,
Hao Wang,
Wenrui Li,
Xiaopeng Fan,
Debin Zhao
Abstract:
Localization is one of the pivotal issues in wireless sensor network applications. In 3D localization studies, most algorithms focus on enhancing the location prediction process, lacking theoretical derivation of the detection distance of an anchor node at the varying hops, engenders a localization performance bottleneck. To address this issue, we propose a probability-based average distance estim…
▽ More
Localization is one of the pivotal issues in wireless sensor network applications. In 3D localization studies, most algorithms focus on enhancing the location prediction process, lacking theoretical derivation of the detection distance of an anchor node at the varying hops, engenders a localization performance bottleneck. To address this issue, we propose a probability-based average distance estimation (PADE) model that utilizes the probability distribution of node distances detected by an anchor node. The aim is to mathematically derive the average distances of nodes detected by an anchor node at different hops. First, we develop a probability-based maximum distance estimation (PMDE) model to calculate the upper bound of the distance detected by an anchor node. Then, we present the PADE model, which relies on the upper bound obtained of the distance by the PMDE model. Finally, the obtained average distance is used to construct a distance loss function, and it is embedded with the traditional distance loss function into a multi-objective genetic algorithm to predict the locations of unknown nodes. The experimental results demonstrate that the proposed method achieves state-of-the-art performance in random and multimodal distributed sensor networks. The average localization accuracy is improved by 3.49\%-12.66\% and 3.99%-22.34%, respectively.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence
Authors:
Philip Jordan,
Florian Grötschla,
Flint Xiaofeng Fan,
Roger Wattenhofer
Abstract:
In Federated Reinforcement Learning (FRL), agents aim to collaboratively learn a common task, while each agent is acting in its local environment without exchanging raw trajectories. Existing approaches for FRL either (a) do not provide any fault-tolerance guarantees (against misbehaving agents), or (b) rely on a trusted central agent (a single point of failure) for aggregating updates. We provide…
▽ More
In Federated Reinforcement Learning (FRL), agents aim to collaboratively learn a common task, while each agent is acting in its local environment without exchanging raw trajectories. Existing approaches for FRL either (a) do not provide any fault-tolerance guarantees (against misbehaving agents), or (b) rely on a trusted central agent (a single point of failure) for aggregating updates. We provide the first decentralized Byzantine fault-tolerant FRL method. Towards this end, we first propose a new centralized Byzantine fault-tolerant policy gradient (PG) algorithm that improves over existing methods by relying only on assumptions standard for non-fault-tolerant PG. Then, as our main contribution, we show how a combination of robust aggregation and Byzantine-resilient agreement methods can be leveraged in order to eliminate the need for a trusted central entity. Since our results represent the first sample complexity analysis for Byzantine fault-tolerant decentralized federated non-convex optimization, our technical contributions may be of independent interest. Finally, we corroborate our theoretical results experimentally for common RL environments, demonstrating the speed-up of decentralized federations w.r.t. the number of participating agents and resilience against various Byzantine attacks.
△ Less
Submitted 7 January, 2024;
originally announced January 2024.
-
Towards Integrated Fine-tuning and Inference when Generative AI meets Edge Intelligence
Authors:
Ning Chen,
Zhipeng Cheng,
Xuwei Fan,
Xiaoyu Xia,
Lianfen Huang
Abstract:
The high-performance generative artificial intelligence (GAI) represents the latest evolution of computational intelligence, while the blessing of future 6G networks also makes edge intelligence (EI) full of development potential. The inevitable encounter between GAI and EI can unleash new opportunities, where GAI's pre-training based on massive computing resources and large-scale unlabeled corpor…
▽ More
The high-performance generative artificial intelligence (GAI) represents the latest evolution of computational intelligence, while the blessing of future 6G networks also makes edge intelligence (EI) full of development potential. The inevitable encounter between GAI and EI can unleash new opportunities, where GAI's pre-training based on massive computing resources and large-scale unlabeled corpora can provide strong foundational knowledge for EI, while EI can harness fragmented computing resources to aggregate personalized knowledge for GAI. However, the natural contradictory features pose significant challenges to direct knowledge sharing. To address this, in this paper, we propose the GAI-oriented synthetical network (GaisNet), a collaborative cloud-edge-end intelligence framework that buffers contradiction leveraging data-free knowledge relay, where the bidirectional knowledge flow enables GAI's virtuous-cycle model fine-tuning and task inference, achieving mutualism between GAI and EI with seamless fusion and collaborative evolution. Experimental results demonstrate the effectiveness of the proposed mechanisms. Finally, we discuss the future challenges and directions in the interplay between GAI and EI.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
GainNet: Coordinates the Odd Couple of Generative AI and 6G Networks
Authors:
Ning Chen,
Jie Yang,
Zhipeng Cheng,
Xuwei Fan,
Zhang Liu,
Bangzhen Huang,
Yifeng Zhao,
Lianfen Huang,
Xiaojiang Du,
Mohsen Guizani
Abstract:
The rapid expansion of AI-generated content (AIGC) reflects the iteration from assistive AI towards generative AI (GAI) with creativity. Meanwhile, the 6G networks will also evolve from the Internet-of-everything to the Internet-of-intelligence with hybrid heterogeneous network architectures. In the future, the interplay between GAI and the 6G will lead to new opportunities, where GAI can learn th…
▽ More
The rapid expansion of AI-generated content (AIGC) reflects the iteration from assistive AI towards generative AI (GAI) with creativity. Meanwhile, the 6G networks will also evolve from the Internet-of-everything to the Internet-of-intelligence with hybrid heterogeneous network architectures. In the future, the interplay between GAI and the 6G will lead to new opportunities, where GAI can learn the knowledge of personalized data from the massive connected 6G end devices, while GAI's powerful generation ability can provide advanced network solutions for 6G network and provide 6G end devices with various AIGC services. However, they seem to be an odd couple, due to the contradiction of data and resources. To achieve a better-coordinated interplay between GAI and 6G, the GAI-native networks (GainNet), a GAI-oriented collaborative cloud-edge-end intelligence framework, is proposed in this paper. By deeply integrating GAI with 6G network design, GainNet realizes the positive closed-loop knowledge flow and sustainable-evolution GAI model optimization. On this basis, the GAI-oriented generic resource orchestration mechanism with integrated sensing, communication, and computing (GaiRom-ISCC) is proposed to guarantee the efficient operation of GainNet. Two simple case studies demonstrate the effectiveness and robustness of the proposed schemes. Finally, we envision the key challenges and future directions concerning the interplay between GAI models and 6G networks.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
A Hybrid Neural Network Model For Predicting The Nitrate Concentration In The Recirculating Aquaculture System
Authors:
Xiangyu Fan,
Jiaxin Lia,
Yingzhe Wang,
Yingsha Qu,
Hao Li,
Keming Qu,
Zhengguo Cui
Abstract:
This study was groundbreaking in its application of neural network models for nitrate management in the Recirculating Aquaculture System (RAS). A hybrid neural network model was proposed, which accurately predicted daily nitrate concentration and its trends using six water quality parameters. We conducted a 105-day aquaculture experiment, during which we collected 450 samples from five sets of RAS…
▽ More
This study was groundbreaking in its application of neural network models for nitrate management in the Recirculating Aquaculture System (RAS). A hybrid neural network model was proposed, which accurately predicted daily nitrate concentration and its trends using six water quality parameters. We conducted a 105-day aquaculture experiment, during which we collected 450 samples from five sets of RAS to train our model (C-L-A model) which incorporates Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and self-Attention. Furthermore, we obtained 90 samples from a standalone RAS as the testing data to evaluate the performance of the model in practical applications. The experimental results proved that the C-L-A model accurately predicted nitrate concentration in RAS and maintained good performance even with a reduced proportion of training data. We recommend using water quality parameters from the past 7 days to forecast future nitrate concentration, as this timeframe allows the model to achieve maximum generalization capability. Additionally, we compared the performance of the C-L-A model with three basic neural network models (CNN, LSTM, self-Attention) as well as three hybrid neural network models (CNN-LSTM, CNN-Attention, LSTM-Attention). The results demonstrated that the C-L-A model (R2=0.956) significantly outperformed the other neural network models (R2=0.901-0.927). Our study suggests that the utilization of neural network models, specifically the C-L-A model, could potentially assist the RAS industry in conserving resources for daily nitrate monitoring.
△ Less
Submitted 15 January, 2024; v1 submitted 2 January, 2024;
originally announced January 2024.
-
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
Authors:
Junjie Ye,
Guanyu Li,
Songyang Gao,
Caishuang Huang,
Yilong Wu,
Sixian Li,
Xiaoran Fan,
Shihan Dou,
Qi Zhang,
Tao Gui,
Xuanjing Huang
Abstract:
Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the intricate capabilities essential for LLMs to effectively ut…
▽ More
Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the intricate capabilities essential for LLMs to effectively utilize tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. These findings offer instructive insights aimed at advancing the field of tool learning. The data is available att https://github.com/Junjie-Ye/ToolEyes.
△ Less
Submitted 14 January, 2024; v1 submitted 1 January, 2024;
originally announced January 2024.
-
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion
Authors:
Xingyuan Li,
Yang Zou,
Jinyuan Liu,
Zhiying Jiang,
Long Ma,
Xin Fan,
Risheng Liu
Abstract:
With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shar…
▽ More
With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
QoE-oriented Dependent Task Scheduling under Multi-dimensional QoS Constraints over Distributed Networks
Authors:
Xuwei Fan,
Zhipeng Cheng,
Ning Chen,
Lianfen Huang,
Xianbin Wang
Abstract:
Task scheduling as an effective strategy can improve application performance on computing resource-limited devices over distributed networks. However, existing evaluation mechanisms fail to depict the complexity of diverse applications, which involve dependencies among tasks, computing resource requirements, and multi-dimensional quality of service (QoS) constraints. Furthermore, traditional QoS-o…
▽ More
Task scheduling as an effective strategy can improve application performance on computing resource-limited devices over distributed networks. However, existing evaluation mechanisms fail to depict the complexity of diverse applications, which involve dependencies among tasks, computing resource requirements, and multi-dimensional quality of service (QoS) constraints. Furthermore, traditional QoS-oriented task scheduling strategies struggle to meet the performance requirements without considering differences in satisfaction and acceptance of application, leading application failures and resource wastage. To tackle these issues, a quality of experience (QoE) cost model is designed to evaluate application completion, depicting the relationship among application satisfaction, communications, and computing resources in the distributed networks. Specifically, considering the sensitivity and preference of QoS, we model the different dimensional QoS degradation cost functions for dependent tasks, which are then integrated into the QoE cost model. Based on the QoE model, the dependent task scheduling problem is formulated as the minimization of overall QoE cost, aiming to improve the application performance in the distributed networks, which is proven Np-hard. Moreover, a heuristic Hierarchical Multi-queue Task Scheduling Algorithm (HMTSA) is proposed to address the QoE-oriented task scheduling problem among multiple dependent tasks, which utilizes hierarchical multiple queues to determine the optimal task execution order and location according to different dimensional QoS priorities. Finally, extensive experiments demonstrate that the proposed algorithm can significantly improve the satisfaction of applications.
△ Less
Submitted 29 December, 2023;
originally announced December 2023.
-
Air-to-Ground Communications Beyond 5G: UAV Swarm Formation Control and Tracking
Authors:
Xiao Fan,
Peiran Wu,
Minghua Xia
Abstract:
Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-poi…
▽ More
Unmanned aerial vehicle (UAV) communications have been widely accepted as promising technologies to support air-to-ground communications in the forthcoming sixth-generation (6G) wireless networks. This paper proposes a novel air-to-ground communication model consisting of aerial base stations served by UAVs and terrestrial user equipments (UEs) by integrating the technique of coordinated multi-point (CoMP) transmission with the theory of stochastic geometry. In particular, a CoMP set consisting of multiple UAVs is developed based on the theory of Poisson-Delaunay tetrahedralization. Effective UAV formation control and UAV swarm tracking schemes for two typical scenarios, including static and mobile UEs, are also developed using the multi-agent system theory to ensure that collaborative UAVs can efficiently reach target spatial positions for mission execution. Thanks to the ease of mathematical tractability, this model provides explicit performance expressions for a typical UE's coverage probability and achievable ergodic rate. Extensive simulation and numerical results corroborate that the proposed scheme outperforms UAV communications without CoMP transmission and obtains similar performance to the conventional CoMP scheme while avoiding search overhead.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
Learning Dense Correspondence for NeRF-Based Face Reenactment
Authors:
Songlin Yang,
Wei Wang,
Yushi Lan,
Xiangyu Fan,
Bo Peng,
Lei Yang,
Jing Dong
Abstract:
Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different fac…
▽ More
Face reenactment is challenging due to the need to establish dense correspondence between various face representations for motion transfer. Recent studies have utilized Neural Radiance Field (NeRF) as fundamental representation, which further enhanced the performance of multi-view face reenactment in photo-realism and 3D consistency. However, establishing dense correspondence between different face NeRFs is non-trivial, because implicit representations lack ground-truth correspondence annotations like mesh-based 3D parametric models (e.g., 3DMM) with index-aligned vertexes. Although aligning 3DMM space with NeRF-based face representations can realize motion control, it is sub-optimal for their limited face-only modeling and low identity fidelity. Therefore, we are inspired to ask: Can we learn the dense correspondence between different NeRF-based face representations without a 3D parametric model prior? To address this challenge, we propose a novel framework, which adopts tri-planes as fundamental NeRF representation and decomposes face tri-planes into three components: canonical tri-planes, identity deformations, and motion. In terms of motion control, our key contribution is proposing a Plane Dictionary (PlaneDict) module, which efficiently maps the motion conditions to a linear weighted addition of learnable orthogonal plane bases. To the best of our knowledge, our framework is the first method that achieves one-shot multi-view face reenactment without a 3D parametric model prior. Extensive experiments demonstrate that we produce better results in fine-grained motion control and identity preservation than previous methods.
△ Less
Submitted 18 December, 2023; v1 submitted 16 December, 2023;
originally announced December 2023.
-
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
Authors:
Shihan Dou,
Enyu Zhou,
Yan Liu,
Songyang Gao,
Jun Zhao,
Wei Shen,
Yuhao Zhou,
Zhiheng Xi,
Xiao Wang,
Xiaoran Fan,
Shiliang Pu,
Jiang Zhu,
Rui Zheng,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increase…
▽ More
Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM.
△ Less
Submitted 8 March, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Neural Gaussian Similarity Modeling for Differential Graph Structure Learning
Authors:
Xiaolong Fan,
Maoguo Gong,
Yue Wu,
Zedong Tang,
Jieyi Liu
Abstract:
Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning mod…
▽ More
Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling \mbox{nearest} neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
FAPP: Fast and Adaptive Perception and Planning for UAVs in Dynamic Cluttered Environments
Authors:
Minghao Lu,
Xiyu Fan,
Han Chen,
Peng Lu
Abstract:
Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the domi…
▽ More
Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the dominant objects. This type of environment poses significant challenges to both perception and planning. Multiple dynamic objects possess various motions, making it extremely difficult to estimate and predict their motions using one motion model. The planning must be highly efficient to avoid cluttered dynamic objects. This paper proposes Fast and Adaptive Perception and Planning (FAPP) for UAVs flying in complex dynamic cluttered environments. A novel and efficient point cloud segmentation strategy is proposed to distinguish static and dynamic objects. To address multiple dynamic objects with different motions, an adaptive estimation method with covariance adaptation is proposed to quickly and accurately predict their motions. Our proposed trajectory optimization algorithm is highly efficient, enabling it to avoid fast objects. Furthermore, an adaptive re-planning method is proposed to address the case when the trajectory optimization cannot find a feasible solution, which is common for dynamic cluttered environments. Extensive validations in both simulation and real-world experiments demonstrate the effectiveness of our proposed system for highly dynamic and cluttered environments.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
PCRDiffusion: Diffusion Probabilistic Models for Point Cloud Registration
Authors:
Yue Wu,
Yongzhe Yuan,
Xiaolong Fan,
Xiaoshui Huang,
Maoguo Gong,
Qiguang Miao
Abstract:
We propose a new framework that formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation. During training stage, object transformation diffuses from ground-truth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the outp…
▽ More
We propose a new framework that formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation. During training stage, object transformation diffuses from ground-truth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the output result in a progressive way. We derive the variational bound in closed form for training and provide implementations of the model. Our work provides the following crucial findings: (i) In contrast to most existing methods, our framework, Diffusion Probabilistic Models for Point Cloud Registration (PCRDiffusion) does not require repeatedly update source point cloud to refine the predicted transformation. (ii) Point cloud registration, one of the representative discriminative tasks, can be solved by a generative way and the unified probabilistic formulation. Finally, we discuss and provide an outlook on the application of diffusion model in different scenarios for point cloud registration. Experimental results demonstrate that our model achieves competitive performance in point cloud registration. In correspondence-free and correspondence-based scenarios, PCRDifussion can both achieve exceeding 50\% performance improvements.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
Urban Region Representation Learning with Attentive Fusion
Authors:
Fengze Sun,
Jianzhong Qi,
Yanchuan Chang,
Xiaoliang Fan,
Shanika Karunasekera,
Egemen Tanin
Abstract:
An increasing number of related urban data sources have brought forth novel opportunities for learning urban region representations, i.e., embeddings. The embeddings describe latent features of urban regions and enable discovering similar regions for urban planning applications. Existing methods learn an embedding for a region using every different type of region feature data, and subsequently fus…
▽ More
An increasing number of related urban data sources have brought forth novel opportunities for learning urban region representations, i.e., embeddings. The embeddings describe latent features of urban regions and enable discovering similar regions for urban planning applications. Existing methods learn an embedding for a region using every different type of region feature data, and subsequently fuse all learned embeddings of a region to generate a unified region embedding. However, these studies often overlook the significance of the fusion process. The typical fusion methods rely on simple aggregation, such as summation and concatenation, thereby disregarding correlations within the fused region embeddings.
To address this limitation, we propose a novel model named HAFusion. Our model is powered by a dual-feature attentive fusion module named DAFusion, which fuses embeddings from different region features to learn higher-order correlations between the regions as well as between the different types of region features. DAFusion is generic - it can be integrated into existing models to enhance their fusion process. Further, motivated by the effective fusion capability of an attentive module, we propose a hybrid attentive feature learning module named HALearning to enhance the embedding learning from each individual type of region features. Extensive experiments on three real-world datasets demonstrate that our model HAFusion outperforms state-of-the-art methods across three different prediction tasks. Using our learned region embedding leads to consistent and up to 31% improvements in the prediction accuracy.
△ Less
Submitted 26 April, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Digital Life Project: Autonomous 3D Characters with Social Intelligence
Authors:
Zhongang Cai,
Jianping Jiang,
Zhongfei Qing,
Xinying Guo,
Mingyuan Zhang,
Zhengyu Lin,
Haiyi Mei,
Chen Wei,
Ruisi Wang,
Wanqi Yin,
Xiangyu Fan,
Han Du,
Liang Pan,
Peng Gao,
Zhitao Yang,
Yang Gao,
Jiaqi Li,
Tianxiang Ren,
Yukun Wei,
Xiaogang Wang,
Chen Change Loy,
Lei Yang,
Ziwei Liu
Abstract:
In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models perso…
▽ More
In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars, incorporates a reflection process based on psychology principles, and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching, a proven industry technique to ensure motion quality, with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively, they enable virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states. Concurrently, these characters can perform contextually relevant bodily movements. Additionally, a motion captioning module further allows the virtual character to recognize and appropriately respond to human players' actions. Homepage: https://digital-life-project.com/
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Physics Inspired Criterion for Pruning-Quantization Joint Learning
Authors:
Weiying Xie,
Xiaoyi Fan,
Xin Zhang,
Yunsong Li,
Jie Lei,
Leyuan Fang
Abstract:
Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an…
▽ More
Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an analogy we first draw between elasticity dynamics (ED) and model compression (MC). Specifically, derived from Hooke's law in ED, we establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics inspired criterion (PIC). Furthermore, we extend PIC with a relative shift variable for a global view. To ensure feasibility and flexibility, available maximum bitwidth and penalty factor are introduced in quantization bitwidth assignment. Experiments on benchmarks of image classification demonstrate that PIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs) compression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10 with 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy drop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.