-
Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models
Authors:
Zhe Ma,
Xuhong Zhang,
Qingming Li,
Tianyu Du,
Wenzhi Chen,
Zonghui Wang,
Shouling Ji
Abstract:
The past few years have witnessed substantial advancement in text-guided image generation powered by diffusion models. However, it was shown that text-to-image diffusion models are vulnerable to training image memorization, raising concerns on copyright infringement and privacy invasion. In this work, we perform practical analysis of memorization in text-to-image diffusion models. Targeting a set…
▽ More
The past few years have witnessed substantial advancement in text-guided image generation powered by diffusion models. However, it was shown that text-to-image diffusion models are vulnerable to training image memorization, raising concerns on copyright infringement and privacy invasion. In this work, we perform practical analysis of memorization in text-to-image diffusion models. Targeting a set of images to protect, we conduct quantitive analysis on them without need to collect any prompts. Specifically, we first formally define the memorization of image and identify three necessary conditions of memorization, respectively similarity, existence and probability. We then reveal the correlation between the model's prediction error and image replication. Based on the correlation, we propose to utilize inversion techniques to verify the safety of target images against memorization and measure the extent to which they are memorized. Model developers can utilize our analysis method to discover memorized images or reliably claim safety against memorization. Extensive experiments on the Stable Diffusion, a popular open-source text-to-image diffusion model, demonstrate the effectiveness of our analysis method.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
TroLLoc: Logic Locking and Layout Hardening for IC Security Closure against Hardware Trojans
Authors:
Fangzhou Wang,
Qijing Wang,
Lilas Alrahis,
Bangqi Fu,
Shui Jiang,
Xiaopeng Zhang,
Ozgur Sinanoglu,
Tsung-Yi Ho,
Evangeline F. Y. Young,
Johann Knechtel
Abstract:
Due to cost benefits, supply chains of integrated circuits (ICs) are largely outsourced nowadays. However, passing ICs through various third-party providers gives rise to many security threats, like piracy of IC intellectual property or insertion of hardware Trojans, i.e., malicious circuit modifications.
In this work, we proactively and systematically protect the physical layouts of ICs against…
▽ More
Due to cost benefits, supply chains of integrated circuits (ICs) are largely outsourced nowadays. However, passing ICs through various third-party providers gives rise to many security threats, like piracy of IC intellectual property or insertion of hardware Trojans, i.e., malicious circuit modifications.
In this work, we proactively and systematically protect the physical layouts of ICs against post-design insertion of Trojans. Toward that end, we propose TroLLoc, a novel scheme for IC security closure that employs, for the first time, logic locking and layout hardening in unison. TroLLoc is fully integrated into a commercial-grade design flow, and TroLLoc is shown to be effective, efficient, and robust. Our work provides in-depth layout and security analysis considering the challenging benchmarks of the ISPD'22/23 contests for security closure. We show that TroLLoc successfully renders layouts resilient, with reasonable overheads, against (i) general prospects for Trojan insertion as in the ISPD'22 contest, (ii) actual Trojan insertion as in the ISPD'23 contest, and (iii) potential second-order attacks where adversaries would first (i.e., before Trojan insertion) try to bypass the locking defense, e.g., using advanced machine learning attacks. Finally, we release all our artifacts for independent verification [2].
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Rotation Initialization and Stepwise Refinement for Universal LiDAR Calibration
Authors:
Yifan Duan,
Xinran Zhang,
Guoliang You,
Yilong Wu,
Xingchen Li,
Yao Li,
Xiaomeng Chu,
Jie Peng,
Yu Zhang,
Jianmin Ji,
Yanyong Zhang
Abstract:
Autonomous systems often employ multiple LiDARs to leverage the integrated advantages, enhancing perception and robustness. The most critical prerequisite under this setting is the estimating the extrinsic between each LiDAR, i.e., calibration. Despite the exciting progress in multi-LiDAR calibration efforts, a universal, sensor-agnostic calibration method remains elusive. According to the coarse-…
▽ More
Autonomous systems often employ multiple LiDARs to leverage the integrated advantages, enhancing perception and robustness. The most critical prerequisite under this setting is the estimating the extrinsic between each LiDAR, i.e., calibration. Despite the exciting progress in multi-LiDAR calibration efforts, a universal, sensor-agnostic calibration method remains elusive. According to the coarse-to-fine framework, we first design a spherical descriptor TERRA for 3-DoF rotation initialization with no prior knowledge. To further optimize, we present JEEP for the joint estimation of extrinsic and pose, integrating geometric and motion information to overcome factors affecting the point cloud registration. Finally, the LiDAR poses optimized by the hierarchical optimization module are input to time syn- chronization module to produce the ultimate calibration results, including the time offset. To verify the effectiveness, we conduct extensive experiments on eight datasets, where 16 diverse types of LiDARs in total and dozens of calibration tasks are tested. In the challenging tasks, the calibration errors can still be controlled within 5cm and 1° with a high success rate.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Towards Robust Physical-world Backdoor Attacks on Lane Detection
Authors:
Xinwei Zhang,
Aishan Liu,
Tianyuan Zhang,
Siyuan Liang,
Xianglong Liu
Abstract:
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint tra…
▽ More
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15\% on average in Attack Success Rate). Our codes will be available upon paper publication.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
A Survey on Personalized Content Synthesis with Diffusion Models
Authors:
Xulu Zhang,
Xiao-Yong Wei,
Wengyu Zhang,
Jinlin Wu,
Zhaoxiang Zhang,
Zhen Lei,
Qing Li
Abstract:
Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image…
▽ More
Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper offers a comprehensive survey of PCS, with a particular focus on the diffusion models. Specifically, we introduce the generic frameworks of PCS research, which can be broadly classified into optimization-based and learning-based approaches. We further categorize and analyze these methodologies, discussing their strengths, limitations, and key techniques. Additionally, we delve into specialized tasks within the field, such as personalized object generation, face synthesis, and style personalization, highlighting their unique challenges and innovations. Despite encouraging progress, we also present an analysis of the challenges such as overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to advance the development of PCS.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Channel Capacity of Near-Field Multiuser Communications
Authors:
Boqun Zhao,
Chongjun Ouyang,
Xingqi Zhang,
Yuanwei Liu
Abstract:
The channel capacity of near-field (NF) communications is characterized by considering three types of multiuser channels: i) multiple access channel (MAC), ii) broadcast channel (BC), and iii) multicast channel (MC). For NF MAC and BC, closed-form expressions are derived for the sum-rate capacity as well as the capacity region under a two-user scenario. These results are further extended to scenar…
▽ More
The channel capacity of near-field (NF) communications is characterized by considering three types of multiuser channels: i) multiple access channel (MAC), ii) broadcast channel (BC), and iii) multicast channel (MC). For NF MAC and BC, closed-form expressions are derived for the sum-rate capacity as well as the capacity region under a two-user scenario. These results are further extended to scenarios with an arbitrary number of users. For NF MC, closed-form expressions are derived for the two-user channel capacity and the capacity upper bound with more users. Further insights are gleaned by exploring special cases, including scenarios with infinitely large array apertures, co-directional users, and linear arrays. Theoretical and numerical results are presented and compared with far-field communications to demonstrate that: i) the NF capacity of these three channels converges to finite values rather than growing unboundedly as the number of array elements increases; ii) the capacity of the MAC and BC with co-directional users can be improved by using the additional range dimensions in NF channels to reduce inter-user interference (IUI); and iii) the MC capacity benefits less from the NF effect compared to the MAC and BC, as multicasting is less sensitive to IUI.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Pedestrian Attribute Recognition as Label-balanced Multi-label Learning
Authors:
Yibo Zhou,
Hai-Miao Hu,
Yirong Xiang,
Xiaokang Zhang,
Haotian Wu
Abstract:
Rooting in the scarcity of most attributes, realistic pedestrian attribute datasets exhibit unduly skewed data distribution, from which two types of model failures are delivered: (1) label imbalance: model predictions lean greatly towards the side of majority labels; (2) semantics imbalance: model is easily overfitted on the under-represented attributes due to their insufficient semantic diversity…
▽ More
Rooting in the scarcity of most attributes, realistic pedestrian attribute datasets exhibit unduly skewed data distribution, from which two types of model failures are delivered: (1) label imbalance: model predictions lean greatly towards the side of majority labels; (2) semantics imbalance: model is easily overfitted on the under-represented attributes due to their insufficient semantic diversity. To render perfect label balancing, we propose a novel framework that successfully decouples label-balanced data re-sampling from the curse of attributes co-occurrence, i.e., we equalize the sampling prior of an attribute while not biasing that of the co-occurred others. To diversify the attributes semantics and mitigate the feature noise, we propose a Bayesian feature augmentation method to introduce true in-distribution novelty. Handling both imbalances jointly, our work achieves best accuracy on various popular benchmarks, and importantly, with minimal computational budget.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
CourseGPT-zh: an Educational Large Language Model Based on Knowledge Distillation Incorporating Prompt Optimization
Authors:
Zheyan Qu,
Lu Yin,
Zitong Yu,
Wenbo Wang,
Xing zhang
Abstract:
Large language models (LLMs) have demonstrated astonishing capabilities in natural language processing (NLP) tasks, sparking interest in their application to professional domains with higher specialized requirements. However, restricted access to closed-source LLMs via APIs and the difficulty in collecting massive high-quality datasets pose obstacles to the development of large language models in…
▽ More
Large language models (LLMs) have demonstrated astonishing capabilities in natural language processing (NLP) tasks, sparking interest in their application to professional domains with higher specialized requirements. However, restricted access to closed-source LLMs via APIs and the difficulty in collecting massive high-quality datasets pose obstacles to the development of large language models in education fields of various courses. Given these challenges, we propose CourseGPT-zh, a course-oriented education LLM that supports customization and low-cost deployment. To address the comprehensiveness and diversity requirements of course-specific corpora, we design a high-quality question-answering corpus distillation framework incorporating prompt optimization, which effectively mines textbook knowledge and enhances its diversity. Moreover, considering the alignment of LLM responses with user needs, a novel method for discrete prompt optimization based on LLM-as-Judge is introduced. During optimization, this framework leverages the LLM's ability to reflect on and exploit error feedback and patterns, allowing for prompts that meet user needs and preferences while saving response length. Lastly, we obtain CourseGPT-zh based on the open-source LLM using parameter-efficient fine-tuning. Experimental results show that our discrete prompt optimization framework effectively improves the response quality of ChatGPT, and CourseGPT-zh exhibits strong professional capabilities in specialized knowledge question-answering, significantly outperforming comparable open-source models.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
Authors:
Shudan Zhang,
Hanlin Zhao,
Xiao Liu,
Qinkai Zheng,
Zehan Qi,
Xiaotao Gu,
Xiaohan Zhang,
Yuxiao Dong,
Jie Tang
Abstract:
Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeB…
▽ More
Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation
Authors:
Wenjie Zhou,
Zhenxin Ding,
Xiaodong Zhang,
Haibo Shi,
Junfeng Wang,
Dawei Yin
Abstract:
Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. For practical deployment, it is critical to carry out knowledge distillation to preserve high performance under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student performance, how does one effe…
▽ More
Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. For practical deployment, it is critical to carry out knowledge distillation to preserve high performance under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student performance, how does one effectively ensemble knowledge from multiple teachers at this stage without the guidance of ground-truth labels? We propose a novel algorithm, GOVERN, to tackle this issue. GOVERN has demonstrated significant improvements in both offline and online experiments. The proposed algorithm has been successfully deployed in a real-world commercial question-answering system.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
GLHF: General Learned Evolutionary Algorithm Via Hyper Functions
Authors:
Xiaobin Li,
Kai Wu,
Yujian Betterest Li,
Xiaoyu Zhang,
Handing Wang,
Jing Liu
Abstract:
Pretrained Optimization Models (POMs) leverage knowledge gained from optimizing various tasks, providing efficient solutions for new optimization challenges through direct usage or fine-tuning. Despite the inefficiencies and limited generalization abilities observed in current POMs, our proposed model, the general pre-trained optimization model (GPOM), addresses these shortcomings. GPOM constructs…
▽ More
Pretrained Optimization Models (POMs) leverage knowledge gained from optimizing various tasks, providing efficient solutions for new optimization challenges through direct usage or fine-tuning. Despite the inefficiencies and limited generalization abilities observed in current POMs, our proposed model, the general pre-trained optimization model (GPOM), addresses these shortcomings. GPOM constructs a population-based pretrained Black-Box Optimization (BBO) model tailored for continuous optimization. Evaluation on the BBOB benchmark and two robot control tasks demonstrates that GPOM outperforms other pretrained BBO models significantly, especially for high-dimensional tasks. Its direct optimization performance exceeds that of state-of-the-art evolutionary algorithms and POMs. Furthermore, GPOM exhibits robust generalization capabilities across diverse task distributions, dimensions, population sizes, and optimization horizons.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Authors:
Shang Shang,
Xinqiang Zhao,
Zhongjiang Yao,
Yepeng Yao,
Liya Su,
Zijing Fan,
Xiaodan Zhang,
Zhengwei Jiang
Abstract:
To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content securi…
▽ More
To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.
△ Less
Submitted 7 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Liberating Seen Classes: Boosting Few-Shot and Zero-Shot Text Classification via Anchor Generation and Classification Reframing
Authors:
Han Liu,
Siyang Zhao,
Xiaotong Zhang,
Feng Zhang,
Wei Wang,
Fenglong Ma,
Hongyang Chen,
Hong Yu,
Xianchao Zhang
Abstract:
Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes t…
▽ More
Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
ID-centric Pre-training for Recommendation
Authors:
Yiqing Wu,
Ruobing Xie,
Zhao Zhang,
Fuzhen Zhuang,
Xu Zhang,
Leyu Lin,
Zhanhui Kang,
Yongjun Xu
Abstract:
Classical sequential recommendation models generally adopt ID embeddings to store knowledge learned from user historical behaviors and represent items. However, these unique IDs are challenging to be transferred to new domains. With the thriving of pre-trained language model (PLM), some pioneer works adopt PLM for pre-trained recommendation, where modality information (e.g., text) is considered un…
▽ More
Classical sequential recommendation models generally adopt ID embeddings to store knowledge learned from user historical behaviors and represent items. However, these unique IDs are challenging to be transferred to new domains. With the thriving of pre-trained language model (PLM), some pioneer works adopt PLM for pre-trained recommendation, where modality information (e.g., text) is considered universal across domains via PLM. Unfortunately, the behavioral information in ID embeddings is still verified to be dominating in PLM-based recommendation models compared to modality information and thus limits these models' performance. In this work, we propose a novel ID-centric recommendation pre-training paradigm (IDP), which directly transfers informative ID embeddings learned in pre-training domains to item representations in new domains. Specifically, in pre-training stage, besides the ID-based sequential model for recommendation, we also build a Cross-domain ID-matcher (CDIM) learned by both behavioral and modality information. In the tuning stage, modality information of new domain items is regarded as a cross-domain bridge built by CDIM. We first leverage the textual information of downstream domain items to retrieve behaviorally and semantically similar items from pre-training domains using CDIM. Next, these retrieved pre-trained ID embeddings, rather than certain textual embeddings, are directly adopted to generate downstream new items' embeddings. Through extensive experiments on real-world datasets, both in cold and warm settings, we demonstrate that our proposed model significantly outperforms all baselines. Codes will be released upon acceptance.
△ Less
Submitted 7 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
E2GNN: Efficient Graph Neural Network Ensembles for Semi-Supervised Classification
Authors:
Xin Zhang,
Daochen Zha,
Qiaoyu Tan
Abstract:
This work studies ensemble learning for graph neural networks (GNNs) under the popular semi-supervised setting. Ensemble learning has shown superiority in improving the accuracy and robustness of traditional machine learning by combining the outputs of multiple weak learners. However, adopting a similar idea to integrate different GNN models is challenging because of two reasons. First, GNN is not…
▽ More
This work studies ensemble learning for graph neural networks (GNNs) under the popular semi-supervised setting. Ensemble learning has shown superiority in improving the accuracy and robustness of traditional machine learning by combining the outputs of multiple weak learners. However, adopting a similar idea to integrate different GNN models is challenging because of two reasons. First, GNN is notorious for its poor inference ability, so naively assembling multiple GNN models would deteriorate the inference efficiency. Second, when GNN models are trained with few labeled nodes, their performance are limited. In this case, the vanilla ensemble approach, e.g., majority vote, may be sub-optimal since most base models, i.e., GNNs, may make the wrong predictions. To this end, in this paper, we propose an efficient ensemble learner--E2GNN to assemble multiple GNNs in a learnable way by leveraging both labeled and unlabeled nodes. Specifically, we first pre-train different GNN models on a given data scenario according to the labeled nodes. Next, instead of directly combing their outputs for label inference, we train a simple multi-layer perceptron--MLP model to mimic their predictions on both labeled and unlabeled nodes. Then the unified MLP model is deployed to infer labels for unlabeled or new nodes. Since the predictions of unlabeled nodes from different GNN models may be incorrect, we develop a reinforced discriminator to effectively filter out those wrongly predicted nodes to boost the performance of MLP. By doing this, we suggest a principled approach to tackle the inference issues of GNN ensembles and maintain the merit of ensemble learning: improved performance. Comprehensive experiments over both transductive and inductive settings, across different GNN backbones and 8 benchmark datasets, demonstrate the superiority of E2GNN.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
StyleSeg V2: Towards Robust One-shot Segmentation of Brain Tissue via Optimization-free Registration Error Perception
Authors:
Zhiwei Wang,
Xiaoyu Zeng,
Chongwei Wu,
Jinxin lv,
Xu Zhang,
Wei Fang,
Qiang Li
Abstract:
One-shot segmentation of brain tissue requires training registration-segmentation (reg-seg) dual-model iteratively, where reg-model aims to provide pseudo masks of unlabeled images for seg-model by warping a carefully-labeled atlas. However, the imperfect reg-model induces image-mask misalignment, poisoning the seg-model subsequently. Recent StyleSeg bypasses this bottleneck by replacing the unlab…
▽ More
One-shot segmentation of brain tissue requires training registration-segmentation (reg-seg) dual-model iteratively, where reg-model aims to provide pseudo masks of unlabeled images for seg-model by warping a carefully-labeled atlas. However, the imperfect reg-model induces image-mask misalignment, poisoning the seg-model subsequently. Recent StyleSeg bypasses this bottleneck by replacing the unlabeled images with their warped copies of atlas, but needs to borrow the diverse image patterns via style transformation. Here, we present StyleSeg V2, inherited from StyleSeg but granted the ability of perceiving the registration errors. The motivation is that good registration behaves in a mirrored fashion for mirrored images. Therefore, almost at no cost, StyleSeg V2 can have reg-model itself "speak out" incorrectly-aligned regions by simply mirroring (symmetrically flipping the brain) its input, and the registration errors are symmetric inconsistencies between the outputs of original and mirrored inputs. Consequently, StyleSeg V2 allows the seg-model to make use of correctly-aligned regions of unlabeled images and also enhances the fidelity of style-transformed warped atlas image by weighting the local transformation strength according to registration errors. The experimental results on three public datasets demonstrate that our proposed StyleSeg V2 outperforms other state-of-the-arts by considerable margins, and exceeds StyleSeg by increasing the average Dice by at least 2.4%.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
Authors:
Zhizhao Duan,
Hao Cheng,
Duo Xu,
Xi Wu,
Xiangxie Zhang,
Xi Ye,
Zhen Xie
Abstract:
In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing…
▽ More
In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition
Authors:
Xitong Zhang,
Ismail R. Alkhouri,
Rongrong Wang
Abstract:
Deep Neural Networks (DNNs) have achieved remarkable success in addressing many previously unsolvable tasks. However, the storage and computational requirements associated with DNNs pose a challenge for deploying these trained models on resource-limited devices. Therefore, a plethora of compression and pruning techniques have been proposed in recent years. Low-rank decomposition techniques are amo…
▽ More
Deep Neural Networks (DNNs) have achieved remarkable success in addressing many previously unsolvable tasks. However, the storage and computational requirements associated with DNNs pose a challenge for deploying these trained models on resource-limited devices. Therefore, a plethora of compression and pruning techniques have been proposed in recent years. Low-rank decomposition techniques are among the approaches most utilized to address this problem. Compared to post-training compression, compression-promoted training is still under-explored. In this paper, we present a theoretically-justified novel approach, termed Low-Rank Induced Training (LoRITa), that promotes low-rankness through the composition of linear layers and compresses by using singular value truncation. This is achieved without the need to change the structure at inference time or require constrained and/or additional optimization, other than the standard weight decay regularization. Moreover, LoRITa eliminates the need to (i) initialize with pre-trained models and (ii) specify rank selection prior to training. Our experimental results (i) demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 on Convolutional Neural Networks, and (ii) illustrate that we achieve either competitive or SOTA results when compared to leading structured pruning methods in terms of FLOPs and parameters drop.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Exploring Text-based Realistic Building Facades Editing Applicaiton
Authors:
Jing Wang,
Xin Zhang
Abstract:
This paper explores the utilization of diffusion models and textual guidance for achieving localized editing of building facades, addressing the escalating demand for sophisticated editing methodologies in architectural design and urban planning. Leveraging the robust generative capabilities of diffusion models, this study presents a promising avenue for realistically synthesizing and modifying ar…
▽ More
This paper explores the utilization of diffusion models and textual guidance for achieving localized editing of building facades, addressing the escalating demand for sophisticated editing methodologies in architectural design and urban planning. Leveraging the robust generative capabilities of diffusion models, this study presents a promising avenue for realistically synthesizing and modifying architectural facades. Through iterative diffusion and text descriptions, these models adeptly capture both the intricate global and local structures inherent in architectural facades, thus effectively navigating the complexity of such designs. Additionally, the paper examines the expansive potential of diffusion models in various facets, including the generation of novel facade designs, the enhancement of existing facades, and the realization of personalized customization. Despite their promise, diffusion models encounter obstacles such as computational resource constraints and data imbalances. To address these challenges, the study introduces the innovative Blended Latent Diffusion method for architectural facade editing, accompanied by a comprehensive visual analysis of its viability and efficacy. Through these endeavors, we aims to propel forward the field of architectural facade editing, contributing to its advancement and practical application.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Adaptive Guidance Learning for Camouflaged Object Detection
Authors:
Zhennan Chen,
Xuying Zhang,
Tian-Zhu Xiang,
Ying Tai
Abstract:
Camouflaged object detection (COD) aims to segment objects visually embedded in their surroundings, which is a very challenging task due to the high similarity between the objects and the background. To address it, most methods often incorporate additional information (e.g., boundary, texture, and frequency clues) to guide feature learning for better detecting camouflaged objects from the backgrou…
▽ More
Camouflaged object detection (COD) aims to segment objects visually embedded in their surroundings, which is a very challenging task due to the high similarity between the objects and the background. To address it, most methods often incorporate additional information (e.g., boundary, texture, and frequency clues) to guide feature learning for better detecting camouflaged objects from the background. Although progress has been made, these methods are basically individually tailored to specific auxiliary cues, thus lacking adaptability and not consistently achieving high segmentation performance. To this end, this paper proposes an adaptive guidance learning network, dubbed \textit{AGLNet}, which is a unified end-to-end learnable model for exploring and adapting different additional cues in CNN models to guide accurate camouflaged feature learning. Specifically, we first design a straightforward additional information generation (AIG) module to learn additional camouflaged object cues, which can be adapted for the exploration of effective camouflaged features. Then we present a hierarchical feature combination (HFC) module to deeply integrate additional cues and image features to guide camouflaged feature learning in a multi-level fusion manner.Followed by a recalibration decoder (RD), different features are further aggregated and refined for accurate object prediction. Extensive experiments on three widely used COD benchmark datasets demonstrate that the proposed method achieves significant performance improvements under different additional cues, and outperforms the recent 20 state-of-the-art methods by a large margin. Our code will be made publicly available at: \textcolor{blue}{https://github.com/ZNan-Chen/AGLNet}.
△ Less
Submitted 6 May, 2024; v1 submitted 5 May, 2024;
originally announced May 2024.
-
A Survey on Privacy-Preserving Caching at Network Edge: Classification, Solutions, and Challenges
Authors:
Xianzhi Zhang,
Yipeng Zhou,
Di Wu,
Shazia Riaz,
Quan Z. Sheng,
Miao Hu,
Linchang Xiao
Abstract:
Caching content at the network edge is a popular and effective technique widely deployed to alleviate the burden of network backhaul, shorten service delay and improve service quality. However, there has been some controversy over privacy violations in caching content at the network edge. On the one hand, the multi-access open edge network provides an ideal surface for external attackers to obtain…
▽ More
Caching content at the network edge is a popular and effective technique widely deployed to alleviate the burden of network backhaul, shorten service delay and improve service quality. However, there has been some controversy over privacy violations in caching content at the network edge. On the one hand, the multi-access open edge network provides an ideal surface for external attackers to obtain private data from the edge cache by extracting sensitive information. On the other hand, privacy can be infringed by curious edge caching providers through caching trace analysis targeting to achieve better caching performance or higher profits. Therefore, an in-depth understanding of privacy issues in edge caching networks is vital and indispensable for creating a privacy-preserving caching service at the network edge. In this article, we are among the first to fill in this gap by examining privacy-preserving techniques for caching content at the network edge. Firstly, we provide an introduction to the background of Privacy-Preserving Edge Caching (PPEC). Next, we summarize the key privacy issues and present a taxonomy for caching at the network edge from the perspective of private data. Additionally, we conduct a retrospective review of the state-of-the-art countermeasures against privacy leakage from content caching at the network edge. Finally, we conclude the survey and envision challenges for future research.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
SoftMCL: Soft Momentum Contrastive Learning for Fine-grained Sentiment-aware Pre-training
Authors:
Jin Wang,
Liang-Chih Yu,
Xuejie Zhang
Abstract:
The pre-training for language models captures general language understanding but fails to distinguish the affective impact of a particular context to a specific word. Recent works have sought to introduce contrastive learning (CL) for sentiment-aware pre-training in acquiring affective information. Nevertheless, these methods present two significant limitations. First, the compatibility of the GPU…
▽ More
The pre-training for language models captures general language understanding but fails to distinguish the affective impact of a particular context to a specific word. Recent works have sought to introduce contrastive learning (CL) for sentiment-aware pre-training in acquiring affective information. Nevertheless, these methods present two significant limitations. First, the compatibility of the GPU memory often limits the number of negative samples, hindering the opportunities to learn good representations. In addition, using only a few sentiment polarities as hard labels, e.g., positive, neutral, and negative, to supervise CL will force all representations to converge to a few points, leading to the issue of latent space collapse. This study proposes a soft momentum contrastive learning (SoftMCL) for fine-grained sentiment-aware pre-training. Instead of hard labels, we introduce valence ratings as soft-label supervision for CL to fine-grained measure the sentiment similarities between samples. The proposed SoftMCL is conducted on both the word- and sentence-level to enhance the model's ability to learn affective information. A momentum queue was introduced to expand the contrastive samples, allowing storing and involving more negatives to overcome the limitations of hardware platforms. Extensive experiments were conducted on four different sentiment-related tasks, which demonstrates the effectiveness of the proposed SoftMCL method. The code and data of the proposed SoftMCL is available at: https://www.github.com/wangjin0818/SoftMCL/.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Non-linear Welfare-Aware Strategic Learning
Authors:
Tian Xie,
Xueru Zhang
Abstract:
This paper studies algorithmic decision-making in the presence of strategic individual behaviors, where an ML model is used to make decisions about human agents and the latter can adapt their behavior strategically to improve their future data. Existing results on strategic learning have largely focused on the linear setting where agents with linear labeling functions best respond to a (noisy) lin…
▽ More
This paper studies algorithmic decision-making in the presence of strategic individual behaviors, where an ML model is used to make decisions about human agents and the latter can adapt their behavior strategically to improve their future data. Existing results on strategic learning have largely focused on the linear setting where agents with linear labeling functions best respond to a (noisy) linear decision policy. Instead, this work focuses on general non-linear settings where agents respond to the decision policy with only "local information" of the policy. Moreover, we simultaneously consider the objectives of maximizing decision-maker welfare (model prediction accuracy), social welfare (agent improvement caused by strategic behaviors), and agent welfare (the extent that ML underestimates the agents). We first generalize the agent best response model in previous works to the non-linear setting, then reveal the compatibility of welfare objectives. We show the three welfare can attain the optimum simultaneously only under restrictive conditions which are challenging to achieve in non-linear settings. The theoretical results imply that existing works solely maximizing the welfare of a subset of parties inevitably diminish the welfare of the others. We thus claim the necessity of balancing the welfare of each party in non-linear settings and propose an irreducible optimization algorithm suitable for general strategic learning. Experiments on synthetic and real data validate the proposed algorithm.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Algorithmic Decision-Making under Agents with Persistent Improvement
Authors:
Tian Xie,
Xuwei Tan,
Xueru Zhang
Abstract:
This paper studies algorithmic decision-making under human's strategic behavior, where a decision maker uses an algorithm to make decisions about human agents, and the latter with information about the algorithm may exert effort strategically and improve to receive favorable decisions. Unlike prior works that assume agents benefit from their efforts immediately, we consider realistic scenarios whe…
▽ More
This paper studies algorithmic decision-making under human's strategic behavior, where a decision maker uses an algorithm to make decisions about human agents, and the latter with information about the algorithm may exert effort strategically and improve to receive favorable decisions. Unlike prior works that assume agents benefit from their efforts immediately, we consider realistic scenarios where the impacts of these efforts are persistent and agents benefit from efforts by making improvements gradually. We first develop a dynamic model to characterize persistent improvements and based on this construct a Stackelberg game to model the interplay between agents and the decision-maker. We analytically characterize the equilibrium strategies and identify conditions under which agents have incentives to improve. With the dynamics, we then study how the decision-maker can design an optimal policy to incentivize the largest improvements inside the agent population. We also extend the model to settings where 1) agents may be dishonest and game the algorithm into making favorable but erroneous decisions; 2) honest efforts are forgettable and not sufficient to guarantee persistent improvements. With the extended models, we further examine conditions under which agents prefer honest efforts over dishonest behavior and the impacts of forgettable efforts.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Learning under Imitative Strategic Behavior with Unforeseeable Outcomes
Authors:
Tian Xie,
Zhiqun Zuo,
Mohammad Mahdi Khalili,
Xueru Zhang
Abstract:
Machine learning systems have been widely used to make decisions about individuals who may best respond and behave strategically to receive favorable outcomes, e.g., they may genuinely improve the true labels or manipulate observable features directly to game the system without changing labels. Although both behaviors have been studied (often as two separate problems) in the literature, most works…
▽ More
Machine learning systems have been widely used to make decisions about individuals who may best respond and behave strategically to receive favorable outcomes, e.g., they may genuinely improve the true labels or manipulate observable features directly to game the system without changing labels. Although both behaviors have been studied (often as two separate problems) in the literature, most works assume individuals can (i) perfectly foresee the outcomes of their behaviors when they best respond; (ii) change their features arbitrarily as long as it is affordable, and the costs they need to pay are deterministic functions of feature changes. In this paper, we consider a different setting and focus on imitative strategic behaviors with unforeseeable outcomes, i.e., individuals manipulate/improve by imitating the features of those with positive labels, but the induced feature changes are unforeseeable. We first propose a Stackelberg game to model the interplay between individuals and the decision-maker, under which we examine how the decision-maker's ability to anticipate individual behavior affects its objective function and the individual's best response. We show that the objective difference between the two can be decomposed into three interpretable terms, with each representing the decision-maker's preference for a certain behavior. By exploring the roles of each term, we further illustrate how a decision-maker with adjusted preferences can simultaneously disincentivize manipulation, incentivize improvement, and promote fairness.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
Authors:
Zhiyu Zoey Chen,
Jing Ma,
Xinlu Zhang,
Nan Hao,
An Yan,
Armineh Nourbakhsh,
Xianjun Yang,
Julian McAuley,
Linda Petzold,
William Yang Wang
Abstract:
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications…
▽ More
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{https://github.com/czyssrs/LLM_X_papers}.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
S4: Self-Supervised Sensing Across the Spectrum
Authors:
Jayanth Shenoy,
Xinjian Davis Zhang,
Shlok Mehrotra,
Bill Tao,
Rem Yang,
Han Zhao,
Deepak Vasisht
Abstract:
Satellite image time series (SITS) segmentation is crucial for many applications like environmental monitoring, land cover mapping and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine grained annotation. We propose S4 a new self-supervised pre-training approach that signif…
▽ More
Satellite image time series (SITS) segmentation is crucial for many applications like environmental monitoring, land cover mapping and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine grained annotation. We propose S4 a new self-supervised pre-training approach that significantly reduces the requirement for labeled training data by utilizing two new insights: (a) Satellites capture images in different parts of the spectrum such as radio frequencies, and visible frequencies. (b) Satellite imagery is geo-registered allowing for fine-grained spatial alignment. We use these insights to formulate pre-training tasks in S4. We also curate m2s2-SITS, a large-scale dataset of unlabeled, spatially-aligned, multi-modal and geographic specific SITS that serves as representative pre-training data for S4. Finally, we evaluate S4 on multiple SITS segmentation datasets and demonstrate its efficacy against competing baselines while using limited labeled data.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion
Authors:
Pengcheng Li,
Jianzong Wang,
Xulong Zhang,
Yong Zhang,
Jing Xiao,
Ning Cheng
Abstract:
One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively…
▽ More
One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Joint Signal Detection and Automatic Modulation Classification via Deep Learning
Authors:
Huijun Xing,
Xuhui Zhang,
Shuo Chang,
Jinke Ren,
Zixun Zhang,
Jie Xu,
Shuguang Cui
Abstract:
Signal detection and modulation classification are two crucial tasks in various wireless communication systems. Different from prior works that investigate them independently, this paper studies the joint signal detection and automatic modulation classification (AMC) by considering a realistic and complex scenario, in which multiple signals with different modulation schemes coexist at different ca…
▽ More
Signal detection and modulation classification are two crucial tasks in various wireless communication systems. Different from prior works that investigate them independently, this paper studies the joint signal detection and automatic modulation classification (AMC) by considering a realistic and complex scenario, in which multiple signals with different modulation schemes coexist at different carrier frequencies. We first generate a coexisting RADIOML dataset (CRML23) to facilitate the joint design. Different from the publicly available AMC dataset ignoring the signal detection step and containing only one signal, our synthetic dataset covers the more realistic multiple-signal coexisting scenario. Then, we present a joint framework for detection and classification (JDM) for such a multiple-signal coexisting environment, which consists of two modules for signal detection and AMC, respectively. In particular, these two modules are interconnected using a designated data structure called "proposal". Finally, we conduct extensive simulations over the newly developed dataset, which demonstrate the effectiveness of our designs. Our code and dataset are now available as open-source (https://github.com/Singingkettle/ChangShuoRadioData).
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation
Authors:
Yimin Deng,
Jianzong Wang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issue…
▽ More
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation
Authors:
Xujie Zhang,
Ente Lin,
Xiu Li,
Yuxuan Luo,
Michael Kampffmeyer,
Xin Dong,
Xiaodan Liang
Abstract:
This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for singl…
▽ More
This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses) and fall short on customizing dressing styles (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 2) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. For the first issue, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. Besides, MMTryon's impressive performance on multi-items and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Multi-Scale Heterogeneity-Aware Hypergraph Representation for Histopathology Whole Slide Images
Authors:
Minghao Han,
Xukun Zhang,
Dingkang Yang,
Tao Liu,
Haopeng Kuang,
Jinghui Feng,
Lihua Zhang
Abstract:
Survival prediction is a complex ordinal regression task that aims to predict the survival coefficient ranking among a cohort of patients, typically achieved by analyzing patients' whole slide images. Existing deep learning approaches mainly adopt multiple instance learning or graph neural networks under weak supervision. Most of them are unable to uncover the diverse interactions between differen…
▽ More
Survival prediction is a complex ordinal regression task that aims to predict the survival coefficient ranking among a cohort of patients, typically achieved by analyzing patients' whole slide images. Existing deep learning approaches mainly adopt multiple instance learning or graph neural networks under weak supervision. Most of them are unable to uncover the diverse interactions between different types of biological entities(\textit{e.g.}, cell cluster and tissue block) across multiple scales, while such interactions are crucial for patient survival prediction. In light of this, we propose a novel multi-scale heterogeneity-aware hypergraph representation framework. Specifically, our framework first constructs a multi-scale heterogeneity-aware hypergraph and assigns each node with its biological entity type. It then mines diverse interactions between nodes on the graph structure to obtain a global representation. Experimental results demonstrate that our method outperforms state-of-the-art approaches on three benchmark datasets. Code is publicly available at \href{https://github.com/Hanminghao/H2GT}{https://github.com/Hanminghao/H2GT}.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
QLSC: A Query Latent Semantic Calibrator for Robust Extractive Question Answering
Authors:
Sheng Ouyang,
Jianzong Wang,
Yong Zhang,
Zhitao Li,
Ziqi Liang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of…
▽ More
Extractive Question Answering (EQA) in Machine Reading Comprehension (MRC) often faces the challenge of dealing with semantically identical but format-variant inputs. Our work introduces a novel approach, called the ``Query Latent Semantic Calibrator (QLSC)'', designed as an auxiliary module for existing MRC models. We propose a unique scaling strategy to capture latent semantic center features of queries. These features are then seamlessly integrated into traditional query and passage embeddings using an attention mechanism. By deepening the comprehension of the semantic queries-passage relationship, our approach diminishes sensitivity to variations in text format and boosts the model's capability in pinpointing accurate answers. Experimental results on robust Question-Answer datasets confirm that our approach effectively handles format-variant but semantically identical queries, highlighting the effectiveness and adaptability of our proposed method.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization
Authors:
Jianzong Wang,
Ziqi Liang,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared…
▽ More
In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning
Authors:
Ziqi Liang,
Jianzong Wang,
Xulong Zhang,
Yong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of informat…
▽ More
Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition
Authors:
Jianzong Wang,
Pengcheng Li,
Xulong Zhang,
Ning Cheng,
Jing Xiao
Abstract:
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and s…
▽ More
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Overcoming Knowledge Barriers: Online Imitation Learning from Observation with Pretrained World Models
Authors:
Xingyuan Zhang,
Philip Becker-Ehmck,
Patrick van der Smagt,
Maximilian Karl
Abstract:
Incorporating the successful paradigm of pretraining and finetuning from Computer Vision and Natural Language Processing into decision-making has become increasingly popular in recent years. In this paper, we study Imitation Learning from Observation with pretrained models and find existing approaches such as BCO and AIME face knowledge barriers, specifically the Embodiment Knowledge Barrier (EKB)…
▽ More
Incorporating the successful paradigm of pretraining and finetuning from Computer Vision and Natural Language Processing into decision-making has become increasingly popular in recent years. In this paper, we study Imitation Learning from Observation with pretrained models and find existing approaches such as BCO and AIME face knowledge barriers, specifically the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB), greatly limiting their performance. The EKB arises when pretrained models lack knowledge about unseen observations, leading to errors in action inference. The DKB results from policies trained on limited demonstrations, hindering adaptability to diverse scenarios. We thoroughly analyse the underlying mechanism of these barriers and propose AIME-v2 upon AIME as a solution. AIME-v2 uses online interactions with data-driven regulariser to alleviate the EKB and mitigates the DKB by introducing a surrogate reward function to enhance policy training. Experimental results on tasks from the DeepMind Control Suite and Meta-World benchmarks demonstrate the effectiveness of these modifications in improving both sample-efficiency and converged performance. The study contributes valuable insights into resolving knowledge barriers for enhanced decision-making in pretraining-based approaches. Code will be available at https://github.com/argmax-ai/aime-v2.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
S$^2$Mamba: A Spatial-spectral State Space Model for Hyperspectral Image Classification
Authors:
Guanchun Wang,
Xiangrong Zhang,
Zelin Peng,
Tianyang Zhang,
Xiuping Jia,
Licheng Jiao
Abstract:
Land cover analysis using hyperspectral images (HSI) remains an open problem due to their low spatial resolution and complex spectral information. Recent studies are primarily dedicated to designing Transformer-based architectures for spatial-spectral long-range dependencies modeling, which is computationally expensive with quadratic complexity. Selective structured state space model (Mamba), whic…
▽ More
Land cover analysis using hyperspectral images (HSI) remains an open problem due to their low spatial resolution and complex spectral information. Recent studies are primarily dedicated to designing Transformer-based architectures for spatial-spectral long-range dependencies modeling, which is computationally expensive with quadratic complexity. Selective structured state space model (Mamba), which is efficient for modeling long-range dependencies with linear complexity, has recently shown promising progress. However, its potential in hyperspectral image processing that requires handling numerous spectral bands has not yet been explored. In this paper, we innovatively propose S$^2$Mamba, a spatial-spectral state space model for hyperspectral image classification, to excavate spatial-spectral contextual features, resulting in more efficient and accurate land cover analysis. In S$^2$Mamba, two selective structured state space models through different dimensions are designed for feature extraction, one for spatial, and the other for spectral, along with a spatial-spectral mixture gate for optimal fusion. More specifically, S$^2$Mamba first captures spatial contextual relations by interacting each pixel with its adjacent through a Patch Cross Scanning module and then explores semantic information from continuous spectral bands through a Bi-directional Spectral Scanning module. Considering the distinct expertise of the two attributes in homogenous and complicated texture scenes, we realize the Spatial-spectral Mixture Gate by a group of learnable matrices, allowing for the adaptive incorporation of representations learned across different dimensions. Extensive experiments conducted on HSI classification benchmarks demonstrate the superiority and prospect of S$^2$Mamba. The code will be available at: https://github.com/PURE-melo/S2Mamba.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Binary duadic codes and their related codes with a square-root-like lower bound
Authors:
Tingting Wu,
Lanqiang Li,
Xiuyu Zhang,
Shixin Zhu
Abstract:
Binary cyclic codes have been a hot topic for many years, and significant progress has been made in the study of this types of codes. As is well known, it is hard to construct infinite families of binary cyclic codes [n, n+1/2] with good minimum distance. In this paper, by using the BCH bound on cyclic codes, one of the open problems proposed by Liu et al. about binary cyclic codes (Finite Field A…
▽ More
Binary cyclic codes have been a hot topic for many years, and significant progress has been made in the study of this types of codes. As is well known, it is hard to construct infinite families of binary cyclic codes [n, n+1/2] with good minimum distance. In this paper, by using the BCH bound on cyclic codes, one of the open problems proposed by Liu et al. about binary cyclic codes (Finite Field Appl 91:102270, 2023) is settled. Specially, we present several families of binary duadic codes with length 2^m-1 and dimension 2^(m-1), and the minimum distances have a square-root-like lower bound. As a by-product, the parameters of their dual codes and extended codes are provided, where the latter are self-dual and doubly-even.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models
Authors:
Zhongzhen Huang,
Kui Xue,
Yongqi Fan,
Linjie Mu,
Ruoyu Liu,
Tong Ruan,
Shaoting Zhang,
Xiaofan Zhang
Abstract:
Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the la…
▽ More
Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new \textit{Distill-Retrieve-Read} framework instead of the previous \textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
A Survey of Deep Learning Library Testing Methods
Authors:
Xiaoyu Zhang,
Weipeng Jiang,
Chao Shen,
Qi Li,
Qian Wang,
Chenhao Lin,
Xiaohong Guan
Abstract:
In recent years, software systems powered by deep learning (DL) techniques have significantly facilitated people's lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs, which can pose serious threats to users' personal property and safety. Study…
▽ More
In recent years, software systems powered by deep learning (DL) techniques have significantly facilitated people's lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs, which can pose serious threats to users' personal property and safety. Studying the characteristics of DL libraries, their associated bugs, and the corresponding testing methods is crucial for enhancing the security of DL systems and advancing the widespread application of DL technology. This paper provides an overview of the testing research related to various DL libraries, discusses the strengths and weaknesses of existing methods, and provides guidance and reference for the application of the DL library. This paper first introduces the workflow of DL underlying libraries and the characteristics of three kinds of DL libraries involved, namely DL framework, DL compiler, and DL hardware library. It then provides definitions for DL underlying library bugs and testing. Additionally, this paper summarizes the existing testing methods and tools tailored to these DL libraries separately and analyzes their effectiveness and limitations. It also discusses the existing challenges of DL library testing and outlines potential directions for future research.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Federated Transfer Component Analysis Towards Effective VNF Profiling
Authors:
Xunzheng Zhang,
Shadi Moazzeni,
Juan Marcelo Parra-Ullauri,
Reza Nejabati,
Dimitra Simeonidou
Abstract:
The increasing concerns of knowledge transfer and data privacy challenge the traditional gather-and-analyse paradigm in networks. Specifically, the intelligent orchestration of Virtual Network Functions (VNFs) requires understanding and profiling the resource consumption. However, profiling all kinds of VNFs is time-consuming. It is important to consider transferring the well-profiled VNF knowledg…
▽ More
The increasing concerns of knowledge transfer and data privacy challenge the traditional gather-and-analyse paradigm in networks. Specifically, the intelligent orchestration of Virtual Network Functions (VNFs) requires understanding and profiling the resource consumption. However, profiling all kinds of VNFs is time-consuming. It is important to consider transferring the well-profiled VNF knowledge to other lack-profiled VNF types while keeping data private. To this end, this paper proposes a Federated Transfer Component Analysis (FTCA) method between the source and target VNFs. FTCA first trains Generative Adversarial Networks (GANs) based on the source VNF profiling data, and the trained GANs model is sent to the target VNF domain. Then, FTCA realizes federated domain adaptation by using the generated source VNF data and less target VNF profiling data, while keeping the raw data locally. Experiments show that the proposed FTCA can effectively predict the required resources for the target VNF. Specifically, the RMSE index of the regression model decreases by 38.5% and the R-squared metric advances up to 68.6%.
△ Less
Submitted 1 May, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation
Authors:
Xin Zhang,
Liangxiu Han,
Tam Sobeih,
Lianghao Han,
Darren Dancey
Abstract:
Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, loc…
▽ More
Depth estimation is crucial for interpreting complex environments, especially in areas such as autonomous vehicle navigation and robotics. Nonetheless, obtaining accurate depth readings from event camera data remains a formidable challenge. Event cameras operate differently from traditional digital cameras, continuously capturing data and generating asynchronous binary spikes that encode time, location, and light intensity. Yet, the unique sampling mechanisms of event cameras render standard image based algorithms inadequate for processing spike data. This necessitates the development of innovative, spike-aware algorithms tailored for event cameras, a task compounded by the irregularity, continuity, noise, and spatial and temporal characteristics inherent in spiking data.Harnessing the strong generalization capabilities of transformer neural networks for spatiotemporal data, we propose a purely spike-driven spike transformer network for depth estimation from spiking camera data. To address performance limitations with Spiking Neural Networks (SNN), we introduce a novel single-stage cross-modality knowledge transfer framework leveraging knowledge from a large vision foundational model of artificial neural networks (ANN) (DINOv2) to enhance the performance of SNNs with limited data. Our experimental results on both synthetic and real datasets show substantial improvements over existing models, with notable gains in Absolute Relative and Square Relative errors (49% and 39.77% improvements over the benchmark model Spike-T, respectively). Besides accuracy, the proposed model also demonstrates reduced power consumptions, a critical factor for practical applications.
△ Less
Submitted 1 May, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM
Authors:
Xuan Zhang,
Wei Gao
Abstract:
Retrieval-augmented language models have exhibited promising performance across various areas of natural language processing (NLP), including fact-critical tasks. However, due to the black-box nature of advanced large language models (LLMs) and the non-retrieval-oriented supervision signal of specific tasks, the training of retrieval model faces significant challenges under the setting of black-bo…
▽ More
Retrieval-augmented language models have exhibited promising performance across various areas of natural language processing (NLP), including fact-critical tasks. However, due to the black-box nature of advanced large language models (LLMs) and the non-retrieval-oriented supervision signal of specific tasks, the training of retrieval model faces significant challenges under the setting of black-box LLM. We propose an approach leveraging Fine-grained Feedback with Reinforcement Retrieval (FFRR) to enhance fact-checking on news claims by using black-box LLM. FFRR adopts a two-level strategy to gather fine-grained feedback from the LLM, which serves as a reward for optimizing the retrieval policy, by rating the retrieved documents based on the non-retrieval ground truth of the task. We evaluate our model on two public datasets for real-world news claim verification, and the results demonstrate that FFRR achieves significant improvements over strong LLM-enabled and non-LLM baselines.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
FairGT: A Fairness-aware Graph Transformer
Authors:
Renqiang Luo,
Huafei Huang,
Shuo Yu,
Xiuzhen Zhang,
Feng Xia
Abstract:
The design of Graph Transformers (GTs) generally neglects considerations for fairness, resulting in biased outcomes against certain sensitive subgroups. Since GTs encode graph information without relying on message-passing mechanisms, conventional fairness-aware graph learning methods cannot be directly applicable to address these issues. To tackle this challenge, we propose FairGT, a Fairness-awa…
▽ More
The design of Graph Transformers (GTs) generally neglects considerations for fairness, resulting in biased outcomes against certain sensitive subgroups. Since GTs encode graph information without relying on message-passing mechanisms, conventional fairness-aware graph learning methods cannot be directly applicable to address these issues. To tackle this challenge, we propose FairGT, a Fairness-aware Graph Transformer explicitly crafted to mitigate fairness concerns inherent in GTs. FairGT incorporates a meticulous structural feature selection strategy and a multi-hop node feature integration method, ensuring independence of sensitive features and bolstering fairness considerations. These fairness-aware graph information encodings seamlessly integrate into the Transformer framework for downstream tasks. We also prove that the proposed fair structural topology encoding with adjacency matrix eigenvector selection and multi-hop integration are theoretically effective. Empirical evaluations conducted across five real-world datasets demonstrate FairGT's superiority in fairness metrics over existing graph transformers, graph neural networks, and state-of-the-art fairness-aware graph learning approaches.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Authors:
Yicheng Gu,
Xueyao Zhang,
Liumeng Xue,
Haizhou Li,
Zhizheng Wu
Abstract:
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constan…
▽ More
Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Simple Network Mechanism Leads to Quasi-Real Brain Activation Patterns with Drosophila Connectome
Authors:
Xiaoyu Zhang,
Pengcheng Yang,
Jiawei Feng,
Qiang Luo,
Wei Lin,
Xin Lu
Abstract:
Considering the high computational demands of most methods, using network communication models to simulate the brain is a more economical way. However, despite numerous brain network communication models, there is still insufficient evidence that they can effectively replicate the real activation patterns of the brain. Moreover, it remains unclear whether actual network structures are crucial in s…
▽ More
Considering the high computational demands of most methods, using network communication models to simulate the brain is a more economical way. However, despite numerous brain network communication models, there is still insufficient evidence that they can effectively replicate the real activation patterns of the brain. Moreover, it remains unclear whether actual network structures are crucial in simulating intelligence. Addressing these issues, we propose a large scale network communication model based on simple rules and design criteria to assess the differences between network models and real situations. We conduct research on the biggest adult Drosophila connectome data set. Experimental results show significant activation in neurons that should respond to stimulus and slight activation in irrelevant ones, which we call quasi-real activation pattern. Besides, when we change the network structure, the quasi-activation patterns disappear. Interestingly, activation regions have shorter network distances to their input neurons, implying that the network structure (not spatial distance) is the core to form brain functionality. In addition, giving the input neurons a unilateral stimulus, we observe a bilateral response, which is consistent with reality. Then we find that both hemispheres have extremely similar statistical indicators. We also develop real-time 3D large spatial network visualization software to observe and document experimental phenomena, filling the software gap. This research reveals network models' power: it can reach the quasi-activation pattern even with simple propagation rules. Besides, it provides evidence that network structure matters in brain activity pattern generation. Future research could fully simulate brain behavior through network models, paving the way for artificial intelligence by developing new propagation rules and optimizing link weights.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations
Authors:
Shen Zhang,
Haojie Zhang,
Jing Zhang,
Xudong Zhang,
Yimeng Zhuang,
Jinting Wu
Abstract:
In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the…
▽ More
In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
The Third Monocular Depth Estimation Challenge
Authors:
Jaime Spencer,
Fabio Tosi,
Matteo Poggi,
Ripudaman Singh Arora,
Chris Russell,
Simon Hadfield,
Richard Bowden,
GuangYuan Zhou,
ZhengXin Li,
Qiang Rao,
YiPing Bao,
Xiao Liu,
Dohyeong Kim,
Jinseong Kim,
Myunghyun Kim,
Mykola Lavreniuk,
Rui Li,
Qing Mao,
Jiang Wu,
Yu Zhu,
Jinqiu Sun,
Yanning Zhang,
Suraj Patni,
Aradhye Agarwal,
Chetan Arora
, et al. (16 additional authors not shown)
Abstract:
This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 su…
▽ More
This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
△ Less
Submitted 27 April, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection
Authors:
Xuanyu Zhang,
Youmin Xu,
Runyi Li,
Jiwen Yu,
Weiqi Li,
Zhipei Xu,
Jian Zhang
Abstract:
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizabili…
▽ More
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.