Skip to main content

Showing 1–50 of 300 results for author: Ge, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.04007  [pdf, other

    cs.CV

    SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

    Authors: Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

    Abstract: In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario da… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Technical Report; Dataset released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit

  2. arXiv:2405.03119  [pdf, ps, other

    cs.IT eess.SP

    DAFT-Spread Affine Frequency Division Multiple Access for Downlink Transmission

    Authors: Yiwei Tao, Miaowen Wen, Yao Ge, Tianqi Mao, Lixia Xiao, Jun Li

    Abstract: Affine frequency division multiplexing (AFDM) and orthogonal AFDM access (O-AFDMA) are promising techniques based on chirp signals, which are able to suppress the performance deterioration caused by Doppler shifts in high-mobility scenarios. However, the high peak-to-average power ratio (PAPR) in AFDM or O-AFDMA is still a crucial problem, which severely limits their practical applications. In thi… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  3. arXiv:2405.02604  [pdf, ps, other

    cs.IT eess.SP

    Interleave Frequency Division Multiplexing

    Authors: Yuhao Chi, Lei Liu, Yao Ge, Xuehui Chen, Ying Li, Zhaoyang Zhang

    Abstract: In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying ex… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE Wireless Communications Letters

  4. arXiv:2405.01312  [pdf, other

    cs.DB cs.CR

    Privacy-Enhanced Database Synthesis for Benchmark Publishing

    Authors: Yongrui Zhong, Yunqing Ge, Jianbin Qin, Shuyuan Zheng, Bo Tang, Yu-Xuan Qiu, Rui Mao, Ye Yuan, Makoto Onizuka, Chuan Xiao

    Abstract: Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating syn… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  5. arXiv:2404.19752  [pdf, other

    cs.CV

    Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

    Authors: Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

    Abstract: Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning model… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  6. arXiv:2404.16957  [pdf, other

    cs.AI cs.CY

    Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability

    Authors: Yunfei Ge, Quanyan Zhu

    Abstract: The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility at… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  7. arXiv:2404.16790  [pdf, other

    cs.CV

    SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

    Authors: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

    Abstract: Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their pro… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  8. arXiv:2404.14396  [pdf, other

    cs.CV

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Authors: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

    Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. I… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Project released at: https://github.com/AILab-CVC/SEED-X

  9. arXiv:2404.13884  [pdf

    eess.IV cs.CV

    MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 FLOPs

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. In recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. In addition, combining CNN and Transformer can effectively combine global and local information for enhancement. However, this approach i… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  10. arXiv:2404.13600  [pdf, other

    cs.RO

    Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments

    Authors: Zirui Wang, Chen Yao, Yangtao Ge, Guowei Shi, Ningbo Yang, Zheng Zhu, Kewei Dong, Hexiang Wei, Zhenzhong Jia, Jing Wu

    Abstract: So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

  11. arXiv:2404.07855  [pdf, other

    cs.CV

    Resolve Domain Conflicts for Generalizable Remote Physiological Measurement

    Authors: Weiyu Sun, Xinyu Zhang, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen

    Abstract: Remote photoplethysmography (rPPG) technology has become increasingly popular due to its non-invasive monitoring of various physiological indicators, making it widely applicable in multimedia interaction, healthcare, and emotion analysis. Existing rPPG methods utilize multiple datasets for training to enhance the generalizability of models. However, they often overlook the underlying conflict issu… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Accepted by ACM MM 2023

  12. arXiv:2404.06835  [pdf, other

    cs.CV

    Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

    Authors: Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan

    Abstract: In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this w… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  13. arXiv:2404.03443  [pdf, ps, other

    cs.CV

    Part-Attention Based Model Make Occluded Person Re-Identification Stronger

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle… ▽ More

    Submitted 1 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted By International Joint Conference on Neural Networks 2024

  14. arXiv:2404.00308  [pdf, other

    cs.CV

    ST-LLM: Large Language Models Are Effective Temporal Learners

    Authors: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

    Abstract: Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we fe… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  15. arXiv:2403.19021  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    Towards LLM-RecSys Alignment with Textual ID Learning

    Authors: Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang

    Abstract: Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted in SIGIR 2024

  16. arXiv:2403.17664  [pdf, other

    cs.CV

    DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

    Authors: Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

    Abstract: Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient infer… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  17. arXiv:2403.16971  [pdf, other

    cs.OS cs.AI cs.CL

    AIOS: LLM Agent Operating System

    Authors: Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, Yongfeng Zhang

    Abstract: The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating het… ▽ More

    Submitted 25 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: 14 pages, 5 figures, 5 tables; comments and suggestions are appreciated

  18. arXiv:2403.16875  [pdf, other

    cs.RO

    TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments

    Authors: Chen Yao, Yangtao Ge, Guowei Shi, Zirui Wang, Ningbo Yang, Zheng Zhu, Hexiang Wei, Yuntian Zhao, Jing Wu, Zhenzhong Jia

    Abstract: Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Robotics and Automation Letters

  19. arXiv:2403.13408  [pdf, other

    cs.CV cs.AI

    S2DM: Sector-Shaped Diffusion Models for Video Generation

    Authors: Haoran Lang, Yuxuan Ge, Zheng Tian

    Abstract: Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features.… ▽ More

    Submitted 22 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 17 pages, 6 figures

  20. arXiv:2403.12450  [pdf, other

    cs.CV

    Intention Action Anticipation Model with Guide-Feedback Loop Mechanism

    Authors: Zongnan Ma, Fuchun Zhang, Zhixiong Nan, Yao Ge

    Abstract: Anticipating human intention from videos has broad applications, such as automatic driving, robot assistive technology, and virtual reality. This study addresses the problem of intention action anticipation using egocentric video sequences to estimate actions that indicate human intention. We propose a Hierarchical Complete-Recent (HCR) information fusion model that makes full use of the features… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  21. arXiv:2403.12373  [pdf, other

    cs.CL

    RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

    Authors: Chi Hu, Yuan Ge, Xiangnan Ma, Hang Cao, Qiang Li, Yonghua Yang, Tong Xiao, Jingbo Zhu

    Abstract: Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent res… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: LREC-Coling 2024 Long Paper

  22. arXiv:2403.11111  [pdf, other

    cs.CV

    3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

    Authors: Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen

    Abstract: In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images a… ▽ More

    Submitted 11 April, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: project page: https://yongtaoge.github.io/projects/humanwild

  23. arXiv:2403.08557  [pdf

    cs.CV

    Occluded Cloth-Changing Person Re-Identification

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: Cloth-changing person re-identification aims to retrieve and identify spe-cific pedestrians by using cloth-unrelated features in person cloth-changing scenarios. However, pedestrian images captured by surveillance probes usually contain occlusions in real-world scenarios. The perfor-mance of existing cloth-changing person re-identification methods is sig-nificantly degraded due to the reduction of… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  24. arXiv:2403.06090  [pdf, other

    cs.CV

    Diffusion Models Trained with Large Data Are Transferable Visual Models

    Authors: Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen

    Abstract: We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data (even synthetic data only), including monocular depth, surface normal, image segmentation, matting, human pose estimation, among virtual… ▽ More

    Submitted 15 March, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

  25. arXiv:2403.00691  [pdf, other

    cs.CV cs.AI

    Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

    Authors: Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian

    Abstract: Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modal… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  26. arXiv:2402.18191  [pdf, other

    cs.CL

    Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

    Authors: Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Hao Yang, Tong Xiao

    Abstract: With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  27. arXiv:2402.10071  [pdf, other

    eess.SP cs.IT

    Approximate Message Passing-Enhanced Graph Neural Network for OTFS Data Detection

    Authors: Wenhao Zhuang, Yuyi Mao, Hengtao He, Lei Xie, Shenghui Song, Yao Ge, Zhi Ding

    Abstract: Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further mini… ▽ More

    Submitted 14 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: 8 pages, 7 figures, and 3 tables. Part of this article was submitted to IEEE for possible publication

  28. arXiv:2402.07140  [pdf, other

    cs.AI

    Graph Descriptive Order Improves Reasoning with Large Language Model

    Authors: Yuyao Ge, Shenghua Liu, Wenjie Feng, Lingrui Mei, Lizhe Chen, Xueqi Cheng

    Abstract: In recent years, large language models have achieved state-of-the-art performance across multiple domains. However, the progress in the field of graph reasoning with LLM remains limited. Our work delves into this gap by thoroughly investigating graph reasoning with LLMs. In this work, we reveal the impact of the order of graph description on LLMs' graph reasoning performance, which significantly a… ▽ More

    Submitted 24 February, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

  29. arXiv:2402.00455  [pdf, ps, other

    cs.IT eess.SP

    New Lower Bounds on Aperiodic Ambiguity Function of Unimodular Sequences

    Authors: Lingsheng Meng, Yong Liang Guan, Yao Ge, Zilong Liu, Pingzhi Fan

    Abstract: This paper presents new aperiodic ambiguity function (AF) lower bounds of unimodular sequences under certain low ambiguity zone. Our key idea, motivated by the Levenshtein correlation bound, is to introduce two weight vectors associated to the delay and Doppler shifts, respectively, and then exploit the upper and lower bounds on the Frobenius norm of the weighted auto- and cross-AF matrices to der… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 5 pages, 1 figure

  30. arXiv:2402.00284  [pdf, other

    cs.IR cs.AI cs.LG

    PAP-REC: Personalized Automatic Prompt for Recommendation Language Model

    Authors: Zelong Li, Jianchao Ji, Yingqiang Ge, Wenyue Hua, Yongfeng Zhang

    Abstract: Recently emerged prompt-based Recommendation Language Models (RLM) can solve multiple recommendation tasks uniformly. The RLMs make full use of the inherited knowledge learned from the abundant pre-training data to solve the downstream recommendation tasks by prompts, without introducing additional parameters or network training. However, handcrafted prompts require significant expertise and human… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  31. arXiv:2401.17270  [pdf, other

    cs.CV

    YOLO-World: Real-Time Open-Vocabulary Object Detection

    Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

    Abstract: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling an… ▽ More

    Submitted 22 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Work still in progress. Code & models are available at: https://github.com/AILab-CVC/YOLO-World

  32. arXiv:2401.14686  [pdf, other

    cs.CV

    SSR: SAM is a Strong Regularizer for domain adaptive semantic segmentation

    Authors: Yanqi Ge, Ye Huang, Wen Li, Lixin Duan

    Abstract: We introduced SSR, which utilizes SAM (segment-anything) as a strong regularizer during training, to greatly enhance the robustness of the image encoder for handling various domains. Specifically, given the fact that SAM is pre-trained with a large number of images over the internet, which cover a diverse variety of domains, the feature encoding extracted by the SAM is obviously less dependent on… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

  33. arXiv:2401.14405  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

    Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

    Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalit… ▽ More

    Submitted 18 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Code and models are available at https://github.com/AILab-CVC/M2PT

  34. arXiv:2401.13672  [pdf, other

    cs.DB cs.AI cs.IR

    Transforming Agriculture with Intelligent Data Management and Insights

    Authors: Yu Pan, Jianxin Sun, Hongfeng Yu, Geng Bai, Yufeng Ge, Joe Luck, Tala Awada

    Abstract: Modern agriculture faces grand challenges to meet increased demands for food, fuel, feed, and fiber with population growth under the constraints of climate change and dwindling natural resources. Data innovation is urgently required to secure and improve the productivity, sustainability, and resilience of our agroecosystems. As various sensors and Internet of Things (IoT) instrumentation become mo… ▽ More

    Submitted 7 November, 2023; originally announced January 2024.

  35. arXiv:2401.10222  [pdf, other

    cs.CV cs.AI

    Supervised Fine-tuning in turn Improves Visual Foundation Models

    Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

    Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuni… ▽ More

    Submitted 11 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 23 pages, 3 figures, Project page: https://github.com/TencentARC/ViSFT/tree/main

  36. arXiv:2401.09740  [pdf, other

    cs.CR

    Hijacking Attacks against Neural Networks by Analyzing Training Data

    Authors: Yunjie Ge, Qian Wang, Huayang Huang, Qi Li, Cong Wang, Chao Shen, Lingchen Zhao, Peipei Jiang, Zheng Fang, Shenyi Zhang

    Abstract: Backdoors and adversarial examples are the two primary threats currently faced by deep neural networks (DNNs). Both attacks attempt to hijack the model behaviors with unintended outputs by introducing (small) perturbations to the inputs. Backdoor attacks, despite the high success rates, often require a strong assumption, which is not always easy to achieve in reality. Adversarial example attacks,… ▽ More

    Submitted 19 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Full version with major polishing, compared to the Usenix Security 2024 edition

  37. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  38. arXiv:2401.04150  [pdf, other

    cs.CV

    Two-stream joint matching method based on contrastive learning for few-shot action recognition

    Authors: Long Deng, Ziqiang Li, Bingxin Zhou, Zhongming Chen, Ao Li, Yongxin Ge

    Abstract: Although few-shot action recognition based on metric learning paradigm has achieved significant success, it fails to address the following issues: (1) inadequate action relation modeling and underutilization of multi-modal information; (2) challenges in handling video matching problems with different lengths and speeds, and video matching problems with misalignment of video sub-actions. To address… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  39. arXiv:2401.03158  [pdf, other

    cs.CL cs.AI

    Quartet Logic: A Four-Step Reasoning (QLFR) framework for advancing Short Text Classification

    Authors: Hui Wu, Yuanben Zhang, Zhonghe Han, Yingyan Hou, Lei Wang, Siye Liu, Qihang Gong, Yunping Ge

    Abstract: Short Text Classification (STC) is crucial for processing and comprehending the brief but substantial content prevalent on contemporary digital platforms. The STC encounters difficulties in grasping semantic and syntactic intricacies, an issue that is apparent in traditional pre-trained language models. Although Graph Convolutional Networks enhance performance by integrating external knowledge bas… ▽ More

    Submitted 6 January, 2024; originally announced January 2024.

  40. arXiv:2401.02674  [pdf, other

    cs.IT eess.SP

    Message Feedback Interference Cancellation Aided UAMP Iterative Detector for OTFS Systems

    Authors: Xiangxiang Li, Haiyan Wang, Yao Ge, Xiaohong Shen, Jiarui Zhao

    Abstract: The designing of efficient signal detectors is important and yet challenge for orthogonal time frequency space (OTFS) systems in high-mobility scenarios. In this letter, we develop an efficient message feedback interference cancellation aided unitary approximate message passing (denoted as UAMPMFIC) iterative detector, where the latest feedback messages from variable nodes are utilized for more re… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

  41. arXiv:2401.02415  [pdf, other

    cs.CL

    LLaMA Pro: Progressive LLaMA with Block Expansion

    Authors: Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

    Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forge… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  42. DGDNN: Decoupled Graph Diffusion Neural Network for Stock Movement Prediction

    Authors: Zinuo You, Zijian Shi, Hongbo Bo, John Cartlidge, Li Zhang, Yan Ge

    Abstract: Forecasting future stock trends remains challenging for academia and industry due to stochastic inter-stock dynamics and hierarchical intra-stock dynamics influencing stock prices. In recent years, graph neural networks have achieved remarkable performance in this problem by formulating multiple stocks as graph-structured data. However, most of these approaches rely on artificially defined factors… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: 12 pages, 5 figures, author manuscript accepted for ICAART 2024 (International Conference on Agents and Artificial Intelligence)

    Journal ref: 16th International Conference on Agents and Artificial Intelligence (ICAART), Volume 2, Feb. 2024, pp. 431-442

  43. arXiv:2401.00551  [pdf, other

    cs.CV

    A Generalist FaceX via Learning Unified Facial Representation

    Authors: Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai

    Abstract: This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Faci… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Comments: Project page: https://diffusion-facex.github.io/

  44. arXiv:2312.14388  [pdf, other

    cs.CR math.CO

    A Generalized Shuffle Framework for Privacy Amplification: Strengthening Privacy Guarantees and Enhancing Utility

    Authors: E Chen, Yang Cao, Yifei Ge

    Abstract: The shuffle model of local differential privacy is an advanced method of privacy amplification designed to enhance privacy protection with high utility. It achieves this by randomly shuffling sensitive data, making linking individual data points to specific individuals more challenging. However, most existing studies have focused on the shuffle model based on $(ε_0,0)$-Locally Differentially Priva… ▽ More

    Submitted 1 March, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: Correct some typos

  45. arXiv:2312.14216  [pdf, other

    cs.CV

    DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

    Authors: Brian Nlong Zhao, Yuhang Xiao, Jiashu Xu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Laurent Itti, Vibhav Vineet, Yunhao Ge

    Abstract: The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  46. arXiv:2312.12742  [pdf, other

    cs.CV

    Cached Transformers: Improving Transformers with Differentiable Memory Cache

    Authors: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

    Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: AAAI 2024

  47. arXiv:2312.11872  [pdf, other

    cs.CV

    Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning

    Authors: Yanqi Ge, Qiang Nie, Ye Huang, Yong Liu, Chengjie Wang, Feng Zheng, Wen Li, Lixin Duan

    Abstract: One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. Many outstanding metric-based and prototype-based methods following the Expectation-Maximization paradigm, have been proposed for this objective. However, they inevitably introduce biases into the learning process, particularly with long-tail distributed training dat… ▽ More

    Submitted 4 February, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: AAAI 2024

  48. arXiv:2312.09251  [pdf, other

    cs.CV

    VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

    Authors: Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

    Abstract: In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  49. arXiv:2312.06739  [pdf, other

    cs.CV

    SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

    Authors: Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

    Abstract: Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding an… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project page: https://yuzhou914.github.io/SmartEdit/

  50. arXiv:2312.06722  [pdf, other

    cs.CV cs.CL cs.RO

    EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

    Authors: Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

    Abstract: Multimodal Large Language Models, combining the remarkable reasoning and generalization capabilities of Large Language Models (LLMs) with the ability to comprehend visual inputs, have opened up new avenues for embodied task planning. Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected… ▽ More

    Submitted 17 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

    Comments: Project released at: https://github.com/ChenYi99/EgoPlan