Skip to main content

Showing 1–50 of 531 results for author: Qiao, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05945  [pdf, other

    cs.CV

    Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

    Authors: Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

    Abstract: Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  2. arXiv:2405.00622  [pdf, other

    cs.CL cs.AI cs.LG

    Causal Evaluation of Language Models

    Authors: Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

    Abstract: Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive ben… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 315 pages, 230 figures, 21 tables. Project website: https://opencausalab.github.io/CaLM

  3. arXiv:2404.19500  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    Towards Real-world Video Face Restoration: A New Benchmark

    Authors: Ziyan Chen, Jingwen He, Xinqi Lin, Yu Qiao, Chao Dong

    Abstract: Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved. Typical BFR methods are evaluated on privately synthesized datasets or self-collected real-world low-quality face ima… ▽ More

    Submitted 4 May, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

    Comments: Project page: https://ziyannchen.github.io/projects/VFRxBenchmark/

  4. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  5. arXiv:2404.16006  [pdf, other

    cs.CV

    MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

    Authors: Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 77 pages, 41 figures

  6. arXiv:2404.10928  [pdf, other

    cs.DC

    GPU-Based Parallel Computing Methods for Medical Photoacoustic Image Reconstruction

    Authors: Xinyao Yi, Yuxin Qiao

    Abstract: Recent years have witnessed a rapid advancement in GPU technology, establishing it as a formidable high-performance parallel computing technology with superior floating-point computational capabilities compared to traditional CPUs. This paper explores the application of this technology in the field of photoacoustic imaging, an emerging non-destructive testing technique in biomedical engineering ch… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  7. arXiv:2404.09259  [pdf, other

    cs.CV cs.AI

    FedCCL: Federated Dual-Clustered Feature Contrast Under Domain Heterogeneity

    Authors: Yu Qiao, Huy Q. Le, Mengchun Zhang, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

    Abstract: Federated learning (FL) facilitates a privacy-preserving neural network training paradigm through collaboration between edge clients and a central server. One significant challenge is that the distributed data is not independently and identically distributed (non-IID), typically including both intra-domain and inter-domain heterogeneity. However, recent research is limited to simply using averaged… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  8. arXiv:2404.08896  [pdf, other

    cs.IR

    Approximate Cluster-Based Sparse Document Retrieval with Segmented Maximum Term Weights

    Authors: Yifan Qiao, Shanxiu He, Yingrui Yang, Parker Carlson, Tao Yang

    Abstract: This paper revisits cluster-based retrieval that partitions the inverted index into multiple groups and skips the index partially at cluster and document levels during online inference using a learned sparse representation. It proposes an approximate search scheme with two parameters to control the rank-safeness competitiveness of pruning with segmented maximum term weights within each cluster. Cl… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

  9. arXiv:2404.06776  [pdf, other

    cs.LG cs.AI cs.CV

    Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data

    Authors: Yu Qiao, Chaoning Zhang, Apurba Adhikary, Choong Seon Hong

    Abstract: Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks. However, challenges arise due to vulnerability to adversarial examples (AEs) and the non-independent and identically distributed (non-IID) nature of data distribution among devices, hindering the deployment of adversarially robust and accurate learning models at the e… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  10. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  11. arXiv:2404.02882  [pdf, other

    cs.LG cs.CL

    Linear Attention Sequence Parallelism

    Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

    Abstract: Sequence Parallel (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single GPU. However, existing SP methods do not take advantage of linear attention features, resulting in sub-optimal parallelism efficiency and usability for linear attention-based language models. In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP m… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Technical Report. Weigao Sun and Zhen Qin contribute equally to this paper. Yiran Zhong is the corresponding author. The code is available at https://github.com/OpenNLPLab/LASP

  12. arXiv:2404.02755  [pdf, other

    cs.CV cs.AI cs.MM

    DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

    Authors: Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun

    Abstract: We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  13. arXiv:2404.01342  [pdf, other

    cs.CL cs.AI

    DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

    Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

    Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process tha… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Published as a conference paper at CVPR 2024

  14. arXiv:2404.00973  [pdf, other

    cs.CV

    VideoDistill: Language-aware Vision Distillation for Video Question Answering

    Authors: Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

    Abstract: Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches, they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation, thus omitting crucial vis… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: This paper is accepted by CVPR2024

  15. arXiv:2404.00913  [pdf, other

    cs.CV cs.AI cs.CL

    LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

    Authors: Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

    Abstract: Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile informat… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: This paper is accepted by CVPR 2024

  16. arXiv:2404.00271  [pdf, other

    cs.LG cs.AI

    TG-NAS: Leveraging Zero-Cost Proxies with Transformer and Graph Convolution Networks for Efficient Neural Architecture Search

    Authors: Ye Qiao, Haocheng Xu, Sitao Huang

    Abstract: Neural architecture search (NAS) is an effective method for discovering new convolutional neural network (CNN) architectures. However, existing approaches often require time-consuming training or intensive sampling and evaluations. Zero-shot NAS aims to create training-free proxies for architecture performance prediction. However, existing proxies have suboptimal performance, and are often outperf… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  17. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  18. arXiv:2403.20194  [pdf, other

    cs.MM

    ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

    Authors: Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang

    Abstract: This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on… ▽ More

    Submitted 25 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  19. arXiv:2403.19622  [pdf, other

    cs.RO cs.CV

    RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

    Authors: Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

    Abstract: The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. Recent progress in utilizing language models as high-level planners has demonstrated that the complexity of tasks can be reduced through decomposing them into primitive-level plans, mak… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 24 pages, 12 figures, 6 tables

  20. arXiv:2403.19160  [pdf, other

    cs.CV

    Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

    Authors: Yutong Chen, Yifan Zhan, Zhihang Zhong, Wei Wang, Xiao Sun, Yu Qiao, Yinqiang Zheng

    Abstract: Neural rendering techniques have significantly advanced 3D human body modeling. However, previous approaches often overlook dynamics induced by factors such as motion inertia, leading to challenges in scenarios like abrupt stops after rotation, where the pose remains static while the appearance changes. This limitation arises from reliance on a single pose as conditional input, resulting in ambigu… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  21. arXiv:2403.17830  [pdf, other

    cs.CV

    Assessment of Multimodal Large Language Models in Alignment with Human Values

    Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

    Abstract: Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining h… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.02692

  22. arXiv:2403.17442  [pdf, other

    cs.IR

    Touch the Core: Exploring Task Dependence Among Hybrid Targets for Recommendation

    Authors: Xing Tang, Yang Qiao, Fuyuan Lyu, Dugang Liu, Xiuqiang He

    Abstract: As user behaviors become complicated on business platforms, online recommendations focus more on how to touch the core conversions, which are highly related to the interests of platforms. These core conversions are usually continuous targets, such as \textit{watch time}, \textit{revenue}, and so on, whose predictions can be enhanced by previous discrete conversion actions. Therefore, multi-task le… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  23. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  24. arXiv:2403.16209  [pdf

    cs.CV cs.AI

    Image Captioning in news report scenario

    Authors: Tianrui Liu, Qi Cai, Changxin Xu, Bo Hong, Jize Xiong, Yuxin Qiao, Tsungwei Yang

    Abstract: Image captioning strives to generate pertinent captions for specified images, situating itself at the crossroads of Computer Vision (CV) and Natural Language Processing (NLP). This endeavor is of paramount importance with far-reaching applications in recommendation systems, news outlets, social media, and beyond. Particularly within the realm of news reporting, captions are expected to encompass d… ▽ More

    Submitted 1 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: 10 pages, 4 figures

  25. arXiv:2403.16206  [pdf

    cs.AI

    Rumor Detection with a novel graph neural network approach

    Authors: Tianrui Liu, Qi Cai, Changxin Xu, Bo Hong, Fanghao Ni, Yuxin Qiao, Tsungwei Yang

    Abstract: The wide spread of rumors on social media has caused a negative impact on people's daily life, leading to potential panic, fear, and mental health problems for the public. How to debunk rumors as early as possible remains a challenging problem. Existing studies mainly leverage information propagation structure to detect rumors, while very few works focus on correlation among users that they may co… ▽ More

    Submitted 1 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: 10 pages, 5 figures

  26. arXiv:2403.16182  [pdf, other

    cs.CV

    EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

    Authors: Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, Yu Qiao

    Abstract: Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  27. arXiv:2403.16159  [pdf, other

    cs.HC

    Designing Child-Centric AI Learning Environments: Insights from LLM-Enhanced Creative Project-Based Learning

    Authors: Siyu Zha, Yuehan Qiao, Qingyu Hu, Zhongsheng Li, Jiangtao Gong, Yingqing Xu

    Abstract: Project-based learning (PBL) is an instructional method that is very helpful in nurturing students' creativity, but it requires significant time and energy from both students and teachers. Large language models (LLMs) have been proven to assist in creative tasks, yet much controversy exists regarding their role in fostering creativity. This paper explores the potential of LLMs in PBL settings, wit… ▽ More

    Submitted 5 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

  28. arXiv:2403.15377  [pdf, other

    cs.CV

    InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

    Authors: Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang

    Abstract: We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token predict… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: a technical report about video understanding

  29. arXiv:2403.13600  [pdf, other

    cs.CV

    VL-Mamba: Exploring State Space Models for Multimodal Learning

    Authors: Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu

    Abstract: Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great p… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  30. arXiv:2403.12803  [pdf, other

    cs.CV

    DreamDA: Generative Data Augmentation with Diffusion Models

    Authors: Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu

    Abstract: The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor. Compared to conventional Data Augmentation (DA) techniques (e.g. cropping and rotation), exploiting prevailing diffusion models for data generation has received scant attention in classification tasks. Existing generative DA methods either inadequately bridge the domain gap between real-world and… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 14 pages, 8 tables, 3 figures

  31. A Multi-loudspeaker Binaural Room Impulse Response Dataset with High-Resolution Translational and Rotational Head Coordinates in a Listening Room

    Authors: Yue Qiao, Ryan Miguel Gonzales, Edgar Choueiri

    Abstract: Data report for the 3D3A Lab Binaural Room Impulse Response (BRIR) Dataset (https://doi.org/10.34770/6gc9-5787).

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Submitted to Frontiers in Signal Processing

  32. arXiv:2403.12037  [pdf, other

    cs.CV

    MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

    Authors: Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

    Abstract: It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator… ▽ More

    Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://sites.google.com/view/minedreamer/main

  33. arXiv:2403.09630  [pdf, other

    cs.CV

    Generalized Predictive Model for Autonomous Driving

    Authors: Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

    Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning ar… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  34. arXiv:2403.09346  [pdf, other

    cs.CV cs.AI

    AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

    Authors: Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

    Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AV… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  35. arXiv:2403.09093  [pdf, other

    cs.CV

    Desigen: A Pipeline for Controllable Design Template Generation

    Authors: Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen

    Abstract: Templates serve as a good starting point to implement a design (e.g., banner, slide) but it takes great effort from designers to manually create. In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images, a background image should preserve enough non-salient s… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for publication in CVPR2024

  36. arXiv:2403.07865  [pdf, other

    cs.CL cs.AI cs.CR cs.LG cs.SE

    Exploring Safety Generalization Challenges of Large Language Models via Code

    Authors: Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Yu Qiao, Wai Lam, Lizhuang Ma

    Abstract: The rapid advancement of Large Language Models (LLMs) has brought about remarkable generative capabilities but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces C… ▽ More

    Submitted 7 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  37. arXiv:2403.06977  [pdf, other

    cs.CV

    VideoMamba: State Space Model for Efficient Video Understanding

    Authors: Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao

    Abstract: Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. The proposed VideoMamba overcomes the limitations of existing 3D convolution neural networks and video transformers. Its linear-complexity operator enables efficient long-term modeling, which is crucial for high-resolution long video unders… ▽ More

    Submitted 12 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: 19 Pages, 7 Figures, 8 Tables

  38. arXiv:2403.04593  [pdf, other

    cs.CV

    Embodied Understanding of Driving Scenarios

    Authors: Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

    Abstract: Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 43 pages, 16 figures

  39. arXiv:2403.02803  [pdf, other

    cs.CV

    Towards Robust Federated Learning via Logits Calibration on Non-IID Data

    Authors: Yu Qiao, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

    Abstract: Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks. However, recent studies have shown that FL is vulnerable to adversarial examples (AEs), leading to a significant drop in its performance. Meanwhile, the non-independent and identically distributed (non-IID) challenge of data distribution be… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE NOMS 2024

  40. arXiv:2403.02308  [pdf, other

    cs.CV

    Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

    Authors: Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

    Abstract: Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), o… ▽ More

    Submitted 7 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  41. arXiv:2403.02118  [pdf, other

    cs.CY cs.AI cs.CV

    Towards Implicit Prompt For Text-To-Image Models

    Authors: Yue Yang, Yuqi lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

    Abstract: Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position pap… ▽ More

    Submitted 8 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  42. arXiv:2403.01742  [pdf, other

    cs.LG cs.AI

    Diffusion-TS: Interpretable Diffusion for General Time Series Generation

    Authors: Xinyu Yuan, Yan Qiao

    Abstract: Denoising diffusion probabilistic models (DDPMs) are becoming the leading paradigm for generative models. It has recently shown breakthroughs in audio synthesis, time series imputation and forecasting. In this paper, we propose Diffusion-TS, a novel diffusion-based framework that generates multivariate time series samples of high quality by using an encoder-decoder transformer with disentangled te… ▽ More

    Submitted 14 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  43. arXiv:2403.01543  [pdf, other

    cs.CV

    Efficient Action Counting with Dynamic Queries

    Authors: Zishi Li, Xiaoxuan Ma, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang

    Abstract: Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated… ▽ More

    Submitted 5 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

    Comments: code: https://github.com/lizishi/DeTRC, proj page: https://shirleymaxx.github.io/DeTRC/

  44. arXiv:2402.19474  [pdf, other

    cs.CV

    The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

    Authors: Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

    Abstract: We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images. Specifically, we propose the All-Seeing Model V2 (ASMv2) that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation (ReC) task. Leveraging this unified task, our model excels not only in perceiving and recognizing… ▽ More

    Submitted 17 April, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Technical Report

  45. arXiv:2402.19465  [pdf, other

    cs.CL cs.AI

    Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

    Authors: Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao

    Abstract: Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robu… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  46. arXiv:2402.19282  [pdf, other

    cs.CL

    WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

    Authors: Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan , et al. (1 additional authors not shown)

    Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy… ▽ More

    Submitted 17 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

  47. arXiv:2402.18951  [pdf, other

    cs.CV

    Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

    Authors: Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

    Abstract: Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 35 pages, 6 figures, 8 tables

  48. arXiv:2402.17511  [pdf, other

    cs.RO cs.AI

    Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

    Authors: Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

    Abstract: Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills. However, acquiring inherent primitive skills in a coupled and long-horizon environm… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 16 pages

    ACM Class: I.2.6

  49. arXiv:2402.16880  [pdf, other

    cs.LG cs.AI cs.CL

    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

    Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

    Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer… ▽ More

    Submitted 19 April, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  50. Infrared and visible Image Fusion with Language-driven Loss in CLIP Embedding Space

    Authors: Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Lei Zhang, Yajun Qiao

    Abstract: Infrared-visible image fusion (IVIF) has attracted much attention owing to the highly-complementary properties of the two image modalities. Due to the lack of ground-truth fused images, the fusion output of current deep-learning based methods heavily depends on the loss functions defined mathematically. As it is hard to well mathematically define the fused image without ground truth, the performan… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.