Skip to main content

Showing 1–50 of 100 results for author: Shou, M Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.18930  [pdf, other

    cs.CV

    Hallucination of Multimodal Large Language Models: A Survey

    Authors: Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou

    Abstract: This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in multimodal tasks. Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content, a challenge k… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 140 references

  2. arXiv:2404.15909  [pdf, other

    cs.CV

    Learning Long-form Video Prior via Generative Pre-Training

    Authors: Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

    Abstract: Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning lon… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  3. arXiv:2404.14055  [pdf, other

    cs.CV

    RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

    Authors: Hai Ci, Pei Yang, Yiren Song, Mike Zheng Shou

    Abstract: We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks. We conduct an in-depth study on it and reveal that the distribution shift unintentionally introduced by the watermarking process, apart from watermark pattern matching, contributes to its exceptional robustness. Our investigation further exposes inherent flaws in i… ▽ More

    Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 25 pages, 8 figures

  4. arXiv:2404.02747  [pdf, other

    cs.CV

    Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

    Authors: Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber

    Abstract: This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  5. arXiv:2403.12728  [pdf, other

    cs.CV

    Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation

    Authors: Jingtao Sun, Yaonan Wang, Mingtao Feng, Chao Ding, Mike Zheng Shou, Ajmal Saeed Mian

    Abstract: Fully-supervised category-level pose estimation aims to determine the 6-DoF poses of unseen instances from known categories, requiring expensive mannual labeling costs. Recently, various self-supervised category-level pose estimation methods have been proposed to reduce the requirement of the annotated datasets. However, most methods rely on synthetic data or 3D CAD model for self-supervised train… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  6. arXiv:2403.07420  [pdf, other

    cs.CV

    DragAnything: Motion Control for Anything using Entity Representation

    Authors: Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang

    Abstract: We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw… ▽ More

    Submitted 15 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: The project website is at: https://weijiawu.github.io/draganything_page/ . The code is at: https://github.com/showlab/DragAnything

  7. arXiv:2402.13724  [pdf, other

    cs.HC cs.CV

    Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters

    Authors: Zechen Bai, Peng Chen, Xiaolan Peng, Lu Liu, Hui Chen, Mike Zheng Shou, Feng Tian

    Abstract: Animating virtual characters has always been a fundamental research problem in virtual reality (VR). Facial animations play a crucial role as they effectively convey emotions and attitudes of virtual humans. However, creating such facial animations can be challenging, as current methods often involve utilization of expensive motion capture devices or significant investments of time and effort from… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: 9 pages. To appear in IEEE-VR

  8. arXiv:2402.01345  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models

    Authors: Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou

    Abstract: Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language. Despite these advances, LVLMs still face challenges with multimodal hallucination, such as generating text descriptions of objects that are not present in the visual information. However, the underlying fundamental reasons of multimodal halluc… ▽ More

    Submitted 6 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

  9. arXiv:2401.13516  [pdf, other

    cs.CV cs.CR

    Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

    Authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

    Abstract: Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areasthat vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel Deepfake detection model that can both recognize andlocalize unknown domai… ▽ More

    Submitted 5 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2308.09921, arXiv:2305.05943

  10. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  11. arXiv:2401.01827  [pdf, other

    cs.CV

    Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

    Authors: David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

    Abstract: Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB),… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: project page: https://showlab.github.io/Moonshot/

  12. arXiv:2401.00849  [pdf, other

    cs.CV

    COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

    Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introd… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: 16 pages; Website: http://fingerrec.github.io/cosmo

  13. arXiv:2312.14232  [pdf, other

    cs.CV cs.AI

    Parrot Captions Teach CLIP to Spot Text

    Authors: Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

    Abstract: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. O… ▽ More

    Submitted 1 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: project page: https://linyq17.github.io/CLIP-Parrot-Bias/. Add more analysis and ablation studies. Update Figure 3 with a more precise metric

  14. arXiv:2312.13324  [pdf, other

    cs.CV

    ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

    Authors: Weijia Mao, Yan-Pei Cao, Jia-Wei Liu, Zhongcong Xu, Mike Zheng Shou

    Abstract: We introduce ShowRoom3D, a three-stage approach for generating high-quality 3D room-scale scenes from texts. Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality. This is primarily attributed to the limitations of 2D priors lacking 3D awareness and constraints in the training methodology. In this paper, we… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  15. arXiv:2312.13108  [pdf, other

    cs.CV

    ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

    Authors: Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper… ▽ More

    Submitted 1 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Project Page: https://showlab.github.io/assistgui/

  16. arXiv:2312.11396  [pdf, other

    cs.CV cs.AI

    MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

    Authors: Qi Mao, Lan Chen, Yuchao Gu, Zhen Fang, Mike Zheng Shou

    Abstract: Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions. However, localized editing in complex scenarios has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region. Meanwhile, mask-free att… ▽ More

    Submitted 21 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: for project page, see https://mag-edit.github.io/

  17. arXiv:2312.06731  [pdf, other

    cs.CV cs.AI

    Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

    Authors: Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

    Abstract: Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Asking the human to label the instruction tuning data is label-intensive and time-consuming. Some works prompted to GPT-4 for data generation were not only costly but also lacked satisfactory performance in co… ▽ More

    Submitted 24 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Technical report

  18. arXiv:2312.02238  [pdf, other

    cs.CV cs.AI cs.MM

    X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

    Authors: Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou

    Abstract: We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the o… ▽ More

    Submitted 23 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page: https://showlab.github.io/X-Adapter/

  19. arXiv:2312.02087  [pdf, other

    cs.CV

    VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

    Authors: Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang

    Abstract: Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swapping in this work, where we aim to re… ▽ More

    Submitted 5 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page at https://videoswap.github.io

  20. arXiv:2312.02015  [pdf, other

    cs.CV

    ColonNeRF: High-Fidelity Neural Reconstruction of Long Colonoscopy

    Authors: Yufei Shi, Beijia Lu, Jia-Wei Liu, Ming Li, Mike Zheng Shou

    Abstract: Colonoscopy reconstruction is pivotal for diagnosing colorectal cancer. However, accurate long-sequence colonoscopy reconstruction faces three major challenges: (1) dissimilarity among segments of the colon due to its meandering and convoluted shape; (2) co-existence of simple and intricately folded geometry structures; (3) sparse viewpoints due to constrained camera trajectories. To tackle these… ▽ More

    Submitted 21 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: for Project Page, see https://showlab.github.io/ColonNeRF/

  21. arXiv:2312.01987  [pdf, other

    cs.CV

    Bootstrapping SparseFormers from Vision Foundation Models

    Authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

    Abstract: The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this p… ▽ More

    Submitted 4 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  22. arXiv:2312.00583  [pdf, other

    cs.CV cs.RO

    MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

    Authors: Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, Jeffrey Ichnowski

    Abstract: Accurate 3D tracking in highly deformable scenes with occlusions and shadows can facilitate new applications in robotics, augmented reality, and generative AI. However, tracking under these conditions is extremely challenging due to the ambiguity that arises with large deformations, shadows, and occlusions. We introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view synthesi… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  23. arXiv:2311.18765  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    MLLMs-Augmented Visual-Language Representation Learning

    Authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

    Abstract: Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLL… ▽ More

    Submitted 13 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

  24. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  25. arXiv:2311.17450  [pdf, other

    cs.CV

    Continual Learning for Image Segmentation with Dynamic Query

    Authors: Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou

    Abstract: Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new k… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: Code: https://github.com/weijiawu/CisDQ

    Journal ref: TCSVT 2023

  26. arXiv:2311.16498  [pdf, other

    cs.CV cs.GR

    MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

    Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou

    Abstract: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Project Page at https://showlab.github.io/magicanimate

  27. arXiv:2311.16081  [pdf, other

    cs.CV cs.AI

    ViT-Lens: Towards Omni-modal Representations

    Authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

    Abstract: Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT… ▽ More

    Submitted 26 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: This work is a follow-up of arXiv:2308.10185. Accepted to CVPR2024

  28. arXiv:2311.14284  [pdf, other

    cs.CV

    Paragraph-to-Image Generation with Information-Enriched Diffusion Model

    Authors: Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

    Abstract: Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusi… ▽ More

    Submitted 29 November, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: The project website is at: https://weijiawu.github.io/ParaDiffusionPage/. Code: https://github.com/weijiawu/ParaDiffusion

  29. arXiv:2311.13574  [pdf, other

    cs.CV

    XAGen: 3D Expressive Human Avatars Generation

    Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Jiashi Feng, Mike Zheng Shou

    Abstract: Recent advances in 3D-aware GAN models have enabled the generation of realistic and controllable human body images. However, existing methods focus on the control of major body joints, neglecting the manipulation of expressive attributes, such as facial expressions, jaw poses, hand poses, and so on. In this work, we present XAGen, the first 3D generative model for human avatars capable of expressi… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023, Project Page at https://showlab.github.io/xagen

  30. arXiv:2310.16003  [pdf, other

    cs.CV

    CVPR 2023 Text Guided Video Editing Competition

    Authors: Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, Rui He, Feng Hu, Junhua Hu, Hai Huang, Hanyu Zhu, Xu Cheng, Jie Tang, Mike Zheng Shou, Kurt Keutzer, Forrest Iandola

    Abstract: Humans watch more than a billion hours of video per day. Most of this video was edited manually, which is a tedious process. However, AI-enabled video-generation and video-editing is on the rise. Building on text-to-image models like Stable Diffusion and Imagen, generative AI has improved dramatically on video tasks. But it's hard to evaluate progress in these video tasks because there is no stand… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Project page: https://sites.google.com/view/loveucvpr23/track4

  31. arXiv:2310.10624  [pdf, other

    cs.CV

    DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

    Authors: Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou

    Abstract: Despite recent progress in diffusion-based video editing, existing methods are limited to short-length videos due to the contradiction between long-range consistency and frame-wise editing. Prior attempts to address this challenge by introducing video-2D representations encounter significant difficulties with large-scale motion- and view-change videos, especially in human-centric scenarios. To ove… ▽ More

    Submitted 7 December, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Project Page: https://showlab.github.io/DynVideo-E/

  32. arXiv:2310.08465  [pdf, other

    cs.CV

    MotionDirector: Motion Customization of Text-to-Video Diffusion Models

    Authors: Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou

    Abstract: Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: Project Page: https://showlab.github.io/MotionDirector/

  33. arXiv:2309.15818  [pdf, other

    cs.CV

    Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

    Authors: David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou

    Abstract: Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marri… ▽ More

    Submitted 17 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: project page is https://showlab.github.io/Show-1

  34. arXiv:2309.12865  [pdf, other

    cs.CV

    Bridging Sensor Gaps via Single-Direction Tuning for Hyperspectral Image Classification

    Authors: Xizhe Xue, Haokui Zhang, Ying Li, Liuwei Wan, Zongwen Bai, Mike Zheng Shou

    Abstract: Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this pap… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  35. arXiv:2309.09858  [pdf, other

    cs.CV

    Unsupervised Open-Vocabulary Object Localization in Videos

    Authors: Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He

    Abstract: In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized seman… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted by ICCV 2023

  36. arXiv:2309.09469  [pdf, other

    cs.SD cs.NE eess.AS

    Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks

    Authors: Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li

    Abstract: Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing. However, their performance in speech processing remains limited due to the lack of an effective auditory front-end. To address this limitation, we introduce Spiking-LEAF, a learnable auditory front-end meticulously designed for SNN-based speech processing. Spiking-LEAF combines a learnab… ▽ More

    Submitted 23 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP2024

  37. arXiv:2309.08513  [pdf, other

    cs.CV cs.AI

    SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

    Authors: Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou

    Abstract: Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse… ▽ More

    Submitted 29 April, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: This work has been accepted by IJCV

  38. arXiv:2309.07698  [pdf, other

    cs.CV

    Dataset Condensation via Generative Model

    Authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

    Abstract: Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: old work,done in 2022

  39. arXiv:2308.10185  [pdf, other

    cs.CV

    ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

    Authors: Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

    Abstract: Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a p… ▽ More

    Submitted 26 March, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

    Comments: 19 pages, 4 figures and 9 tables

  40. arXiv:2308.09921  [pdf, other

    cs.CV cs.AI

    Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces

    Authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

    Abstract: The exploitation of Deepfake techniques for malicious intentions has driven significant research interest in Deepfake detection. Deepfake manipulations frequently introduce random tampered traces, leading to unpredictable outcomes in different facial regions. However, existing detection methods heavily rely on specific forgery indicators, and as the forgery mode improves, these traces become incre… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2305.05943

  41. arXiv:2308.06739  [pdf, other

    cs.CV

    Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

    Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

    Abstract: Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  42. arXiv:2308.06548  [pdf, other

    cs.CV

    Revisiting Vision Transformer from the View of Path Ensemble

    Authors: Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou

    Abstract: Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in ea… ▽ More

    Submitted 12 August, 2023; originally announced August 2023.

    Comments: Accepted by ICCV 2023, oral presentation

  43. arXiv:2308.06160  [pdf, other

    cs.CV

    DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

    Authors: Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen

    Abstract: Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthet… ▽ More

    Submitted 9 October, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

    Journal ref: Proc. Advances In Neural Information Processing Systems (NeurIPS 2023)

  44. arXiv:2307.16715  [pdf, other

    cs.CV

    UniVTG: Towards Unified Video-Language Temporal Grounding

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

    Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: https://github.com/showlab/UniVTG

  45. arXiv:2307.10816  [pdf, other

    cs.CV

    BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

    Authors: Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou

    Abstract: Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired da… ▽ More

    Submitted 21 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. Code is available at: https://github.com/showlab/BoxDiff

  46. arXiv:2307.05463  [pdf, other

    cs.CV

    EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

    Authors: Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

    Abstract: Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of e… ▽ More

    Submitted 18 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

    Comments: Published in ICCV 2023

  47. arXiv:2306.15255  [pdf, other

    cs.CV cs.CL

    GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

    Authors: Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

    Abstract: In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: 5 pages, 2 figures, 4 tables, the champion solution for Ego4D Natural Language Queries Challenge in CVPR 2023

  48. arXiv:2306.12642  [pdf, other

    cs.CV

    TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

    Authors: Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou

    Abstract: Visual foundation models like CLIP excel in learning feature representations from extensive datasets through self-supervised methods, demonstrating remarkable transfer learning and generalization capabilities. A growing number of applications based on visual foundation models are emerging, including innovative solutions such as BLIP-2. These applications employ pre-trained CLIP models as upstream… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

  49. arXiv:2306.08640  [pdf, other

    cs.CV

    AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

    Authors: Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou

    Abstract: Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected… ▽ More

    Submitted 28 June, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Project page: https://showlab.github.io/assistgpt/

  50. arXiv:2305.20087  [pdf, other

    cs.CV

    Too Large; Data Reduction for Vision-Language Pre-Training

    Authors: Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

    Abstract: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major s… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: ICCV2023. Code: https://github.com/showlab/datacentric.vlp