Skip to main content

Showing 1–50 of 243 results for author: Shan, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.04007  [pdf, other

    cs.CV

    SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

    Authors: Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

    Abstract: In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario da… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Technical Report; Dataset released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit

  2. arXiv:2405.00351  [pdf, other

    cs.HC cs.AI cs.CV cs.MM

    Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

    Authors: Zidong Cao, Zhan Wang, Yexin Liu, Yan-Pei Cao, Ying Shan, Wei Zeng, Lin Wang

    Abstract: Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user's ability to engage with objects of interest. In this paper, we present a novel system, cal… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 11 pages

  3. arXiv:2404.18392  [pdf, other

    cs.DC

    Dflow, a Python framework for constructing cloud-native AI-for-Science workflows

    Authors: Xinzijian Liu, Yanbo Han, Zhuoyuan Li, Jiahao Fan, Chengqian Zhang, Jinzhe Zeng, Yifan Shan, Yannan Yuan, Wei-Hong Xu, Yun-Pei Liu, Yuzhi Zhang, Tongqi Wen, Darrin M. York, Zhicheng Zhong, Hang Zheng, Jun Cheng, Linfeng Zhang, Han Wang

    Abstract: In the AI-for-science era, scientific computing scenarios such as concurrent learning and high-throughput computing demand a new generation of infrastructure that supports scalable computing resources and automated workflow management on both cloud and high-performance supercomputers. Here we introduce Dflow, an open-source Python toolkit designed for scientists to construct workflows with simple… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  4. arXiv:2404.16790  [pdf, other

    cs.CV

    SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

    Authors: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

    Abstract: Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their pro… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  5. arXiv:2404.14396  [pdf, other

    cs.CV

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Authors: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

    Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. I… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Project released at: https://github.com/AILab-CVC/SEED-X

  6. arXiv:2404.07191  [pdf, other

    cs.CV

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Authors: Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan

    Abstract: We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To… ▽ More

    Submitted 14 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: Technical report. Project: https://github.com/TencentARC/InstantMesh

  7. arXiv:2404.00308  [pdf, other

    cs.CV

    ST-LLM: Large Language Models Are Effective Temporal Learners

    Authors: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

    Abstract: Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we fe… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  8. arXiv:2403.19098  [pdf, other

    cs.CV

    GraphAD: Interaction Scene Graph for End-to-end Autonomous Driving

    Authors: Yunpeng Zhang, Deheng Qian, Ding Li, Yifeng Pan, Yong Chen, Zhenbao Liang, Zhiyao Zhang, Shurui Zhang, Hongxu Li, Maolei Fu, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du

    Abstract: Modeling complicated interactions among the ego-vehicle, road agents, and map elements has been a crucial part for safety-critical autonomous driving. Previous works on end-to-end autonomous driving rely on the attention mechanism for handling heterogeneous interactions, which fails to capture the geometric priors and is also computationally intensive. In this paper, we propose the Interaction Sce… ▽ More

    Submitted 6 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: project page: https://github.com/zhangyp15/GraphAD

  9. arXiv:2403.11589  [pdf, other

    cs.CV

    UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling

    Authors: Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan

    Abstract: Reconstructing photo-realistic drivable human avatars from multi-view image sequences has been a popular and challenging topic in the field of computer vision and graphics. While existing NeRF-based methods can achieve high-quality novel view rendering of human models, both training and inference processes are time-consuming. Recent approaches have utilized 3D Gaussians to represent the human body… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  10. arXiv:2403.10050  [pdf, other

    cs.CV

    Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

    Authors: Tian-Xing Xu, Wenbo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang

    Abstract: 3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namel… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  11. arXiv:2403.10044  [pdf, other

    cs.CV

    SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

    Authors: Tao Wu, Xuewei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

    Abstract: Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by AAAI2024

  12. arXiv:2403.09326  [pdf, other

    cs.GR cs.AI

    HeadEvolver: Text to Head Avatars via Locally Learnable Mesh Deformation

    Authors: Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Ying Shan, Xiaohang Zhan, Zeyu Wang

    Abstract: We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: 12 pages, 15 figures

    ACM Class: I.2.6; I.3.8

  13. arXiv:2403.08309  [pdf, other

    cs.LG cs.AI

    HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback

    Authors: Ang Li, Qiugen Xiao, Peng Cao, Jian Tang, Yi Yuan, Zijie Zhao, Xiaoyuan Chen, Liang Zhang, Xiangyang Li, Kaitong Yang, Weidong Guo, Yukang Gan, Xu Yu, Daniell Wang, Ying Shan

    Abstract: Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter annotation cycles and lower costs over Reinforcement Learning from Human Feedback (RLHF), making it highly efficient during the rapid strategy iteration periods of large language model (LLM) training. Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: 18 pages, 7 figures

  14. arXiv:2403.06976  [pdf, other

    cs.CV

    BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

    Authors: Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu

    Abstract: Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs). Despite these advancements, current DM adaptations for inpainting, which involve modifications to the sampling strategy or the development of inpainting-specific DMs, frequently suffer from semantic inconsistencies and reduced image quality. Addressing these cha… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  15. arXiv:2403.05895  [pdf, other

    cs.CV

    DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

    Authors: Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong Liu, Yang Wu, Ying Shan, Xiaojuan Qi

    Abstract: Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and dep… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: 24 pages, 14 figures, Tech Report

  16. arXiv:2402.18146  [pdf, ps, other

    cs.CV

    3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

    Authors: Chaokang Jiang, Guangming Wang, Jiuming Liu, Hesheng Wang, Zhuang Ma, Zhenqiang Liu, Zhujin Liang, Yi Shan, Dalong Du

    Abstract: Learning 3D scene flow from LiDAR point clouds presents significant difficulties, including poor generalization from synthetic datasets to real scenes, scarcity of real-world 3D labels, and poor performance on real sparse LiDAR point clouds. We present a novel approach from the perspective of auto-labelling, aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR poin… ▽ More

    Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR2024! 10 pages, 6 figures

  17. arXiv:2402.10491  [pdf, other

    cs.CV

    Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

    Authors: Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

    Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: Project Page: https://guolanqing.github.io/Self-Cascade/

  18. arXiv:2402.02772  [pdf, other

    cs.LG

    Contrastive Diffuser: Planning Towards High Return States via Contrastive Learning

    Authors: Yixiang Shan, Zhengbang Zhu, Ting Long, Qifan Liang, Yi Chang, Weinan Zhang, Liang Yin

    Abstract: Applying diffusion models in reinforcement learning for long-term planning has gained much attention recently. Several diffusion-based methods have successfully leveraged the modeling capabilities of diffusion for arbitrary distributions. These methods generate subsequent trajectories for planning and have demonstrated significant improvement. However, these methods are limited by their plain base… ▽ More

    Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: 13 pages with appendix and references, 10 figures, 3 tables

  19. arXiv:2402.02583  [pdf, other

    cs.CV cs.LG

    DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

    Authors: Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

    Abstract: Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years. Although owning diverse and high-quality generation capabilities, translating these abilities to fine-grained image editing remains challenging. In this paper, we propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing: (1) in complex scenarios, editing resu… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

  20. arXiv:2402.02439  [pdf, other

    cs.LG cs.AI

    DiffStitch: Boosting Offline Reinforcement Learning with Diffusion-based Trajectory Stitching

    Authors: Guanghe Li, Yixiang Shan, Zhengbang Zhu, Ting Long, Weinan Zhang

    Abstract: In offline reinforcement learning (RL), the performance of the learned policy highly depends on the quality of offline datasets. However, in many cases, the offline dataset contains very limited optimal trajectories, which poses a challenge for offline RL algorithms as agents must acquire the ability to transit to high-reward regions. To address this issue, we introduce Diffusion-based Trajectory… ▽ More

    Submitted 21 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  21. arXiv:2401.17807  [pdf, other

    cs.CV cs.GR

    Advances in 3D Generation: A Survey

    Authors: Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan

    Abstract: Generating 3D models lies at the core of computer graphics and has been the focus of decades of research. With the emergence of advanced neural representations and generative models, the field of 3D content generation is developing rapidly, enabling the creation of increasingly high-quality and diverse 3D models. The rapid growth of this field makes it difficult to stay abreast of all recent devel… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 33 pages, 12 figures

  22. arXiv:2401.17270  [pdf, other

    cs.CV

    YOLO-World: Real-Time Open-Vocabulary Object Detection

    Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

    Abstract: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling an… ▽ More

    Submitted 22 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Work still in progress. Code & models are available at: https://github.com/AILab-CVC/YOLO-World

  23. RecDCL: Dual Contrastive Learning for Recommendation

    Authors: Dan Zhang, Yangliao Geng, Wenwen Gong, Zhongang Qi, Zhiyu Chen, Xing Tang, Ying Shan, Yuxiao Dong, Jie Tang

    Abstract: Self-supervised learning (SSL) has recently achieved great success in mining the user-item interactions for collaborative filtering. As a major paradigm, contrastive learning (CL) based SSL helps address data sparsity in Web platforms by contrasting the embeddings between raw and augmented data. However, existing CL-based methods mostly focus on contrasting in a batch-wise way, failing to exploit… ▽ More

    Submitted 18 February, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted to WWW 2024

    Journal ref: Proceedings of TheWebConf 2024 (WWW '24), May 13--17, 2024, Singapore

  24. arXiv:2401.14828  [pdf, other

    cs.CV

    TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

    Authors: Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan

    Abstract: Text-driven 3D scene editing has gained significant attention owing to its convenience and user-friendliness. However, existing methods still lack accurate control of the specified appearance and location of the editing result due to the inherent limitations of the text description. To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D b… ▽ More

    Submitted 25 April, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: Accpeted by Siggraph 2024 & ACM Transactions on Graphics

  25. arXiv:2401.14405  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

    Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

    Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalit… ▽ More

    Submitted 18 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Code and models are available at https://github.com/AILab-CVC/M2PT

  26. arXiv:2401.11240  [pdf, other

    cs.DC

    CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference

    Authors: Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang

    Abstract: Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dy… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  27. arXiv:2401.11181  [pdf, other

    cs.DC

    Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

    Authors: Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

    Abstract: Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference request… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  28. arXiv:2401.10222  [pdf, other

    cs.CV cs.AI

    Supervised Fine-tuning in turn Improves Visual Foundation Models

    Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

    Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuni… ▽ More

    Submitted 11 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 23 pages, 3 figures, Project page: https://github.com/TencentARC/ViSFT/tree/main

  29. arXiv:2401.09047  [pdf, other

    cs.CV

    VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

    Authors: Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

    Abstract: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: Homepage: https://ailab-cvc.github.io/videocrafter; Github: https://github.com/AILab-CVC/VideoCrafter

  30. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  31. arXiv:2401.02415  [pdf, other

    cs.CL

    LLaMA Pro: Progressive LLaMA with Block Expansion

    Authors: Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

    Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forge… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  32. arXiv:2312.09251  [pdf, other

    cs.CV

    VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

    Authors: Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

    Abstract: In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  33. arXiv:2312.06739  [pdf, other

    cs.CV

    SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

    Authors: Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

    Abstract: Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding an… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project page: https://yuzhou914.github.io/SmartEdit/

  34. arXiv:2312.06722  [pdf, other

    cs.CV cs.CL cs.RO

    EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

    Authors: Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

    Abstract: Multimodal Large Language Models, combining the remarkable reasoning and generalization capabilities of Large Language Models (LLMs) with the ability to comprehend visual inputs, have opened up new avenues for embodied task planning. Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected… ▽ More

    Submitted 17 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

    Comments: Project released at: https://github.com/ChenYi99/EgoPlan

  35. arXiv:2312.05730  [pdf, other

    cs.MM

    AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild

    Authors: Yongkang Yin, Xu Li, Ying Shan, Yuexian Zou

    Abstract: Speaker diarization in real-world videos presents significant challenges due to varying acoustic conditions, diverse scenes, the presence of off-screen speakers, etc. This paper builds upon a previous study (AVR-Net) and introduces a novel multi-modal speaker diarization system, AFL-Net. The proposed AFL-Net incorporates dynamic lip movement as an additional modality to enhance the identity distin… ▽ More

    Submitted 13 March, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

  36. arXiv:2312.04919  [pdf, other

    cs.SD eess.AS

    Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

    Authors: Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

    Abstract: Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a w… ▽ More

    Submitted 8 January, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  37. arXiv:2312.04461  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

    Authors: Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, Ying Shan

    Abstract: Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized t… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Tech report; Project page: https://photo-maker.github.io/

  38. arXiv:2312.03793  [pdf, other

    cs.CV

    AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

    Authors: Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang

    Abstract: Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the vi… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Project Page: https://vvictoryuki.github.io/animatezero.github.io/

  39. arXiv:2312.03641  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

    Authors: Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan

    Abstract: Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this p… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Project Page: https://wzhouxiff.github.io/projects/MotionCtrl/

  40. arXiv:2312.03047  [pdf, other

    cs.CV

    MagicStick: Controllable Video Editing via Control Handle Transformations

    Authors: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen

    Abstract: Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easi… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Project page: https://magic-stick-edit.github.io/ Github repository: https://github.com/mayuelala/MagicStick

  41. arXiv:2312.00330  [pdf, other

    cs.CV cs.AI

    StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

    Authors: Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang, Ying Shan

    Abstract: Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style contro… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

    Comments: Project page: https://gongyeliu.github.io/StyleCrafter.github.io/ ; GitHub repository: https://github.com/GongyeLiu/StyleCrafter

  42. arXiv:2311.17123  [pdf, other

    cs.CV cs.AI

    ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

    Authors: Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, Yanpei Cao, Ying Shan, Long Quan

    Abstract: In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method, to lift the 2D image into 3D space.… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: see project page: https://gaoxiangjun.github.io/contex_human/

  43. arXiv:2311.17092  [pdf, other

    cs.CV

    SEED-Bench-2: Benchmarking Multimodal Large Language Models

    Authors: Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

    Abstract: Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of si… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: Project released at: https://github.com/AILab-CVC/SEED-Bench. arXiv admin note: text overlap with arXiv:2307.16125

  44. arXiv:2311.17061  [pdf, other

    cs.CV

    HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

    Authors: Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu

    Abstract: Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time. In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-graine… ▽ More

    Submitted 14 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted by CVPR 2024, camera-ready version. Project Page: https://alvinliu0.github.io/projects/HumanGaussian

  45. arXiv:2311.16961  [pdf, other

    cs.CV

    HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

    Authors: Jingbo Zhang, Xiaoyu Li, Qi Zhang, Yanpei Cao, Ying Shan, Jing Liao

    Abstract: Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail t… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: Homepage: https://eckertzhang.github.io/HumanRef.github.io/

  46. arXiv:2311.16473  [pdf, other

    cs.CV

    GS-IR: 3D Gaussian Splatting for Inverse Rendering

    Authors: Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia

    Abstract: We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results. Unlike previous works that use implicit neural representations and volume rendering (e.g. NeRF), which suffer from low expressive power and high computational complexity, we extend GS, a top-p… ▽ More

    Submitted 28 March, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

  47. arXiv:2311.16081  [pdf, other

    cs.CV cs.AI

    ViT-Lens: Towards Omni-modal Representations

    Authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

    Abstract: Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT… ▽ More

    Submitted 26 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: This work is a follow-up of arXiv:2308.10185. Accepted to CVPR2024

  48. arXiv:2311.15599  [pdf, other

    cs.CV cs.AI cs.LG

    UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

    Authors: Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

    Abstract: Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As tr… ▽ More

    Submitted 18 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: CVPR 2024. Code, all the models, reproducible training scripts at https://github.com/AILab-CVC/UniRepLKNet

  49. arXiv:2311.12040  [pdf

    q-bio.QM cs.AI cs.LG

    TransCDR: a deep learning model for enhancing the generalizability of cancer drug response prediction through transfer learning and multimodal data fusion for drug representation

    Authors: Xiaoqiong Xia, Chaoyu Zhu, Yuqi Shan, Fan Zhong, Lei Liu

    Abstract: Accurate and robust drug response prediction is of utmost importance in precision medicine. Although many models have been developed to utilize the representations of drugs and cancer cell lines for predicting cancer drug responses (CDR), their performances can be improved by addressing issues such as insufficient data modality, suboptimal fusion algorithms, and poor generalizability for novel dru… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: 8 figures

  50. arXiv:2311.11255  [pdf, other

    cs.SD cs.MM eess.AS

    M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

    Authors: Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

    Abstract: The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both un… ▽ More

    Submitted 4 March, 2024; v1 submitted 19 November, 2023; originally announced November 2023.