Skip to main content

Showing 1–50 of 364 results for author: Shan, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06457  [pdf, ps, other

    cs.CL

    A Systematic Analysis of Hybrid Linear Attention

    Authors: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian

    Abstract: Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. arXiv:2507.06203  [pdf, ps, other

    cs.CL

    A Survey on Latent Reasoning

    Authors: Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng , et al. (8 additional authors not shown)

    Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inferen… ▽ More

    Submitted 10 July, 2025; v1 submitted 8 July, 2025; originally announced July 2025.

  3. arXiv:2507.01926  [pdf, ps, other

    cs.CV

    IC-Custom: Diverse Image Customization via In-Context Learning

    Authors: Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan

    Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome t… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Project page: https://liyaowei-stu.github.io/project/IC_Custom

  4. arXiv:2507.01603  [pdf, ps, other

    cs.CV

    DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

    Authors: Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang

    Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods r… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  5. arXiv:2506.17074  [pdf, ps, other

    cs.CV

    Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion

    Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan

    Abstract: We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresse… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: Technical Report. Project page: https://assembler3d.github.io

  6. arXiv:2506.16141  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

    Authors: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu

    Abstract: Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasonin… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Code released at: https://github.com/TencentARC/GRPO-CARE

  7. arXiv:2506.14851  [pdf, ps, other

    cs.DC cs.AI cs.LG

    Efficient Serving of LLM Applications with Probabilistic Demand Modeling

    Authors: Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, Minyi Guo

    Abstract: Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource dema… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  8. SimSpark: Interactive Simulation of Social Media Behaviors

    Authors: Ziyue Lin, Yi Shan, Lin Gao, Xinghua Jia, Siming Chen

    Abstract: Understanding user behaviors on social media has garnered significant scholarly attention, enhancing our comprehension of how virtual platforms impact society and empowering decision-makers. Simulating social media behaviors provides a robust tool for capturing the patterns of social media behaviors, testing hypotheses, and predicting the effects of various interventions, ultimately contributing t… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 32 pages, 7 figures

    Journal ref: Proc. ACM Hum.-Comput. Interact. 9, 2, Article CSCW168 (April 2025), 32 pages

  9. arXiv:2506.13497  [pdf, ps, other

    cs.DC

    DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving

    Authors: Heyang Huang, Cunchen Hu, Jiaqi Zhu, Ziyuan Gao, Liangliang Xu, Yizhou Shan, Yungang Bao, Sun Ninghui, Tianwei Zhang, Sa Wang

    Abstract: The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to ineffic… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  10. arXiv:2506.11638  [pdf, ps, other

    cs.CL cs.AI

    LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

    Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan

    Abstract: Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for ed… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  11. arXiv:2506.11106  [pdf, ps, other

    cs.CL cs.AI cs.IR

    Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking

    Authors: Ningyuan Li, Junrui Liu, Yi Shan, Minghui Huang, Tong Li

    Abstract: Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline's exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a re… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

  12. arXiv:2506.07672  [pdf, ps, other

    cs.AI

    MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

    Authors: Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu

    Abstract: (M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  13. arXiv:2506.05240  [pdf, ps, other

    cs.LG cs.CV

    Aligning Latent Spaces with Flow Priors

    Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo

    Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  14. arXiv:2506.03714  [pdf, other

    cs.CV

    FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

    Authors: Shuai Liu, Mingyue Cui, Boyang Li, Quanmin Liang, Tinghe Hong, Kai Huang, Yunxiao Shan, Kai Huang

    Abstract: Fully sparse 3D detectors have recently gained significant attention due to their efficiency in long-range detection. However, sparse 3D detectors extract features only from non-empty voxels, which impairs long-range interactions and causes the center feature missing. The former weakens the feature extraction capability, while the latter hinders network optimization. To address these challenges, w… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Accepted by CVPR2025

  15. arXiv:2506.03126  [pdf, ps, other

    cs.CV

    AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

    Authors: Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

    Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: Project released at: https://qiulu66.github.io/animeshooter/

  16. arXiv:2506.02975  [pdf, ps, other

    cs.CV cs.AI

    HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

    Authors: Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Zunnan Xu, Zhaoyang Zhang, Yixiao Ge, Xiu Li, Ying Shan

    Abstract: With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strat… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  17. arXiv:2505.21374  [pdf, ps, other

    cs.CV

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Authors: Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan

    Abstract: Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on ex… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Homepage: https://github.com/TencentARC/Video-Holmes

  18. arXiv:2505.21205  [pdf, ps, other

    cs.CV

    Sci-Fi: Symmetric Constraint for Frame Inbetweening

    Authors: Liuhan Chen, Xiaodong Cun, Xiaoyu Li, Xianyi He, Shenghai Yuan, Jie Chen, Ying Shan, Li Yuan

    Abstract: Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame co… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 22 pages, 9 figures

  19. arXiv:2505.21153  [pdf

    cs.MM cs.HC

    THE WASTIVE: An Interactive Ebb and Flow of Digital Fabrication Waste

    Authors: Yifan Shan, Bo Liu, Sebastian Bidegain, Thijs Roumen

    Abstract: What if digital fabrication waste could observe the world? What would they see? What would they say? "THE WASTIVE" reimagines digital fabrication waste as sentient observers, giving them a poetic voice through interactive art. As viewers approach, the installation awakens, mimicking the rhythmic ebb and flow of ocean waves - a silent dialogue where discarded materials "observe" and respond to huma… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: video demo: https://youtu.be/Yh3dmKYNP-8

  20. arXiv:2505.20480  [pdf, ps, other

    eess.SP cs.CL q-bio.NC

    BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics

    Authors: Hui Zheng, Hai-Teng Wang, Yi-Tao Jing, Pei-Yang Lin, Han-Qing Zhao, Wei Chen, Peng-Hu Wei, Yong-Zhi Shan, Guo-Guang Zhao, Yun-Zhe Liu

    Abstract: Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research. In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG). These neural signals capture rich population-level activity but present key challenges: (i) task… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  21. arXiv:2505.17997   

    cs.LG cs.CL

    Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

    Authors: Jintian Shao, Yiming Cheng, Hongyi Huang, Beiwen Zhang, Zhiyu Wu, You Shan, Mingkai Zheng

    Abstract: The VAPO framework has demonstrated significant empirical success in enhancing the efficiency and reliability of reinforcement learning for long chain-of-thought (CoT) reasoning tasks with large language models (LLMs). By systematically addressing challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals, VAPO achieves state-of-the-art performance. While its pr… ▽ More

    Submitted 27 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: We are withdrawing this submission as the underlying experiment is currently incomplete. We require additional time to gather more data and supplement the existing findings to ensure a comprehensive and robust presentation. We intend to resubmit once these additions are finalized

  22. arXiv:2505.16324  [pdf, ps, other

    cs.CV

    TensorAR: Refinement is All You Need in Autoregressive Image Generation

    Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan

    Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token predictio… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  23. arXiv:2505.13031  [pdf, ps, other

    cs.AI

    MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

    Authors: Yicheng Xiao, Lin Song, Yukang Chen, Yingmin Luo, Yuxin Chen, Yukang Gan, Wei Huang, Xiu Li, Xiaojuan Qi, Ying Shan

    Abstract: Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion m… ▽ More

    Submitted 11 June, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Code: https://github.com/TencentARC/MindOmni

  24. arXiv:2505.12511  [pdf, other

    cs.CL

    DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

    Authors: Yanting Li, Jiyue Jiang, Zikang Wang, Ziqian Lin, Dongchen He, Yuheng Shan, Yanruisheng Shao, Jiayi Li, Xiangyu Shi, Jiuming Wang, Yanyu Chen, Yimin Fan, Han Li, Yu Li

    Abstract: Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their abilit… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  25. arXiv:2505.10222   

    cs.LG cs.CL

    ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

    Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, MingKai Zheng

    Abstract: Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces Compl… ▽ More

    Submitted 27 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

    Comments: We are withdrawing this submission as the underlying experiment is currently incomplete. We require additional time to gather more data and supplement the existing findings to ensure a comprehensive and robust presentation. We intend to resubmit once these additions are finalized

  26. arXiv:2505.10202  [pdf, other

    cs.CL

    VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

    Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, YiMing Cheng, ZhiYu Wu, You Shan, MingKai Zheng

    Abstract: Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  27. arXiv:2505.06702  [pdf, ps, other

    cs.HC

    Do Language Model Agents Align with Humans in Rating Visualizations? An Empirical Study

    Authors: Zekai Shao, Yi Shan, Yixuan He, Yuxuan Yao, Junhong Wang, Xiaolong, Zhang, Yu Zhang, Siming Chen

    Abstract: Large language models encode knowledge in various domains and demonstrate the ability to understand visualizations. They may also capture visualization design knowledge and potentially help reduce the cost of formative studies. However, it remains a question whether large language models are capable of predicting human feedback on visualizations. To investigate this question, we conducted three st… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: 14 pages, 8 figures

  28. arXiv:2505.05422  [pdf, other

    cs.CV cs.AI cs.CL

    TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

    Authors: Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan

    Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLI… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Technical Report

  29. arXiv:2505.03756  [pdf, other

    cs.AR cs.AI cs.LG cs.PF

    Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

    Authors: Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo

    Abstract: Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dep… ▽ More

    Submitted 19 April, 2025; originally announced May 2025.

  30. arXiv:2505.03730  [pdf, other

    cs.CV cs.AI cs.MM

    FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

    Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang

    Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose Fle… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted by Siggraph2025, Project Page: https://shiyi-zh0408.github.io/projectpages/FlexiAct/

  31. arXiv:2504.20461  [pdf, other

    cs.DC

    Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss

    Authors: Jingjia Luo, Mingxing Zhang, Kang Chen, Xia Liao, Yingdi Shan, Jinlei Jiang, Yongwei Wu

    Abstract: The increase in the dimensionality of neural embedding models has enhanced the accuracy of semantic search capabilities but also amplified the computational demands for Approximate Nearest Neighbor Searches (ANNS). This complexity poses significant challenges in online and interactive services, where query latency is a critical performance metric. Traditional graph-based ANNS methods, while effect… ▽ More

    Submitted 30 April, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

  32. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  33. arXiv:2504.12240  [pdf, other

    cs.CV

    Cobra: Efficient Line Art COlorization with BRoAder References

    Authors: Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan

    Abstract: The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing c… ▽ More

    Submitted 6 May, 2025; v1 submitted 16 April, 2025; originally announced April 2025.

    Comments: Project page with code: https://zhuang2002.github.io/Cobra/

  34. arXiv:2504.06684  [pdf, other

    cs.RO cs.MA

    SDHN: Skewness-Driven Hypergraph Networks for Enhanced Localized Multi-Robot Coordination

    Authors: Delin Zhao, Yanbo Shan, Chang Liu, Shenghang Lin, Yingxin Shou, Bin Xu

    Abstract: Multi-Agent Reinforcement Learning is widely used for multi-robot coordination, where simple graphs typically model pairwise interactions. However, such representations fail to capture higher-order collaborations, limiting effectiveness in complex tasks. While hypergraph-based approaches enhance cooperation, existing methods often generate arbitrary hypergraph structures and lack adaptability to e… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  35. arXiv:2504.01506  [pdf, other

    cs.LG

    MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

    Authors: Yongjun He, Roger Waleffe, Zhichao Han, Johnu George, Binhang Yuan, Zitao Zhang, Yinan Shan, Yang Zhao, Debojyoti Dutta, Theodoros Rekatsinas, Ce Zhang

    Abstract: Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for spe… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: To appear in ICDE 2025

  36. arXiv:2504.01016  [pdf, other

    cs.GR cs.AI cs.CV

    GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

    Authors: Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, Ying Shan

    Abstract: Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from ope… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Project webpage: https://geometrycrafter.github.io/

  37. arXiv:2504.01014  [pdf, ps, other

    cs.CV

    AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

    Authors: Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan

    Abstract: Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinit… ▽ More

    Submitted 30 May, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: Project released at: https://howe125.github.io/AnimeGamer.github.io/

  38. arXiv:2503.24376  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

    Authors: Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu

    Abstract: Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this,… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: Technical Report (In Progress); Code released at: https://github.com/TencentARC/SEED-Bench-R1

  39. arXiv:2503.23791  [pdf, other

    cs.PL cs.SE

    LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators

    Authors: Yuchen Liu, Junhao Hu, Yingdi Shan, Ge Li, Yanzhen Zou, Yihong Dong, Tao Xie

    Abstract: Rewriting C code in Rust provides stronger memory safety, yet migrating large codebases such as the 32-million-line Linux kernel remains challenging. While rule-based translators (e.g., C2Rust) provide accurate yet largely unsafe Rust programs, recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness", omitting significant portions of… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  40. arXiv:2503.22262  [pdf, other

    cs.CV

    Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

    Authors: Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, Huchuan Lu

    Abstract: With the rapid proliferation of 3D devices and the shortage of 3D content, stereo conversion is attracting increasing attention. Recent works introduce pretrained Diffusion Models (DMs) into this task. However, due to the scarcity of large-scale training data and comprehensive benchmarks, the optimal methodologies for employing DMs in stereo conversion and the accurate evaluation of stereo effects… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025 Project webpage: https://mono2stereo-bench.github.io/

  41. arXiv:2503.19480  [pdf, ps, other

    cs.CV

    GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

    Authors: Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan

    Abstract: The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle rema… ▽ More

    Submitted 30 May, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Project released at: https://mashijie1028.github.io/GenHancer/

  42. arXiv:2503.17407  [pdf, other

    cs.CL cs.LG

    A Comprehensive Survey on Long Context Language Modeling

    Authors: Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li , et al. (12 additional authors not shown)

    Abstract: Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-c… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  43. arXiv:2503.14694  [pdf, other

    cs.CL cs.CV

    HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

    Authors: Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao

    Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-in… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  44. arXiv:2503.13434  [pdf, other

    cs.CV cs.AI cs.MM

    BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

    Authors: Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou

    Abstract: Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and rep… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: Project Webpage: https://liyaowei-stu.github.io/project/BlobCtrl/

  45. arXiv:2503.08703  [pdf, ps, other

    cs.NE cs.CV

    SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

    Authors: Yimeng Shan, Zhenbang Ren, Haodi Wu, Wenjie Wei, Rui-Jie Zhu, Shuai Wang, Dehao Zhang, Yichen Xiao, Jieyuan Zhang, Kexin Shi, Jingzhinan Wang, Jason K. Eshraghian, Haicheng Qu, Jiqing Zhang, Malu Zhang, Yang Yang

    Abstract: Event cameras provide superior temporal resolution, dynamic range, power efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches that combine Artificial Neural Networks (ANNs) and SNNs, along with suboptimal architectures, compromise energy efficiency and… ▽ More

    Submitted 17 June, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

    Comments: 11 pages,7 figures,4 tables

  46. arXiv:2503.06237  [pdf, other

    cs.CV

    Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection

    Authors: Yifan Chang, Junjie Huang, Xiaofeng Wang, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du, Xingang Wang

    Abstract: Monocular 3D lane detection is a fundamental task in autonomous driving. Although sparse-point methods lower computational load and maintain high accuracy in complex lane geometries, current methods fail to fully leverage the geometric structure of lanes in both lane geometry representations and model design. In lane geometry representations, we present a theoretical analysis alongside experimenta… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: CVPR2025

  47. arXiv:2503.05639  [pdf, other

    cs.CV cs.AI cs.MM

    VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

    Authors: Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu

    Abstract: Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context pre… ▽ More

    Submitted 8 April, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: Project page available at https://yxbian23.github.io/project/video-painter

  48. arXiv:2503.05638  [pdf, other

    cs.CV cs.AI cs.GR

    TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

    Authors: Mark YU, Wenbo Hu, Jinbo Xing, Ying Shan

    Abstract: We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: Project webpage: https://trajectorycrafter.github.io/

  49. arXiv:2503.04135  [pdf, other

    cs.CL

    Biological Sequence with Language Model Prompting: A Survey

    Authors: Jiyue Jiang, Zikang Wang, Yuheng Shan, Heyan Chai, Jiayi Li, Zixian Ma, Xinrui Zhang, Yu Li

    Abstract: Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based method… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  50. arXiv:2503.00040  [pdf, other

    cs.NE cs.CV

    Memory-Free and Parallel Computation for Quantized Spiking Neural Networks

    Authors: Dehao Zhang, Shuai Wang, Yichen Xiao, Wenjie Wei, Yimeng Shan, Malu Zhang, Yang Yang

    Abstract: Quantized Spiking Neural Networks (QSNNs) offer superior energy efficiency and are well-suited for deployment on resource-limited edge devices. However, limited bit-width weight and membrane potential result in a notable performance decline. In this study, we first identify a new underlying cause for this decline: the loss of historical information due to the quantized membrane potential. To tackl… ▽ More

    Submitted 25 February, 2025; originally announced March 2025.