Skip to main content

Showing 1–50 of 431 results for author: Shi, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05949  [pdf, other

    cs.CV

    CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

    Authors: Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen

    Abstract: Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Exp… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  2. arXiv:2405.05858  [pdf, other

    cs.CV cs.AI cs.GR cs.RO

    Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera

    Authors: Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl

    Abstract: We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence glob… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  3. arXiv:2405.05518  [pdf, other

    cs.CV cs.RO eess.IV

    DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction

    Authors: Siyu Li, Jiacheng Lin, Hao Shi, Jiaming Zhang, Song Wang, You Yao, Zhiyong Li, Kailun Yang

    Abstract: Temporal information plays a pivotal role in Bird's-Eye-View (BEV) driving scene understanding, which can alleviate the visual information sparsity. However, the indiscriminate temporal fusion method will cause the barrier of feature redundancy when constructing vectorized High-Definition (HD) maps. In this paper, we revisit the temporal fusion of vectorized HD maps, focusing on temporal instance… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: The source code will be made publicly available at https://github.com/lynn-yu/DTCLMapper

  4. arXiv:2405.04950  [pdf, other

    cs.CV cs.AI cs.CL

    VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

    Authors: Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, Min Zhang

    Abstract: Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Addi… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 17 pages; Accepted by ICML 2024

  5. arXiv:2405.03882  [pdf, other

    cs.CV cs.AI

    Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

    Authors: Huihong Shi, Haikuo Shao, Wendong Mao, Zhongfeng Wang

    Abstract: Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unf… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  6. arXiv:2405.03764  [pdf, other

    cs.CL cs.IR

    GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation

    Authors: Wenjie Zhou, Zhenxin Ding, Xiaodong Zhang, Haibo Shi, Junfeng Wang, Dawei Yin

    Abstract: Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. For practical deployment, it is critical to carry out knowledge distillation to preserve high performance under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student performance, how does one effe… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  7. arXiv:2405.02713  [pdf, other

    cs.IT

    Set Transformation: Trade-off Between Repair Bandwidth and Sub-packetization

    Authors: Hao Shi, Zhengyi Jiang, Zhongyi Huang, Bo Bai, Gong Zhang, Hanxu Hou

    Abstract: Maximum distance separable (MDS) codes facilitate the achievement of elevated levels of fault tolerance in storage systems while incurring minimal redundancy overhead. Reed-Solomon (RS) codes are typical MDS codes with the sub-packetization level being one, however, they require large repair bandwidth defined as the total amount of symbols downloaded from other surviving nodes during single-node f… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

  8. arXiv:2405.00578  [pdf, other

    cs.CL cs.AI

    The Real, the Better: Aligning Large Language Models with Online Human Behaviors

    Authors: Guanying Jiang, Lingyong Yan, Haibo Shi, Dawei Yin

    Abstract: Large language model alignment is widely used and studied to avoid LLM producing unhelpful and harmful responses. However, the lengthy training process and predefined preference bias hinder adaptation to online diverse human preferences. To this end, this paper proposes an alignment framework, called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by directly leveraging real onlin… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 11 pages, 6 figures

  9. arXiv:2404.17196  [pdf, other

    cs.CR cs.AI

    Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Applications

    Authors: Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, Yu Jiang

    Abstract: Presently, with the assistance of advanced LLM application development frameworks, more and more LLM-powered applications can effortlessly augment the LLMs' knowledge with external content using the retrieval augmented generation (RAG) technique. However, these frameworks' designs do not have sufficient consideration of the risk of external content, thereby allowing attackers to undermine the appl… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  10. arXiv:2404.16789  [pdf, other

    cs.LG cs.AI cs.CL

    Continual Learning of Large Language Models: A Comprehensive Survey

    Authors: Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Hao Wang

    Abstract: The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 57 pages, 2 figures, 4 tables. Work in progress

  11. arXiv:2404.16205  [pdf, other

    cs.CV cs.MM

    AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

    Authors: Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chenlong He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai , et al. (11 additional authors not shown)

    Abstract: This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed met… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop -- AI for Streaming (AIS) Video Quality Assessment Challenge

  12. arXiv:2404.14568  [pdf, other

    cs.CV

    UVMap-ID: A Controllable and Personalized UV Map Generative Model

    Authors: Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humphrey Shi, Nicu Sebe, Bruno Lepri

    Abstract: Recently, diffusion models have made significant strides in synthesizing realistic 2D human images based on provided text prompts. Building upon this, researchers have extended 2D text-to-image diffusion models into the 3D domain for generating human textures (UV Maps). However, some important problems about UV Map Generative models are still not solved, i.e., how to generate personalized texture… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  13. arXiv:2404.12794  [pdf, other

    cs.CV cs.MM cs.RO eess.IV

    MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model

    Authors: Kang Zeng, Hao Shi, Jiacheng Lin, Siyu Li, Jintao Cheng, Kaiwei Wang, Zhiyong Li, Kailun Yang

    Abstract: LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Ob… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: The source code will be made publicly available at https://github.com/Terminal-K/MambaMOS

  14. arXiv:2404.11313  [pdf, other

    eess.IV cs.AI

    NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

    Authors: Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

  15. arXiv:2404.10573  [pdf, other

    cs.AI cs.CE q-bio.BM

    AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation

    Authors: Lijun Liu, Jiali Yang, Jianfei Song, Xinglin Yang, Lele Niu, Zeqi Cai, Hui Shi, Tingjun Hou, Chang-yu Hsieh, Weiran Shen, Yafeng Deng

    Abstract: Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifyin… ▽ More

    Submitted 17 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

  16. arXiv:2404.07990  [pdf, other

    cs.CV cs.AI

    OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

    Authors: Moreno D'Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, Nicu Sebe

    Abstract: Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In th… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Highlight - Code: https://github.com/Picsart-AI-Research/OpenBias

  17. arXiv:2404.04629  [pdf, other

    cs.CV

    DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

    Authors: Duy-Tho Le, Hengcan Shi, Jianfei Cai, Hamid Rezatofighi

    Abstract: Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the i… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 23 pages

  18. arXiv:2404.04554  [pdf, other

    math.QA cs.CE

    A quantum algorithm for the Kalman filter using block encoding

    Authors: Hao Shi, Guofeng Zhang, Ming Zhang

    Abstract: Quantum algorithms offer significant speed-ups over their classical counterparts in various applications. In this paper, we develop quantum algorithms for the Kalman filter widely used in classical control engineering using the block encoding method. The entire calculation process is achieved by performing matrix operations on Hamiltonians based on the block encoding framework, including addition,… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 23 pages, 20 figures, 3 tables

  19. arXiv:2404.03575  [pdf, other

    cs.CV

    DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

    Authors: Haoran Li, Haolin Shi, Wenli Zhang, Wenjun Wu, Yong Liao, Lin Wang, Lik-hang Lee, Pengyuan Zhou

    Abstract: Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors. Despite significant progress, existing methods struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework, to tackle the aforementioned three challenges mainly via two strategies.… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  20. arXiv:2404.01975  [pdf, other

    cs.LG

    DSGNN: A Dual-View Supergrid-Aware Graph Neural Network for Regional Air Quality Estimation

    Authors: Xin Zhang, Ling Chen, Xing Tang, Hongyu Shi

    Abstract: Air quality estimation can provide air quality for target regions without air quality stations, which is useful for the public. Existing air quality estimation methods divide the study area into disjointed grid regions, and apply 2D convolution to model the spatial dependencies of adjacent grid regions based on the first law of geography, failing to model the spatial dependencies of distant grid r… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Submitted to TKDE, 12 pages and 8 figures

  21. arXiv:2404.01686  [pdf, other

    cs.CV

    JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments

    Authors: Duy-Tho Le, Chenhui Gou, Stavya Datta, Hengcan Shi, Ian Reid, Jianfei Cai, Hamid Rezatofighi

    Abstract: Autonomous robot systems have attracted increasing research attention in recent years, where environment understanding is a crucial step for robot navigation, human-robot interaction, and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks, w… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  22. arXiv:2404.00335  [pdf, other

    cs.CV

    Learning Trimaps via Clicks for Image Matting

    Authors: Chenyi Zhang, Yihan Hu, Henghui Ding, Humphrey Shi, Yao Zhao, Yunchao Wei

    Abstract: Despite significant advancements in image matting, existing models heavily depend on manually-drawn trimaps for accurate results in natural image scenarios. However, the process of obtaining trimaps is time-consuming, lacking user-friendliness and device compatibility. This reliance greatly limits the practical application of all trimap-based matting methods. To address this issue, we introduce Cl… ▽ More

    Submitted 6 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

  23. arXiv:2403.20230  [pdf, other

    cs.AR cs.LG

    An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

    Authors: Haikuo Shao, Huihong Shi, Wendong Mao, Zhongfeng Wang

    Abstract: Vision Transformers (ViTs) have achieved significant success in computer vision. However, their intensive computations and massive memory footprint challenge ViTs' deployment on embedded devices, calling for efficient ViTs. Among them, EfficientViT, the state-of-the-art one, features a Convolution-Transformer hybrid architecture, enhancing both accuracy and hardware efficiency. Unfortunately, exis… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: To appear in the 2024 IEEE International Symposium on Circuits and Systems (ISCAS 2024)

  24. arXiv:2403.20058  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Revolutionizing Disease Diagnosis with simultaneous functional PET/MR and Deeply Integrated Brain Metabolic, Hemodynamic, and Perfusion Networks

    Authors: Luoyu Wang, Yitian Tao, Qing Yang, Yan Liang, Siwei Liu, Hongcheng Shi, Dinggang Shen, Han Zhang

    Abstract: Simultaneous functional PET/MR (sf-PET/MR) presents a cutting-edge multimodal neuroimaging technique. It provides an unprecedented opportunity for concurrently monitoring and integrating multifaceted brain networks built by spatiotemporally covaried metabolic activity, neural activity, and cerebral blood flow (perfusion). Albeit high scientific/clinical values, short in hardware accessibility of P… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: 11 pages

  25. arXiv:2403.18819  [pdf, other

    cs.CV

    Benchmarking Object Detectors with COCO: A New Path Forward

    Authors: Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai

    Abstract: The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer,… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Technical report. Dataset website: https://cocorem.xyz and code: https://github.com/kdexd/coco-rem

  26. arXiv:2403.14773  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM eess.IV

    StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

    Authors: Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

    Abstract: Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video generation (typically 16 or 24 frames), ending up with hard-cuts when naively extended to the case of long video synthesis. To overcome these limitations, we introduc… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: https://github.com/Picsart-AI-Research/StreamingT2V

  27. arXiv:2403.11331  [pdf

    physics.geo-ph cs.LG physics.data-an

    Potential of Domain Adaptation in Machine Learning in Ecology and Hydrology to Improve Model Extrapolability

    Authors: Haiyang Shi

    Abstract: Due to the heterogeneity of the global distribution of ecological and hydrological ground-truth observations, machine learning models can have limited adaptability when applied to unknown locations, which is referred to as weak extrapolability. Domain adaptation techniques have been widely used in machine learning domains such as image classification, which can improve the model generalization abi… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  28. arXiv:2403.10012  [pdf, other

    cs.CV cs.RO eess.IV physics.optics

    Real-World Computational Aberration Correction via Quantized Domain-Mixing Representation

    Authors: Qi Jiang, Zhonghua Yi, Shaohua Gao, Yao Gao, Xiaolong Qian, Hao Shi, Lei Sun, Zhijie Xu, Kailun Yang, Kaiwei Wang

    Abstract: Relying on paired synthetic data, existing learning-based Computational Aberration Correction (CAC) methods are confronted with the intricate and multifaceted synthetic-to-real domain gap, which leads to suboptimal performance in real-world applications. In this paper, in contrast to improving the simulation pipeline, we deliver a novel insight into real-world CAC from the perspective of Unsupervi… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Codes and datasets will be made publicly available at https://github.com/zju-jiangqi/QDMR

  29. arXiv:2403.08504  [pdf, other

    cs.CV cs.RO eess.IV

    OccFiner: Offboard Occupancy Refinement with Hybrid Propagation

    Authors: Hao Shi, Song Wang, Jiaming Zhang, Xiaoting Yin, Zhongdao Wang, Zhijian Zhao, Guangming Wang, Jianke Zhu, Kailun Yang, Kaiwei Wang

    Abstract: Vision-based occupancy prediction, also known as 3D Semantic Scene Completion (SSC), presents a significant challenge in computer vision. Previous methods, confined to onboard processing, struggle with simultaneous geometric and semantic estimation, continuity across varying viewpoints, and single-view occlusion. Our paper introduces OccFiner, a novel offboard framework designed to enhance the acc… ▽ More

    Submitted 15 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  30. arXiv:2403.08164  [pdf, other

    cs.SD cs.LG eess.AS

    EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

    Authors: Ziqi Liang, Haoxiang Shi, Jiawei Wang, Keda Lu

    Abstract: Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techn… ▽ More

    Submitted 17 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: Accepted by the 27th IEEE International Conference on Computer Supported Cooperative Work in Design (IEEE CSCWD 2024). arXiv admin note: substantial text overlap with arXiv:2211.01948

  31. arXiv:2403.07788  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation

    Authors: Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, C. Karen Liu

    Abstract: Imitation learning from human hand motion data presents a promising avenue for imbuing robots with human-like dexterity in real-world manipulation tasks. Despite this potential, substantial challenges persist, particularly with the portability of existing hand motion capture (mocap) systems and the difficulty of translating mocap data into effective control policies. To tackle these issues, we int… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  32. arXiv:2403.07518  [pdf, other

    cs.CV

    Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss

    Authors: Xuhua Ren, Hengcan Shi, Jin Li

    Abstract: Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV traini… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  33. arXiv:2403.06529  [pdf, other

    cs.CV

    Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis

    Authors: Zijian Chen, Mei Wang, Weihong Deng, Hongzhi Shi, Dongchao Wen, Yingjie Zhang, Xingchen Cui, Jian Zhao

    Abstract: 2D face recognition encounters challenges in unconstrained environments due to varying illumination, occlusion, and pose. Recent studies focus on RGB-D face recognition to improve robustness by incorporating depth information. However, collecting sufficient paired RGB-D training data is expensive and time-consuming, hindering wide deployment. In this work, we first construct a diverse depth datase… ▽ More

    Submitted 16 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: 9 pages, 5 figures

  34. arXiv:2403.05912  [pdf, other

    eess.IV cs.CV

    Mask-Enhanced Segment Anything Model for Tumor Lesion Semantic Segmentation

    Authors: Hairong Shi, Songhao Han, Shaofei Huang, Yue Liao, Guanbin Li, Xiangxing Kong, Hua Zhu, Xiaomu Wang, Si Liu

    Abstract: Tumor lesion segmentation on CT or MRI images plays a critical role in cancer diagnosis and treatment planning. Considering the inherent differences in tumor lesion segmentation data across various medical imaging modalities and equipment, integrating medical knowledge into the Segment Anything Model (SAM) presents promising capability due to its versatility and generalization potential. Recent st… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  35. arXiv:2403.04690  [pdf, other

    cs.CV cs.AI cs.LG

    Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

    Authors: Ali Hassani, Wen-Mei Hwu, Humphrey Shi

    Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infra… ▽ More

    Submitted 22 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Project page: https://github.com/SHI-Labs/NATTEN

  36. arXiv:2403.03031  [pdf, other

    cs.CL

    Learning to Use Tools via Cooperative and Interactive Agents

    Authors: Zhengliang Shi, Shen Gao, Xiuyi Chen, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Pengjie Ren, Suzan Verberne, Zhaochun Ren

    Abstract: Tool learning empowers large language models (LLMs) as agents to use external tools to extend their capability. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating the result into the next action prediction. However, they still suffer from potential performance degradation when addressing complex tasks due to: (1) the limitation of t… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: 20 pages

  37. arXiv:2403.03017  [pdf, other

    cs.AI

    OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following

    Authors: Haochen Shi, Zhiyuan Sun, Xingdi Yuan, Marc-Alexandre Côté, Bang Liu

    Abstract: Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these e… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  38. arXiv:2403.01962  [pdf, other

    cs.RO

    An Efficient Model-Based Approach on Learning Agile Motor Skills without Reinforcement

    Authors: Haojie Shi, Tingguang Li, Qingxu Zhu, Jiapeng Sheng, Lei Han, Max Q. -H. Meng

    Abstract: Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use i… ▽ More

    Submitted 18 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted by ICRA2024

  39. arXiv:2402.18275  [pdf, other

    cs.SD cs.CL eess.AS

    Investigation of Adapter for Automatic Speech Recognition in Noisy Environment

    Authors: Hao Shi, Tatsuya Kawahara

    Abstract: Adapting an automatic speech recognition (ASR) system to unseen noise environments is crucial. Integrating adapters into neural networks has emerged as a potent technique for transfer learning. This study thoroughly investigates adapter-based ASR adaptation in noisy environments. We conducted experiments using the CHiME--4 dataset. The results show that inserting the adapter in the shallow layer y… ▽ More

    Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  40. arXiv:2402.18169  [pdf, ps, other

    cs.CL

    MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery

    Authors: Feihong Lu, Weiqi Wang, Yangyifei Luo, Ziqin Zhu, Qingyun Sun, Baixuan Xu, Haochen Shi, Shiqi Gao, Qian Li, Yangqiu Song, Jianxin Li

    Abstract: Social media has become a ubiquitous tool for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicitness of intentions in social media posts, the need for cross-modality understanding of both text and images, and the presence of noisy information such a… ▽ More

    Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 11 pages, 5 figures

  41. arXiv:2402.14034  [pdf, other

    cs.MA cs.AI

    AgentScope: A Flexible yet Robust Multi-Agent Platform

    Authors: Dawei Gao, Zitao Li, Weirui Kuang, Xuchen Pan, Daoyuan Chen, Zhijian Ma, Bingchen Qian, Liuyi Yao, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, Jingren Zhou

    Abstract: With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications. However, the complexities in coordinating agents' cooperation and LLMs' erratic performance pose notable challenges in developing robust and efficient multi-agent applications. To tackle these challenges, we propose AgentScope, a developer-centric multi-agent platform with me… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: We have released code on https://github.com/modelscope/agentscope

  42. arXiv:2402.13572  [pdf, other

    cs.LG cs.AI math.NA

    On the Expressive Power of a Variant of the Looped Transformer

    Authors: Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael K. Ng, Zhenguo Li, Zhaoqiang Liu

    Abstract: Besides natural language processing, transformers exhibit extraordinary performance in solving broader applications, including scientific computing and computer vision. Previous works try to explain this from the expressive power and capability perspectives that standard transformers are capable of performing some algorithms. To empower transformers with algorithmic capabilities and motivated by t… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  43. arXiv:2402.13561  [pdf, other

    cs.CL cs.CV

    Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

    Authors: Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang

    Abstract: Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: working in progress, under review

  44. arXiv:2402.12950  [pdf, other

    cs.SE cs.AI

    QuanTest: Entanglement-Guided Testing of Quantum Neural Network Systems

    Authors: Jinjing Shi, Zimeng Xiao, Heyuan Shi, Yu Jiang, Xuelong Li

    Abstract: Quantum Neural Network (QNN) combines the Deep Learning (DL) principle with the fundamental theory of quantum mechanics to achieve machine learning tasks with quantum acceleration. Recently, QNN systems have been found to manifest robustness issues similar to classical DL systems. There is an urgent need for ways to test their correctness and security. However, QNN systems differ significantly fro… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  45. arXiv:2402.11468  [pdf

    cs.MA

    Online Physical Enhanced Residual Learning for Connected Autonomous Vehicles Platoon Centralized Control

    Authors: Hang Zhou, Heye Huang, Peng Zhang, Haotian Shi, Keke Long, Xiaopeng Li

    Abstract: This paper introduces an online physical enhanced residual learning (PERL) framework for Connected Autonomous Vehicles (CAVs) platoon, aimed at addressing the challenges posed by the dynamic and unpredictable nature of traffic environments. The proposed framework synergistically combines a physical model, represented by Model Predictive Control (MPC), with data-driven online Q-learning. The MPC co… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

  46. arXiv:2402.11176  [pdf, other

    cs.CL cs.AI

    KnowTuning: Knowledge-aware Fine-tuning for Large Language Models

    Authors: Yougang Lyu, Lingyong Yan, Shuaiqiang Wang, Haibo Shi, Dawei Yin, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Zhaochun Ren

    Abstract: Despite their success at many natural language processing (NLP) tasks, large language models still struggle to effectively leverage knowledge for knowledge-intensive tasks, manifesting limitations such as generating incomplete, non-factual, or illogical answers. These limitations stem from inadequate knowledge awareness of LLMs during vanilla fine-tuning. To address these problems, we propose a kn… ▽ More

    Submitted 17 April, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  47. arXiv:2402.09872  [pdf, other

    cs.CV

    Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community

    Authors: Arman Isajanyan, Artur Shatveryan, David Kocharyan, Zhangyang Wang, Humphrey Shi

    Abstract: Social reward as a form of community recognition provides a strong source of motivation for users of online platforms to engage and contribute with content. The recent progress of text-conditioned image synthesis has ushered in a collaborative era where AI empowers users to craft original visual artworks seeking community validation. Nevertheless, assessing these models in the context of collectiv… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: 16 pages with 10 figures, accepted at ICLR 2024 as a spotlight, codes can be accessed at https://github.com/Picsart-AI-Research/Social-Reward

  48. arXiv:2402.07754  [pdf, other

    cs.CL cs.AI cs.LG

    Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models

    Authors: Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Zhenguo Li, Wei Bi, Lingpeng Kong

    Abstract: Diffusion models have gained attention in text processing, offering many potential advantages over traditional autoregressive models. This work explores the integration of diffusion models and Chain-of-Thought (CoT), a well-established technique to improve the reasoning ability in autoregressive language models. We propose Diffusion-of-Thought (DoT), allowing reasoning steps to diffuse over time t… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  49. arXiv:2401.16700  [pdf, other

    cs.CV cs.RO eess.IV

    Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

    Authors: Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang

    Abstract: 3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are pr… ▽ More

    Submitted 25 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted to IJCNN 2024. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose

  50. arXiv:2401.15569  [pdf, ps, other

    cs.CL

    Efficient Tuning and Inference for Large Language Models on Textual Graphs

    Authors: Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang

    Abstract: Rich textual and topological information of textual graphs need to be modeled in real-world applications such as webpages, e-commerce, and academic articles. Practitioners have been long following the path of adopting a shallow text encoder and a subsequent graph neural network (GNN) to solve this problem. In light of recent advancements in large language models (LLMs), it is apparent that integra… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.