Skip to main content

Showing 1–50 of 159 results for author: Hu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  2. arXiv:2404.06756  [pdf, other

    cs.LG cs.AI

    CrimeAlarm: Towards Intensive Intent Dynamics in Fine-grained Crime Prediction

    Authors: Kaixi Hu, Lin Li, Qing Xie, Xiaohui Tao, Guandong Xu

    Abstract: Granularity and accuracy are two crucial factors for crime event prediction. Within fine-grained event classification, multiple criminal intents may alternately exhibit in preceding sequential events, and progress differently in next. Such intensive intent dynamics makes training models hard to capture unobserved intents, and thus leads to sub-optimal generalization performance, especially in the… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: Accepted by DASFAA 2024

  3. arXiv:2404.00364  [pdf, other

    cs.RO cs.AI

    Accurate Cutting-point Estimation for Robotic Lychee Harvesting through Geometry-aware Learning

    Authors: Gengming Zhang, Hao Cao, Kewei Hu, Yaoqiang Pan, Yuqin Deng, Hongjun Wang, Hanwen Kang

    Abstract: Accurately identifying lychee-picking points in unstructured orchard environments and obtaining their coordinate locations is critical to the success of lychee-picking robots. However, traditional two-dimensional (2D) image-based object detection methods often struggle due to the complex geometric structures of branches, leaves and fruits, leading to incorrect determination of lychee picking point… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  4. arXiv:2403.16124  [pdf, other

    cs.CV

    Enhancing Visual Continual Learning with Language-Guided Supervision

    Authors: Bolin Ni, Hongbo Zhao, Chenghao Zhang, Ke Hu, Gaofeng Meng, Zhaoxiang Zhang, Shiming Xiang

    Abstract: Continual learning (CL) aims to empower models to learn new tasks without forgetting previously acquired knowledge. Most prior works concentrate on the techniques of architectures, replay data, regularization, \etc. However, the category name of each class is largely neglected. Existing methods commonly utilize the one-hot labels and randomly initialize the classifier head. We argue that the scarc… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  5. arXiv:2403.15981  [pdf, other

    cs.CV

    Exploring Accurate 3D Phenotyping in Greenhouse through Neural Radiance Fields

    Authors: Junhong Zhao, Wei Ying, Yaoqiang Pan, Zhenfeng Yi, Chao Chen, Kewei Hu, Hanwen Kang

    Abstract: Accurate collection of plant phenotyping is critical to optimising sustainable farming practices in precision agriculture. Traditional phenotyping in controlled laboratory environments, while valuable, falls short in understanding plant growth under real-world conditions. Emerging sensor and digital technologies offer a promising approach for direct phenotyping of plants in farm environments. This… ▽ More

    Submitted 28 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

  6. arXiv:2402.09685  [pdf, other

    cs.RO

    Pheno-Robot: An Auto-Digital Modelling System for In-Situ Phenotyping in the Field

    Authors: Yaoqiang Pan, Kewei Hu, Tianhao Liu, Chao Chen, Hanwen Kang

    Abstract: Accurate reconstruction of plant models for phenotyping analysis is critical for optimising sustainable agricultural practices in precision agriculture. Traditional laboratory-based phenotyping, while valuable, falls short of understanding how plants grow under uncontrolled conditions. Robotic technologies offer a promising avenue for large-scale, direct phenotyping in real-world environments. Thi… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  7. arXiv:2402.03093  [pdf, other

    cs.CV cs.HC

    AI-Enhanced Virtual Reality in Medicine: A Comprehensive Survey

    Authors: Yixuan Wu, Kaiyuan Hu, Danny Z. Chen, Jian Wu

    Abstract: With the rapid advance of computer graphics and artificial intelligence technologies, the ways we interact with the world have undergone a transformative shift. Virtual Reality (VR) technology, aided by artificial intelligence (AI), has emerged as a dominant interaction media in multiple application areas, thanks to its advantage of providing users with immersive experiences. Among those applicati… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  8. arXiv:2401.12789  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

    Authors: W. Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, Tara N. Sainath

    Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  9. arXiv:2401.12433  [pdf, other

    cs.CV

    A Novel Garment Transfer Method Supervised by Distilled Knowledge of Virtual Try-on Model

    Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Kerui Hu, Jianrong Tan

    Abstract: This paper proposes a novel garment transfer method supervised with knowledge distillation from virtual try-on. Our method first reasons the transfer parsing to provide shape prior to downstream tasks. We employ a multi-phase teaching strategy to supervise the training of the transfer parsing reasoning model, learning the response and feature knowledge from the try-on parsing reasoning model. To c… ▽ More

    Submitted 4 April, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  10. arXiv:2401.11874  [pdf, other

    cs.CV

    Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis

    Authors: Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, Qiang Huo

    Abstract: Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using… ▽ More

    Submitted 28 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Submitted to Pattern Recognition

  11. arXiv:2401.09232  [pdf, other

    cs.CV

    Dynamic Relation Transformer for Contextual Text Block Detection

    Authors: Jiawei Wang, Shunchi Zhang, Kai Hu, Chixiang Ma, Zhuoyao Zhong, Lei Sun, Qiang Huo

    Abstract: Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation prob… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  12. arXiv:2401.09220  [pdf, other

    cs.CL

    UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

    Authors: Kai Hu, Jiawei Wang, Weihong Lin, Zhuoyao Zhong, Lei Sun, Qiang Huo

    Abstract: Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address th… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  13. arXiv:2401.07487  [pdf, other

    cs.RO cs.CV

    Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

    Authors: Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, Huazhe Xu

    Abstract: Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vas… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

  14. arXiv:2401.04325  [pdf, other

    cs.CV

    RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale

    Authors: Han Li, Yukai Ma, Yaqing Gu, Kewei Hu, Yong Liu, Xingxing Zuo

    Abstract: We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocul… ▽ More

    Submitted 19 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

  15. Developing Flying Explorer for Autonomous Digital Modelling in Wild Unknowns

    Authors: Naizhong Zhang. Yaoqiang Pan, Yangwen Jin, Peiqi Jin, Kewei Hu, Xiao Huang, Hanwen Kang

    Abstract: This work presents an innovative solution for robotic odometry, path planning and exploration in wild unknown environments, focusing on digital modelling. The approach uses a minimum cost formulation with pseudo-randomly generated objectives, integrating multi-path planning and evaluation, with emphasis on full coverage of unknown maps based on feasible boundaries of interest. The evaluation carri… ▽ More

    Submitted 29 December, 2023; originally announced December 2023.

  16. arXiv:2312.14481  [pdf, other

    cs.CV cs.AI cs.RO

    SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

    Authors: Wenxi Yue, Jing Zhang, Kun Hu, Qiuxia Wu, Zongyuan Ge, Yong Xia, Jiebo Luo, Zhiyong Wang

    Abstract: The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity… ▽ More

    Submitted 22 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Technical Report. The source code will be released at https://github.com/wenxi-yue/SurgicalPart-SAM

  17. arXiv:2312.10073  [pdf, other

    cs.IR cs.AI

    Data Scarcity in Recommendation Systems: A Survey

    Authors: Zefeng Chen, Wensheng Gan, Jiayang Wu, Kaixia Hu, Hong Lin

    Abstract: The prevalence of online content has led to the widespread adoption of recommendation systems (RSs), which serve diverse purposes such as news, advertisements, and e-commerce recommendations. Despite their significance, data scarcity issues have significantly impaired the effectiveness of existing RS models and hindered their progress. To address this challenge, the concept of knowledge transfer,… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: ACM Transactions on Recommender Systems, 32 pages

  18. arXiv:2312.06951  [pdf, other

    cs.LG cs.DC

    Feature Norm Regularized Federated Learning: Transforming Skewed Distributions into Global Insights

    Authors: Ke Hu, WeiDong Qiu, Peng Tang

    Abstract: In the field of federated learning, addressing non-independent and identically distributed (non-i.i.d.) data remains a quintessential challenge for improving global model performance. This work introduces the Feature Norm Regularized Federated Learning (FNR-FL) algorithm, which uniquely incorporates class average feature norms to enhance model accuracy and convergence in non-i.i.d. scenarios. Our… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  19. arXiv:2312.00710  [pdf, other

    cs.LG stat.ME stat.ML

    SpaCE: The Spatial Confounding Environment

    Authors: Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici

    Abstract: Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating caus… ▽ More

    Submitted 5 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

  20. High-fidelity 3D Reconstruction of Plants using Neural Radiance Field

    Authors: Kewei Hu, Ying Wei, Yaoqiang Pan, Hanwen Kang, Chao Chen

    Abstract: Accurate reconstruction of plant phenotypes plays a key role in optimising sustainable farming practices in the field of Precision Agriculture (PA). Currently, optical sensor-based approaches dominate the field, but the need for high-fidelity 3D reconstruction of crops and plants in unstructured agricultural environments remains challenging. Recently, a promising development has emerged in the for… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

  21. arXiv:2311.03351  [pdf, other

    cs.LG cs.RO

    Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

    Authors: Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, Huazhe Xu

    Abstract: Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we… ▽ More

    Submitted 17 March, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: Our website: https://lei-kun.github.io/uni-o4/

  22. arXiv:2311.01444  [pdf, other

    cs.CV cs.RO

    LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

    Authors: Anqi Joyce Yang, Sergio Casas, Nikita Dvornik, Sean Segal, Yuwen Xiong, Jordan Sir Kwang Hu, Carter Fang, Raquel Urtasun

    Abstract: A major bottleneck to scaling-up training of self-driving perception systems are the human annotations required for supervision. A promising alternative is to leverage "auto-labelling" offboard perception models that are trained to automatically generate annotations from raw LiDAR point clouds at a fraction of the cost. Auto-labels are most commonly generated via a two-stage approach -- first obje… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: 20 pages, 8 figures, 7 tables

    Journal ref: CoRL 2023

  23. arXiv:2310.09361  [pdf, other

    cs.LG

    Is Certifying $\ell_p$ Robustness Still Worthwhile?

    Authors: Ravi Mangal, Klas Leino, Zifan Wang, Kai Hu, Weicheng Yu, Corina Pasareanu, Anupam Datta, Matt Fredrikson

    Abstract: Over the years, researchers have developed myriad attacks that exploit the ubiquity of adversarial examples, as well as defenses that aim to guard against the security vulnerabilities posed by such attacks. Of particular interest to this paper are defenses that provide provable guarantees against the class of $\ell_p$-bounded attacks. Certified defenses have made significant progress, taking robus… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

  24. arXiv:2310.08205  [pdf, other

    cs.MM cs.HC

    LiveVV: Human-Centered Live Volumetric Video Streaming System

    Authors: Kaiyuan Hu, Yongting Chen, Kaiying Han, Junhua Liu, Haowen Yang, Yili Jin, Boyan Li, Fangxin Wang

    Abstract: Volumetric video has emerged as a prominent medium within the realm of eXtended Reality (XR) with the advancements in computer graphics and depth capture hardware. Users can fully immersive themselves in volumetric video with the ability to switch their viewport in six degree-of-freedom (DOF), including three rotational dimensions (yaw, pitch, roll) and three translational dimensions (X, Y, Z). Di… ▽ More

    Submitted 18 October, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

  25. arXiv:2310.04673  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

    Authors: Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

    Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, o… ▽ More

    Submitted 10 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 10 pages, under review

  26. arXiv:2310.02513  [pdf, other

    cs.LG

    A Recipe for Improved Certifiable Robustness: Capacity and Data

    Authors: Kai Hu, Klas Leino, Zifan Wang, Matt Fredrikson

    Abstract: A key challenge, supported both theoretically and empirically, is that robustness demands greater network capacity and more data than standard training. However, effectively adding capacity under stringent Lipschitz constraints has proven more difficult than it may seem, evident by the fact that state-of-the-art approach tend more towards \emph{underfitting} than overfitting. Moreover, we posit th… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  27. arXiv:2310.00808  [pdf, other

    cs.CV

    Completing Visual Objects via Bridging Generation and Segmentation

    Authors: Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu

    Abstract: This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the gene… ▽ More

    Submitted 2 February, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

  28. arXiv:2309.17059  [pdf, other

    cs.CV

    GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation

    Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Zheyuan Zhou, Kerui Hu

    Abstract: Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perfo… ▽ More

    Submitted 4 December, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

  29. arXiv:2309.10218  [pdf, other

    cs.CY cs.HC

    A Hierarchy-based Analysis Approach for Blended Learning: A Case Study with Chinese Students

    Authors: Yu Ye, Gongjin Zhang, Hongbiao Si, Liang Xu, Shenghua Hu, Yong Li, Xulong Zhang, Kaiyu Hu, Fangzhou Ye

    Abstract: Blended learning is generally defined as the combination of traditional face-to-face learning and online learning. This learning mode has been widely used in advanced education across the globe due to the COVID-19 pandemic's social distance restriction as well as the development of technology. Online learning plays an important role in blended learning, and as it requires more student autonomy, th… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)

  30. arXiv:2309.09039  [pdf, other

    cs.CV

    Microscale 3-D Capacitance Tomography with a CMOS Sensor Array

    Authors: Manar Abdelatty, Joseph Incandela, Kangping Hu, Joseph W. Larkin, Sherief Reda, Jacob K. Rosenstein

    Abstract: Electrical capacitance tomography (ECT) is a nonoptical imaging technique in which a map of the interior permittivity of a volume is estimated by making capacitance measurements at its boundary and solving an inverse problem. While previous ECT demonstrations have often been at centimeter scales, ECT is not limited to macroscopic systems. In this paper, we demonstrate ECT imaging of polymer micros… ▽ More

    Submitted 2 December, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

  31. arXiv:2309.07405  [pdf, other

    cs.SD cs.AI eess.AS

    FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

    Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

    Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as… ▽ More

    Submitted 6 October, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures, submitted to ICASSP 2024

  32. arXiv:2309.05658  [pdf, other

    cs.MM cs.NI eess.IV

    From Capture to Display: A Survey on Volumetric Video

    Authors: Yili Jin, Kaiyuan Hu, Junhua Liu, Fangxin Wang, Xue Liu

    Abstract: Volumetric video, which offers immersive viewing experiences, is gaining increasing prominence. With its six degrees of freedom, it provides viewers with greater immersion and interactivity compared to traditional videos. Despite their potential, volumetric video services poses significant challenges. This survey conducts a comprehensive review of the existing literature on volumetric video. We fi… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: Submitted

  33. arXiv:2309.03467  [pdf, other

    cs.CV cs.AI

    Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

    Authors: Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang

    Abstract: A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short i… ▽ More

    Submitted 8 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Accepted by AAAI 24

    ACM Class: I.4.0

  34. arXiv:2308.16763  [pdf, other

    cs.CL cs.AI

    Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection

    Authors: Kairui Hu, Ming Yan, Joey Tianyi Zhou, Ivor W. Tsang, Wen Haw Chong, Yong Keong Yap

    Abstract: Stance detection aims to identify the attitude expressed in a document towards a given target. Techniques such as Chain-of-Thought (CoT) prompting have advanced this task, enhancing a model's reasoning capabilities through the derivation of intermediate rationales. However, CoT relies primarily on a model's pre-trained internal knowledge during reasoning, thereby neglecting the valuable external i… ▽ More

    Submitted 7 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: 5 pages, 2 figures, 2 tables

  35. A Novel Perception and Semantic Mapping Method for Robot Autonomy in Orchards

    Authors: Yaoqiang Pan, Hao Cao, Kewei Hu, Hanwen Kang, Xing Wang

    Abstract: Agricultural robots must navigate challenging dynamic and semi-structured environments. Recently, environmental modeling using LiDAR-based SLAM has shown promise in providing highly accurate geometry. However, how this chaotic environmental information can be used to achieve effective robot automation in the agricultural sector remains unexplored. In this study, we propose a novel semantic mapping… ▽ More

    Submitted 23 November, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

  36. arXiv:2308.16725  [pdf, other

    cs.CV cs.AI cs.MM

    Terrain Diffusion Network: Climatic-Aware Terrain Generation with Geological Sketch Guidance

    Authors: Zexin Hu, Kun Hu, Clinton Mo, Lei Pan, Zhiyong Wang

    Abstract: Sketch-based terrain generation seeks to create realistic landscapes for virtual environments in various applications such as computer games, animation and virtual reality. Recently, deep learning based terrain generation has emerged, notably the ones based on generative adversarial networks (GAN). However, these methods often struggle to fulfill the requirements of flexible user control and maint… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  37. arXiv:2308.13273  [pdf, other

    cs.CV cs.MM

    Bridging the Gap: Fine-to-Coarse Sketch Interpolation Network for High-Quality Animation Sketch Inbetweening

    Authors: Jiaming Shen, Kun Hu, Wei Bao, Chang Wen Chen, Zhiyong Wang

    Abstract: The 2D animation workflow is typically initiated with the creation of keyframes using sketch-based drawing. Subsequent inbetweens (i.e., intermediate sketch frames) are crafted through manual interpolation for smooth animations, which is a labor-intensive process. Thus, the prospect of automatic animation sketch interpolation has become highly appealing. However, existing video interpolation metho… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

    Comments: 7pages,6figures

  38. arXiv:2308.09302  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

    Authors: Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong Wang

    Abstract: Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

  39. arXiv:2308.08746  [pdf, other

    cs.CV cs.AI cs.RO

    SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

    Authors: Wenxi Yue, Jing Zhang, Kun Hu, Yong Xia, Jiebo Luo, Zhiyong Wang

    Abstract: The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgic… ▽ More

    Submitted 21 December, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

    Comments: AAAI2024. The source code is available at https://github.com/wenxi-yue/SurgicalSAM

  40. arXiv:2308.08717  [pdf, other

    cs.CV cs.AI

    EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

    Authors: Liang Wang, Nan Zhang, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Guokuan Li, Kaiyu Hu, Guilin Jiang, Jing Xiao

    Abstract: Real-time video analytics on edge devices for changing scenes remains a difficult task. As edge devices are usually resource-constrained, edge deep neural networks (DNNs) have fewer weights and shallower architectures than general DNNs. As a result, they only perform well in limited scenarios and are sensitive to data drift. In this paper, we introduce EdgeMA, a practical and efficient video analy… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: Accepted by 30th International Conference on Neural Information Processing (ICONIP 2023)

  41. arXiv:2308.07578  [pdf

    cs.MM

    Understanding User Behavior in Volumetric Video Watching: Dataset, Analysis and Prediction

    Authors: Kaiyuan Hu, Haowen Yang, Yili Jin, Junhua Liu, Yongting Chen, Miao Zhang, Fangxin Wang

    Abstract: Volumetric video emerges as a new attractive video paradigm in recent years since it provides an immersive and interactive 3D viewing experience with six degree-of-freedom (DoF). Unlike traditional 2D or panoramic videos, volumetric videos require dense point clouds, voxels, meshes, or huge neural models to depict volumetric scenes, which results in a prohibitively high bandwidth burden for video… ▽ More

    Submitted 16 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by ACM MM'23

  42. arXiv:2308.06125  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Improving Joint Speech-Text Representations Without Alignment

    Authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these metho… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Journal ref: INTERSPEECH 2023

  43. arXiv:2307.10224  [pdf, other

    cs.AI

    RL-ViGen: A Reinforcement Learning Benchmark for Visual Generalization

    Authors: Zhecheng Yuan, Sizhe Yang, Pu Hua, Can Chang, Kaizhe Hu, Huazhe Xu

    Abstract: Visual Reinforcement Learning (Visual RL), coupled with high-dimensional observations, has consistently confronted the long-standing challenge of out-of-distribution generalization. Despite the focus on algorithms aimed at resolving visual generalization problems, we argue that the devil is in the existing benchmarks as they are restricted to isolated tasks and generalization categories, undermini… ▽ More

    Submitted 26 September, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

  44. arXiv:2307.07837  [pdf, other

    cs.RO

    Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?

    Authors: Jialu Gao, Kaizhe Hu, Guowei Xu, Huazhe Xu

    Abstract: Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniq… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

  45. arXiv:2305.20006  [pdf, other

    eess.IV cs.CV

    Physics-Informed Ensemble Representation for Light-Field Image Super-Resolution

    Authors: Manchang Jin, Gaosheng Liu, Kunshu Hu, Xin Luo, Kun Li, Jingyu Yang

    Abstract: Recent learning-based approaches have achieved significant progress in light field (LF) image super-resolution (SR) by exploring convolution-based or transformer-based network structures. However, LF imaging has many intrinsic physical priors that have not been fully exploited. In this paper, we analyze the coordinate transformation of the LF imaging process to reveal the geometric relationship in… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  46. arXiv:2305.15663  [pdf, other

    cs.CL cs.SD eess.AS

    Mixture-of-Expert Conformer for Streaming Multilingual ASR

    Authors: Ke Hu, Bo Li, Tara N. Sainath, Yu Zhang, Francoise Beaufays

    Abstract: End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate whi… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  47. arXiv:2304.08956  [pdf, other

    cs.CV

    PG-VTON: A Novel Image-Based Virtual Try-On Method via Progressive Inference Paradigm

    Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Kerui Hu

    Abstract: Virtual try-on is a promising computer vision topic with a high commercial value wherein a new garment is visually worn on a person with a photo-realistic effect. Previous studies conduct their shape and content inference at one stage, employing a single-scale warping mechanism and a relatively unsophisticated content inference mechanism. These approaches have led to suboptimal results in terms of… ▽ More

    Submitted 4 December, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

  48. arXiv:2304.07957  [pdf, other

    cs.CL

    A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images

    Authors: Kai Hu, Zhuoyuan Wu, Zhuoyao Zhong, Weihong Lin, Lei Sun, Qiang Huo

    Abstract: In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images. Specifically, KVPFormer first identifies key entities from all entities in an image with a Transformer encoder, then takes these key entities as \textbf{questions} and feeds them into a Tr… ▽ More

    Submitted 16 April, 2023; originally announced April 2023.

    Comments: AAAI 2023

  49. arXiv:2304.03650  [pdf, other

    cs.CV

    A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation

    Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Kerui Hu, Kang Wang

    Abstract: As bird's-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning th… ▽ More

    Submitted 17 August, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

  50. arXiv:2303.15293  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    A Deliberation-based Joint Acoustic and Text Decoder

    Authors: Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

    Abstract: We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass dec… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Interspeech 2021