Skip to main content

Showing 1–50 of 2,061 results for author: Wu, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05941  [pdf, other

    cs.RO cs.CV cs.LG

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Authors: Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, Ted Xiao

    Abstract: The field of robotics has made significant advances towards generalist robot manipulation policies. However, real-world evaluation of such policies is not scalable and faces reproducibility challenges, which are likely to worsen as policies broaden the spectrum of tasks they can perform. We identify control and visual disparities between real and simulated environments as key challenges for reliab… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  2. arXiv:2405.05876  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Composable Part-Based Manipulation

    Authors: Weiyu Liu, Jiayuan Mao, Joy Hsu, Tucker Hermans, Animesh Garg, Jiajun Wu

    Abstract: In this paper, we propose composable part-based manipulation (CPM), a novel approach that leverages object-part decomposition and part-part correspondences to improve learning and generalization of robotic manipulation skills. By considering the functional correspondences between object parts, we conceptualize functional actions, such as pouring and constrained placing, as combinations of differen… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: Presented at CoRL 2023. For videos and additional results, see our website: https://cpmcorl2023.github.io/

  3. arXiv:2405.05741  [pdf, ps, other

    cs.CL cs.AI

    Can large language models understand uncommon meanings of common words?

    Authors: Jinyang Wu, Feihu Che, Xinxin Zheng, Shuai Zhang, Ruihan Jin, Shuai Nie, Pengpeng Shao, Jianhua Tao

    Abstract: Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. P… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  4. arXiv:2405.05538  [pdf, other

    cs.CV

    A Survey on Personalized Content Synthesis with Diffusion Models

    Authors: Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li

    Abstract: Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  5. arXiv:2405.05497  [pdf, other

    cs.CV

    Multi-Level Feature Fusion Network for Lightweight Stereo Image Super-Resolution

    Authors: Yunxiang Li, Wenbin Zou, Qiaomu Wei, Feng Huang, Jing Wu

    Abstract: Stereo image super-resolution utilizes the cross-view complementary information brought by the disparity effect of left and right perspective images to reconstruct higher-quality images. Cascading feature extraction modules and cross-view feature interaction modules to make use of the information from stereo images is the focus of numerous methods. However, this adds a great deal of network parame… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 10 pages, 7 figures, CVPRWorkshop NTIRE2024

  6. arXiv:2405.04332  [pdf, other

    cs.CR

    WALLETRADAR: Towards Automating the Detection of Vulnerabilities in Browser-based Cryptocurrency Wallets

    Authors: Pengcheng Xia, Yanhui Guo, Zhaowen Lin, Jun Wu, Pengbo Duan, Ningyu He, Kailong Wang, Tianming Liu, Yinliang Yue, Guoai Xu, Haoyu Wang

    Abstract: Cryptocurrency wallets, acting as fundamental infrastructure to the blockchain ecosystem, have seen significant user growth, particularly among browser-based wallets (i.e., browser extensions). However, this expansion accompanies security challenges, making these wallets prime targets for malicious activities. Despite a substantial user base, there is not only a significant gap in comprehensive se… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Just accepted by the Automated Software Engineering Journal

  7. arXiv:2405.04286  [pdf, other

    cs.CL

    Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore

    Authors: Junchao Wu, Runzhe Zhan, Derek F. Wong, Shu Yang, Xuebo Liu, Lidia S. Chao, Min Zhang

    Abstract: The efficacy of an large language model (LLM) generated text detector depends substantially on the availability of sizable training data. White-box zero-shot detectors, which require no such data, are nonetheless limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose an simple but effective black-box zero-shot detection approach, predicated on the obs… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  8. arXiv:2405.04167  [pdf, other

    cs.CV eess.IV

    Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment

    Authors: Aobo Li, Jinjian Wu, Yongxu Liu, Leida Li

    Abstract: The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR2024

  9. arXiv:2405.03977  [pdf, other

    cs.DL cs.AI cs.LG

    Can citations tell us about a paper's reproducibility? A case study of machine learning papers

    Authors: Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

    Abstract: The iterative character of work in machine learning (ML) and artificial intelligence (AI) and reliance on comparisons against benchmark datasets emphasize the importance of reproducibility in that literature. Yet, resource constraints and inadequate documentation can make running replications particularly challenging. Our work explores the potential of using downstream citation contexts as a signa… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 9 pages, 4 figures

  10. arXiv:2405.03864  [pdf, other

    cs.RO cs.AI

    Learning Planning Abstractions from Language

    Authors: Weiyu Liu, Geng Chen, Joy Hsu, Jiayuan Mao, Jiajun Wu

    Abstract: This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action co… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: The first two authors contributed equally. The last two authors provide equal advising. Project website: https://parl2024.github.io/

  11. arXiv:2405.02288  [pdf, other

    cs.CV cs.AI cs.RO

    Prospective Role of Foundation Models in Advancing Autonomous Vehicles

    Authors: Jianhua Wu, Bingzhao Gao, Jincheng Gao, Jianhao Yu, Hongqing Chu, Qiankun Yu, Xun Gong, Yi Chang, H. Eric Tseng, Hong Chen, Jie Chen

    Abstract: With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, CLIP, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhance scene understanding and reason… ▽ More

    Submitted 8 December, 2023; originally announced May 2024.

    Comments: 36 pages,5 figures

  12. arXiv:2405.02023  [pdf, other

    cs.CV

    IFNet: Deep Imaging and Focusing for Handheld SAR with Millimeter-wave Signals

    Authors: Yadong Li, Dongheng Zhang, Ruixu Geng, Jincheng Wu, Yang Hu, Qibin Sun, Yan Chen

    Abstract: Recent advancements have showcased the potential of handheld millimeter-wave (mmWave) imaging, which applies synthetic aperture radar (SAR) principles in portable settings. However, existing studies addressing handheld motion errors either rely on costly tracking devices or employ simplified imaging models, leading to impractical deployment or limited performance. In this paper, we present IFNet,… ▽ More

    Submitted 5 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  13. arXiv:2405.00461  [pdf, other

    cs.RO cs.AI cs.CL cs.HC

    Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning

    Authors: Huan Xu, Jinlin Wu, Guanglin Cao, Zhen Lei, Zhen Chen, Hongbin Liu

    Abstract: Ultrasound robots are increasingly used in medical diagnostics and early disease screening. However, current ultrasound robots lack the intelligence to understand human intentions and instructions, hindering autonomous ultrasound scanning. To solve this problem, we propose a novel Ultrasound Embodied Intelligence system that equips ultrasound robots with the large language model (LLM) and domain k… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: ICRA 2024 Full-day Workshop: C4SR+: Continuum, Compliant, Cooperative, Cognitive

  14. arXiv:2404.19696  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

    Authors: Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu

    Abstract: 3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regulariz… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. The first two authors contributed equally

  15. arXiv:2404.19401  [pdf, other

    cs.CV

    UniFS: Universal Few-shot Instance Perception with Point Representations

    Authors: Sheng Jin, Ruijie Yao, Lumin Xu, Wentao Liu, Chen Qian, Ji Wu, Ping Luo

    Abstract: Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of ta… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  16. arXiv:2404.19289  [pdf, other

    cs.CV cs.LG

    On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning

    Authors: Yun-Hao Cao, Jianxin Wu

    Abstract: Self-supervised learning (SSL) has developed rapidly in recent years. However, most of the mainstream methods are computationally expensive and rely on two (or more) augmentations for each image to construct positive pairs. Moreover, they mainly focus on large models and large-scale datasets, which lack flexibility and feasibility in many practical applications. In this paper, we propose an effici… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: 13 pages, 7 figures

  17. arXiv:2404.18816  [pdf, other

    cs.CR cs.SE

    AppPoet: Large Language Model based Android malware detection via multi-view prompt engineering

    Authors: Wenxiang Zhao, Juntao Wu, Zhaoyi Meng

    Abstract: Due to the vast array of Android applications, their multifarious functions and intricate behavioral semantics, attackers can adopt various tactics to conceal their genuine attack intentions within legitimate functions. However, numerous feature engineering based methods suffer from a limitation in mining behavioral semantic information, thus impeding the accuracy and efficiency of Android malware… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  18. arXiv:2404.18758  [pdf, other

    cs.CV cs.LG

    Transitive Vision-Language Prompt Learning for Domain Generalization

    Authors: Liyuan Wang, Yan Jin, Zhen Chen, Jinlin Wu, Mengke Li, Yang Lu, Hanzi Wang

    Abstract: The vision-language pre-training has enabled deep models to make a huge step forward in generalizing across unseen domains. The recent learning method based on the vision-language pre-training model is a great tool for domain generalization and can solve this problem to a large extent. However, there are still some issues that an advancement still suffers from trading-off between domain invariance… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  19. arXiv:2404.18428  [pdf, other

    cs.DB

    Geospatial Big Data: Survey and Challenges

    Authors: Jiayang Wu, Wensheng Gan, Han-Chieh Chao, Philip S. Yu

    Abstract: In recent years, geospatial big data (GBD) has obtained attention across various disciplines, categorized into big earth observation data and big human behavior data. Identifying geospatial patterns from GBD has been a vital research focus in the fields of urban management and environmental sustainability. This paper reviews the evolution of GBD mining and its integration with advanced artificial… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: IEEE JSTARS. 14 pages, 5 figures

  20. arXiv:2404.18359  [pdf, other

    cs.CL cs.AI

    FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

    Authors: Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

    Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choi… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  21. arXiv:2404.17297  [pdf, ps, other

    cs.PL

    Denotation-based Compositional Compiler Verification

    Authors: Zheng Cheng, Jiyang Wu, Di Wang, Qinxiang Cao

    Abstract: A desired but challenging property of compiler verification is compositionality in the sense that the compilation correctness of a program can be deduced from that of its substructures ranging from statements, functions, and modules incrementally. Previously proposed approaches have devoted extensive effort to module-level compositionality based on small-step semantics and simulation theories. Thi… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 38 pages, 8 figures

  22. arXiv:2404.16905  [pdf, other

    cs.CL cs.SD eess.AS

    Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

    Authors: Shen Zhang, Haojie Zhang, Jing Zhang, Xudong Zhang, Yimeng Zhuang, Jinting Wu

    Abstract: In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  23. arXiv:2404.16831  [pdf, other

    cs.CV

    The Third Monocular Depth Estimation Challenge

    Authors: Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, YiPing Bao, Xiao Liu, Dohyeong Kim, Jinseong Kim, Myunghyun Kim, Mykola Lavreniuk, Rui Li, Qing Mao, Jiang Wu, Yu Zhu, Jinqiu Sun, Yanning Zhang, Suraj Patni, Aradhye Agarwal, Chetan Arora , et al. (16 additional authors not shown)

    Abstract: This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 su… ▽ More

    Submitted 27 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: To appear in CVPRW2024

  24. arXiv:2404.16375  [pdf, other

    cs.CV cs.AI cs.CL

    List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

    Authors: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

    Abstract: Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these vis… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Preprint

  25. arXiv:2404.16323  [pdf, other

    cs.CV

    DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction

    Authors: Jiamin Wu, Kenkun Liu, Han Gao, Xiaoke Jiang, Lei Zhang

    Abstract: In this paper, we study the problem of 3D reconstruction from a single-view RGB image and propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis. Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder. In particular, we introduce the use of deformable transformer, all… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  26. arXiv:2404.16077  [pdf, other

    cs.PL cs.LG

    Supercompiler Code Optimization with Zero-Shot Reinforcement Learning

    Authors: Jialong Wu, Chaoyi Deng, Jianmin Wang, Mingsheng Long

    Abstract: Effective code optimization in compilers plays a central role in computer and software engineering. While compilers can be made to automatically search the optimization space without the need for user interventions, this is not a standard practice since the search is slow and cumbersome. Here we present CodeZero, an artificial intelligence agent trained extensively on large data to produce effecti… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  27. arXiv:2404.15806  [pdf, other

    cs.LG

    Where to Mask: Structure-Guided Masking for Graph Masked Autoencoders

    Authors: Chuang Liu, Yuyao Wang, Yibing Zhan, Xueqi Ma, Dapeng Tao, Jia Wu, Wenbin Hu

    Abstract: Graph masked autoencoders (GMAE) have emerged as a significant advancement in self-supervised pre-training for graph-structured data. Previous GMAE models primarily utilize a straightforward random masking strategy for nodes or edges during training. However, this strategy fails to consider the varying significance of different nodes within the graph structure. In this paper, we investigate the po… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 9 pages, 3 Figures. Accepted by IJCAI 2024

  28. arXiv:2404.15660  [pdf, other

    cs.CL

    KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

    Authors: Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao

    Abstract: Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise i… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  29. arXiv:2404.15451  [pdf, other

    cs.CV

    CFPFormer: Feature-pyramid like Transformer Decoder for Segmentation and Detection

    Authors: Hongyi Cai, Mohammad Mahdinur Rahman, Jingyu Wu, Yulun Deng

    Abstract: Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection. However, the currently existing models generally focus on the Encoder-side Transformer to extract features, from which decoder improvement can bring further potential with well-designed architecture. We propose CFPFormer, a novel decoder… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  30. arXiv:2404.15449  [pdf, other

    cs.CV cs.AI

    ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

    Authors: Weifeng Chen, Jiacheng Zhang, Jie Wu, Hefeng Wu, Xuefeng Xiao, Liang Lin

    Abstract: The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  31. arXiv:2404.14716  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS

    Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 16 pages, 6 figures

  32. arXiv:2404.14032  [pdf, other

    cs.CV

    1st Place Solution to the 1st SkatingVerse Challenge

    Authors: Tao Sun, Yuanzi Fu, Kaicheng Yang, Jian Wu, Ziyong Feng

    Abstract: This paper presents the winning solution for the 1st SkatingVerse Challenge. We propose a method that involves several steps. To begin, we leverage the DINO framework to extract the Region of Interest (ROI) and perform precise cropping of the raw video footage. Subsequently, we employ three distinct models, namely Unmasked Teacher, UniformerV2, and InfoGCN, to capture different aspects of the data… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 3 pages, 1st SkatingVerse Challenge, 18th IEEE International Conference on Automatic Face and Gesture Recognition workshop

  33. arXiv:2404.13686  [pdf, other

    cs.CV

    Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

    Authors: Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, Xuefeng Xiao

    Abstract: Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degra… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  34. arXiv:2404.13600  [pdf, other

    cs.RO

    Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments

    Authors: Zirui Wang, Chen Yao, Yangtao Ge, Guowei Shi, Ningbo Yang, Zheng Zhu, Kewei Dong, Hexiang Wei, Zhenzhong Jia, Jing Wu

    Abstract: So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

  35. arXiv:2404.13192  [pdf, other

    cs.CL cs.AI

    Heterogeneous Subgraph Transformer for Fake News Detection

    Authors: Yuchen Zhang, Xiaoxiao Ma, Jia Wu, Jian Yang, Hao Fan

    Abstract: Fake news is pervasive on social media, inflicting substantial harm on public discourse and societal well-being. We investigate the explicit structural information and textual features of news pieces by constructing a heterogeneous graph concerning the relations among news topics, entities, and content. Through our study, we reveal that fake news can be effectively detected in terms of the atypica… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  36. arXiv:2404.13059  [pdf, other

    math.OC cs.CE cs.GR

    Regularization in Space-Time Topology Optimization for Multi-Axis Additive Manufacturing

    Authors: Weiming Wang, Kai Wu, Fred van Keulen, Jun Wu

    Abstract: In additive manufacturing, the fabrication sequence has a large influence on the quality of manufactured components. While planning of the fabrication sequence is typically performed after the component has been designed, recent developments have demonstrated the possibility and benefits of simultaneous optimization of both the structural layout and the corresponding fabrication sequence. The simu… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 23 pages, 14 figures

  37. arXiv:2404.13026  [pdf, other

    cs.CV cs.AI

    PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

    Authors: Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y. Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, William T. Freeman

    Abstract: Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Project website at: https://physdreamer.github.io/

  38. arXiv:2404.13013  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

    Authors: Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi

    Abstract: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encode… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  39. arXiv:2404.12702  [pdf, other

    cs.CV

    Modeling Multi-Granularity Context Information Flow for Pavement Crack Detection

    Authors: Junbiao Pang, Baocheng Xiong, Jiaqi Wu

    Abstract: Crack detection has become an indispensable, interesting yet challenging task in the computer vision community. Specially, pavement cracks have a highly complex spatial structure, a low contrasting background and a weak spatial continuity, posing a significant challenge to an effective crack detection method. In this paper, we address these problems from a view that utilizes contexts of the cracks… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  40. arXiv:2404.12500  [pdf, other

    cs.HC cs.CL cs.CV

    UIClip: A Data-driven Model for Assessing User Interface Design

    Authors: Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

    Abstract: User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmenta… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  41. arXiv:2404.12372  [pdf, other

    cs.CV

    MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

    Authors: Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, Zuozhu Liu

    Abstract: Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited,… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  42. arXiv:2404.11871  [pdf, other

    cs.CV

    Group-On: Boosting One-Shot Segmentation with Supportive Query

    Authors: Hanjing Zhou, Mingze Yin, JinTai Chen, Danny Chen, Jian Wu

    Abstract: One-shot semantic segmentation aims to segment query images given only ONE annotated support image of the same class. This task is challenging because target objects in the support and query images can be largely different in appearance and pose (i.e., intra-class variation). Prior works suggested that incorporating more annotated support images in few-shot settings boosts performances but increas… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  43. arXiv:2404.11171  [pdf, other

    cs.LG cs.AI eess.SP

    Personalized Heart Disease Detection via ECG Digital Twin Generation

    Authors: Yaojun Hu, Jintai Chen, Lianting Hu, Dantong Li, Jiahuan Yan, Haochao Ying, Huiying Liang, Jian Wu

    Abstract: Heart diseases rank among the leading causes of global mortality, demonstrating a crucial need for early diagnosis and intervention. Most traditional electrocardiogram (ECG) based automated diagnosis methods are trained at population level, neglecting the customization of personalized ECGs to enhance individual healthcare management. A potential solution to address this limitation is to employ dig… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  44. arXiv:2404.11095  [pdf, other

    cs.CL cs.AI

    Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues

    Authors: Jiao Ou, Jiayu Wu, Che Liu, Fuzheng Zhang, Di Zhang, Kun Gai

    Abstract: Aligning large language models (LLMs) with human expectations requires high-quality instructional dialogues, which can be achieved by raising diverse, in-depth, and insightful instructions that deepen interactions. Existing methods target instructions from real instruction dialogues as a learning goal and fine-tune a user simulator for posing instructions. However, the user simulator struggles to… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: 27 pages, 3 figures, 12 tables

  45. arXiv:2404.10556  [pdf, other

    cs.NI eess.SP

    Generative AI for Advanced UAV Networking

    Authors: Geng Sun, Wenwen Xie, Dusit Niyato, Hongyang Du, Jiawen Kang, Jing Wu, Sumei Sun, Ping Zhang

    Abstract: With the impressive achievements of chatGPT and Sora, generative artificial intelligence (GAI) has received increasing attention. Not limited to the field of content generation, GAI is also widely used to solve the problems in wireless communication scenarios due to its powerful learning and generalization capabilities. Therefore, we discuss key applications of GAI in improving unmanned aerial veh… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  46. arXiv:2404.10498  [pdf, other

    cs.AI cs.CV cs.DC

    LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Perception System

    Authors: Shijing Hu, Ruijun Deng, Xin Du, Zhihui Lu, Qiang Duan, Yi He, Shih-Chia Huang, Jie Wu

    Abstract: Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaborat… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  47. arXiv:2404.10282  [pdf, other

    cs.LG cs.CV

    Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning

    Authors: Kyle Hsu, Jubayer Ibn Hamid, Kaylee Burns, Chelsea Finn, Jiajun Wu

    Abstract: Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: 22 pages, 10 figures, code available at https://github.com/kylehkhsu/tripod

  48. arXiv:2404.10229  [pdf, other

    cs.CL

    Generative Text Steganography with Large Language Model

    Authors: Jiaxuan Wu, Zhengxian Wu, Yiming Xue, Juan Wen, Wanli Peng

    Abstract: Recent advances in large language models (LLMs) have blurred the boundary of high-quality text generation between humans and machines, which is favorable for generative text steganography. While, current advanced steganographic mapping is not suitable for LLMs since most users are restricted to accessing only the black-box API or user interface of the LLMs, thereby lacking access to the training v… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  49. arXiv:2404.09654  [pdf, other

    cs.CV cs.MM

    Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection

    Authors: Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu

    Abstract: Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomal… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  50. arXiv:2404.09204  [pdf, other

    cs.CV cs.AI

    TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

    Authors: Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng

    Abstract: Multimodal Large Language Models (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of ML… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.