Skip to main content

Showing 1–50 of 164 results for author: Gao, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.18106  [pdf, other

    cs.CV

    Semi-supervised Text-based Person Search

    Authors: Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, Min Zhang

    Abstract: Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtain… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: 13 pages

  2. arXiv:2404.14676  [pdf, other

    cs.CV cs.GR

    DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance

    Authors: Linxuan Xin, Zheng Zhang, Jinfu Wei, Ge Li, Duan Gao

    Abstract: Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets. To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 16 pages, 17 figures

    ACM Class: I.3.0, I.4.9

  3. AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps

    Authors: Di Kevin Gao, Andrew Haverly, Sudip Mittal, Jiming Wu, Jingdao Chen

    Abstract: Artificial intelligence (AI) ethics has emerged as a burgeoning yet pivotal area of scholarly research. This study conducts a comprehensive bibliometric analysis of the AI ethics literature over the past two decades. The analysis reveals a discernible tripartite progression, characterized by an incubation phase, followed by a subsequent phase focused on imbuing AI with human-like attributes, culmi… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Journal ref: International Journal of Business Analytics (IJBAN), 2024, 11(1), 1-19

  4. arXiv:2403.11789  [pdf, other

    cs.CV

    EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding

    Authors: Wenhua Wu, Qi Wang, Guangming Wang, Junping Wang, Tiankun Zhao, Yang Liu, Dongchao Gao, Zhe Liu, Hesheng Wang

    Abstract: Road surface reconstruction plays a vital role in autonomous driving systems, enabling road lane perception and high-precision mapping. Recently, neural implicit encoding has achieved remarkable results in scene representation, particularly in the realistic rendering of scene textures. However, it faces challenges in directly representing geometric information for large-scale scenes. To address th… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  5. arXiv:2403.10014  [pdf, other

    cs.NI cs.AI

    NNCTC: Physical Layer Cross-Technology Communication via Neural Networks

    Authors: Haoyu Wang, Jiazhao Wang, Demin Gao, Wenchao Jiang

    Abstract: Cross-technology communication(CTC) enables seamless interactions between diverse wireless technologies. Most existing work is based on reversing the transmission path to identify the appropriate payload to generate the waveform that the target devices can recognize. However, this method suffers from many limitations, including dependency on specific technologies and the necessity for intricate al… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: 12 pages

    ACM Class: C.2.2

  6. arXiv:2403.09861  [pdf, other

    cs.ET cs.AI

    NN-Defined Modulator: Reconfigurable and Portable Software Modulator on IoT Gateways

    Authors: Jiazhao Wang, Wenchao Jiang, Ruofeng Liu, Bin Hu, Demin Gao, Shuai Wang

    Abstract: A physical-layer modulator is a vital component for an IoT gateway to map the symbols to signals. However, due to the soldered hardware chipsets on the gateway's motherboards or the diverse toolkits on different platforms for the software radio, the existing solutions either have limited extensibility or are platform-specific. Such limitation is hard to ignore when modulation schemes and hardware… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Journal ref: NSDI 2024

  7. arXiv:2403.09559  [pdf, other

    cs.CL cs.CV

    Less is More: Data Value Estimation for Visual Instruction Tuning

    Authors: Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen

    Abstract: Visual instruction tuning is the key to building multimodal large language models (MLLMs), which greatly improves the reasoning capabilities of large language models (LLMs) in vision scenario. However, existing MLLMs mostly rely on a mixture of multiple highly diverse visual instruction datasets for training (even more than a million instructions), which may introduce data redundancy. To investiga… ▽ More

    Submitted 21 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  8. arXiv:2403.05551  [pdf

    cs.CY

    A Bibliometric View of AI Ethics Development

    Authors: Di Kevin Gao, Andrew Haverly, Sudip Mittal, Jingdao Chen

    Abstract: Artificial Intelligence (AI) Ethics is a nascent yet critical research field. Recent developments in generative AI and foundational models necessitate a renewed look at the problem of AI Ethics. In this study, we perform a bibliometric analysis of AI Ethics literature for the last 20 years based on keyword search. Our study reveals a three-phase development in AI Ethics, namely an incubation phase… ▽ More

    Submitted 8 February, 2024; originally announced March 2024.

  9. arXiv:2403.03689  [pdf, other

    cs.CL cs.AI

    General2Specialized LLMs Translation for E-commerce

    Authors: Kaidi Chen, Ben Chen, Dehong Gao, Huangyu Dai, Wen Jiang, Wei Ning, Shanqing Yu, Libin Yang, Xiaoyan Cai

    Abstract: Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these prob… ▽ More

    Submitted 6 April, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

    Comments: 4 pages, 1 figure, WWW2024 accepted

  10. arXiv:2402.14034  [pdf, other

    cs.MA cs.AI

    AgentScope: A Flexible yet Robust Multi-Agent Platform

    Authors: Dawei Gao, Zitao Li, Weirui Kuang, Xuchen Pan, Daoyuan Chen, Zhijian Ma, Bingchen Qian, Liuyi Yao, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, Jingren Zhou

    Abstract: With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications. However, the complexities in coordinating agents' cooperation and LLMs' erratic performance pose notable challenges in developing robust and efficient multi-agent applications. To tackle these challenges, we propose AgentScope, a developer-centric multi-agent platform with me… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: We have released code on https://github.com/modelscope/agentscope

  11. arXiv:2402.01750  [pdf, other

    cs.CL cs.AI

    PACE: A Pragmatic Agent for Enhancing Communication Efficiency Using Large Language Models

    Authors: Jiaxuan Li, Minxi Yang, Dahua Gao, Wenlong Xu, Guangming Shi

    Abstract: Current communication technologies face limitations in terms of theoretical capacity, spectrum availability, and power resources. Pragmatic communication, leveraging terminal intelligence for selective data transmission, offers resource conservation. Existing research lacks universal intention resolution tools, limiting applicability to specific tasks. This paper proposes an image pragmatic commun… ▽ More

    Submitted 30 January, 2024; originally announced February 2024.

    Comments: 11 pages,11 figures, submitted to IJCAI 2024

  12. arXiv:2401.13516  [pdf, other

    cs.CV cs.CR

    Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

    Authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

    Abstract: Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areasthat vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel Deepfake detection model that can both recognize andlocalize unknown domai… ▽ More

    Submitted 5 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2308.09921, arXiv:2305.05943

  13. arXiv:2401.03563  [pdf, other

    cs.CL cs.IR

    Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

    Authors: Yingqian Min, Kun Zhou, Dawei Gao, Wayne Xin Zhao, He Hu, Yaliang Li

    Abstract: Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training a… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 14 pages, working in progress

  14. arXiv:2401.03407  [pdf, other

    cs.CV

    Bilateral Reference for High-Resolution Dichotomous Image Segmentation

    Authors: Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, Nicu Sebe

    Abstract: We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). The LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction proce… ▽ More

    Submitted 6 March, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Add the link to codes

  15. arXiv:2401.01659  [pdf, other

    cs.CV

    DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models

    Authors: Yichen Liu, Huajian Zhang, Daqing Gao

    Abstract: Object detection models represented by YOLO series have been widely used and have achieved great results on the high quality datasets, but not all the working conditions are ideal. To settle down the problem of locating targets on low quality datasets, the existing methods either train a new object detection network, or need a large collection of low-quality datasets to train. However, we propose… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    MSC Class: 68T45 ACM Class: I.2.10

  16. arXiv:2312.13108  [pdf, other

    cs.CV

    ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

    Authors: Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper… ▽ More

    Submitted 1 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Project Page: https://showlab.github.io/assistgui/

  17. arXiv:2312.07009  [pdf, other

    cs.CV

    Vision-language Assisted Attribute Learning

    Authors: Kongming Liang, Xinran Wang, Rui Wang, Donghui Gao, Ling Jin, Weidong Liu, Xiatian Zhu, Zhanyu Ma, Jun Guo

    Abstract: Attribute labeling at large scale is typically incomplete and partial, posing significant challenges to model optimization. Existing attribute learning methods often treat the missing labels as negative or simply ignore them all during training, either of which could hamper the model performance to a great extent. To overcome these limitations, in this paper we leverage the available vision-langua… ▽ More

    Submitted 14 December, 2023; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE IC-NIDC 2023

  18. arXiv:2312.06947  [pdf, other

    cs.CV

    MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

    Authors: Kangneng Zhou, Daiheng Gao, Xuan Wang, Jie Zhang, Peng Zhang, Xusen Sun, Longhao Zhang, Shiqi Yang, Bang Zhang, Liefeng Bo, Yaxing Wang, Ming-Ming Cheng

    Abstract: 3D-aware portrait editing has a wide range of applications in multiple fields. However, current approaches are limited due that they can only perform mask-guided or text-based editing. Even by fusing the two procedures into a model, the editing quality and stability cannot be ensured. To address this limitation, we propose \textbf{MaTe3D}: mask-guided text-based 3D-aware portrait editing. In this… ▽ More

    Submitted 3 May, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: 13 pages, 13 figures

  19. arXiv:2312.02473  [pdf, other

    cs.LG cs.DC

    NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams

    Authors: Chaoyi Chen, Dechao Gao, Yanfeng Zhang, Qiange Wang, Zhenbo Fu, Xuecang Zhang, Junhua Zhu, Yu Gu, Ge Yu

    Abstract: Existing Graph Neural Network (GNN) training frameworks have been designed to help developers easily create performant GNN implementations. However, most existing GNN frameworks assume that the input graphs are static, but ignore that most real-world graphs are constantly evolving. Though many dynamic GNN models have emerged to learn from evolving graphs, the training process of these dynamic GNNs… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 12 pages, 15 figures

  20. arXiv:2312.02141  [pdf, other

    cs.CV

    iMatching: Imperative Correspondence Learning

    Authors: Zitong Zhan, Dasong Gao, Yun-Jou Lin, Youjie Xia, Chen Wang

    Abstract: Learning feature correspondence is a foundational task in computer vision, holding immense importance for downstream applications such as visual odometry and 3D reconstruction. Despite recent progress in data-driven models, feature correspondence learning is still limited by the lack of accurate per-pixel correspondence labels. To overcome this difficulty, we introduce a new self-supervised scheme… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  21. arXiv:2312.01841  [pdf, other

    cs.CV

    VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

    Authors: Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao

    Abstract: Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that sup… ▽ More

    Submitted 6 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: 10 pages, 8 figures

  22. arXiv:2312.01105  [pdf, other

    cs.CV

    S2P3: Self-Supervised Polarimetric Pose Prediction

    Authors: Patrick Ruhkamp, Daoyi Gao, Nassir Navab, Benjamin Busam

    Abstract: This paper proposes the first self-supervised 6D object pose prediction from multimodal RGB+polarimetric images. The novel training paradigm comprises 1) a physical model to extract geometric information of polarized light, 2) a teacher-student knowledge distillation scheme and 3) a self-supervised loss formulation through differentiable rendering and an invertible physical constraint. Both networ… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted at IJCV

  23. arXiv:2311.18610  [pdf, other

    cs.CV

    DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

    Authors: Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai

    Abstract: Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Project page: https://daoyig.github.io/DiffCAD/ Video: https://www.youtube.com/watch?v=PCursyPosMY

  24. arXiv:2311.16081  [pdf, other

    cs.CV cs.AI

    ViT-Lens: Towards Omni-modal Representations

    Authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

    Abstract: Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT… ▽ More

    Submitted 26 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: This work is a follow-up of arXiv:2308.10185. Accepted to CVPR2024

  25. arXiv:2311.15672  [pdf, other

    cs.CV

    HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

    Authors: Xihe Yang, Xingyu Chen, Daiheng Gao, Shaohui Wang, Xiaoguang Han, Baoyuan Wang

    Abstract: As for human avatar reconstruction, contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper, we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic artic… ▽ More

    Submitted 31 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  26. arXiv:2311.07599  [pdf, other

    cs.SE cs.AI

    Testing LLMs on Code Generation with Varying Levels of Prompt Specificity

    Authors: Lincoln Murr, Morgan Grainger, David Gao

    Abstract: Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. Among the myriad of applications that benefit from LLMs, automated code generation is increasingly promising. The potential to transform natural language prompts into executable code promises a major shift in software development practices and paves the way for significant re… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  27. arXiv:2310.16003  [pdf, other

    cs.CV

    CVPR 2023 Text Guided Video Editing Competition

    Authors: Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, Rui He, Feng Hu, Junhua Hu, Hai Huang, Hanyu Zhu, Xu Cheng, Jie Tang, Mike Zheng Shou, Kurt Keutzer, Forrest Iandola

    Abstract: Humans watch more than a billion hours of video per day. Most of this video was edited manually, which is a tedious process. However, AI-enabled video-generation and video-editing is on the rise. Building on text-to-image models like Stable Diffusion and Imagen, generative AI has improved dramatically on video tasks. But it's hard to evaluate progress in these video tasks because there is no stand… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Project page: https://sites.google.com/view/loveucvpr23/track4

  28. arXiv:2309.15818  [pdf, other

    cs.CV

    Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

    Authors: David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou

    Abstract: Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marri… ▽ More

    Submitted 17 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: project page is https://showlab.github.io/Show-1

  29. arXiv:2309.15796  [pdf, other

    eess.AS cs.CL cs.LG

    Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

    Authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. Thi… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  30. arXiv:2309.02033  [pdf, other

    cs.LG cs.DB cs.DC

    Data-Juicer: A One-Stop Data Processing System for Large Language Models

    Authors: Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou

    Abstract: The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LL… ▽ More

    Submitted 20 December, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: 20 Pages, 10 figures, 9 tables. The system, data recipes, and demos are continuously maintained at https://github.com/alibaba/data-juicer

  31. arXiv:2309.00363  [pdf, other

    cs.LG

    FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning

    Authors: Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou

    Abstract: LLMs have demonstrated great capabilities in various NLP tasks. Different entities can further improve the performance of those LLMs on their specific downstream tasks by fine-tuning LLMs. When several entities have similar interested tasks, but their data cannot be shared because of privacy concerns regulations, federated learning (FL) is a mainstream solution to leverage the data of different en… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: Source code: https://github.com/alibaba/FederatedScope/tree/llm

  32. arXiv:2308.15990  [pdf, other

    cs.SD eess.AS

    Dual-path Transformer Based Neural Beamformer for Target Speech Extraction

    Authors: Aoqi Guo, Sichong Qian, Baoxiang Li, Dazhi Gao

    Abstract: Neural beamformers, which integrate both pre-separation and beamforming modules, have demonstrated impressive effectiveness in target speech extraction. Nevertheless, the performance of these beamformers is inherently limited by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer supported by a dual-path transformer. Initially, we employ the cross-… ▽ More

    Submitted 7 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

  33. arXiv:2308.15363  [pdf, other

    cs.DB cs.CL cs.LG

    Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

    Authors: Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, Jingren Zhou

    Abstract: Large language models (LLMs) have emerged as a new paradigm for Text-to-SQL task. However, the absence of a systematical benchmark inhibits the development of designing effective, efficient and economic LLM-based Text-to-SQL solutions. To address this challenge, in this paper, we first conduct a systematical and extensive comparison over existing prompt engineering methods, including question repr… ▽ More

    Submitted 20 November, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: We have released code on https://github.com/BeachWang/DAIL-SQL

  34. arXiv:2308.10627  [pdf, other

    cs.CV

    Polarimetric Information for Multi-Modal 6D Pose Estimation of Photometrically Challenging Objects with Limited Data

    Authors: Patrick Ruhkamp, Daoyi Gao, HyunJun Jung, Nassir Navab, Benjamin Busam

    Abstract: 6D pose estimation pipelines that rely on RGB-only or RGB-D data show limitations for photometrically challenging objects with e.g. textureless surfaces, reflections or transparency. A supervised learning-based method utilising complementary polarisation information as input modality is proposed to overcome such limitations. This supervised approach is then extended to a self-supervised paradigm b… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023 TRICKY Workshop

  35. arXiv:2308.09921  [pdf, other

    cs.CV cs.AI

    Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces

    Authors: Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

    Abstract: The exploitation of Deepfake techniques for malicious intentions has driven significant research interest in Deepfake detection. Deepfake manipulations frequently introduce random tampered traces, leading to unpredictable outcomes in different facial regions. However, existing detection methods heavily rely on specific forgery indicators, and as the forgery mode improves, these traces become incre… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2305.05943

  36. arXiv:2308.08236   

    cs.ET physics.optics

    Redundancy-free integrated optical convolver for optical neural networks based on arrayed waveguide grating

    Authors: Shiji Zhang, Haojun Zhou, Bo Wu, Xueyi Jiang, Dingshan Gao, Jing Xu, Jianji Dong, Xinliang Zhang

    Abstract: Optical neural networks (ONNs) have gained significant attention due to their potential for high-speed and energy-efficient computation in artificial intelligence. The implementation of optical convolutions plays a vital role in ONNs, as they are fundamental operations within neural network architectures. However, state-of-the-art convolution architectures often suffer from redundant inputs, leadi… ▽ More

    Submitted 22 August, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

    Comments: The data are not sufficiently detailed and need to be supplemented with some detailed data

  37. arXiv:2308.06547  [pdf, other

    eess.AS cs.CL cs.SD

    Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

    Authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan

    Abstract: When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either fi… ▽ More

    Submitted 12 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2023

  38. arXiv:2308.04288  [pdf, other

    cs.CV

    Cloth2Tex: A Customized Cloth Texture Generation Pipeline for 3D Virtual Try-On

    Authors: Daiheng Gao, Xu Chen, Xindi Zhang, Qi Wang, Ke Sun, Bang Zhang, Liefeng Bo, Qixing Huang

    Abstract: Fabricating and designing 3D garments has become extremely demanding with the increasing need for synthesizing realistic dressed persons for a variety of applications, e.g. 3D virtual try-on, digitalization of 2D clothes into 3D apparel, and cloth animation. It thus necessitates a simple and straightforward pipeline to obtain high-quality texture from simple input, such as 2D reference images. Sin… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 15 pages, 15 figures

  39. arXiv:2307.16715  [pdf, other

    cs.CV

    UniVTG: Towards Unified Video-Language Temporal Grounding

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

    Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: https://github.com/showlab/UniVTG

  40. arXiv:2307.08072  [pdf, other

    cs.CL cs.AI

    Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study

    Authors: Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen

    Abstract: Despite the superior performance, Large Language Models~(LLMs) require significant computational resources for deployment and use. To overcome this issue, quantization methods have been widely applied to reduce the memory footprint of LLMs as well as increasing the inference rate. However, a major challenge is that low-bit quantization methods often lead to performance degradation. It is important… ▽ More

    Submitted 26 July, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: 15 pages, 4 figures

  41. arXiv:2306.15942  [pdf, other

    cs.SD cs.AI eess.AS

    Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

    Authors: Aoqi Guo, Junnan Wu, Peng Gao, Wenbo Zhu, Qinwen Guo, Dazhi Gao, Yujun Wang

    Abstract: Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input fea… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  42. arXiv:2306.15255  [pdf, other

    cs.CV cs.CL

    GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

    Authors: Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

    Abstract: In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: 5 pages, 2 figures, 4 tables, the champion solution for Ego4D Natural Language Queries Challenge in CVPR 2023

  43. arXiv:2306.11252  [pdf, other

    cs.CL cs.LG

    HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

    Authors: Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

    Abstract: We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verb… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

  44. arXiv:2306.08640  [pdf, other

    cs.CV

    AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

    Authors: Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou

    Abstract: Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal user queries. Despite this progress, complex visual-based tasks still remain challenging due to the diverse nature of visual tasks. This diversity is reflected… ▽ More

    Submitted 28 June, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Project page: https://showlab.github.io/assistgpt/

  45. arXiv:2306.08200  [pdf, other

    cs.CV cs.LG

    POP: Prompt Of Prompts for Continual Learning

    Authors: Zhiyuan Hu, Jiancheng Lyu, Dashan Gao, Nuno Vasconcelos

    Abstract: Continual learning (CL) has attracted increasing attention in the recent past. It aims to mimic the human ability to learn new concepts without catastrophic forgetting. While existing CL methods accomplish this to some extent, they are still prone to semantic drift of the learned feature space. Foundation models, which are endowed with a robust feature representation, learned from very large datas… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  46. arXiv:2306.01031  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts

    Authors: Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur

    Abstract: This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) cr… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  47. RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search

    Authors: Yang Bai, Min Cao, Daming Gao, Ziqiang Cao, Chen Chen, Zhenfeng Fan, Liqiang Nie, Min Zhang

    Abstract: Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal representations. Towards this, we propose a Relation and Sensitivity aware representation learning method (RaSa), including two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For one thing, ex… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted by IJCAI 2023. Code is available at https://github.com/Flame-Chasers/RaSa

  48. arXiv:2305.06883  [pdf, other

    cs.GT cs.SI

    Cross-channel Budget Coordination for Online Advertising System

    Authors: Guangyuan Shen, Shenjie Sun, Dehong Gao, Shaolei Li, Libin Yang, Yongping Shi, Wei Ning

    Abstract: In online advertising (Ad), advertisers are always eager to know how to globally optimize their budget allocation strategies across different channels for more conversions such as orders, payments, etc. Ignoring competition among different advertisers causes objective inconsistency, that is, a single advertiser locally optimizes the conversions only based on its own historical statistics, which is… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: under review

  49. arXiv:2305.06272  [pdf, other

    cs.IR cs.CR cs.DC cs.LG

    FedPDD: A Privacy-preserving Double Distillation Framework for Cross-silo Federated Recommendation

    Authors: Sheng Wan, Dashan Gao, Hanlin Gu, Daning Hu

    Abstract: Cross-platform recommendation aims to improve recommendation accuracy by gathering heterogeneous features from different platforms. However, such cross-silo collaborations between platforms are restricted by increasingly stringent privacy protection regulations, thus data cannot be aggregated for training. Federated learning (FL) is a practical solution to deal with the data silo problem in recomm… ▽ More

    Submitted 30 January, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted by IJCNN2023

  50. arXiv:2305.06158  [pdf, other

    cs.IR cs.AI cs.LG

    EdgeNet : Encoder-decoder generative Network for Auction Design in E-commerce Online Advertising

    Authors: Guangyuan Shen, Shengjie Sun, Dehong Gao, Libin Yang, Yongping Shi, Wei Ning

    Abstract: We present a new encoder-decoder generative network dubbed EdgeNet, which introduces a novel encoder-decoder framework for data-driven auction design in online e-commerce advertising. We break the neural auction paradigm of Generalized-Second-Price(GSP), and improve the utilization efficiency of data while ensuring the economic characteristics of the auction mechanism. Specifically, EdgeNet introd… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: under review. arXiv admin note: substantial text overlap with arXiv:2106.03593 by other authors