Skip to main content

Showing 1–50 of 419 results for author: Dong, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05134  [pdf, other

    cs.CY cs.AI cs.LG

    Enhancing Deep Knowledge Tracing via Diffusion Models for Personalized Adaptive Learning

    Authors: Ming Kuo, Shouvon Sarker, Lijun Qian, Yujian Fu, Xiangfang Li, Xishuang Dong

    Abstract: In contrast to pedagogies like evidence-based teaching, personalized adaptive learning (PAL) distinguishes itself by closely monitoring the progress of individual students and tailoring the learning path to their unique knowledge and requirements. A crucial technique for effective PAL implementation is knowledge tracing, which models students' evolving knowledge to predict their future performance… ▽ More

    Submitted 24 April, 2024; originally announced May 2024.

  2. arXiv:2405.03110  [pdf, other

    cs.IR

    Vector Quantization for Recommender Systems: A Review and Outlook

    Authors: Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, Xiao-Ming Wu

    Abstract: Vector quantization, renowned for its unparalleled feature compression capabilities, has been a prominent topic in signal processing and machine learning research for several decades and remains widely utilized today. With the emergence of large models and generative AI, vector quantization has gained popularity in recommender systems, establishing itself as a preferred solution. This paper starts… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  3. arXiv:2405.00448  [pdf, other

    cs.CV

    MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

    Authors: Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, Xiaodan Liang

    Abstract: This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking as inputs a text instruction and multiple garment images. Our MMTryon mainly addresses two problems overlooked in prior literature: 1) Support of multiple try-on items and dressing styleExisting methods are commonly designed for singl… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  4. arXiv:2404.19364  [pdf, other

    cs.CL

    Navigating Brain Language Representations: A Comparative Analysis of Neural Language Models and Psychologically Plausible Models

    Authors: Yunhao Zhang, Shaonan Wang, Xinyi Dong, Jiajun Yu, Chengqing Zong

    Abstract: Neural language models, particularly large-scale ones, have been consistently proven to be most effective in predicting brain neural activity across a range of studies. However, previous research overlooked the comparison of these models with psychologically plausible ones. Moreover, evaluations were reliant on limited, single-modality, and English cognitive datasets. To address these questions, w… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  5. arXiv:2404.19048  [pdf, other

    cs.CL cs.AI

    A Framework for Real-time Safeguarding the Text Generation of Large Language Model

    Authors: Ximing Dong, Dayi Lin, Shaowei Wang, Ahmed E. Hassan

    Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and… ▽ More

    Submitted 1 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

  6. arXiv:2404.19019  [pdf, other

    cs.DS cs.DC

    Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering

    Authors: Laxman Dhulipala, Xiaojun Dong, Kishen N Gowda, Yan Gu

    Abstract: Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree $T$, the SLD of $T$ is a binary dendrogram that summarizes the $n-1$ clusterings obtained by contracting the edges of $T$ in order of weight. Existing algorithms for computing the SLD all require $Ω(n\log n)$ work where $n = |T|$. Furthermore,… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: To appear at SPAA 2024

  7. PASGAL: Parallel And Scalable Graph Algorithm Library

    Authors: Xiaojun Dong, Yan Gu, Yihan Sun, Letong Wang

    Abstract: In this paper, we introduce PASGAL (Parallel And Scalable Graph Algorithm Library), a parallel graph library that scales to a variety of graph types, many processors, and large graph sizes. One special focus of PASGAL is the efficiency on \textit{large-diameter graphs}, which is a common challenge for many existing parallel graph processing systems: many existing graph processing systems can be ev… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  8. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  9. arXiv:2404.16771  [pdf, other

    cs.CV cs.AI

    ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

    Authors: Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang

    Abstract: Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial de… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Project page: https://ssugarwh.github.io/consistentid.github.io/

  10. arXiv:2404.16452  [pdf, other

    cs.CV

    PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

    Authors: Lihua Jing, Rui Wang, Wenqi Ren, Xin Dong, Cong Zou

    Abstract: Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  11. arXiv:2404.14066  [pdf, other

    cs.CV cs.IR

    SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

    Authors: Xuzheng Yu, Chen Jiang, Xingning Dong, Tian Gan, Ming Yang, Qingpei Guo

    Abstract: The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing appr… ▽ More

    Submitted 6 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

  12. arXiv:2404.13792  [pdf, other

    cs.MM cs.AI cs.CL cs.HC

    Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion Outcome

    Authors: Donghuo Zeng, Roberto S. Legaspi, Yuewen Sun, Xinshuai Dong, Kazushi Ikeda, Peter Spirtes, kun Zhang

    Abstract: Customizing persuasive conversations related to the outcome of interest for specific users achieves better persuasion results. However, existing persuasive conversation systems rely on persuasive strategies and encounter challenges in dynamically adjusting dialogues to suit the evolving states of individual users during interactions. This limitation restricts the system's ability to deliver flexib… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: 14 pages, 10 figures, Accepted by Persuasive Technology 2024

  13. arXiv:2404.13044  [pdf, other

    cs.CV

    Unified Scene Representation and Reconstruction for 3D Large Language Models

    Authors: Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

    Abstract: Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Project Page: https://chtsy.github.io/uni3drr-page/

  14. arXiv:2404.12138  [pdf, other

    cs.AI

    Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?

    Authors: Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, Yanghua Xiao

    Abstract: Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investi… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  15. arXiv:2404.10584  [pdf, other

    cs.CV

    ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig

    Authors: Chunli Peng, Xuan Dong, Tiantian Cao, Zhengqing Li, Kun Dong, Weixin Li

    Abstract: The fusion of images from dual camera systems featuring a wide-angle and a telephoto camera has become a hotspot problem recently. By integrating simultaneously captured wide-angle and telephoto images from these systems, the resulting fused image achieves a wide field of view (FOV) coupled with high-definition quality. Existing approaches are mostly deep learning methods, and predominantly rely o… ▽ More

    Submitted 29 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

  16. arXiv:2404.08237  [pdf, other

    cs.CV cs.AI

    IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer

    Authors: Yuhang Qiu, Honghui Chen, Xingbo Dong, Zheng Lin, Iman Yi Liao, Massimo Tistarelli, Zhe Jin

    Abstract: Determining dense feature points on fingerprints used in constructing deep fixed-length representations for accurate matching, particularly at the pixel level, is of significant interest. To explore the interpretability of fingerprint matching, we propose a multi-stage interpretable fingerprint matching network, namely Interpretable Fixed-length Representation for Fingerprint Matching via Vision T… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: ready to submit to IEEE Transactions on Information Forensics and Security (TIFS)

  17. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  18. arXiv:2404.01059  [pdf, ps, other

    cs.IT eess.SP

    STAR-RIS Aided Secure MIMO Communication Systems

    Authors: Xiequn Dong, Zesong Fei, Xinyi Wang, Meng Hua, Qingqing Wu

    Abstract: This paper investigates simultaneous transmission and reflection reconfigurable intelligent surface (STAR-RIS) aided physical layer security (PLS) in multiple-input multiple-output (MIMO) systems, where the base station (BS) transmits secrecy information with the aid of STAR-RIS against multiple eavesdroppers equipped with multiple antennas. We aim to maximize the secrecy rate by jointly optimizin… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  19. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  20. arXiv:2403.19866  [pdf, other

    cs.CV cs.AI

    Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization

    Authors: Yuhang Li, Xin Dong, Chen Chen, Jingtao Li, Yuxin Wen, Michael Spranger, Lingjuan Lyu

    Abstract: Synthetic image data generation represents a promising avenue for training deep learning models, particularly in the realm of transfer learning, where obtaining real images within a specific domain can be prohibitively expensive due to privacy and intellectual property considerations. This work delves into the generation and utilization of synthetic images derived from text-to-image generative mod… ▽ More

    Submitted 2 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: ICLR24 Score 6865 https://openreview.net/forum?id=CjPt1AC6w0

  21. arXiv:2403.19516  [pdf, ps, other

    stat.ML cs.LG cs.SI math.ST

    Maximum Likelihood Estimation on Stochastic Blockmodels for Directed Graph Clustering

    Authors: Mihai Cucuringu, Xiaowen Dong, Ning Zhang

    Abstract: This paper studies the directed graph clustering problem through the lens of statistics, where we formulate clustering as estimating underlying communities in the directed stochastic block model (DSBM). We conduct the maximum likelihood estimation (MLE) on the DSBM and thereby ascertain the most probable community assignment given the observed graph structure. In addition to the statistical point… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  22. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  23. arXiv:2403.16212  [pdf, other

    eess.IV cs.CV cs.LG

    Leveraging Deep Learning and Xception Architecture for High-Accuracy MRI Classification in Alzheimer Diagnosis

    Authors: Shaojie Li, Haichen Qu, Xinqi Dong, Bo Dang, Hengyi Zang, Yulu Gong

    Abstract: Exploring the application of deep learning technologies in the field of medical diagnostics, Magnetic Resonance Imaging (MRI) provides a unique perspective for observing and diagnosing complex neurodegenerative diseases such as Alzheimer Disease (AD). With advancements in deep learning, particularly in Convolutional Neural Networks (CNNs) and the Xception network architecture, we are now able to a… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  24. arXiv:2403.15955  [pdf, other

    cs.CV cs.AI

    Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection

    Authors: Minzhou Pan, Zhenting Wang, Xin Dong, Vikash Sehwag, Lingjuan Lyu, Xue Lin

    Abstract: In this paper, we propose WaterMark Detection (WMD), the first invisible watermark detection method under a black-box and annotation-free setting. WMD is capable of detecting arbitrary watermarks within a given reference dataset using a clean non-watermarked dataset as a reference, without relying on specific decoding methods or prior knowledge of the watermarking techniques. We develop WMD using… ▽ More

    Submitted 30 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

  25. arXiv:2403.15885  [pdf, other

    cs.CL

    STEntConv: Predicting Disagreement with Stance Detection and a Signed Graph Convolutional Network

    Authors: Isabelle Lorge, Li Zhang, Xiaowen Dong, Janet B. Pierrehumbert

    Abstract: The rise of social media platforms has led to an increase in polarised online discussions, especially on political and socio-cultural topics such as elections and climate change. We propose a simple and novel unsupervised method to predict whether the authors of two posts agree or disagree, leveraging user stances about named entities obtained from their posts. We present STEntConv, a model which… ▽ More

    Submitted 26 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

    Comments: Accepted for the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

  26. arXiv:2403.15378  [pdf, other

    cs.CV

    Long-CLIP: Unlocking the Long-Text Capability of CLIP

    Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: All codes and models are publicly available at https://github.com/beichenzbc/Long-CLIP

  27. arXiv:2403.14483  [pdf, other

    cs.LG cs.AI q-fin.ST

    Utilizing the LightGBM Algorithm for Operator User Credit Assessment Research

    Authors: Shaojie Li, Xinqi Dong, Danqing Ma, Bo Dang, Hengyi Zang, Yulu Gong

    Abstract: Mobile Internet user credit assessment is an important way for communication operators to establish decisions and formulate measures, and it is also a guarantee for operators to obtain expected benefits. However, credit evaluation methods have long been monopolized by financial industries such as banks and credit. As supporters and providers of platform network technology and network resources, co… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  28. arXiv:2403.14374  [pdf, other

    cs.CL cs.IR

    FIT-RAG: Black-Box RAG with Factual Information and Token Reduction

    Authors: Yuren Mao, Xuemei Dong, Wenyi Xu, Yunjun Gao, Bin Wei, Ying Zhang

    Abstract: Due to the extraordinarily large number of parameters, fine-tuning Large Language Models (LLMs) to update long-tail or out-of-date knowledge is impractical in lots of applications. To avoid fine-tuning, we can alternatively treat a LLM as a black-box (i.e., freeze the parameters of the LLM) and augment it with a Retrieval-Augmented Generation (RAG) system, namely black-box RAG. Recently, black-box… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  29. arXiv:2403.13805  [pdf, other

    cs.CV cs.AI cs.LG

    RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project: https://github.com/Liuziyu77/RAR

  30. arXiv:2403.13703  [pdf

    cs.CV cs.AI

    Fostc3net:A Lightweight YOLOv5 Based On the Network Structure Optimization

    Authors: Danqing Ma, Shaojie Li, Bo Dang, Hengyi Zang, Xinqi Dong

    Abstract: Transmission line detection technology is crucial for automatic monitoring and ensuring the safety of electrical facilities. The YOLOv5 series is currently one of the most advanced and widely used methods for object detection. However, it faces inherent challenges, such as high computational load on devices and insufficient detection accuracy. To address these concerns, this paper presents an enha… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  31. arXiv:2403.10288  [pdf, other

    stat.ML cs.AI cs.LG

    Rough Transformers for Continuous and Efficient Time-Series Modelling

    Authors: Fernando Moreno-Pino, Álvaro Arroyo, Harrison Waldon, Xiaowen Dong, Álvaro Cartea

    Abstract: Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range depe… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  32. arXiv:2403.07033  [pdf, other

    cs.LG cs.AI

    Interpreting What Typical Fault Signals Look Like via Prototype-matching

    Authors: Qian Chen, Xingjian Dong, Zhike Peng

    Abstract: Neural networks, with powerful nonlinear mapping and classification capabilities, are widely applied in mechanical fault diagnosis to ensure safety. However, being typical black-box models, their application is limited in high-reliability-required scenarios. To understand the classification logic and explain what typical fault signals look like, the prototype matching network (PMN) is proposed by… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 17 pages, 12 figures, 6 tables

  33. arXiv:2403.06535  [pdf, other

    cs.LG cs.AI cs.MA

    Decentralized and Lifelong-Adaptive Multi-Agent Collaborative Learning

    Authors: Shuo Tang, Rui Ye, Chenxin Xu, Xiaowen Dong, Siheng Chen, Yanfeng Wang

    Abstract: Decentralized and lifelong-adaptive multi-agent collaborative learning aims to enhance collaboration among multiple agents without a central server, with each agent solving varied tasks over time. To achieve efficient collaboration, agents should: i) autonomously identify beneficial collaborative relationships in a decentralized manner; and ii) adapt to dynamically changing task observations. In t… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 23 pages, 15 figures

  34. arXiv:2403.04735  [pdf, other

    cs.CV

    SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

    Authors: Jielin Qiu, Andrea Madotto, Zhaojiang Lin, Paul A. Crook, Yifan Ethan Xu, Xin Luna Dong, Christos Faloutsos, Lei Li, Babak Damavandi, Seungwhan Moon

    Abstract: Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric V… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  35. arXiv:2402.17645  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

    Authors: Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Conghui He, Dahua Lin, Jiaqi Wang

    Abstract: We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song represen… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: project page: https://pjlab-songcomposer.github.io/ code: https://github.com/pjlab-songcomposer/songcomposer

  36. arXiv:2402.14767  [pdf, other

    cs.CV

    DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

    Authors: Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang

    Abstract: We present DualFocus, a novel framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance. Current MLLMs typically singularly focus on inputs at a predefined resolution, resulting in deficiencies in detailed questions involving local regions. We introduced a DualFocus mechanism where the model concentrates on t… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  37. "It Must Be Gesturing Towards Me": Gesture-Based Interaction between Autonomous Vehicles and Pedestrians

    Authors: Xiang Chang, Zihe Chen, Xiaoyan Dong, Yuxin Cai, Tingmin Yan, Haolin Cai, Zherui Zhou, Guyue Zhou, Jiangtao Gong

    Abstract: Interacting with pedestrians understandably and efficiently is one of the toughest challenges faced by autonomous vehicles (AVs) due to the limitations of current algorithms and external human-machine interfaces (eHMIs). In this paper, we design eHMIs based on gestures inspired by the most popular method of interaction between pedestrians and human drivers. Eight common gestures were selected to c… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 26 pages,22 figures

    MSC Class: H.5.2

    Journal ref: CHI2024

  38. arXiv:2402.11190  [pdf, other

    cs.CL

    Disclosure and Mitigation of Gender Bias in LLMs

    Authors: Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee

    Abstract: Large Language Models (LLMs) can generate biased responses. Yet previous direct probing techniques contain either gender mentions or predefined gender stereotypes, which are challenging to comprehensively collect. Hence, we propose an indirect probing framework based on conditional generation. This approach aims to induce LLMs to disclose their gender bias even without explicit gender or stereotyp… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: The first two authors contribute equally

  39. arXiv:2402.10466  [pdf, other

    cs.CL cs.AI

    Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

    Authors: Zekun Li, Zhiyu Zoey Chen, Mike Ross, Patrick Huber, Seungwhan Moon, Zhaojiang Lin, Xin Luna Dong, Adithya Sagar, Xifeng Yan, Paul A. Crook

    Abstract: Large language models (LLMs) are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we… ▽ More

    Submitted 1 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: More results in the next version. Code available at: https://github.com/facebookresearch/FnCTOD

  40. arXiv:2402.08017  [pdf, other

    cs.CV cs.CL cs.LG

    Lumos : Empowering Multimodal LLMs with Scene Text Recognition

    Authors: Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao, Longfang Zhao, Ankit Ramchandani, Xin Luna Dong, Anuj Kumar

    Abstract: We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

    Comments: Submitted to KDD 2024 (ADS Track)

  41. arXiv:2402.07577  [pdf, other

    cs.CL

    Topic Modeling as Multi-Objective Contrastive Optimization

    Authors: Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

    Abstract: Recent representation learning approaches enhance neural topic models by optimizing the weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and the contrastive learning objective that contrasts pairs of input documents. However, document-level contrastive learning might capture low-level mutual information, such as word ratio, which disturbs topic modeling. Moreove… ▽ More

    Submitted 9 March, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024 (poster)

  42. arXiv:2402.05569  [pdf, other

    cs.LG cs.AI eess.SP stat.ML

    Hypergraph Node Classification With Graph Neural Networks

    Authors: Bohan Tang, Zexi Liu, Keyue Jiang, Siheng Chen, Xiaowen Dong

    Abstract: Hypergraphs, with hyperedges connecting more than two nodes, are key for modelling higher-order interactions in real-world data. The success of graph neural networks (GNNs) reveals the capability of neural networks to process data with pairwise interactions. This inspires the usage of neural networks for data with higher-order interactions, thereby leading to the development of hypergraph neural n… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  43. arXiv:2402.00740  [pdf, other

    cs.CV

    DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras

    Authors: Weixing Xie, Xiao Dong, Yong Yang, Qiqin Lin, Jingze Chen, Junfeng Yao, Xiaohu Guo

    Abstract: With the popularity of monocular videos generated by video sharing and live broadcasting applications, reconstructing and editing dynamic scenes in stationary monocular cameras has become a special but anticipated technology. In contrast to scene reconstructions that exploit multi-view observations, the problem of modeling a dynamic scene from a single view is significantly more under-constrained… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  44. arXiv:2401.17797  [pdf, other

    cs.CV

    M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

    Authors: Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang, Qingpei Guo

    Abstract: We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end,… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

  45. SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

    Authors: Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

    Abstract: We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-traini… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by TCSVT (IEEE Transactions on Circuits and Systems for Video Technology)

  46. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  47. arXiv:2401.16268  [pdf

    cs.CY

    A.I. In All The Wrong Places

    Authors: Marc Böhlen, Ruolin Chen, Xiaoxu Dong, Srikar Gopaladinne, Hemanth Gorla, Divya Kandukuri, Sean Mansfield

    Abstract: This text describes experiences gained across a two-year test period during which two generations of Generative Artificial Intelligence (A.I.) systems were incorporated into an interdisciplinary, university level course on A.I. for art and design practices. The text uses the results from the courses to reflect on new opportunities for generative systems in art and design, while considering traps a… ▽ More

    Submitted 17 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: 20 pages, 3 tables, 4 images

  48. arXiv:2401.12425  [pdf, other

    cs.CV cs.CL cs.LG

    The Neglected Tails of Vision-Language Models

    Authors: Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

    Abstract: Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in… ▽ More

    Submitted 1 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Project Page: https://shubhamprshr27.github.io/neglected-tails-of-vlms/

  49. arXiv:2401.09235  [pdf, ps, other

    cs.LG cs.AI

    A Characterization Theorem for Equivariant Networks with Point-wise Activations

    Authors: Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin

    Abstract: Equivariant neural networks have shown improved performance, expressiveness and sample complexity on symmetrical domains. But for some specific symmetries, representations, and choice of coordinates, the most common point-wise activations, such as ReLU, are not equivariant, hence they cannot be employed in the design of equivariant neural networks. The theorem we present in this paper describes al… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted at the 12th International Conference on Learning Representations (ICLR 2024)

  50. arXiv:2401.07762  [pdf

    cs.CE

    Auto-Regressive Model with Exogenous Input--ARX--based traffic-flow prediction

    Authors: Jun Ying, Xin Dong, Bowei Li, Zihan Tian

    Abstract: Traffic flow prediction is widely used in travel decision making, traffic control, roadway system planning, business sectors, and government agencies. ARX models have proved to be highly effective and versatile. In this research, we investigated the applications of ARX models in prediction for real traffic flow in New York City. The ARX models were constructed by linear/polynomial or neural networ… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.