Skip to main content

Showing 1–50 of 220 results for author: Peng, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12383  [pdf, ps, other

    cs.LG stat.ML

    Scaling Probabilistic Circuits via Monarch Matrices

    Authors: Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, Guy Van den Broeck

    Abstract: Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects si… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  2. arXiv:2506.05128  [pdf, ps, other

    cs.CL cs.AI cs.LG

    DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning

    Authors: Tanmay Parekh, Kartik Mehta, Ninareh Mehrabi, Kai-Wei Chang, Nanyun Peng

    Abstract: Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED.… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Submitted at ACL ARR May 2025

  3. arXiv:2506.02175  [pdf, ps, other

    cs.CL

    AI Debate Aids Assessment of Controversial Claims

    Authors: Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

    Abstract: As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics like public health where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI truthfulness by enabling humans to supervise systems that may exceed human ca… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  4. arXiv:2506.00319  [pdf, ps, other

    cs.CL

    SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

    Authors: Yufei Tian, Jiao Sun, Nanyun Peng, Zizhao Zhang

    Abstract: As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities.… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025

  5. arXiv:2505.22657  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

    Authors: Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang

    Abstract: Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench,… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: demos at: https://3dllm-mem.github.io

  6. arXiv:2505.20759  [pdf, ps, other

    cs.CV cs.AI

    PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

    Authors: Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

    Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and ou… ▽ More

    Submitted 15 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: 18 pages

  7. arXiv:2505.18356  [pdf, ps, other

    cs.CL cs.AI cs.LG

    The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

    Authors: Lucas Bandarkar, Nanyun Peng

    Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overl… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    ACM Class: I.2.7

  8. arXiv:2505.14972  [pdf, other

    cs.CL

    Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies

    Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng

    Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we intr… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  9. arXiv:2505.06827  [pdf, ps, other

    cs.CR cs.AI

    Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking

    Authors: Fabrice Y Harel-Canada, Boran Erol, Connor Choi, Jason Liu, Gary Jiarui Song, Nanyun Peng, Amit Sahai

    Abstract: Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: In Review @ ACL 2025

  10. arXiv:2504.09737  [pdf, other

    cs.AI cs.CL cs.HC cs.LG

    Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

    Authors: Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, James Zou

    Abstract: Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, a… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 30 pages, 7 figures

  11. arXiv:2504.06647  [pdf, other

    cs.CV

    Uni-PrevPredMap: Extending PrevPredMap to a Unified Framework of Prior-Informed Modeling for Online Vectorized HD Map Construction

    Authors: Nan Peng, Xun Zhou, Mingming Wang, Guisong Chen, Wenqi Xu

    Abstract: Safety constitutes a foundational imperative for autonomous driving systems, necessitating the maximal incorporation of accessible external prior information. This study establishes that temporal perception buffers and cost-efficient maps inherently form complementary prior sources for online vectorized high-definition (HD) map construction. We present Uni-PrevPredMap, a unified prior-informed fra… ▽ More

    Submitted 9 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

  12. arXiv:2504.03786  [pdf, other

    cs.CL

    Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs

    Authors: Sifan Li, Yujun Cai, Bryan Hooi, Nanyun Peng, Yiwei Wang

    Abstract: Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic ana… ▽ More

    Submitted 15 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

  13. arXiv:2504.01018  [pdf, other

    cs.CL

    Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

    Authors: Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng

    Abstract: Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework t… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Work in Progress

  14. arXiv:2503.17352  [pdf, other

    cs.CV cs.CL

    OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

    Authors: Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

    Abstract: Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 23 pages, 11 figures, 8 tables

  15. arXiv:2503.17136  [pdf, other

    cs.CL

    CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

    Authors: Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang

    Abstract: Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  16. arXiv:2503.16356  [pdf, other

    cs.CL cs.AI cs.CV cs.IR cs.LG

    CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

    Authors: Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng

    Abstract: Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we observe tha… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: Work in progress

  17. arXiv:2503.05037  [pdf, ps, other

    cs.CL cs.IR

    Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

    Authors: Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, Nanyun Peng

    Abstract: Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid downstream failures. In this work, we repurpose a relation extraction dataset (e.g., Re-DocRED) to design controlled experiments that quantify the impact of heuristic biase… ▽ More

    Submitted 2 June, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: ACL 2025 Main Conference

  18. arXiv:2503.03194  [pdf, other

    cs.CL cs.AI

    Structured Outputs Enable General-Purpose LLMs to be Medical Experts

    Authors: Guangfu Guo, Kai Zhang, Bryan Hoo, Yujun Cai, Xiaoqian Lu, Nanyun Peng, Yiwei Wang

    Abstract: Medical question-answering (QA) is a critical task for evaluating how effectively large language models (LLMs) encode clinical knowledge and assessing their potential applications in medicine. Despite showing promise on multiple-choice tests, LLMs frequently struggle with open-ended medical questions, producing responses with dangerous hallucinations or lacking comprehensive coverage of critical a… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  19. arXiv:2502.17832  [pdf, other

    cs.LG cs.AI cs.CR cs.CV

    MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

    Authors: Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

    Abstract: Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, whe… ▽ More

    Submitted 8 March, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

  20. arXiv:2502.17793  [pdf, ps, other

    cs.CV cs.AI

    SYNTHIA: Novel Concept Design with Affordance Composition

    Authors: Hyeonjeong Ha, Xiaomeng Jin, Jeonghwan Kim, Jiateng Liu, Zhenhailong Wang, Khanh Duy Nguyen, Ansel Blume, Nanyun Peng, Kai-Wei Chang, Heng Ji

    Abstract: Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, functional coherence--the integration of multiple affordances into a single coherent concept--remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, funct… ▽ More

    Submitted 6 June, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: ACL 2025 Main, Code is available https://github.com/HyeonjeongHa/SYNTHIA

  21. arXiv:2502.17710  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures

    Authors: Akhila Yerukola, Saadia Gabriel, Nanyun Peng, Maarten Sap

    Abstract: Gestures are an integral part of non-verbal communication, with meanings that vary across cultures, and misinterpretations that can have serious social and diplomatic consequences. As AI systems become more integrated into global applications, ensuring they do not inadvertently perpetuate cultural offenses is critical. To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonv… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: 40 pages, 49 figures

  22. arXiv:2502.17709  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Contrastive Visual Data Augmentation

    Authors: Yu Zhou, Bingxuan Li, Mohan Tang, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

    Abstract: Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their a… ▽ More

    Submitted 4 June, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Journal ref: ICML 2025

  23. arXiv:2502.17651  [pdf, other

    cs.CV cs.AI cs.CL

    METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling

    Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng

    Abstract: Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart… ▽ More

    Submitted 5 March, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

  24. arXiv:2502.17394  [pdf, ps, other

    cs.CL cs.AI

    SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

    Authors: Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I-Hung Hsu, Kai-Wei Chang, Nanyun Peng

    Abstract: Event Detection (ED) -- the task of identifying event mentions from natural language text -- is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to… ▽ More

    Submitted 5 June, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: Under review at ACL ARR May 2025

  25. arXiv:2502.14275  [pdf, other

    cs.CL cs.LG

    Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

    Authors: Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu

    Abstract: Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes na… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 15 pages, 11 figures

  26. arXiv:2502.08180  [pdf, other

    cs.CL cs.AI

    Enhancing LLM Character-Level Manipulation via Divide and Conquer

    Authors: Zhen Xiong, Yujun Cai, Bryan Hooi, Nanyun Peng, Zhecheng Li, Yiwei Wang

    Abstract: Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the cr… ▽ More

    Submitted 27 March, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

  27. arXiv:2501.00052  [pdf, other

    cs.LG cs.GT cs.MA

    Efficient and Scalable Deep Reinforcement Learning for Mean Field Control Games

    Authors: Nianli Peng, Yilin Wang

    Abstract: Mean Field Control Games (MFCGs) provide a powerful theoretical framework for analyzing systems of infinitely many interacting agents, blending elements from Mean Field Games (MFGs) and Mean Field Control (MFC). However, solving the coupled Hamilton-Jacobi-Bellman and Fokker-Planck equations that characterize MFCG equilibria remains a significant computational challenge, particularly in high-dimen… ▽ More

    Submitted 27 December, 2024; originally announced January 2025.

  28. arXiv:2412.06483  [pdf, other

    cs.CL cs.AI

    SafeWorld: Geo-Diverse Safety Alignment

    Authors: Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, Nanyun Peng

    Abstract: In the rapidly evolving field of Large Language Models (LLMs), ensuring safety is a crucial and widely discussed topic. However, existing works often overlook the geo-diversity of cultural and legal standards across the world. To demonstrate the challenges posed by geo-diverse safety standards, we introduce SafeWorld, a novel benchmark specifically designed to evaluate LLMs' ability to generate re… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

    Comments: Accepted by NeurIPS 2024

  29. arXiv:2412.02172  [pdf, other

    cs.CV cs.AI cs.CL

    VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

    Authors: Xueqing Wu, Yuheng Ding, Bingxuan Li, Pan Lu, Da Yin, Kai-Wei Chang, Nanyun Peng

    Abstract: The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a sin… ▽ More

    Submitted 18 March, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: CVPR 2025. https://visco-benchmark.github.io/

  30. arXiv:2411.18651  [pdf, other

    cs.CV cs.CL cs.LG

    Verbalized Representation Learning for Interpretable Few-Shot Generalization

    Authors: Cheng-Fu Yang, Da Yin, Wenbo Hu, Nanyun Peng, Bolei Zhou, Kai-Wei Chang

    Abstract: Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracti… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

  31. arXiv:2411.17993  [pdf, other

    cs.CL

    DRS: Deep Question Reformulation With Structured Output

    Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang

    Abstract: Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating t… ▽ More

    Submitted 25 March, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

  32. arXiv:2411.05192  [pdf, other

    cs.CL cs.AI

    Explaining Mixtures of Sources in News Articles

    Authors: Alexander Spangher, James Youn, Matt DeButts, Nanyun Peng, Emilio Ferrara, Jonathan May

    Abstract: Human writers plan, then write. For large language models (LLMs) to play a role in longer-form article generation, we must understand the planning steps humans make before writing. We explore one kind of planning, source-selection in news, as a case-study for evaluating plans in long-form generation. We ask: why do specific stories call for specific kinds of sources? We imagine a generative proces… ▽ More

    Submitted 7 November, 2024; originally announced November 2024.

    Comments: 9 pages

  33. arXiv:2411.02688  [pdf, other

    cs.CL cs.LG

    On the Loss of Context-awareness in General Instruction Fine-tuning

    Authors: Yihan Wang, Andrew Bai, Nanyun Peng, Cho-Jui Hsieh

    Abstract: Pre-trained Large Language Models (LLMs) require post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs to enable instruction following. However, this process can potentially harm existing capabilities learned during pre-training. In this paper, we investigate the loss of context awareness after SFT, where context awareness is defined as the ability to extract and… ▽ More

    Submitted 2 February, 2025; v1 submitted 4 November, 2024; originally announced November 2024.

  34. arXiv:2411.01610  [pdf, other

    cs.CL

    Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM

    Authors: Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung

    Abstract: Contrastive decoding (CD) (Li et al., 2023) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. Although CD is applied to various LMs and domains to enhance open-ended text generation, it is still unclear why CD often works well, when it could fail, and how we can make it better. To deepen our understanding of CD, we first theoretically prove that C… ▽ More

    Submitted 3 November, 2024; originally announced November 2024.

    Comments: EMNLP 2024 Oral

  35. arXiv:2410.23252  [pdf, other

    cs.CL

    Evaluating Cultural and Social Awareness of LLM Web Agents

    Authors: Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu

    Abstract: As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social no… ▽ More

    Submitted 8 March, 2025; v1 submitted 30 October, 2024; originally announced October 2024.

    Comments: NAACL 2025 Findings

  36. arXiv:2410.20533  [pdf, other

    cs.LG cs.CL

    Guiding Through Complexity: What Makes Good Supervision for Hard Math Reasoning Tasks?

    Authors: Xuan He, Da Yin, Nanyun Peng

    Abstract: How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at… ▽ More

    Submitted 25 February, 2025; v1 submitted 27 October, 2024; originally announced October 2024.

    Comments: NAACL 2025 Main

  37. arXiv:2410.20021  [pdf, other

    cs.CL cs.AI

    Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization

    Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Naifan Cheung, Nanyun Peng, Kai-wei Chang

    Abstract: Cross-lingual summarization (CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. However, unlike languages such as English, Chinese or Spanish, for those relatively low-resource languages with limited usage or data, recent studies have shown that LLMs' performance on CLS tasks… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 October, 2024; originally announced October 2024.

  38. arXiv:2410.20016  [pdf, other

    cs.CL

    Vulnerability of LLMs to Vertically Aligned Text Manipulations

    Authors: Zhecheng Li, Yiwei Wang, Bryan Hooi, Yujun Cai, Zhen Xiong, Nanyun Peng, Kai-wei Chang

    Abstract: Text classification involves categorizing a given text, such as determining its sentiment or identifying harmful content. With the advancement of large language models (LLMs), these models have become highly effective at performing text classification tasks. However, they still show vulnerabilities to variations in text formatting. Recent research demonstrates that modifying input formats, such as… ▽ More

    Submitted 25 March, 2025; v1 submitted 25 October, 2024; originally announced October 2024.

  39. arXiv:2410.18393  [pdf, other

    cs.CL cs.SI

    SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness

    Authors: Tanmay Parekh, Jeffrey Kwan, Jiarui Yu, Sparsh Johri, Hyosang Ahn, Sreya Muppalla, Kai-Wei Chang, Wei Wang, Nanyun Peng

    Abstract: Social media is often the first place where communities discuss the latest societal trends. Prior works have utilized this platform to extract epidemic-related information (e.g. infections, preventive measures) to provide early warnings for epidemic prediction. However, these works only focused on English posts, while epidemics can occur anywhere in the world, and early discussions are often in th… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

    Comments: Accepted at EMNLP 2024

  40. arXiv:2410.15277  [pdf, other

    cs.CL

    BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

    Authors: Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng

    Abstract: Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across docu… ▽ More

    Submitted 15 February, 2025; v1 submitted 20 October, 2024; originally announced October 2024.

    Comments: Accepted by NAACL 2025 Findings. Project page: https://jasonforjoy.github.io/BRIEF/

  41. arXiv:2410.09988  [pdf, other

    cs.LG cs.AI

    HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

    Authors: Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner

    Abstract: Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational t… ▽ More

    Submitted 13 December, 2024; v1 submitted 13 October, 2024; originally announced October 2024.

    Comments: Code and the HARDMath dataset is available at https://github.com/sarahmart/HARDMath

  42. arXiv:2410.08182  [pdf, other

    cs.CV cs.AI cs.CL

    MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

    Authors: Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun Peng

    Abstract: Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we s… ▽ More

    Submitted 19 March, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: ICLR 2025

  43. arXiv:2410.06458  [pdf, other

    cs.CL cs.AI cs.LG

    LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

    Authors: Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng

    Abstract: Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs'… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

    Comments: To appear at EMNLP 2024

  44. arXiv:2410.04628  [pdf, other

    cs.CL

    Control Large Language Models via Divide and Conquer

    Authors: Bingxuan Li, Yiwei Wang, Tao Meng, Kai-Wei Chang, Nanyun Peng

    Abstract: This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG). We systematically evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications. We conclude that LLMs face significant challenges in consistently satisfyi… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: EMNLP 2024

  45. arXiv:2410.03856  [pdf, other

    cs.CL cs.LG

    Detecting Machine-Generated Long-Form Content with Latent-Space Variables

    Authors: Yufei Tian, Zeyu Pan, Nanyun Peng

    Abstract: The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing machine-generated outputs from human-written ones, which is crucial for ensuring authenticity and trustworthiness of expressions. Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts, inclu… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  46. arXiv:2409.03363  [pdf, other

    cs.CL

    Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding

    Authors: Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang

    Abstract: The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member a… ▽ More

    Submitted 14 January, 2025; v1 submitted 5 September, 2024; originally announced September 2024.

  47. arXiv:2409.00292  [pdf, other

    cs.CL cs.SD eess.AS

    REFFLY: Melody-Constrained Lyrics Editing Model

    Authors: Songyan Zhao, Bingxuan Li, Yufei Tian, Nanyun Peng

    Abstract: Automatic melody-to-lyric (M2L) generation aims to create lyrics that align with a given melody. While most previous approaches generate lyrics from scratch, revision, editing plain text draft to fit it into the melody, offers a much more flexible and practical alternative. This enables broad applications, such as generating lyrics from flexible inputs (keywords, themes, or full text that needs re… ▽ More

    Submitted 2 May, 2025; v1 submitted 30 August, 2024; originally announced September 2024.

  48. arXiv:2408.10086  [pdf, other

    cs.AI

    ARMADA: Attribute-Based Multimodal Data Augmentation

    Authors: Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu, Nanyun Peng, Heng Ji

    Abstract: In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world example… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  49. arXiv:2408.03567  [pdf, other

    cs.CV cs.CL

    Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

    Authors: Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang, Nanyun Peng, Kris Kitani, Fu-Jen Chu

    Abstract: We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seaml… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  50. arXiv:2408.01046  [pdf, other

    cs.CL

    QUDSELECT: Selective Decoding for Questions Under Discussion Parsing

    Authors: Ashima Suvarna, Xiao Liu, Tanmay Parekh, Kai-Wei Chang, Nanyun Peng

    Abstract: Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences. In QUD parsing, each sentence is viewed as an answer to a question triggered by an anchor sentence in prior context. The resulting QUD structure is required to conform to several theoretical criteria like answer compatibility (how well the question is answered)… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: 11 Pages, 5 figures