-
FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Authors:
Xuehai He,
Jian Zheng,
Jacob Zhiyuan Fang,
Robinson Piramuthu,
Mohit Bansal,
Vicente Ordonez,
Gunnar A Sigurdsson,
Nanyun Peng,
Xin Eric Wang
Abstract:
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexibl…
▽ More
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Medical Vision-Language Pre-Training for Brain Abnormalities
Authors:
Masoud Monajatipoor,
Zi-Yi Dou,
Aichi Chien,
Nanyun Peng,
Kai-Wei Chang
Abstract:
Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take b…
▽ More
Vision-language models have become increasingly powerful for tasks that require an understanding of both visual and linguistic elements, bridging the gap between these modalities. In the context of multimodal clinical AI, there is a growing need for models that possess domain-specific knowledge, as existing models often lack the expertise required for medical applications. In this paper, we take brain abnormalities as an example to demonstrate how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset from case reports and published journals and subsequently constructing a high-performance vision-language model tailored to specific medical tasks. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain. We evaluated the resulting model with quantitative and qualitative intrinsic evaluations. The resulting dataset and our code can be found here https://github.com/masoud-monajati/MedVL_pretraining_pipeline
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Weak-to-Strong Extrapolation Expedites Alignment
Authors:
Chujie Zheng,
Ziqi Wang,
Heng Ji,
Minlie Huang,
Nanyun Peng
Abstract:
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO…
▽ More
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
Authors:
Haoyi Qiu,
Wenbo Hu,
Zi-Yi Dou,
Nanyun Peng
Abstract:
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucina…
▽ More
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose an large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling
Authors:
Hritik Bansal,
Po-Nien Kung,
P. Jeffrey Brantingham,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and st…
▽ More
Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets, respectively. In addition, we outperform CLIP-Event by 22% precision on M2E2 dataset. The framework also allows flexible adaptation and generalization to unseen domains.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
PhonologyBench: Evaluating Phonological Skills of Large Language Models
Authors:
Ashima Suvarna,
Harshita Khandelwal,
Nanyun Peng
Abstract:
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research. LLMs are widely used in various downstream applications that leverage phonology such as educational tools and poetry generation. Moreover, LLMs can potentially learn imperfect associations between orthographic and phonological forms from the train…
▽ More
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research. LLMs are widely used in various downstream applications that leverage phonology such as educational tools and poetry generation. Moreover, LLMs can potentially learn imperfect associations between orthographic and phonological forms from the training data. Thus, it is imperative to benchmark the phonological skills of LLMs. To this end, we present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs in English: grapheme-to-phoneme conversion, syllable counting, and rhyme word generation. Despite having no access to speech data, LLMs showcased notable performance on the PhonologyBench tasks. However, we observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans. Our findings underscore the importance of studying LLM performance on phonological tasks that inadvertently impact real-world applications. Furthermore, we encourage researchers to choose LLMs that perform well on the phonological task that is closely related to the downstream application since we find that no single model consistently outperforms the others on all the tasks.
△ Less
Submitted 5 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Event Detection from Social Media for Epidemic Prediction
Authors:
Tanmay Parekh,
Anh Mac,
Jiarui Yu,
Yuxuan Dong,
Syed Shahriar,
Bonnie Liu,
Eric Yang,
Kuan-Hao Huang,
Wei Wang,
Nanyun Peng,
Kai-Wei Chang
Abstract:
Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by de…
▽ More
Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by developing a framework to extract and analyze epidemic-related events from social media posts. To this end, we curate an epidemic event ontology comprising seven disease-agnostic event types and construct a Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic. Experimentation reveals how ED models trained on COVID-based SPEED can effectively detect epidemic events for three unseen epidemics of Monkeypox, Zika, and Dengue; while models trained on existing ED datasets fail miserably. Furthermore, we show that reporting sharp increases in the extracted events by our framework can provide warnings 4-9 weeks earlier than the WHO epidemic declaration for Monkeypox. This utility of our framework lays the foundations for better preparedness against emerging epidemics.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
Authors:
Hritik Bansal,
Ashima Suvarna,
Gantavya Bhatt,
Nanyun Peng,
Kai-Wei Chang,
Aditya Grover
Abstract:
A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This only leverages the pairwise comparisons when the generations are placed in an identical context. However, such conditional rankings often fail to capture the complex and multidimensional aspects of human preferences. In this work,…
▽ More
A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This only leverages the pairwise comparisons when the generations are placed in an identical context. However, such conditional rankings often fail to capture the complex and multidimensional aspects of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. While prior preference optimizations are designed for conditional ranking protocols (e.g., DPO), our proposed preference acquisition protocol introduces DOVE, a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, we find that the LLM trained with joint instruction-response preference data using DOVE outperforms the LLM trained with DPO by 5.2% and 3.3% win-rate for the summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Argument-Aware Approach To Event Linking
Authors:
I-Hung Hsu,
Zihan Xue,
Nilay Pochh,
Sahil Bansal,
Premkumar Natarajan,
Jayanth Srinivasa,
Nanyun Peng
Abstract:
Event linking connects event mentions in text with relevant nodes in a knowledge base (KB). Prior research in event linking has mainly borrowed methods from entity linking, overlooking the distinct features of events. Compared to the extensively explored entity linking task, events have more complex structures and can be more effectively distinguished by examining their associated arguments. Moreo…
▽ More
Event linking connects event mentions in text with relevant nodes in a knowledge base (KB). Prior research in event linking has mainly borrowed methods from entity linking, overlooking the distinct features of events. Compared to the extensively explored entity linking task, events have more complex structures and can be more effectively distinguished by examining their associated arguments. Moreover, the information-rich nature of events leads to the scarcity of event KBs. This emphasizes the need for event linking models to identify and classify event mentions not in the KB as ``out-of-KB,'' an area that has received limited attention. In this work, we tackle these challenges by introducing an argument-aware approach. First, we improve event linking models by augmenting input text with tagged event argument information, facilitating the recognition of key information about event mentions. Subsequently, to help the model handle ``out-of-KB'' scenarios, we synthesize out-of-KB training examples from in-KB instances through controlled manipulation of event arguments. Our experiment across two test datasets showed significant enhancements in both in-KB and out-of-KB scenarios, with a notable 22% improvement in out-of-KB evaluations.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Chain of Thought Explanation for Dialogue State Tracking
Authors:
Lin Xu,
Ningxin Peng,
Daquan Zhou,
See-Kiong Ng,
Jinlan Fu
Abstract:
Dialogue state tracking (DST) aims to record user queries and goals during a conversational interaction achieved by maintaining a predefined set of slots and their corresponding values. Current approaches decide slot values opaquely, while humans usually adopt a more deliberate approach by collecting information from relevant dialogue turns and then reasoning the appropriate values. In this work,…
▽ More
Dialogue state tracking (DST) aims to record user queries and goals during a conversational interaction achieved by maintaining a predefined set of slots and their corresponding values. Current approaches decide slot values opaquely, while humans usually adopt a more deliberate approach by collecting information from relevant dialogue turns and then reasoning the appropriate values. In this work, we focus on the steps needed to figure out slot values by proposing a model named Chain-of-Thought-Explanation (CoTE) for the DST task. CoTE, which is built on the generative DST framework, is designed to create detailed explanations step by step after determining the slot values. This process leads to more accurate and reliable slot values. More-over, to improve the reasoning ability of the CoTE, we further construct more fluent and high-quality explanations with automatic paraphrasing, leading the method CoTE-refined. Experimental results on three widely recognized DST benchmarks-MultiWOZ 2.2, WoZ 2.0, and M2M-demonstrate the remarkable effectiveness of the CoTE. Furthermore, through a meticulous fine-grained analysis, we observe significant benefits of our CoTE on samples characterized by longer dialogue turns, user responses, and reasoning steps.
△ Less
Submitted 9 March, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Improving Event Definition Following For Zero-Shot Event Detection
Authors:
Zefan Cai,
Po-Nien Kung,
Ashima Suvarna,
Mingyu Derek Ma,
Hritik Bansal,
Baobao Chang,
P. Jeffrey Brantingham,
Wei Wang,
Nanyun Peng
Abstract:
Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types, and prompt them with unseen event definitions. These approaches yield sporadic successes, yet generally fall short of expectations. In this work, we aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of ev…
▽ More
Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types, and prompt them with unseen event definitions. These approaches yield sporadic successes, yet generally fall short of expectations. In this work, we aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. To verify our hypothesis, we construct an automatically generated Diverse Event Definition (DivED) dataset and conduct comparative studies. Our experiments reveal that a large number of event types (200) and diverse event definitions can significantly boost event extraction performance; on the other hand, the performance does not scale with over ten examples per event type. Beyond scaling, we incorporate event ontology information and hard-negative samples during training, further boosting the performance. Based on these findings, we fine-tuned a LLaMA-2-7B model on our DivED dataset, yielding performance that surpasses SOTA large language models like GPT-3.5 across three open benchmarks on zero-shot event detection.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation
Authors:
Xueqing Wu,
Rui Zheng,
Jingzhen Sha,
Te-Lin Wu,
Hanyu Zhou,
Mohan Tang,
Kai-Wei Chang,
Nanyun Peng,
Haoran Huang
Abstract:
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensi…
▽ More
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensive. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs with a multi-turn prompting technique. We construct the DACO dataset, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2) ~2k query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. We train a 6B supervised fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns reasonable data analysis capabilities. To further align the models with human preference, we use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases, validating the effectiveness of our proposed algorithm. Data and code are released at https://github.com/shirley-wu/daco
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
On Prompt-Driven Safeguarding for Large Language Models
Authors:
Chujie Zheng,
Fan Yin,
Hao Zhou,
Fandong Meng,
Jie Zhou,
Kai-Wei Chang,
Minlie Huang,
Nanyun Peng
Abstract:
Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) from complying with queries that contain harmful intents. However, the working mechanisms of safety prompts have not been revealed yet, which hinders the potential for automatically optimizing them to improve LLM safety. To this end, we investigate the impact of safety prompts from the pe…
▽ More
Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) from complying with queries that contain harmful intents. However, the working mechanisms of safety prompts have not been revealed yet, which hinders the potential for automatically optimizing them to improve LLM safety. To this end, we investigate the impact of safety prompts from the perspective of model representations. We find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. Instead, the queries' representations are moved by safety prompts in similar directions where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. Inspired by these findings, we propose a method called DRO (Directed Representation Optimization) for automatic safety prompt optimization. It treats safety prompts as continuous, trainable embeddings and learns to move the representations of harmful/harmless queries along/opposite the direction in which the model's refusal probability increases. Experiments with eight LLMs on out-of-domain benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts and outperforms strong baselines, without compromising the general model capability.
△ Less
Submitted 4 March, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Authors:
Rohan Wadhawan,
Hritik Bansal,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visua…
▽ More
Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
DeepEdit: Knowledge Editing as Decoding with Constraints
Authors:
Yiwei Wang,
Muhao Chen,
Nanyun Peng,
Kai-Wei Chang
Abstract:
We propose a new perspective of knowledge editing (KE) for large language models (LLMs) that treats it as a constrained decoding problem. We design decoding constraints to regulate LLMs, ensuring coherence between reasoning steps when incorporating new knowledge. To enforce these constraints, we utilize a depth-first search to adaptively substitute new knowledge for the LLMs' original reasoning st…
▽ More
We propose a new perspective of knowledge editing (KE) for large language models (LLMs) that treats it as a constrained decoding problem. We design decoding constraints to regulate LLMs, ensuring coherence between reasoning steps when incorporating new knowledge. To enforce these constraints, we utilize a depth-first search to adaptively substitute new knowledge for the LLMs' original reasoning steps, greedily seeking the optimal path of multi-hop reasoning with new knowledge. From this vantage, we propose DEEPEDIT: Depth-first Search-based Decoding for Knowledge Editing. DEEPEDIT improves the KE of LLMs by enhancing the conciseness, coherence, pertinence, and receptiveness of reasoning with new knowledge. DEEPEDIT is flexibly applicable to any black-box LLM without requiring access to model parameters or token-wise distributions. In addition to DEEPEDIT, we propose two new KE benchmarks: MQuAKE-2002 and MQuAKE-hard, which are designed to provide more precise and challenging assessments of KE approaches. Qualitatively, DEEPEDIT enables LLMs to produce more succinct reasoning outputs in accordance with new knowledge. Quantitatively, it yields significant improvements on multiple KE benchmarks.
△ Less
Submitted 1 April, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Model Editing Can Hurt General Abilities of Large Language Models
Authors:
Jia-Chen Gu,
Hao-Xiang Xu,
Jun-Yu Ma,
Pan Lu,
Zhen-Hua Ling,
Kai-Wei Chang,
Nanyun Peng
Abstract:
One critical challenge that has emerged is the presence of hallucinations in the output of large language models (LLMs) due to false or outdated knowledge. Since retraining LLMs with updated information is resource-intensive, there has been a growing interest in model editing. However, current model editing methods, while effective in improving editing performance in various scenarios, often overl…
▽ More
One critical challenge that has emerged is the presence of hallucinations in the output of large language models (LLMs) due to false or outdated knowledge. Since retraining LLMs with updated information is resource-intensive, there has been a growing interest in model editing. However, current model editing methods, while effective in improving editing performance in various scenarios, often overlook potential side effects on the general abilities of LLMs. In this paper, we raise concerns that model editing inherently improves the factuality of the model, but may come at the cost of a significant degradation of these general abilities. Systematically, we analyze side effects by evaluating four popular editing methods on three LLMs across eight representative task categories. Extensive empirical research reveals that current model editing methods are difficult to couple well with LLMs to simultaneously improve the factuality and maintain the general abilities such as reasoning, question answering, etc. Strikingly, the use of a specific method to edit LLaMA-1 (7B) resulted in a drastic performance degradation to nearly 0 on all selected tasks with just a single edit. Therefore, we advocate for more research efforts to minimize the loss of general abilities acquired during LLM pre-training and to ultimately preserve them during model editing.
△ Less
Submitted 4 February, 2024; v1 submitted 9 January, 2024;
originally announced January 2024.
-
New Job, New Gender? Measuring the Social Bias in Image Generation Models
Authors:
Wenxuan Wang,
Haonan Bai,
Jen-tse Huang,
Yuxuan Wan,
Youliang Yuan,
Haoyi Qiu,
Nanyun Peng,
Michael R. Lyu
Abstract:
Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can…
▽ More
Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel metamorphic testing framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on any changes related to gender, race, and age. BiasPainter adopts a testing oracle that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. To evaluate the effectiveness of BiasPainter, we use BiasPainter to test five widely-used commercial image generation software and models, such as stable diffusion and Midjourney. Experimental results show that 100\% of the generated test cases can successfully trigger social bias in image generation models.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
Tracking the Newsworthiness of Public Documents
Authors:
Alexander Spangher,
Emilio Ferrara,
Ben Welsh,
Nanyun Peng,
Serdar Tumgoren,
Jonathan May
Abstract:
Journalists must find stories in huge amounts of textual data (e.g. leaks, bills, press releases) as part of their jobs: determining when and why text becomes news can help us understand coverage patterns and help us build assistive tools. Yet, this is challenging because very few labelled links exist, language use between corpora is very different, and text may be covered for a variety of reasons…
▽ More
Journalists must find stories in huge amounts of textual data (e.g. leaks, bills, press releases) as part of their jobs: determining when and why text becomes news can help us understand coverage patterns and help us build assistive tools. Yet, this is challenging because very few labelled links exist, language use between corpora is very different, and text may be covered for a variety of reasons. In this work we focus on news coverage of local public policy in the San Francisco Bay Area by the San Francisco Chronicle. First, we gather news articles, public policy documents and meeting recordings and link them using probabilistic relational modeling, which we show is a low-annotation linking methodology that outperforms other retrieval-based baselines. Second, we define a new task: newsworthiness prediction, to predict if a policy item will get covered. We show that different aspects of public policy discussion yield different newsworthiness signals. Finally we perform human evaluation with expert journalists and show our systems identify policies they consider newsworthy with 68% F1 and our coverage recommendations are helpful with an 84% win-rate.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
MacGyver: Are Large Language Models Creative Problem Solvers?
Authors:
Yufei Tian,
Abhilasha Ravichander,
Lianhui Qin,
Ronan Le Bras,
Raja Marjieh,
Nanyun Peng,
Yejin Choi,
Thomas L. Griffiths,
Faeze Brahman
Abstract:
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their…
▽ More
We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.
This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.
△ Less
Submitted 27 March, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction
Authors:
Kuan-Hao Huang,
I-Hung Hsu,
Tanmay Parekh,
Zhiyu Xie,
Zixuan Zhang,
Premkumar Natarajan,
Kai-Wei Chang,
Nanyun Peng,
Heng Ji
Abstract:
Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current…
▽ More
Event extraction has gained considerable interest due to its wide-ranging applications. However, recent studies draw attention to evaluation issues, suggesting that reported scores may not accurately reflect the true performance. In this work, we identify and address evaluation challenges, including inconsistency due to varying data assumptions or preprocessing steps, the insufficiency of current evaluation frameworks that may introduce dataset or data split bias, and the low reproducibility of some previous approaches. To address these challenges, we present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 14 datasets spanning seven diverse domains and includes 14 recent methodologies, conducting a comprehensive benchmark reevaluation. We also evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance. Inspired by our reevaluation results and findings, we discuss the role of event extraction in the current NLP era, as well as future challenges and insights derived from TextEE. We believe TextEE, the first standardized comprehensive benchmarking tool, will significantly facilitate future event extraction research.
△ Less
Submitted 16 February, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation
Authors:
Haoyi Qiu,
Kung-Hsiang Huang,
Jingnong Qu,
Nanyun Peng
Abstract:
Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data…
▽ More
Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data to detect the factually inconsistencies during testing time. However, previous approaches generating perturbed summaries are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs). Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate our approach significantly outperforms previous systems on the AggreFact-SOTA benchmark, showcasing its efficacy in evaluating factuality of abstractive summarization.
△ Less
Submitted 2 April, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Nonlinear Multi-objective Reinforcement Learning with Provable Guarantees
Authors:
Nianli Peng,
Brandon Fain
Abstract:
We describe RA-E3 (Reward-Aware Explicit Explore or Exploit), an algorithm with provable guarantees for solving a single or multi-objective Markov Decision Process (MDP) where we want to maximize the expected value of a nonlinear function over accumulated rewards. This allows us to model fairness-aware welfare optimization for multi-objective reinforcement learning as well as risk-aware reinforcem…
▽ More
We describe RA-E3 (Reward-Aware Explicit Explore or Exploit), an algorithm with provable guarantees for solving a single or multi-objective Markov Decision Process (MDP) where we want to maximize the expected value of a nonlinear function over accumulated rewards. This allows us to model fairness-aware welfare optimization for multi-objective reinforcement learning as well as risk-aware reinforcement learning with nonlinear Von Neumann-Morgenstern utility functions in the single objective setting. RA-E3 extends the classic E3 algorithm that solves MDPs with scalar rewards and linear preferences. We first state a distinct reward-aware version of value iteration that calculates a non-stationary policy that is approximately optimal for a given model of the environment. This sub-procedure is based on an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. We then describe how to use this optimization procedure in a larger algorithm that must simultaneously learn a model of the environment. The algorithm learns an approximately optimal policy in time that depends polynomially on the MDP size, desired approximation, and smoothness of the nonlinear function, and exponentially on the number of objectives.
△ Less
Submitted 14 December, 2023; v1 submitted 4 November, 2023;
originally announced November 2023.
-
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Authors:
Te-Lin Wu,
Zi-Yi Dou,
Qingyuan Hu,
Yu Hou,
Nischal Reddy Chandra,
Marjorie Freedman,
Ralph M. Weischedel,
Nanyun Peng
Abstract:
Multimodal counterfactual reasoning is a vital yet challenging ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few datasets targeting the counterfactual reasoning abilities of multimodal models. Am…
▽ More
Multimodal counterfactual reasoning is a vital yet challenging ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few datasets targeting the counterfactual reasoning abilities of multimodal models. Among them, they only cover reasoning over synthetic environments or specific types of events (e.g. traffic collisions), making them hard to reliably benchmark the model generalization ability in diverse real-world scenarios and reasoning dimensions. To overcome these limitations, we develop a video question answering dataset, ACQUIRED: it consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints, which ensures a focus on real-world diversity. In addition, each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal, which can comprehensively evaluate the model counterfactual abilities along multiple aspects. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap (>13%) between models and humans. The findings suggest that multimodal counterfactual reasoning remains an open challenge and ACQUIRED is a comprehensive and reliable benchmark for inspiring future research in this direction.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
Authors:
Po-Nien Kung,
Fan Yin,
Di Wu,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions. However, how to select new tasks to improve the performance and generalizability of IT models remains an open question. Training on all existing tasks is impractical due to prohibiting computation requirements, and randomly se…
▽ More
Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions. However, how to select new tasks to improve the performance and generalizability of IT models remains an open question. Training on all existing tasks is impractical due to prohibiting computation requirements, and randomly selecting tasks can lead to suboptimal performance. In this work, we propose active instruction tuning based on prompt uncertainty, a novel framework to identify informative tasks, and then actively tune the models on the selected tasks. We represent the informativeness of new tasks with the disagreement of the current model outputs over perturbed prompts. Our experiments on NIV2 and Self-Instruct datasets demonstrate that our method consistently outperforms other baseline strategies for task selection, achieving better out-of-distribution generalization with fewer training tasks. Additionally, we introduce a task map that categorizes and diagnoses tasks based on prompt uncertainty and prediction probability. We discover that training on ambiguous (prompt-uncertain) tasks improves generalization while training on difficult (prompt-certain and low-probability) tasks offers no benefit, underscoring the importance of task selection for instruction tuning.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation
Authors:
Yufei Tian,
Felix Zhang,
Nanyun Peng
Abstract:
Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a c…
▽ More
Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
Authors:
Te-Lin Wu,
Yu Zhou,
Nanyun Peng
Abstract:
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions to the environment without being told exactly what/where to ground (e.g., localizing and track…
▽ More
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions to the environment without being told exactly what/where to ground (e.g., localizing and tracking the `sponge` in video from the instruction "Dip the `sponge` into the bucket."). While existing works approach this problem from a pure vision perspective, we investigate to which extent the textual modality (i.e., task instructions) and their interaction with visual modality can be beneficial. Specifically, we propose to improve phrase grounding models' ability on localizing the active objects by: (1) learning the role of `objects undergoing change` and extracting them accurately from the instructions, (2) leveraging pre- and post-conditions of the objects during actions, and (3) recognizing the objects more robustly with descriptional knowledge. We leverage large language models (LLMs) to extract the aforementioned action-object knowledge, and design a per-object aggregation masking technique to effectively perform joint inference on object phrases and symbolic knowledge. We evaluate our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments demonstrate the effectiveness of our proposed framework, which leads to>54% improvements in all standard metrics on the TREK-150-OPE-Det localization + tracking task, >7% improvements in all standard metrics on the TREK-150-OPE tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD task.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Evaluating Large Language Models on Controlled Generation Tasks
Authors:
Jiao Sun,
Yufei Tian,
Wangchunshu Zhou,
Nan Xu,
Qian Hu,
Rahul Gupta,
John Frederick Wieting,
Nanyun Peng,
Xuezhe Ma
Abstract:
While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different gr…
▽ More
While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that **large language models struggle at meeting fine-grained hard constraints**.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters
Authors:
Yixin Wan,
George Pu,
Jiao Sun,
Aparna Garimella,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying bi…
▽ More
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.
△ Less
Submitted 1 December, 2023; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Mitigating Bias for Question Answering Models by Tracking Bias Influence
Authors:
Mingyu Derek Ma,
Jiun-Yu Kao,
Arpit Gupta,
Yu-Hsiang Lin,
Wenbo Zhao,
Tagyoung Chung,
Wei Wang,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bi…
▽ More
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems
Authors:
Yixin Wan,
Jieyu Zhao,
Aman Chadha,
Nanyun Peng,
Kai-Wei Chang
Abstract:
Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. We define generic personas to represent demographic groups, such as "an Asian person", whereas specific personas may take the form of specific popular Asian names like "Yumi". While the adoption of personas enriches user experienc…
▽ More
Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. We define generic personas to represent demographic groups, such as "an Asian person", whereas specific personas may take the form of specific popular Asian names like "Yumi". While the adoption of personas enriches user experiences by making dialogue systems more engaging and approachable, it also casts a shadow of potential risk by exacerbating social biases within model responses, thereby causing societal harm through interactions with users. In this paper, we systematically study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt. We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to investigate persona biases by experimenting with UNIVERSALPERSONA, a systematically constructed persona dataset encompassing various types of both generic and specific model personas. Through benchmarking on four different models -- including Blender, ChatGPT, Alpaca, and Vicuna -- our study uncovers significant persona biases in dialogue systems. Our findings also underscore the pressing need to revisit the use of personas in dialogue agents to ensure safe application.
△ Less
Submitted 2 November, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways
Authors:
Mingyu Derek Ma,
Alexander K. Taylor,
Nuan Wen,
Yanchen Liu,
Po-Nien Kung,
Wenna Qin,
Shicheng Wen,
Azure Zhou,
Diyi Yang,
Xuezhe Ma,
Nanyun Peng,
Wei Wang
Abstract:
We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID-19-related news articles accompanied by comprehensive insights, including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information. Besides discovering information flow patterns among users,…
▽ More
We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID-19-related news articles accompanied by comprehensive insights, including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information. Besides discovering information flow patterns among users, we construct communities among users and develop the propagation forecasting capability, enabling tracing and understanding of how information is disseminated at a higher level.
△ Less
Submitted 20 February, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Contextual Label Projection for Cross-Lingual Structured Prediction
Authors:
Tanmay Parekh,
I-Hung Hsu,
Kuan-Hao Huang,
Kai-Wei Chang,
Nanyun Peng
Abstract:
Label projection, which involves obtaining translated labels and texts jointly, is essential for leveraging machine translation to facilitate cross-lingual transfer in structured prediction tasks. Prior research exploring label projection often compromise translation accuracy by favoring simplified label translation or relying solely on word-level alignments. In this paper, we introduce a novel la…
▽ More
Label projection, which involves obtaining translated labels and texts jointly, is essential for leveraging machine translation to facilitate cross-lingual transfer in structured prediction tasks. Prior research exploring label projection often compromise translation accuracy by favoring simplified label translation or relying solely on word-level alignments. In this paper, we introduce a novel label projection approach, CLaP, which translates text to the target language and performs contextual translation on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We benchmark CLaP with other label projection techniques on zero-shot cross-lingual transfer across 39 languages on two representative structured prediction tasks - event argument extraction (EAE) and named entity recognition (NER), showing over 2.4 F1 improvement for EAE and 1.4 F1 improvement for NER. We further explore the applicability of CLaP on ten extremely low-resource languages to showcase its potential for cross-lingual structured prediction.
△ Less
Submitted 14 April, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
Authors:
Kevin Yang,
Dan Klein,
Asli Celikyilmaz,
Nanyun Peng,
Yuandong Tian
Abstract:
We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt d…
▽ More
We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. Using two different prompts causes model outputs to be more differentiated on average, resulting in cleaner preference labels in the absence of human annotations. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks--harmlessness, helpfulness, and story outline generation--and when using both 7B and 30B model scales for simulating preference data.
△ Less
Submitted 16 March, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Next Steps for Human-Centered Generative AI: A Technical Perspective
Authors:
Xiang 'Anthony' Chen,
Jeff Burke,
Ruofei Du,
Matthew K. Hong,
Jennifer Jacobs,
Philippe Laban,
Dingzeyu Li,
Nanyun Peng,
Karl D. D. Willis,
Chien-Sheng Wu,
Bolei Zhou
Abstract:
Through iterative, cross-disciplinary discussions, we define and propose next-steps for Human-centered Generative AI (HGAI). We contribute a comprehensive research agenda that lays out future directions of Generative AI spanning three levels: aligning with human values; assimilating human intents; and augmenting human abilities. By identifying these next-steps, we intend to draw interdisciplinary…
▽ More
Through iterative, cross-disciplinary discussions, we define and propose next-steps for Human-centered Generative AI (HGAI). We contribute a comprehensive research agenda that lays out future directions of Generative AI spanning three levels: aligning with human values; assimilating human intents; and augmenting human abilities. By identifying these next-steps, we intend to draw interdisciplinary research teams to pursue a coherent set of emergent ideas in HGAI, focusing on their interested topics while maintaining a coherent big picture of the future work landscape.
△ Less
Submitted 22 December, 2023; v1 submitted 27 June, 2023;
originally announced June 2023.
-
DesCo: Learning Object Recognition with Rich Language Descriptions
Authors:
Liunian Harold Li,
Zi-Yi Dou,
Nanyun Peng,
Kai-Wei Chang
Abstract:
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions th…
▽ More
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
Open-Domain Text Evaluation via Contrastive Distribution Modeling
Authors:
Sidi Lu,
Hongyi Liu,
Asli Celikyilmaz,
Tianlu Wang,
Nanyun Peng
Abstract:
Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models' generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between i…
▽ More
Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models' generation quality remains a challenge. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the _contrast_ of two probabilistic distributions -- one known to be superior to the other -- to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) _Generative_ CDM, which harnesses the contrast of two language models' distributions to generate synthetic examples for training discriminator-based metrics; 2) _Discriminative_ CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.
△ Less
Submitted 3 May, 2024; v1 submitted 20 June, 2023;
originally announced June 2023.
-
On Compositionality and Improved Training of NADO
Authors:
Sidi Lu,
Wenbo Zhao,
Chenyang Tao,
Arpit Gupta,
Shanchan Wu,
Tagyoung Chung,
Nanyun Peng
Abstract:
NeurAlly-Decomposed Oracle (NADO) is a powerful approach for controllable generation with large language models. Differentiating from finetuning/prompt tuning, it has the potential to avoid catastrophic forgetting of the large base model and achieve guaranteed convergence to an entropy-maximized closed-form solution without significantly limiting the model capacity. Despite its success, several ch…
▽ More
NeurAlly-Decomposed Oracle (NADO) is a powerful approach for controllable generation with large language models. Differentiating from finetuning/prompt tuning, it has the potential to avoid catastrophic forgetting of the large base model and achieve guaranteed convergence to an entropy-maximized closed-form solution without significantly limiting the model capacity. Despite its success, several challenges arise when applying NADO to more complex scenarios. First, the best practice of using NADO for the composition of multiple control signals is under-explored. Second, vanilla NADO suffers from gradient vanishing for low-probability control signals and is highly reliant on the forward-consistency regularization. In this paper, we study the aforementioned challenges when using NADO theoretically and empirically. We show we can achieve guaranteed compositional generalization of NADO with a certain practice, and propose a novel alternative parameterization of NADO to perfectly guarantee the forward-consistency. We evaluate the improved training of NADO, i.e. NADO++, on CommonGen. Results show that NADO++ improves the effectiveness of the algorithm in multiple aspects.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Unsupervised Melody-to-Lyric Generation
Authors:
Yufei Tian,
Anjali Narayan-Chen,
Shereen Oraby,
Alessandra Cervone,
Gunnar Sigurdsson,
Chenyang Tao,
Wenbo Zhao,
Yiwen Chen,
Tagyoung Chung,
Jing Huang,
Nanyun Peng
Abstract:
Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationshi…
▽ More
Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationship between melody and lyrics. In this work, we propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. Specifically, we design a hierarchical lyric generation framework that first generates a song outline and second the complete lyrics. The framework enables disentanglement of training (based purely on text) from inference (melody-guided text generation) to circumvent the shortage of parallel data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints as guidance during inference. The two-step hierarchical design also enables content control via the lyric outline, a much-desired feature for democratizing collaborative song creation. Experimental results show that our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines, for example SongMASS, a SOTA model trained on a parallel dataset, with a 24% relative overall quality improvement based on human ratings.
△ Less
Submitted 22 December, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation
Authors:
Liyan Kang,
Luyang Huang,
Ningxin Peng,
Peihao Zhu,
Zewei Sun,
Shanbo Cheng,
Mingxuan Wang,
Degen Huang,
Jinsong Su
Abstract:
We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation. Compared with the widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: Amb…
▽ More
We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation. Compared with the widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation. To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder. Extensive experiments on the BigVideo show that: a) Visual information consistently improves the NMT model in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test sets. b) Visual information helps disambiguation, compared to the strong text baseline on terminology-targeted scores and human evaluation. Dataset and our implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.
△ Less
Submitted 3 July, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
AMPERE: AMR-Aware Prefix for Generation-Based Event Argument Extraction Model
Authors:
I-Hung Hsu,
Zhiyu Xie,
Kuan-Hao Huang,
Prem Natarajan,
Nanyun Peng
Abstract:
Event argument extraction (EAE) identifies event arguments and their specific roles for a given event. Recent advancement in generation-based EAE models has shown great performance and generalizability over classification-based models. However, existing generation-based EAE models mostly focus on problem re-formulation and prompt design, without incorporating additional information that has been s…
▽ More
Event argument extraction (EAE) identifies event arguments and their specific roles for a given event. Recent advancement in generation-based EAE models has shown great performance and generalizability over classification-based models. However, existing generation-based EAE models mostly focus on problem re-formulation and prompt design, without incorporating additional information that has been shown to be effective for classification-based models, such as the abstract meaning representation (AMR) of the input passages. Incorporating such information into generation-based models is challenging due to the heterogeneous nature of the natural language form prevalently used in generation-based models and the structured form of AMRs. In this work, we study strategies to incorporate AMR into generation-based EAE models. We propose AMPERE, which generates AMR-aware prefixes for every layer of the generation model. Thus, the prefix introduces AMR information to the generation-based EAE model and then improves the generation. We also introduce an adjusted copy mechanism to AMPERE to help overcome potential noises brought by the AMR graph. Comprehensive experiments and analyses on ACE2005 and ERE datasets show that AMPERE can get 4% - 10% absolute F1 score improvements with reduced training data and it is in general powerful across different training sizes.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Code-Switched Text Synthesis in Unseen Language Pairs
Authors:
I-Hung Hsu,
Avik Ray,
Shubham Garg,
Nanyun Peng,
Jing Huang
Abstract:
Existing efforts on text synthesis for code-switching mostly require training on code-switched texts in the target language pairs, limiting the deployment of the models to cases lacking code-switched data. In this work, we study the problem of synthesizing code-switched texts for language pairs absent from the training data. We introduce GLOSS, a model built on top of a pre-trained multilingual ma…
▽ More
Existing efforts on text synthesis for code-switching mostly require training on code-switched texts in the target language pairs, limiting the deployment of the models to cases lacking code-switched data. In this work, we study the problem of synthesizing code-switched texts for language pairs absent from the training data. We introduce GLOSS, a model built on top of a pre-trained multilingual machine translation model (PMMTM) with an additional code-switching module. This module, either an adapter or extra prefixes, learns code-switching patterns from code-switched data during training, while the primary component of GLOSS, i.e., the PMMTM, is frozen. The design of only adjusting the code-switching module prevents our model from overfitting to the constrained training data for code-switching. Hence, GLOSS exhibits the ability to generalize and synthesize code-switched texts across a broader spectrum of language pairs. Additionally, we develop a self-training algorithm on target language pairs further to enhance the reliability of GLOSS. Automatic evaluations on four language pairs show that GLOSS achieves at least 55% relative BLEU and METEOR scores improvements compared to strong baselines. Human evaluations on two language pairs further validate the success of GLOSS.
△ Less
Submitted 7 July, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Are Fairy Tales Fair? Analyzing Gender Bias in Temporal Narrative Event Chains of Children's Fairy Tales
Authors:
Paulina Toro Isaza,
Guangxuan Xu,
Akintoye Oloko,
Yufang Hou,
Nanyun Peng,
Dakuo Wang
Abstract:
Social biases and stereotypes are embedded in our culture in part through their presence in our stories, as evidenced by the rich history of humanities and social science literature analyzing such biases in children stories. Because these analyses are often conducted manually and at a small scale, such investigations can benefit from the use of more recent natural language processing methods that…
▽ More
Social biases and stereotypes are embedded in our culture in part through their presence in our stories, as evidenced by the rich history of humanities and social science literature analyzing such biases in children stories. Because these analyses are often conducted manually and at a small scale, such investigations can benefit from the use of more recent natural language processing methods that examine social bias in models and data corpora. Our work joins this interdisciplinary effort and makes a unique contribution by taking into account the event narrative structures when analyzing the social bias of stories. We propose a computational pipeline that automatically extracts a story's temporal narrative verb-based event chain for each of its characters as well as character attributes such as gender. We also present a verb-based event annotation scheme that can facilitate bias analysis by including categories such as those that align with traditional stereotypes. Through a case study analyzing gender bias in fairy tales, we demonstrate that our framework can reveal bias in not only the unigram verb-based events in which female and male characters participate but also in the temporal narrative order of such event participation.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models
Authors:
Mingyu Derek Ma,
Xiaoxuan Wang,
Po-Nien Kung,
P. Jeffrey Brantingham,
Nanyun Peng,
Wei Wang
Abstract:
Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extracti…
▽ More
Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibits higher passage quality and better align with the task definitions compared with the human-curated data.
△ Less
Submitted 20 February, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Identifying Informational Sources in News Articles
Authors:
Alexander Spangher,
Nanyun Peng,
Jonathan May,
Emilio Ferrara
Abstract:
News articles are driven by the informational sources journalists use in reporting. Modeling when, how and why sources get used together in stories can help us better understand the information we consume and even help journalists with the task of producing it. In this work, we take steps toward this goal by constructing the largest and widest-ranging annotated dataset, to date, of informational s…
▽ More
News articles are driven by the informational sources journalists use in reporting. Modeling when, how and why sources get used together in stories can help us better understand the information we consume and even help journalists with the task of producing it. In this work, we take steps toward this goal by constructing the largest and widest-ranging annotated dataset, to date, of informational sources used in news writing. We show that our dataset can be used to train high-performing models for information detection and source attribution. We further introduce a novel task, source prediction, to study the compositionality of sources in news articles. We show good performance on this task, which we argue is an important proof for narrative science exploring the internal structure of news articles and aiding in planning-based language generation, and an important step towards a source-recommendation system to aid journalists.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Gender Biases in Automatic Evaluation Metrics for Image Captioning
Authors:
Haoyi Qiu,
Zi-Yi Dou,
Tianlu Wang,
Asli Celikyilmaz,
Nanyun Peng
Abstract:
Model-based evaluation metrics (e.g., CLIPScore and GPTScore) have demonstrated decent correlations with human judgments in various language generation tasks. However, their impact on fairness remains largely unexplored. It is widely recognized that pretrained models can inadvertently encode societal biases, thus employing these models for evaluation purposes may inadvertently perpetuate and ampli…
▽ More
Model-based evaluation metrics (e.g., CLIPScore and GPTScore) have demonstrated decent correlations with human judgments in various language generation tasks. However, their impact on fairness remains largely unexplored. It is widely recognized that pretrained models can inadvertently encode societal biases, thus employing these models for evaluation purposes may inadvertently perpetuate and amplify biases. For example, an evaluation metric may favor the caption "a woman is calculating an account book" over "a man is calculating an account book," even if the image only shows male accountants. In this paper, we conduct a systematic study of gender biases in model-based automatic evaluation metrics for image captioning tasks. We start by curating a dataset comprising profession, activity, and object concepts associated with stereotypical gender associations. Then, we demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations, as well as the propagation of biases to generation models through reinforcement learning. Finally, we present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments. Our dataset and framework lay the foundation for understanding the potential harm of model-based evaluation metrics, and facilitate future works to develop more inclusive evaluation metrics.
△ Less
Submitted 2 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Masked Path Modeling for Vision-and-Language Navigation
Authors:
Zi-Yi Dou,
Feng Gao,
Nanyun Peng
Abstract:
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring c…
▽ More
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring costly human-annotated data that restricts scalability. In this paper, we introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks. Our proposed method involves allowing the agent to actively explore navigation environments without a specific goal and collect the paths it traverses. Subsequently, we train the agent on this collected data to reconstruct the original path given a randomly masked subpath. This way, the agent can actively accumulate a diverse and substantial amount of data while learning conditional action generation. To evaluate the effectiveness of our technique, we conduct experiments on various VLN datasets and demonstrate the versatility of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we conduct an analysis that highlights the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
Authors:
Po-Nien Kung,
Nanyun Peng
Abstract:
Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models util…
▽ More
Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions. Specifically, we create simplified task definitions by removing all semantic components and only leaving the output space information, and delusive examples that contain incorrect input-output mapping. Our experiments show that models trained on simplified task definition or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Furthermore, we introduce a random baseline to perform zeroshot classification tasks, and find it achieves similar performance (42.6% exact-match) as IT does (43% exact-match) in low resource setting, while both methods outperform naive T5 significantly (30% per exact-match). Our analysis provides evidence that the impressive performance gain of current IT models can come from picking up superficial patterns, such as learning the output format and guessing. Our study highlights the urgent need for more reliable IT methods and evaluation.
△ Less
Submitted 25 May, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems
Authors:
Sarik Ghazarian,
Yijia Shao,
Rujun Han,
Aram Galstyan,
Nanyun Peng
Abstract:
Commonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. We take the first step by focusing on event commonsense that considers events and their relations, and is crucial in both dialogues and general commonsense reasoning. We propose ACCENT, an eve…
▽ More
Commonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. We take the first step by focusing on event commonsense that considers events and their relations, and is crucial in both dialogues and general commonsense reasoning. We propose ACCENT, an event commonsense evaluation metric empowered by commonsense knowledge bases (CSKBs). ACCENT first extracts event-relation tuples from a dialogue, and then evaluates the response by scoring the tuples in terms of their compatibility with the CSKB. To evaluate ACCENT, we construct the first public event commonsense evaluation dataset for open-domain dialogues. Our experiments show that ACCENT is an efficient metric for event commonsense evaluation, which achieves higher correlations with human judgments than existing baselines.
△ Less
Submitted 3 November, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Unsupervised Melody-Guided Lyrics Generation
Authors:
Yufei Tian,
Anjali Narayan-Chen,
Shereen Oraby,
Alessandra Cervone,
Gunnar Sigurdsson,
Chenyang Tao,
Wenbo Zhao,
Tagyoung Chung,
Jing Huang,
Nanyun Peng
Abstract:
Automatic song writing is a topic of significant practical interest. However, its research is largely hindered by the lack of training data due to copyright concerns and challenged by its creative nature. Most noticeably, prior works often fall short of modeling the cross-modal correlation between melody and lyrics due to limited parallel data, hence generating lyrics that are less singable. Exist…
▽ More
Automatic song writing is a topic of significant practical interest. However, its research is largely hindered by the lack of training data due to copyright concerns and challenged by its creative nature. Most noticeably, prior works often fall short of modeling the cross-modal correlation between melody and lyrics due to limited parallel data, hence generating lyrics that are less singable. Existing works also lack effective mechanisms for content control, a much desired feature for democratizing song creation for people with limited music background. In this work, we propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data. Instead, we design a hierarchical lyric generation framework that disentangles training (based purely on text) from inference (melody-guided text generation). At inference time, we leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process. Evaluation results show that our model can generate high-quality lyrics that are more singable, intelligible, coherent, and in rhyme than strong baselines including those supervised on parallel data.
△ Less
Submitted 25 May, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Tractable Control for Autoregressive Language Generation
Authors:
Honghua Zhang,
Meihua Dang,
Nanyun Peng,
Guy Van den Broeck
Abstract:
Despite the success of autoregressive large language models in text generation, it remains a major challenge to generate text that satisfies complex constraints: sampling from the conditional distribution ${\Pr}(\text{text} | α)$ is intractable for even the simplest lexical constraints $α$. To overcome this challenge, we propose to use tractable probabilistic models (TPMs) to impose lexical constr…
▽ More
Despite the success of autoregressive large language models in text generation, it remains a major challenge to generate text that satisfies complex constraints: sampling from the conditional distribution ${\Pr}(\text{text} | α)$ is intractable for even the simplest lexical constraints $α$. To overcome this challenge, we propose to use tractable probabilistic models (TPMs) to impose lexical constraints in autoregressive text generation models, which we refer to as GeLaTo (Generating Language with Tractable Constraints). To demonstrate the effectiveness of this framework, we use distilled hidden Markov models, where we can efficiently compute ${\Pr}(\text{text} | α)$, to guide autoregressive generation from GPT2. GeLaTo achieves state-of-the-art performance on challenging benchmarks for constrained text generation (e.g., CommonGen), beating various strong baselines by a large margin. Our work not only opens up new avenues for controlling large language models but also motivates the development of more expressive TPMs.
△ Less
Submitted 15 November, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.