-
Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions
Authors:
Divyansh Agarwal,
Alexander R. Fabbri,
Philippe Laban,
Ben Risher,
Shafiq Joty,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and ope…
▽ More
Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.
△ Less
Submitted 26 April, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
Authors:
Liyan Tang,
Philippe Laban,
Greg Durrett
Abstract:
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many call…
▽ More
Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
Authors:
Philippe Laban,
Lidiya Murakhovs'ka,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM completes a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Ar…
▽ More
The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM completes a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Are you sure?", offering an opportunity for the model to reflect on its initial answer, and decide whether to confirm or flip its answer. A systematic study of ten LLMs on seven classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect). We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely. The FlipFlop experiment illustrates the universality of sycophantic behavior in LLMs and provides a robust framework to analyze model behavior and evaluate future models.
△ Less
Submitted 21 February, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems
Authors:
Lidiya Murakhovs'ka,
Philippe Laban,
Tian Xie,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gathering preferences. In this work, we define a new problem space for conversational agents that aim to provide both product recommendations and educational value thr…
▽ More
Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gathering preferences. In this work, we define a new problem space for conversational agents that aim to provide both product recommendations and educational value through mixed-type mixed-initiative dialog. We introduce SalesOps, a framework that facilitates the simulation and evaluation of such systems by leveraging recent advancements in large language models (LLMs). We build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate either side of the framework. A comprehensive human study compares SalesBot against professional salespeople, revealing that although SalesBot approaches professional performance in terms of fluency and informativeness, it lags behind in recommendation quality. We emphasize the distinct limitations both face in providing truthful information, highlighting the challenges of ensuring faithfulness in the CRS context. We release our code and make all data available.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Automatic and Human-AI Interactive Text Generation
Authors:
Yao Dou,
Philippe Laban,
Claire Gardent,
Wei Xu
Abstract:
In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simpli…
▽ More
In this tutorial, we focus on text-to-text generation, a class of natural language generation (NLG) tasks, that takes a piece of text as input and then generates a revision that is improved according to some specific criteria (e.g., readability or linguistic styles), while largely retaining the original meaning and the length of the text. This includes many useful applications, such as text simplification, paraphrase generation, style transfer, etc. In contrast to text summarization and open-ended text completion (e.g., story), the text-to-text generation tasks we discuss in this tutorial are more constrained in terms of semantic consistency and targeted language styles. This level of control makes these tasks ideal testbeds for studying the ability of models to generate text that is both semantically adequate and stylistically appropriate. Moreover, these tasks are interesting from a technical standpoint, as they require complex combinations of lexical and syntactical transformations, stylistic control, and adherence to factual knowledge, -- all at once. With a special focus on text simplification and revision, this tutorial aims to provide an overview of the state-of-the-art natural language generation research from four major aspects -- Data, Models, Human-AI Collaboration, and Evaluation -- and to discuss and showcase a few significant and recent advances: (1) the use of non-retrogressive approaches; (2) the shift from fine-tuning to prompting with large language models; (3) the development of new learnable metric and fine-grained human evaluation framework; (4) a growing body of studies and datasets on non-English languages; (5) the rise of HCI+NLP+Accessibility interdisciplinary research to create real-world writing assistant systems.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Beyond the Chat: Executable and Verifiable Text-Editing with LLMs
Authors:
Philippe Laban,
Jesse Vig,
Marti A. Hearst,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces do not support transparency and verifiability of the editing changes that they suggest. To give the author more agency when editing with an LLM, we present InkSync, an editing interface that suggests…
▽ More
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces do not support transparency and verifiability of the editing changes that they suggest. To give the author more agency when editing with an LLM, we present InkSync, an editing interface that suggests executable edits directly within the document being edited. Because LLMs are known to introduce factual errors, Inksync also supports a 3-stage approach to mitigate this risk: Warn authors when a suggested edit introduces new information, help authors Verify the new information's accuracy through external search, and allow an auditor to perform an a-posteriori verification by Auditing the document via a trace of all auto-generated content. Two usability studies confirm the effectiveness of InkSync's components when compared to standard LLM-based chat interfaces, leading to more accurate, more efficient editing, and improved user experience.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Art or Artifice? Large Language Models and the False Promise of Creativity
Authors:
Tuhin Chakrabarty,
Philippe Laban,
Divyansh Agarwal,
Smaranda Muresan,
Chien-Sheng Wu
Abstract:
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writ…
▽ More
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.
△ Less
Submitted 8 March, 2024; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles
Authors:
Kung-Hsiang Huang,
Philippe Laban,
Alexander R. Fabbri,
Prafulla Kumar Choubey,
Shafiq Joty,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, the summarization of diverse information dispersed across multiple articles about an event remains underexplored. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To…
▽ More
Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, the summarization of diverse information dispersed across multiple articles about an event remains underexplored. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Next, to enable consistent automatic evaluation, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of summaries. Through correlation analyses, we outline the best practices for effectively using automatic LLM-based metrics on the DiverseSumm dataset. Finally, we study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover under 40% of the diverse information on average.
△ Less
Submitted 22 March, 2024; v1 submitted 17 September, 2023;
originally announced September 2023.
-
XGen-7B Technical Report
Authors:
Erik Nijkamp,
Tian Xie,
Hiroaki Hayashi,
Bo Pang,
Congying Xia,
Chen Xing,
Jesse Vig,
Semih Yavuz,
Philippe Laban,
Ben Krause,
Senthil Purushwalkam,
Tong Niu,
Wojciech Kryściński,
Lidiya Murakhovs'ka,
Prafulla Kumar Choubey,
Alex Fabbri,
Ye Liu,
Rui Meng,
Lifu Tu,
Meghana Bhat,
Chien-Sheng Wu,
Silvio Savarese,
Yingbo Zhou,
Shafiq Joty,
Caiming Xiong
Abstract:
Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many t…
▽ More
Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Next Steps for Human-Centered Generative AI: A Technical Perspective
Authors:
Xiang 'Anthony' Chen,
Jeff Burke,
Ruofei Du,
Matthew K. Hong,
Jennifer Jacobs,
Philippe Laban,
Dingzeyu Li,
Nanyun Peng,
Karl D. D. Willis,
Chien-Sheng Wu,
Bolei Zhou
Abstract:
Through iterative, cross-disciplinary discussions, we define and propose next-steps for Human-centered Generative AI (HGAI). We contribute a comprehensive research agenda that lays out future directions of Generative AI spanning three levels: aligning with human values; assimilating human intents; and augmenting human abilities. By identifying these next-steps, we intend to draw interdisciplinary…
▽ More
Through iterative, cross-disciplinary discussions, we define and propose next-steps for Human-centered Generative AI (HGAI). We contribute a comprehensive research agenda that lays out future directions of Generative AI spanning three levels: aligning with human values; assimilating human intents; and augmenting human abilities. By identifying these next-steps, we intend to draw interdisciplinary research teams to pursue a coherent set of emergent ideas in HGAI, focusing on their interested topics while maintaining a coherent big picture of the future work landscape.
△ Less
Submitted 22 December, 2023; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
Authors:
Fan Yin,
Jesse Vig,
Philippe Laban,
Shafiq Joty,
Caiming Xiong,
Chien-Sheng Jason Wu
Abstract:
Large language models (LLMs) have shown impressive performance in following natural language instructions to solve unseen tasks. However, it remains unclear whether models truly understand task definitions and whether the human-written definitions are optimal. In this paper, we systematically study the role of task definitions in instruction learning. We first conduct an ablation analysis informed…
▽ More
Large language models (LLMs) have shown impressive performance in following natural language instructions to solve unseen tasks. However, it remains unclear whether models truly understand task definitions and whether the human-written definitions are optimal. In this paper, we systematically study the role of task definitions in instruction learning. We first conduct an ablation analysis informed by human annotations to understand which parts of a task definition are most important, and find that model performance only drops substantially when removing contents describing the task output, in particular label information. Next, we propose an automatic algorithm to compress task definitions to a minimal supporting set of tokens, and find that 60\% of tokens can be removed while maintaining or even improving model performance. Based on these results, we propose two strategies to help models better leverage task instructions: (1) providing only key information for tasks in a common structured format, and (2) adding a meta-tuning stage to help the model better understand the definitions. With these two strategies, we achieve a 4.2 Rouge-L improvement over 119 unseen test tasks.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages
Authors:
Philippe Laban,
Jesse Vig,
Wojciech Kryscinski,
Shafiq Joty,
Caiming Xiong,
Chien-Sheng Wu
Abstract:
Text simplification research has mostly focused on sentence-level simplification, even though many desirable edits - such as adding relevant background information or reordering content - may require document-level context. Prior work has also predominantly framed simplification as a single-step, input-to-output task, only implicitly modeling the fine-grained, span-level edits that elucidate the s…
▽ More
Text simplification research has mostly focused on sentence-level simplification, even though many desirable edits - such as adding relevant background information or reordering content - may require document-level context. Prior work has also predominantly framed simplification as a single-step, input-to-output task, only implicitly modeling the fine-grained, span-level edits that elucidate the simplification process. To address both gaps, we introduce the SWiPE dataset, which reconstructs the document-level editing process from English Wikipedia (EW) articles to paired Simple Wikipedia (SEW) articles. In contrast to prior work, SWiPE leverages the entire revision history when pairing pages in order to better identify simplification edits. We work with Wikipedia editors to annotate 5,000 EW-SEW document pairs, labeling more than 40,000 edits with proposed 19 categories. To scale our efforts, we propose several models to automatically label edits, achieving an F-1 score of up to 70.6, indicating that this is a tractable but challenging NLU task. Finally, we categorize the edits produced by several simplification models and find that SWiPE-trained models generate more complex edits while reducing unwanted edits.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond
Authors:
Philippe Laban,
Wojciech Kryściński,
Divyansh Agarwal,
Alexander R. Fabbri,
Caiming Xiong,
Shafiq Joty,
Chien-Sheng Wu
Abstract:
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency de…
▽ More
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Designing and Evaluating Interfaces that Highlight News Coverage Diversity Using Discord Questions
Authors:
Philippe Laban,
Chien-Sheng Wu,
Lidiya Murakhovs'ka,
Xiang 'Anthony' Chen,
Caiming Xiong
Abstract:
Modern news aggregators do the hard work of organizing a large news stream, creating collections for a given news story with tens of source options. This paper shows that navigating large source collections for a news story can be challenging without further guidance. In this work, we design three interfaces -- the Annotated Article, the Recomposed Article, and the Question Grid -- aimed at accomp…
▽ More
Modern news aggregators do the hard work of organizing a large news stream, creating collections for a given news story with tens of source options. This paper shows that navigating large source collections for a news story can be challenging without further guidance. In this work, we design three interfaces -- the Annotated Article, the Recomposed Article, and the Question Grid -- aimed at accompanying news readers in discovering coverage diversity while they read. A first usability study with 10 journalism experts confirms the designed interfaces all reveal coverage diversity and determine each interface's potential use cases and audiences. In a second usability study, we developed and implemented a reading exercise with 95 novice news readers to measure exposure to coverage diversity. Results show that Annotated Article users are able to answer questions 34% more completely than with two existing interfaces while finding the interface equally easy to use.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Discord Questions: A Computational Approach To Diversity Analysis in News Coverage
Authors:
Philippe Laban,
Chien-Sheng Wu,
Lidiya Murakhovs'ka,
Xiang 'Anthony' Chen,
Caiming Xiong
Abstract:
There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging. We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity. The framework is bas…
▽ More
There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging. We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity. The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences. To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set. We illustrate the framework's feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Marvista: Exploring the Design of a Human-AI Collaborative News Reading Tool
Authors:
Xiang 'Anthony' Chen,
Chien-Sheng Wu,
Lidiya Murakhovs'ka,
Philippe Laban,
Tong Niu,
Wenhao Liu,
Caiming Xiong
Abstract:
We explore the design of Marvista -- a human-AI collaborative tool that employs a suite of natural language processing models to provide end-to-end support for reading online news articles. Before reading an article, Marvista helps a user plan what to read by filtering text based on how much time one can spend and what questions one is interested to find out from the article. During reading, Marvi…
▽ More
We explore the design of Marvista -- a human-AI collaborative tool that employs a suite of natural language processing models to provide end-to-end support for reading online news articles. Before reading an article, Marvista helps a user plan what to read by filtering text based on how much time one can spend and what questions one is interested to find out from the article. During reading, Marvista helps the user reflect on their understanding of each paragraph with AI-generated questions. After reading, Marvista generates an explainable human-AI summary that combines both AI's processing of the text, the user's reading behavior, and user-generated data in the reading process. In contrast to prior work that offered (content-independent) interaction techniques or devices for reading, Marvista takes a human-AI collaborative approach that contributes text-specific guidance (content-aware) to support the entire reading process.
△ Less
Submitted 23 June, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors
Authors:
Liyan Tang,
Tanya Goyal,
Alexander R. Fabbri,
Philippe Laban,
Jiacheng Xu,
Semih Yavuz,
Wojciech Kryściński,
Justin F. Rousseau,
Greg Durrett
Abstract:
The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has be…
▽ More
The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.
△ Less
Submitted 25 May, 2023; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets
Authors:
Philippe Laban,
Chien-Sheng Wu,
Wenhao Liu,
Caiming Xiong
Abstract:
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) tha…
▽ More
Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish a preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place a higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves a higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest that NND can give a second life to human annotations and provide low-cost NLG evaluation.
△ Less
Submitted 9 November, 2022; v1 submitted 13 May, 2022;
originally announced May 2022.
-
Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation
Authors:
Philippe Laban,
Chien-Sheng Wu,
Lidiya Murakhovs'ka,
Wenhao Liu,
Caiming Xiong
Abstract:
Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap. In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which t…
▽ More
Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap. In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason. Even though we find that recent progress in QGen leads to a significant increase in question acceptance rates, there is still large room for improvement, with the best model having only 68.4% of its questions accepted by the ten teachers who participated in our study. We then leverage the annotations we collected to analyze standard NLG metrics and find that model performance has reached projected upper-bounds, suggesting new automatic metrics are needed to guide QGen research forward.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
NewsPod: Automatic and Interactive News Podcasts
Authors:
Philippe Laban,
Elicia Ye,
Srujay Korlakunta,
John Canny,
Marti A. Hearst
Abstract:
News podcasts are a popular medium to stay informed and dive deep into news topics. Today, most podcasts are handcrafted by professionals. In this work, we advance the state-of-the-art in automatically generated podcasts, making use of recent advances in natural language processing and text-to-speech technology. We present NewsPod, an automatically generated, interactive news podcast. The podcast…
▽ More
News podcasts are a popular medium to stay informed and dive deep into news topics. Today, most podcasts are handcrafted by professionals. In this work, we advance the state-of-the-art in automatically generated podcasts, making use of recent advances in natural language processing and text-to-speech technology. We present NewsPod, an automatically generated, interactive news podcast. The podcast is divided into segments, each centered on a news event, with each segment structured as a Question and Answer conversation, whose goal is to engage the listener. A key aspect of the design is the use of distinct voices for each role (questioner, responder), to better simulate a conversation. Another novel aspect of NewsPod allows listeners to interact with the podcast by asking their own questions and receiving automatically generated answers. We validate the soundness of this system design through two usability studies, focused on evaluating the narrative style and interactions with the podcast, respectively. We find that NewsPod is preferred over a baseline by participants, with 80% claiming they would use the system in the future.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
Authors:
Philippe Laban,
Tobias Schnabel,
Paul N. Bennett,
Marti A. Hearst
Abstract:
In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between…
▽ More
In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level). We provide a highly effective and light-weight method called SummaCConv that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences. On our newly introduced benchmark called SummaC (Summary Consistency) consisting of six large inconsistency detection datasets, SummaCConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% point improvement compared to prior work. We make the models and datasets available: https://github.com/tingofurro/summac
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
MixQG: Neural Question Generation with Mixed Answer Types
Authors:
Lidiya Murakhovs'ka,
Chien-Sheng Wu,
Philippe Laban,
Tong Niu,
Wenhao Liu,
Caiming Xiong
Abstract:
Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on the short factoid type of answers. In this paper, we propose a neural question generator, MixQG, to bridge this gap. We combine 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstr…
▽ More
Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on the short factoid type of answers. In this paper, we propose a neural question generator, MixQG, to bridge this gap. We combine 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains and can generate questions with different cognitive levels when conditioned on different answer types. Our code is released and well-integrated with the Huggingface library to facilitate various downstream applications.
△ Less
Submitted 31 May, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test
Authors:
Philippe Laban,
Luke Dai,
Lucas Bandarkar,
Marti A. Hearst
Abstract:
The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuf…
▽ More
The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero-Shot setting: models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve high-performance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95% accuracy), model performance drops from 94% to 78% as block size increases, creating a conceptually simple challenge to benchmark NLP models. Code available: https://github.com/tingofurro/shuffle_test/
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
Keep it Simple: Unsupervised Simplification of Multi-Paragraph Text
Authors:
Philippe Laban,
Tobias Schnabel,
Paul Bennett,
Marti A. Hearst
Abstract:
This work presents Keep it Simple (KiS), a new approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity. We train the model with a novel algorithm to optimize the reward (k-SCST), in which the model proposes several candidate simplifications, computes each candidate's reward, and encourages candidates that outperform th…
▽ More
This work presents Keep it Simple (KiS), a new approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity. We train the model with a novel algorithm to optimize the reward (k-SCST), in which the model proposes several candidate simplifications, computes each candidate's reward, and encourages candidates that outperform the mean reward. Finally, we propose a realistic text comprehension task as an evaluation method for text simplification. When tested on the English news domain, the KiS model outperforms strong supervised baselines by more than 4 SARI points, and can help people complete a comprehension task an average of 18% faster while retaining accuracy, when compared to the original text. Code available: https://github.com/tingofurro/keep_it_simple
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
What's The Latest? A Question-driven News Chatbot
Authors:
Philippe Laban,
John Canny,
Marti A. Hearst
Abstract:
This work describes an automatic news chatbot that draws content from a diverse set of news articles and creates conversations with a user about the news. Key components of the system include the automatic organization of news articles into topical chatrooms, integration of automatically generated questions into the conversation, and a novel method for choosing which questions to present which avo…
▽ More
This work describes an automatic news chatbot that draws content from a diverse set of news articles and creates conversations with a user about the news. Key components of the system include the automatic organization of news articles into topical chatrooms, integration of automatically generated questions into the conversation, and a novel method for choosing which questions to present which avoids repetitive suggestions. We describe the algorithmic framework and present the results of a usability study that shows that news readers using the system successfully engage in multi-turn conversations about specific news stories.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.
-
News Headline Grouping as a Challenging NLU Task
Authors:
Philippe Laban,
Lucas Bandarkar,
Marti A. Hearst
Abstract:
Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding dataset (HLGD) consisting of 20,056 pairs of news head…
▽ More
Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding dataset (HLGD) consisting of 20,056 pairs of news headlines, each labeled with a binary judgement as to whether the pair belongs within the same group. On HLGD, human annotators achieve high performance of around 0.9 F-1, while current state-of-the art Transformer models only reach 0.75 F-1, opening the path for further improvements. We further propose a novel unsupervised Headline Generator Swap model for the task of HeadLine Grouping that achieves within 3 F-1 of the best supervised model. Finally, we analyze high-performing models with consistency tests, and find that models are not consistent in their predictions, revealing modeling limits of current architectures.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.
-
The Summary Loop: Learning to Write Abstractive Summaries Without Examples
Authors:
Philippe Laban,
Andrew Hsi,
John Canny,
Marti A. Hearst
Abstract:
This work presents a new approach to unsupervised abstractive summarization based on maximizing a combination of coverage and fluency for a given length constraint. It introduces a novel method that encourages the inclusion of key terms from the original document into the summary: key terms are masked out of the original document and must be filled in by a coverage model using the current generate…
▽ More
This work presents a new approach to unsupervised abstractive summarization based on maximizing a combination of coverage and fluency for a given length constraint. It introduces a novel method that encourages the inclusion of key terms from the original document into the summary: key terms are masked out of the original document and must be filled in by a coverage model using the current generated summary. A novel unsupervised training procedure leverages this coverage model along with a fluency model to generate and score summaries. When tested on popular news summarization datasets, the method outperforms previous unsupervised methods by more than 2 R-1 points, and approaches results of competitive supervised methods. Our model attains higher levels of abstraction with copied passages roughly two times shorter than prior work, and learns to compress and merge sentences without supervision.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.