Skip to main content

Showing 1–20 of 20 results for author: Roark, B

Searching in archive cs. Search in all archives.
.
  1. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

    Authors: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson , et al. (2 additional authors not shown)

    Abstract: Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  2. arXiv:2303.03457  [pdf, other

    cs.CL

    Spelling convention sensitivity in neural language models

    Authors: Elizabeth Nielsen, Christo Kirov, Brian Roark

    Abstract: We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consis… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

    Journal ref: EACL Findings 2023

  3. Beyond Arabic: Software for Perso-Arabic Script Manipulation

    Authors: Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat

    Abstract: This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalizati… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Preprint to appear in the Proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP 2022) at EMNLP, Abu Dhabi, United Arab Emirates, December 7-11, 2022. 7 pages

    ACM Class: I.2.7; I.7.2; I.7.1

  4. arXiv:2210.12273  [pdf

    cs.CL

    Graphemic Normalization of the Perso-Arabic Script

    Authors: Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat

    Abstract: Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions. This paper documents the challenges that Perso-Arabic presents beyond the best-… ▽ More

    Submitted 29 January, 2024; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: Pre-print to appear in the Proceedings of Grapholinguistics in the 21st Century (G21C), 2022. Telecom Paris, Palaiseau, France, June 8-10, 2022. 41 pages, 38 tables, 3 figures

    ACM Class: I.2.7; I.7.2; I.7.1

  5. arXiv:2110.01140  [pdf, other

    cs.CL

    Structured abbreviation expansion in context

    Authors: Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat

    Abstract: Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, in that ad hoc abbreviations are intentional and may involve substantial differences from the orig… ▽ More

    Submitted 3 October, 2021; originally announced October 2021.

    Comments: Accepted to Findings of EMNLP 2021

  6. arXiv:2104.06325  [pdf, other

    cs.CL

    Finding Concept-specific Biases in Form--Meaning Associations

    Authors: Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián Blasi

    Abstract: This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for "tongue" is more likely than chance to contain the phone [l]. By controlling for the influence of language fami… ▽ More

    Submitted 29 April, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL 2021. This is the camera ready version. Code is available in https://github.com/rycolab/form-meaning-associations

  7. arXiv:2102.02183  [pdf, other

    cs.CL

    Disambiguatory Signals are Stronger in Word-initial Positions

    Authors: Tiago Pimentel, Ryan Cotterell, Brian Roark

    Abstract: Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjecture -- as in Wedel et al. (2019b), but common elsewhere -- that languages have evolved to provide more in… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: Accepted at EACL 2021. Code is available in https://github.com/tpimentelms/frontload-disambiguation

  8. arXiv:2007.01176  [pdf

    cs.CL

    Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

    Authors: Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall

    Abstract: This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and s… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: Published at LREC 2020

  9. Phonotactic Complexity and its Trade-offs

    Authors: Tiago Pimentel, Brian Roark, Ryan Cotterell

    Abstract: We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme usi… ▽ More

    Submitted 7 May, 2020; originally announced May 2020.

    Comments: Published in TACL: https://doi.org/10.1162/tacl_a_00296

    Journal ref: Transactions of the Association for Computational Linguistics, Vol. 8, 1-18

  10. arXiv:2004.09571  [pdf, other

    eess.AS cs.SD stat.ML

    Language-agnostic Multilingual Modeling

    Authors: Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark

    Abstract: Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scal… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

  11. arXiv:1906.05906  [pdf, other

    cs.CL

    Meaning to Form: Measuring Systematicity as Information

    Authors: Tiago Pimentel, Arya D. McCarthy, Damián E. Blasi, Brian Roark, Ryan Cotterell

    Abstract: A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{gl… ▽ More

    Submitted 26 July, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Accepted for publication at ACL 2019

  12. arXiv:1906.04726  [pdf, other

    cs.CL

    What Kind of Language Is Hard to Language-Model?

    Authors: Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

    Abstract: How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl cor… ▽ More

    Submitted 25 February, 2020; v1 submitted 11 June, 2019; originally announced June 2019.

    Comments: Published at ACL 2019

  13. arXiv:1905.08701  [pdf, other

    cs.CL cs.FL cs.IT

    Approximating probabilistic models as weighted finite automata

    Authors: Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol

    Abstract: Weighted finite automata (WFA) are often used to represent probabilistic models, such as $n$-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such tha… ▽ More

    Submitted 29 January, 2021; v1 submitted 21 May, 2019; originally announced May 2019.

  14. arXiv:1806.03743  [pdf, other

    cs.CL

    Are All Languages Equally Hard to Language-Model?

    Authors: Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Brian Roark

    Abstract: For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a s… ▽ More

    Submitted 25 February, 2020; v1 submitted 10 June, 2018; originally announced June 2018.

    Comments: Published at NAACL 2018

  15. arXiv:cs/0105019  [pdf, ps, other

    cs.CL

    Robust Probabilistic Predictive Syntactic Processing

    Authors: Brian Roark

    Abstract: This thesis presents a broad-coverage probabilistic top-down parser, and its application to the problem of language modeling for speech recognition. The parser builds fully connected derivations incrementally, in a single pass from left-to-right across the string. We argue that the parsing approach that we have adopted is well-motivated from a psycholinguistic perspective, as a model that captur… ▽ More

    Submitted 9 May, 2001; originally announced May 2001.

    Comments: Ph.D. Thesis, Brown University, Advisor: Mark Johnson. 140 pages, 40 figures, 27 tables

    ACM Class: I.2.7

  16. arXiv:cs/0105016  [pdf, ps, other

    cs.CL

    Probabilistic top-down parsing and language modeling

    Authors: Brian Roark

    Abstract: This paper describes the functioning of a broad-coverage probabilistic top-down parser, and its application to the problem of language modeling for speech recognition. The paper first introduces key notions in language modeling and probabilistic parsing, and briefly reviews some previous approaches to using syntactic structure for language modeling. A lexicalized probabilistic top-down parser is… ▽ More

    Submitted 8 May, 2001; originally announced May 2001.

    Comments: 28 pages, 6 tables, 8 figures. To appear in Computational Linguistics 27(2), June 2001

    ACM Class: I.2.7

  17. arXiv:cs/0008027  [pdf, ps, other

    cs.CL

    Measuring efficiency in high-accuracy, broad-coverage statistical parsing

    Authors: Brian Roark, Eugene Charniak

    Abstract: Very little attention has been paid to the comparison of efficiency between high accuracy statistical parsers. This paper proposes one machine-independent metric that is general enough to allow comparisons across very different parsing architectures. This metric, which we call ``events considered'', measures the number of ``events'', however they are defined for a particular parser, for which a… ▽ More

    Submitted 24 August, 2000; originally announced August 2000.

    Comments: 8 pages, 4 figures, 2 tables

    ACM Class: I.2.7

    Journal ref: Proceedings of the COLING 2000 Workshop on Efficiency in Large-Scale Parsing Systems, 2000, pages 29-36

  18. arXiv:cs/0008026  [pdf, ps, other

    cs.CL

    Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction

    Authors: Brian Roark, Eugene Charniak

    Abstract: Generating semantic lexicons semi-automatically could be a great time saver, relative to creating them by hand. In this paper, we present an algorithm for extracting potential entries for a category from an on-line corpus, based upon a small set of exemplars. Our algorithm finds more correct terms and fewer incorrect ones than previous work in this area. Additionally, the entries that are genera… ▽ More

    Submitted 24 August, 2000; originally announced August 2000.

    Comments: 7 pages, 1 figure, 5 tables

    ACM Class: I.2.7

    Journal ref: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL), 1998, pages 1110-1116

  19. arXiv:cs/0008021  [pdf, ps, other

    cs.CL

    Compact non-left-recursive grammars using the selective left-corner transform and factoring

    Authors: Mark Johnson, Brian Roark

    Abstract: The left-corner transform removes left-recursion from (probabilistic) context-free grammars and unification grammars, permitting simple top-down parsing techniques to be used. Unfortunately the grammars produced by the standard left-corner transform are usually much larger than the original. The selective left-corner transform described in this paper produces a transformed grammar which simulate… ▽ More

    Submitted 22 August, 2000; originally announced August 2000.

    Comments: 7 pages, 5 tables, 2 figures

    ACM Class: I.2.7

    Journal ref: Proceedings of the 18th International Conference on Computational Linguistics (COLING), 2000, pages 355-361

  20. arXiv:cs/0008017  [pdf, ps, other

    cs.CL

    Efficient probabilistic top-down and left-corner parsing

    Authors: Brian Roark, Mark Johnson

    Abstract: This paper examines efficient predictive broad-coverage parsing without dynamic programming. In contrast to bottom-up methods, depth-first top-down parsing produces partial parses that are fully connected trees spanning the entire left context, from which any kind of non-local dependency or partial semantic interpretation can in principle be read. We contrast two predictive parsing approaches, t… ▽ More

    Submitted 21 August, 2000; originally announced August 2000.

    Comments: 8 pages, 3 tables, 3 figures

    ACM Class: I.2.7

    Journal ref: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999, pages 421-428