no code implementations • CL (ACL) 2021 • Richard Sproat, Alexander Gutkin
Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means.
no code implementations • SLPAT (ACL) 2022 • Brian Roark, Alexander Gutkin
We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.
no code implementations • LREC 2022 • Isin Demirsahin, Cibu Johny, Alexander Gutkin, Brian Roark
This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization.
no code implementations • NAACL (SIGTYP) 2022 • Christo Kirov, Richard Sproat, Alexander Gutkin
For reflex generation, the missing reflexes are treated as “masked pixels” in an “image” which is a representation of an entire cognate set across a language family.
no code implementations • LREC 2022 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Lawrence Wolf-Sonkin, Brian Roark
The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems.
1 code implementation • 19 May 2023 • Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar
We evaluate commonly used models on the benchmark.
1 code implementation • 26 Jan 2023 • Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
1 code implementation • 21 Oct 2022 • Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctuation for the original Arabic and numerous other regional orthographic traditions.
1 code implementation • 18 Oct 2022 • Llion Jones, Richard Sproat, Haruko Ishikawa, Alexander Gutkin
If one sees the place name Houston Mercer Dog Run in New York, how does one know how to pronounce it?
no code implementations • 9 May 2022 • Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes
In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages.
no code implementations • EACL 2021 • Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark
This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts.
1 code implementation • 14 Oct 2020 • Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu Johny, Anna Katanova, Oddur Kjartansson, Chenfang Li, Tatiana Merkulova, Yin May Oo, Knot Pipatsrisawat, Clara Rivera, Supheakmungkol Sarin, Pasindu De Silva, Keshan Sodimana, Richard Sproat, Theeraphol Wattanavekin, Jaka Aris Eko Wibawa
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • EMNLP (SIGTYP) 2020 • Alexander Gutkin, Richard Sproat
This paper describes the NEMO submission to SIGTYP 2020 shared task which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS).
no code implementations • 12 Oct 2020 • Alexander Gutkin, Martin Jansche, Lucy Skidmore
This extended abstract surveying the work on phonological typology was prepared for "SIGTYP 2020: The Second Workshop on Computational Research in Linguistic Typology" to be held at EMNLP 2020.
no code implementations • 30 Apr 2020 • Alexander Gutkin, Tatiana Merkulova, Martin Jansche
In this paper we investigate whether the various linguistic features from World Atlas of Language Structures (WALS) can be reliably inferred from multi-lingual text.
2 code implementations • 21 May 2019 • Martin Jansche, Alexander Gutkin
We consider the problem of efficient sampling: drawing random string variates from the probability distribution represented by stochastic automata and transformations of those.