1 code implementation • NAACL 2022 • Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, Yuval Pinter
CIAug achieves state-of-the-art results over existing interpolative augmentation methods on 10 benchmark datasets across 4 languages in text classification and named-entity recognition tasks.
1 code implementation • 20 Apr 2024 • Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella
Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.
no code implementations • 6 Mar 2024 • Carinne Cherf, Yuval Pinter
Evaluation of this task is crucial to determine the quality of the translation.
1 code implementation • 2 Mar 2024 • Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter
While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.
no code implementations • 28 Feb 2024 • Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.
1 code implementation • 14 Feb 2024 • Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren
This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs.
2 code implementations • 20 Dec 2023 • Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren
Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks.
no code implementations • 19 Dec 2023 • Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta
Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.
2 code implementations • arXiv 2023 • Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.
Ranked #1 on Named Entity Recognition (NER) on UNER v1 (Danish)
no code implementations • 20 Oct 2023 • Lisa Beinborn, Yuval Pinter
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce.
no code implementations • 18 Oct 2023 • Yuval Pinter, Michael Elhadad
We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations.
2 code implementations • 18 Aug 2023 • Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren
With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks.
2 code implementations • 16 May 2023 • Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren
Specifically, we propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code, given its serial version.
2 code implementations • 16 May 2023 • Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren
Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes.
1 code implementation • 13 Oct 2022 • Shaked Yehezkel, Yuval Pinter
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context.
no code implementations • 2 Aug 2022 • Cassandra L. Jacobs, Yuval Pinter
We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one.
no code implementations • LREC 2022 • Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova
The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.
2 code implementations • 27 Apr 2022 • Re'em Harel, Yuval Pinter, Gal Oren
As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications.
no code implementations • 10 Sep 2021 • Yuval Pinter
The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing.
no code implementations • 1 Aug 2021 • Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein
Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch.
1 code implementation • Findings (NAACL) 2022 • Elazar Gershuni, Yuval Pinter
We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.
1 code implementation • SCiL 2021 • Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein
Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data.
no code implementations • LREC 2020 • Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, Timofey Arkhangelskiy, Nataly Krizhanovsky, Andrew Krizhanovsky, Elena Klyachko, Alexey Sorokin, John Mansfield, Valts Ern{\v{s}}treits, Yuval Pinter, Cass Jacobs, ra L., Ryan Cotterell, Mans Hulden, David Yarowsky
The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.
2 code implementations • ACL 2020 • Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron C. Wallace
In NLP this often entails extracting snippets of an input text `responsible for' corresponding model output; when such a snippet comprises tokens that indeed informed the model's prediction, it is a faithful explanation.
1 code implementation • COLING 2020 • Yuval Pinter, Cassandra L. Jacobs, Max Bittker
We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems.
no code implementations • 14 Dec 2019 • Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne
We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks.
2 code implementations • IJCNLP 2019 • Sarah Wiegreffe, Yuval Pinter
We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
1 code implementation • WS 2019 • Yuval Pinter, Marc Marone, Jacob Eisenstein
Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations.
1 code implementation • EMNLP 2018 • Yuval Pinter, Jacob Eisenstein
Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers.
Ranked #14 on Link Prediction on WN18RR
no code implementations • NAACL 2018 • Ian Stewart, Yuval Pinter, Jacob Eisenstein
We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.
1 code implementation • 13 Apr 2018 • Ian Stewart, Yuval Pinter, Jacob Eisenstein
We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.
2 code implementations • EMNLP 2017 • Yuval Pinter, Robert Guthrie, Jacob Eisenstein
In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings.
no code implementations • 10 May 2016 • Yuval Pinter, Roi Reichart, Idan Szpektor
A description and annotation guidelines for the Yahoo Webscope release of Query Treebank, Version 1. 0, May 2016.