Search Results for author: Yuval Pinter

Found 35 papers, 20 papers with code

CIAug: Equipping Interpolative Augmentation with Curriculum Learning

1 code implementation • NAACL 2022 • Ramit Sawhney, Ritesh Soun, Shrey Pandit, Megh Thakkar, Sarvagya Malaviya, Yuval Pinter

CIAug achieves state-of-the-art results over existing interpolative augmentation methods on 10 benchmark datasets across 4 languages in text classification and named-entity recognition tasks.

Data Augmentation named-entity-recognition +5

Paper
Code

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

1 code implementation • 20 Apr 2024 • Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

text-classification Text Classification

Paper
Code

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.

Machine Translation Translation

Paper
Add Code

BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation

no code implementations • 6 Mar 2024 • Carinne Cherf, Yuval Pinter

Evaluation of this task is crucial to determine the quality of the translation.

Machine Translation NMT +2

Paper
Add Code

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

1 code implementation • 2 Mar 2024 • Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed.

Paper
Code

Tokenization Is More Than Compression

no code implementations • 28 Feb 2024 • Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models.

Data Compression

Paper
Add Code

MPIrigen: MPI Code Generation through Domain-Specific Language Models

1 code implementation • 14 Feb 2024 • Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs.

Code Generation

Paper
Code

Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

2 code implementations • 20 Dec 2023 • Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks.

Code Generation

6,617

Paper
Code

Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

no code implementations • 19 Dec 2023 • Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta

Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

Paper
Add Code

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

2 code implementations • arXiv 2023 • Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.

Ranked #1 on Named Entity Recognition (NER) on UNER v1 (Danish)

Cross-Lingual NER Multilingual Named Entity Recognition +3

Paper
Code

Analyzing Cognitive Plausibility of Subword Tokenization

no code implementations • 20 Oct 2023 • Lisa Beinborn, Yuval Pinter

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce.

Paper
Add Code

Emptying the Ocean with a Spoon: Should We Edit Models?

no code implementations • 18 Oct 2023 • Yuval Pinter, Michael Elhadad

We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations.

Model Editing Retrieval

Paper
Add Code

Scope is all you need: Transforming LLMs for HPC Code

2 code implementations • 18 Aug 2023 • Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks.

Code Completion

Paper
Code

Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

2 code implementations • 16 May 2023 • Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Specifically, we propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code, given its serial version.

Data Augmentation

Paper
Code

MPI-rical: Data-Driven MPI Distributed Parallelism Assistance with Transformers

2 code implementations • 16 May 2023 • Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes.

Code Completion Code Translation +2

Paper
Code

Incorporating Context into Subword Vocabularies

1 code implementation • 13 Oct 2022 • Shaked Yehezkel, Yuval Pinter

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context.

NER

Paper
Code

Lost in Space Marking

no code implementations • 2 Aug 2022 • Cassandra L. Jacobs, Yuval Pinter

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one.

Paper
Add Code

UniMorph 4.0: Universal Morphology

no code implementations • LREC 2022 • Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema.

Morphological Inflection

Paper
Add Code

Learning to Parallelize in a Shared-Memory Environment with Transformers

2 code implementations • 27 Apr 2022 • Re'em Harel, Yuval Pinter, Gal Oren

As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications.

Management

Paper
Code

Integrating Approaches to Word Representation

no code implementations • 10 Sep 2021 • Yuval Pinter

The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing.

Paper
Add Code

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

no code implementations • 1 Aug 2021 • Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch.

Paper
Add Code

Restoring Hebrew Diacritics Without a Dictionary

1 code implementation • Findings (NAACL) 2022 • Elazar Gershuni, Yuval Pinter

We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.

Paper
Code

Will it Unblend?

1 code implementation • SCiL 2021 • Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data.

Paper
Code

UniMorph 3.0: Universal Morphology

no code implementations • LREC 2020 • Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, Timofey Arkhangelskiy, Nataly Krizhanovsky, Andrew Krizhanovsky, Elena Klyachko, Alexey Sorokin, John Mansfield, Valts Ern{\v{s}}treits, Yuval Pinter, Cass Jacobs, ra L., Ryan Cotterell, Mans Hulden, David Yarowsky

Paper
Add Code

Learning to Faithfully Rationalize by Construction

2 code implementations • ACL 2020 • Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron C. Wallace

In NLP this often entails extracting snippets of an input text `responsible for' corresponding model output; when such a snippet comprises tokens that indeed informed the model's prediction, it is a faithful explanation.

Feature Importance text-classification +1

Paper
Code

NYTWIT: A Dataset of Novel Words in the New York Times

1 code implementation • COLING 2020 • Yuval Pinter, Cassandra L. Jacobs, Max Bittker

We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems.

Paper
Code

Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

no code implementations • 14 Dec 2019 • Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne

We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks.

Sentence

Paper
Add Code

Attention is not not Explanation

2 code implementations • IJCNLP 2019 • Sarah Wiegreffe, Yuval Pinter

We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.

Decision Making Experimental Design

6,417

Paper
Code

Character Eyes: Seeing Language through Character-Level Taggers

1 code implementation • WS 2019 • Yuval Pinter, Marc Marone, Jacob Eisenstein

Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations.

POS

Paper
Code

Predicting Semantic Relations using Global Graph Properties

1 code implementation • EMNLP 2018 • Yuval Pinter, Jacob Eisenstein

Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers.

Ranked #14 on Link Prediction on WN18RR

Link Prediction

Paper
Code

Si O No, Que Penses? Catalonian Independence and Linguistic Identity on Social Media

no code implementations • NAACL 2018 • Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Paper
Add Code

Sí o no, què penses? Catalonian Independence and Linguistic Identity on Social Media

1 code implementation • 13 Apr 2018 • Ian Stewart, Yuval Pinter, Jacob Eisenstein

We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation.

Paper
Code

Mimicking Word Embeddings using Subword RNNs

2 code implementations • EMNLP 2017 • Yuval Pinter, Robert Guthrie, Jacob Eisenstein

In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings.

Word Embeddings

149

Paper
Code

Syntactic Parsing of Web Queries with Question Intent

no code implementations • NAACL 2016 • Yuval Pinter, Roi Reichart, Idan Szpektor

Community Question Answering Domain Adaptation +1

Paper
Add Code

The Yahoo Query Treebank, V. 1.0

no code implementations • 10 May 2016 • Yuval Pinter, Roi Reichart, Idan Szpektor

A description and annotation guidelines for the Yahoo Webscope release of Query Treebank, Version 1. 0, May 2016.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.