SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.
Source: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text ProcessingPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 98 | 9.82% |
Question Answering | 59 | 5.91% |
Sentence | 48 | 4.81% |
Text Generation | 40 | 4.01% |
Translation | 31 | 3.11% |
Retrieval | 30 | 3.01% |
Machine Translation | 27 | 2.71% |
Natural Language Understanding | 21 | 2.10% |
Sentiment Analysis | 18 | 1.80% |