Subword Segmentation

GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.

GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.

Source: Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Decoder 2 15.38%
NMT 2 15.38%
Denoising 1 7.69%
Image Denoising 1 7.69%
Translation 1 7.69%
Toxic Comment Classification 1 7.69%
Linguistic Acceptability 1 7.69%
Natural Language Inference 1 7.69%
Paraphrase Identification 1 7.69%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories