GBST

Introduced by Tay et al. in Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.

GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.

Source: Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Decoder	2	15.38%
NMT	2	15.38%
Denoising	1	7.69%
Image Denoising	1	7.69%
Translation	1	7.69%
Toxic Comment Classification	1	7.69%
Linguistic Acceptability	1	7.69%
Natural Language Inference	1	7.69%
Paraphrase Identification	1	7.69%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Subword Segmentation