TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Linguistic Acceptability	CoLA	DistilBERT 66M	Accuracy	49.1%	# 38
Sentiment Analysis	IMDb	DistilBERT 66M	Accuracy	92.82	# 29
Semantic Textual Similarity	MRPC	DistilBERT 66M	Accuracy	90.2%	# 14
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	# Correct Groups	49 ± 4	# 19
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	Fowlkes Mallows Score (FMS)	29.1 ± .2	# 18
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	Adjusted Rand Index (ARI)	11.3 ± .3	# 18
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	Adjusted Mutual Information (AMI)	14.0 ± .3	# 18
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	# Solved Walls	0 ± 0	# 10
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	Wasserstein Distance (WD)	86.7 ± .6	# 2
Natural Language Inference	QNLI	DistilBERT 66M	Accuracy	90.2%	# 36
Question Answering	Quora Question Pairs	DistilBERT 66M	Accuracy	89.2%	# 13
Natural Language Inference	RTE	DistilBERT 66M	Accuracy	62.9%	# 68
Question Answering	SQuAD1.1 dev	DistilBERT	EM	77.7	# 20
Question Answering	SQuAD1.1 dev	DistilBERT 66M	F1	85.8	# 22
Sentiment Analysis	SST-2 Binary classification	DistilBERT 66M	Accuracy	91.3	# 54
Semantic Textual Similarity	STS Benchmark	DistilBERT 66M	Pearson Correlation	0.907	# 16
Natural Language Inference	WNLI	DistilBERT 66M	Accuracy	44.4	# 23

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/task-1-grouping-on-ocw)](https://paperswithcode.com/sota/task-1-grouping-on-ocw?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/question-answering-on-quora-question-pairs)](https://paperswithcode.com/sota/question-answering-on-quora-question-pairs?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/question-answering-on-squad11-dev)](https://paperswithcode.com/sota/question-answering-on-squad11-dev?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/natural-language-inference-on-wnli)](https://paperswithcode.com/sota/natural-language-inference-on-wnli?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/sentiment-analysis-on-imdb)](https://paperswithcode.com/sota/sentiment-analysis-on-imdb?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=distilbert-a-distilled-version-of-bert)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilbert-a-distilled-version-of-bert/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=distilbert-a-distilled-version-of-bert)`

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

NeurIPS 2019 · Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf ·

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

PDF Abstract

Code

Add Remove Mark official

huggingface/transformers official

125,425

huggingface/swift-coreml-transforme… official

1,583

PaddlePaddle/PaddleNLP

11,450

huggingface/node-question-answering

464

huggingface/tflite-android-transfor…

376

See all 32 implementations

Tasks

Add Remove

Hate Speech Detection

Knowledge Distillation

Language Modelling

Linguistic Acceptability

Natural Language Inference

Only Connect Walls Dataset Task 1 (Grouping)

Question Answering

Semantic Textual Similarity

Sentiment Analysis

Transfer Learning

Datasets

GLUE

SST

SQuAD

IMDb Movie Reviews SST-2

QNLI

MRPC

CoLA

Quora

Quora Question Pairs RTE STS Benchmark WNLI

OCW

Results from the Paper

Edit

Ranked #2 on Only Connect Walls Dataset Task 1 (Grouping) on OCW (Wasserstein Distance (WD) metric, using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Linguistic Acceptability	CoLA	DistilBERT 66M	Accuracy	49.1%	# 38	Compare
Sentiment Analysis	IMDb	DistilBERT 66M	Accuracy	92.82	# 29	Compare
Semantic Textual Similarity	MRPC	DistilBERT 66M	Accuracy	90.2%	# 14	Compare
Only Connect Walls Dataset Task 1 (Grouping)	OCW	DistilBERT (BASE)	# Correct Groups	49 ± 4	# 19	Compare
			Fowlkes Mallows Score (FMS)	29.1 ± .2	# 18	Compare
			Adjusted Rand Index (ARI)	11.3 ± .3	# 18	Compare
			Adjusted Mutual Information (AMI)	14.0 ± .3	# 18	Compare
			# Solved Walls	0 ± 0	# 10	Compare
			Wasserstein Distance (WD)	86.7 ± .6	# 2	Compare
Natural Language Inference	QNLI	DistilBERT 66M	Accuracy	90.2%	# 36	Compare
Question Answering	Quora Question Pairs	DistilBERT 66M	Accuracy	89.2%	# 13	Compare
Natural Language Inference	RTE	DistilBERT 66M	Accuracy	62.9%	# 68	Compare
Question Answering	SQuAD1.1 dev	DistilBERT	EM	77.7	# 20	Compare
Question Answering	SQuAD1.1 dev	DistilBERT 66M	F1	85.8	# 22	Compare
Sentiment Analysis	SST-2 Binary classification	DistilBERT 66M	Accuracy	91.3	# 54	Compare
Semantic Textual Similarity	STS Benchmark	DistilBERT 66M	Pearson Correlation	0.907	# 16	Compare
Natural Language Inference	WNLI	DistilBERT 66M	Accuracy	44.4	# 23	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • DistilBERT • Dropout • GELU • Knowledge Distillation • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove