DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

PDF Abstract

Results from the Paper


Ranked #2 on Only Connect Walls Dataset Task 1 (Grouping) on OCW (Wasserstein Distance (WD) metric, using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Linguistic Acceptability CoLA DistilBERT 66M Accuracy 49.1% # 38
Sentiment Analysis IMDb DistilBERT 66M Accuracy 92.82 # 29
Semantic Textual Similarity MRPC DistilBERT 66M Accuracy 90.2% # 14
Only Connect Walls Dataset Task 1 (Grouping) OCW DistilBERT (BASE) # Correct Groups 49 ± 4 # 19
Fowlkes Mallows Score (FMS) 29.1 ± .2 # 18
Adjusted Rand Index (ARI) 11.3 ± .3 # 18
Adjusted Mutual Information (AMI) 14.0 ± .3 # 18
# Solved Walls 0 ± 0 # 10
Wasserstein Distance (WD) 86.7 ± .6 # 2
Natural Language Inference QNLI DistilBERT 66M Accuracy 90.2% # 36
Question Answering Quora Question Pairs DistilBERT 66M Accuracy 89.2% # 13
Natural Language Inference RTE DistilBERT 66M Accuracy 62.9% # 68
Question Answering SQuAD1.1 dev DistilBERT EM 77.7 # 20
Question Answering SQuAD1.1 dev DistilBERT 66M F1 85.8 # 22
Sentiment Analysis SST-2 Binary classification DistilBERT 66M Accuracy 91.3 # 54
Semantic Textual Similarity STS Benchmark DistilBERT 66M Pearson Correlation 0.907 # 16
Natural Language Inference WNLI DistilBERT 66M Accuracy 44.4 # 23

Methods