Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT
Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT{'}s effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillation{---}a popular model compression method. The results show that BERT-level effectiveness can be achieved by a single-layer LSTM with at least $40\times$ fewer FLOPS and only ${\sim}3\%$ parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT{'}s knowledge all the way down to linear models{---}a relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.
PDF Abstract