NLP Benchmarking Superglue Xtreme

David over Goliath: towards smaller models for cheaper, faster, and greener NLP

By Manuel Tonneau
March 27, 2020

Is bigger always better? For some time, driven by the will to dominate performance leaderboards, the answer of big NLP players to this question seems to have been yes. From the release of BERT by Google in October 2018 (0.11 billion parameters in its base version) to the one of CTRL (1.63 billion parameters) by Salesforce in September 2019 up until the recent release of T-NLG (17 billion parameters) by Microsoft in February 2020, the growth in language model size appeared unstoppable.

Yet, an opposite trend pushing for smaller models has also been observed in the NLP community, led by smaller players such as Hugging Face with the release of DistilBERT (0.066 billion parameters) in October 2019 but also bigger players such as Google with ALBERT (0.025 billion in its base version), released in September 2019. Now, what is the motivation behind this trend favouring David over Goliath?

The first major answer is cost. Training these monsters can cost up to tens of thousands of dollars, which the big tech companies can afford on the contrary to smaller companies. In times of raising climate change awareness, the environmental cost is also non-negligible. The second and nonetheless relevant answer is speed. Who wants to wait several seconds when Google searching? Increasing the model size may lead to better performance but it also tends to slow the model down, to the great displeasure of users.

In this blog post, we discuss this new trend favoring smaller models and present in detail three of the latter, namely DistilBERT from Hugging Face, PD-BERT from Google, and BERT-of-Theseus from Microsoft Asia.

Squeezing models is reasonable: the lottery ticket hypothesis

The main narrative behind the growth in the number of parameters for language models is that more parameters will contain more information which will lead to better model performance. In this regard, some could worry that squeezing models by arbitrarily taking away parameters from trained models could seriously harm their performance. In a nutshell: are some parameters more important than others for prediction?

In a few words, the answer is yes and has been conceptualized as the “lottery ticket hypothesis” in an interesting paper from 2018 by Jonathan Frankle and Michael Carbin from MIT.

The idea is the following: during a neural network initialization, that is when assigning values to the network’s parameters before training the model, the model parameters are defined randomly. In this big lottery, some parts of the network get luckier than others with their assigned value. The subnetworks with lucky tickets have initialization values that allow them to train more effectively, that is reach optimal parameter values that will yield good results in less time compared to the rest of the network. The paper shows that these subnetworks can even be trained separately from the rest of the network with the same initialization values and still achieve a similar accuracy while having way less parameters.

How to squeeze models in practice?

Now we know that compressing models makes sense, how can we do it in practice? It turns out there are three common methods to do this.

The first one is called quantization and aims at reducing the parameter precision. To give an example, when asked what the temperature is in Berlin, you could say 10 degrees Celsius but it is not exactly accurate. We round the real value for ease of use. In the case of neural networks, reducing the number of decimals helps lowering computation and memory needs.

The second method is called pruning and aims at removing part of the model. It can for instance be applied to remove connections or neurons.

The third method is called knowledge distillation and implies using an existing pre-trained model (the teacher) to train a smaller model (the student) to mimic the teacher. In this case, the model is not compressed but the goal, which is to obtain a smaller and faster model, is the same. For more information on these methods, we refer the readers to this very thorough blog post on the subject by Rasa.

Here, we choose to focus on three recent models to give an example of how squeezing models translates in practice and the results one can hope for when using these methods.

Knowledge distillation at work: DistilBERT

In the past months, the third listed method, namely knowledge distillation, has received attention, especially after the release of the DistilBERT model by Hugging Face, published in October 2019. The authors apply knowledge distillation to the famous BERT model developed by Google. DistilBERT’s architecture is similar to the original BERT base architecture, made of a stack of 12 Transformer encoders, but has 40% less parameters.

The original BERT model was pre-trained on two tasks: masked language modelling (MLM), that is removing a word in a sentence and asking the model to predict it, and next sentence prediction which, for two sentences A and B, boils down to predicting whether sentence B comes after sentence A or not.

Regarding MLM, we force the model to assign probabilities to words that are close to the true value \(y = ( y_1, ..., y_V ) \), \( 1 \) for the word that was actually masked and \( 0 \) for all of the other ones. In practice, we minimize the following MLM loss function:

\( - \sum_{w=1}^V y_{m,w} \log (p_{m,w})\)

with \(y_{m,w} \) equal to \( 1 \) if masked word \(m\) is the \(w\)-th word from the vocabulary list and \(0\) otherwise; \(p_{m,w}\)the predicted probability that masked word \(m\) is the \(w\)-th word from the vocabulary list. \(V\) is the vocabulary size or, in other words, the number of possible words. The objective is to minimize this loss function, pushing \(p_{w,w}\) towards \( 1 \) when the masked word is the \(w\)-th word from the vocabulary and otherwise pushing it towards \( 0 \).

If we were only sticking to the MLM loss, we would basically only train a smaller student architecture but there would not be any learning between the teacher and the student. Therefore, in addition to the MLM loss, the authors add a distillation loss that has this purpose. When masking a word, it forces the student to mimic the output probability distribution of the teacher model. To give you an example of what we are talking about, when the teacher model is trained, we can then ask it to predict which word was masked, as shown below:

Example of text generation using GPT-2
Top 5 guesses of BERT base for the masked token. Made using the Pipeline tool from Hugging Face. Here, ‘score’ equals probability.

Above, we present in descending order of importance the five most probable words that define best how it is to work at Creatext according to the teacher model BERT base. Though we present just the five most probable ones, all words in the vocabulary are assigned a probability. This is the output probability distribution which the teacher model outputs and which represents its knowledge learned during training.

We transmit this knowledge from the teacher to the student by using the following distillation loss:

\( - \sum_{w=1}^V p^{T}_{m,w} \log (p^S_{m,w})\)

with \(p^T_{m,w}\) the probability assigned by the teacher model \(T\) that masked word \(m\) is the \(w\)-th word from the vocabulary and \(p^S_{m,w}\) the equivalent for student model \(S\) . Notice that here, instead of forcing the probability assigned to each word by the model to resemble the true value \(y\), we force the probability assigned to each word by the student model to resemble the probability assigned to each word by the teacher model. That way, by minimizing this loss, the probability distribution of the student model tends towards the one from the teacher. The final training objective is a linear combination of the MLM loss and the distillation loss, having the student both learn like the teacher in the case of MLM and from the teacher with the distillation loss.

Regarding the choice of the training data, DistilBERT is trained on a subset of BERT base’s training set. Yet, a recent paper by Google shows that even giving nonsensical sentences (random list of words) for training is enough for the student to get good performance as the information given by the teacher model through its output probability distribution is strong enough. For instance, if the training sentence is “love trash guitar” and the masked word is “trash”, the trained model will probably predict “playing” instead of “trash” which is a legitimate prediction that will then be learned by the student. Note that this only works for the distillation loss and not the MLM loss since for the latter, it would amount to teaching your model non-sense.

When it comes to initialization, the authors take advantage of the similar architecture between the teacher and the student models and take every other layer from the teacher model (BERT base) to initialize the student. This is an additional way to benefit from the teacher’s learnings and has a significant impact on performance (drop of 3-4% accuracy on the General Language Understanding Evaluation (GLUE) benchmark, a reference in the NLP field, without initialization).

PD-BERT: where pre-training and distillation meet

DistilBERT is a great example of how knowledge distillation can help build smaller models with satisfactory performance. Yet, does that mean the good old pre-training used for BERT, that is training a language model on a massive amount of unlabeled text data, is to be discarded? No, according to a recent paper by Google. Indeed, this paper shows that pre-training smaller architectures without initialization from bigger models leads to good results with significant speed gains.

The process for this novel method the authors call pre-trained distillation (PD) goes as follows. A small version of a BERT model is randomly initialized and trained just like the big one, with the masked language modeling (MLM) approach described above. The authors then apply knowledge distillation in the same way as for DistilBERT, helping the smaller model learn from the bigger one. The resulting student can finally be fine-tuned on a labelled dataset, specializing therefore on a specific task (e.g. sentiment analysis). The value of this paper also lies in the fact that they compare different approaches (pre-training and fine-tuning with and without distillation). The authors show that, while classic pre-training and fine-tuning of smaller models already lead to better results than DistilBERT on average, adding distillation between pre-training and fine-tuning helps reach even better results.

The fact that student models are not initialized with parts of their teachers allows for more flexibility in the model size. In total, the authors released 4 models, all smaller than BERT base. The smallest one, namely BERT Tiny, has 4.4 million parameters and is therefore 15x smaller than DistilBERT and 25x smaller than BERT base.

Example of text generation using GPT-2
Model size (in millions of parameters) depending on the number of layers L and the embedding size H. As a reference, BERT base has L=12 and H=768. Source: Turc et al. (2019)

This speeds up the training time compared to a big architecture such as BERT Large by up to 65 times and we can imagine that the speed-up in inference is also high, though not mentioned by the authors.

BERT on the operating table: BERT-of-Theseus

What if instead of having the student mimic the teacher after the teacher has learned, we would have the student and the teacher learn together with the student progressively replacing the teacher? This is the intuition behind a recent model called BERT-of-Theseus, released in February 2020 by Microsoft Asia.

Example of text generation using GPT-2
Source: Xu et al. (2020)

The name of this new model comes from the ship of Theseus problem in philosophy: when gradually replacing elements from an object, is the final object still the initial object or a different one?

In practice, an original BERT model is fine-tuned, that is adapted to a specific task (e.g. sentence classification). The first step is to refinetune this initial model on the same task and replace gradually (with an increasing probability \(p\)) its elements (named predecessors where each predecessor is a stack of 2 encoder layers from the initial model) with new elements (successors where each successor is equivalent to \( 1 \) layer of the initial model in size). The model size therefore shrinks and the successors (or students like in knowledge distillation) learn in interaction with predecessors (teachers). When \(p\) reaches \(1\), the second step is to fine-tune on the same task the stack of all successors. Note that, similarly to DistilBERT, layers of the original model are used to initialize the successors (in this case, the first six layers).

Example of text generation using GPT-2
A 6-layer predecessor \(P = \{ prd_1, … , prd_3 \} \) is compressed to a 3-layer successor \(S = \{ scc_1, …, scc_3 \} \). During part a, each predecessor \(prd_i\) is replaced by successor \(scc_i\) with probability \(p\). In part b, all successors are stacked together and fine-tuned. Source: Xu et al. (2020)

General results: lighter and faster

Here comes the time for a comparison.

Example of text generation using GPT-2
Experimental results on the dev set of the General Language Understanding Evaluation (GLUE) benchmark. Source: Xu et al. (2020)

The table above summarizes the performance of different models including the teacher model BERT-base and the three models we presented here (DistilBERT, Pre-trained distillation named PD-BERT and BERT-of-Theseus). Each column refers to a specific task from the General Language Understanding Evaluation (GLUE). For instance, CoLA stands for Corpus of Linguistic Acceptability and consists of a binary classification task where the model is evaluated on its ability to tell if a sentence is grammatically correct.

When looking at the performance, DistilBERT retains 97% of BERT-base’s performance with 40% less parameters while BERT-of-Theseus retains 98.35% also with 40% less parameters. Regarding inference speed, DistilBERT and BERT-of-Theseus are respectively 1.6x and 1.94x faster than BERT-base in terms of inference speed. BERT-of-Theseus presents some additional advantages. Like PD-BERT, it is for instance model-agnostic and could also be applied for other kinds of neural-network-based models. It is also faster and less costly to train since it only relies on fine-tuning which is way less costly than pre-training from scratch, needed for DistilBERT training.

In the end, BERT-of-Theseus seems to be a clear winner though it must be noted that it was published 4 months after DistilBERT and 6 months after PD-BERT, which means ages in NLP time! To the credit of DistilBERT, it was also one of the pioneer models in this research field focusing on smaller models and most probably influenced researchers to look more in this direction.


In this blog post, we presented three recent models, namely DistilBERT, PD-BERT and BERT-of-Theseus, which are smaller than their teacher BERT-base, significantly faster and performing almost as well on most tasks. Does this mean David has defeated Goliath? Probably not! It is actually more likely that both research trends, pushing respectively for smaller models and bigger models, co-exist as their objectives are different. While the big tech companies will keep on pushing for the best possible model using lots of resources, other smaller companies and their research teams will focus more on models they can apply to solve business problems. In an ideal world, we could even envision smaller models performing better than big ones, ALBERT from Google being an example of this. In any case, we have reasons to be excited about what the future brings!

We hope you had fun and learned new things reading this blog post. Please reach out to us at as we would love to get feedback on these posts and answer any questions you might have on how to use this technology for your business.


Keskar, Nitish Shirish, et al. "Ctrl: A conditional transformer language model for controllable generation." arXiv preprint arXiv:1909.05858 (2019).
Turing-NLG blog post

Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).

The Staggering Cost of Training SOTA AI models, Synced

Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and policy considerations for deep learning in NLP." arXiv preprint arXiv:1906.02243 (2019).

Learn how to make BERT smaller and faster, Rasa blog

Paper: Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).

Paper: Turc, Iulia, et al. "Well-read students learn better: The impact of student initialization on knowledge distillation." arXiv preprint arXiv:1908.08962 (2019).

Paper: Xu, Canwen, et al. "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing." arXiv preprint arXiv:2002.02925 (2020).

DistilBERT, the four versions of PD-BERT and BERT-of-Theseus can be downloaded and used in a few lines of code using the Hugging Face transformers repository