Keeping up with the BERTs: a review of the main NLP benchmarks

In the past few years, the fact that NLP has gained momentum is the least one can say. The increasing performance of NLP models at an ever-growing number of tasks and the rising attention this field has attracted led Sebastian Ruder from DeepMind to talk about an ImageNet moment of NLP in July 2018, referring to a similar booming phase for the computer vision field in 2012. Clément Delangue from Hugging Face even described NLP as the most important field of machine learning (ML) at Data Driven NYC in January of this year. 

Now, it is one thing to say that models are improving but what’s really important is to know which tasks can be done with the current state-of-the-art (SOTA) models and how model and human performance compare. In this blog post, we answer these questions by presenting the General Language Understanding Evaluation (GLUE) benchmark and by analyzing performance on each task to differentiate easy and hard tasks for current SOTA models. Finally, we present the latest benchmarks to date, SuperGLUE but also XTREME, set up in order to keep up with rising model performance and evaluate new models on other languages than English.

The necessity of a benchmark: GLUE

Transfer learning in the NLP context can be defined as the combination of model training on massive amounts of text data to learn fundamental aspects of language and use of this accumulated knowledge during training to perform language-related tasks such as sentence classification. The development of this learning method has allowed NLP models, such as BERT from Google, to perform very well on a wide range of tasks, from question answering to sentiment analysis. 

In the meantime, a rising energy put into NLP research has led to the development of several new models, from ALBERT to ERNIE and more. The necessity of comparing the performance of these models sparked the establishment of numerous benchmarks for each task, such as SentEval (Conneau et al, 2017) and more recently the General Language Understanding Evaluation benchmark (GLUE) (Wang et al, 2019). The latter has grown to become one of the most important benchmarks in NLP, because of the variety of tasks it contains but also and most importantly because it provides a point of comparison between human and model performance.

The GLUE benchmark consists of 9 English sentence understanding tasks which can be divided into three categories: single-sentence tasks, similarity tasks and inference tasks. The first category includes acceptability judgments, for which the model is asked to determine whether a sentence is grammatically correct, and sentiment analysis, where the model learns to determine whether a sentence is positive, negative or neutral. The second category includes sentence pairs for which the model is asked to determine whether the two sentences are paraphrases of each other but also how semantically similar they are from one another. Finally, the inference tasks include logical problems (e.g determine whether two arguments are contradictory or not), question-answering and reading comprehension, in which the model reads a sentence with a pronoun and selects the referent of that pronoun from a list of choices. The GLUE score is then determined for each model by calculating the average of scores on each task.

Example of text generation using GPT-2
Source: Wang et al. (2019)

As mentioned earlier, one important advantage of GLUE is that it provides a human performance baseline, calculated by averaging the score of non-expert crowd workers on each of the tasks described before. This human baseline was beaten for the first time by the Multi-Task Deep Neural Network (MT-DNN) ensemble model from Microsoft in June 2019. Several other models then followed which confirms the speed at which model performance is increasing in the NLP field. As of today (April 27, 2020), 11 models are better than the GLUE human baseline.

Example of text generation using GPT-2
Source: GLUE benchmark leaderboard (as of April 27th 2020)

One thing to be noticed and mentioned by Thomas Wolf in his video entitled “The Future of NLP” is that most models on the top of the leaderboard were trained by big tech firms such as Google, Facebook, Ali Baba or Baidu. This can be explained by the rising model size which implies more computing power to train that only big tech firms can afford and therefore a narrowing of the competition field. 

Hard and easy tasks for current SOTA NLP

The great overall performance of the different models listed in the table above hides differences in performance across tasks. Now, at which tasks do NLP models beat humans and are there tasks for which humans still have the lead?

Regarding the easy tasks to current NLP SOTA models, here is a list of the easiest tasks by decreasing order of easiness:

  • Recognizing paraphrases: the task based on Microsoft Research Paraphrase Corpus and where the model has to tell whether one sentence is a paraphrased version of another one is the easiest, with the current best model on GLUE (ALBERT + DAAF + NAS) outperforming the human baseline by 11.2 percentage points in terms of accuracy.
  • Semantic similarity between questions: the task based on the Quora Question Pairs dataset consists of asking the model to determine whether two questions are similar semantically or not. To give you a sense of what this semantic similarity means, the questions “how do you start a bakery?” and “how can one start a bakery business?” are considered semantically similar while “what are natural numbers?” and “what is a least natural number?” are not. On this task, the current best model on GLUE outperforms the human baseline by 10.6 percentage points in terms of accuracy.
  • Acceptability judgements: on the Corpus of Linguistic Acceptability where the model must tell whether a sentence is grammatically correct, the current best GLUE model has an accuracy higher by 7.1 points compared to the human baseline in terms of accuracy.
  • Extractive question answering: when the model is asked to identify the sentence in a corpus that answers a specific question, the current best GLUE model performs with an accuracy that is higher by 6.3 points compared to the human baseline.

When it comes to harder tasks, here are two of the hardest tasks by decreasing order:

  • Textual entailment: this task is part of the logical problems mentioned before. The model is asked to recognize whether the meaning of one text can be inferred, or entailed, from the other text. The textual entailment is either positive (text entails hypothesis), contradictory (text contradicts hypothesis), or neutral. An example of a positive textual entailment would be:
    • Text: If you help the needy, God will reward you
    • Hypothesis: Giving money to a poor man has good consequences
    For this task, on the Recognizing Textual Entailment (RTE) corpus, the best GLUE model has an accuracy lower by 1.9 points compared to the human baseline.
  • Natural language inference: in the Winograd Natural Language Inference task (WNLI), the model is asked to determine which subject-specific personal pronouns refer to. For instance, in the sentence “The trophy would not fit in the brown suitcase because it was too big”, the model is asked what was “too big”, or in other words what does “it” refer to, and the answer is “the trophy”. For this task, the best GLUE model has an accuracy lower by 1.4 points compared to the human baseline.

To conclude on this performance comparison between NLP models and the human baseline, one can see a trend with the easy tasks mostly belonging to the single-sentence tasks (grammar check) or the sentence pairs (recognize paraphrases or semantic similarity). On the other hand, NLP models are still lagging behind on natural language inference, though not by a lot. 

The necessity to set the bar higher and further: from GLUE to SuperGLUE and XTREME

The fact that the human baseline was beaten on GLUE raised discussions on the necessity of a new benchmark that would include more challenging tasks and for which the human baseline would not be instantly reached. This was the motivation behind SuperGLUE which was presented in the Processings of NIPS 2019. The authors defined it as a “new benchmark designed to pose a more rigorous test of language understanding”. Two of the hardest tasks from GLUE and presented before (RTE and WNLI) are kept and new tasks are added, under the condition that they go beyond the scope of the current SOTA systems but are solvable by most college-educated English speakers. These tasks include boolean question answering, where the answer is either yes or no, causal reasoning or reading comprehension with multiple-choice questions. As of today, the best performing model on SuperGLUE is T5 from Google, published in October 2019, but it still ranks under the SuperGLUE human baseline, though by just 0.6 points in terms of overall SuperGLUE score.

Example of text generation using GPT-2
Source: SuperGLUE benchmark leaderboard (April 27th 2020)

One thing to be noted is that the overall number of models on SuperGLUE (4 without the baselines) is still way lower than for GLUE (30 models without the baselines). It’s hard to tell why but we hope that new architectures will be uploaded on SuperGLUE so we can observe how they perform on more challenging tasks. 

Now, the success of these models on English tasks has raised interest of NLP practitioners from non-English speaking countries to use these models on text in their language. Following this trend, many new versions of BERT, among other architectures, were trained for foreign languages, from German to Vietnamese, with some funny names to be noted, like CamemBERT for French. The necessity to evaluate these models motivated the creation of a new multilingual benchmark, published on March 24, 2020 by practitioners from different institutions (Carnegie Mellon, Google, DeepMind), and entitled XTREME. This benchmark includes four categories of tasks, including sentence classification, structured prediction (e.g. named entity recognition, part-of-speech tagging), sentence retrieval (find sentence in a text based on a query) and question answering. The overall score is built as an average of scores on different tasks, just like in GLUE. The benchmark covers 40 different languages, including many under-studied languages such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka and Singapore), Telugu and Malayalam (spoken in southern India). One interesting finding is that models achieve close to human performance on most tasks in English but performance remains significantly lower for many of the other languages, which leaves a lot of room for improvement. 


After reviewing the evolution of the main NLP benchmarks in the past year, there is no doubt that NLP systems are improving very fast. For some tasks in English, such as recognizing paraphrases or defining whether a sentence is correct grammatically, the current best models are already significantly better than the human baseline. Also, there is no doubt that the setting up of benchmarks such as GLUE or SuperGLUE played a role in the improvement of monolingual NLP models, by providing a point of comparison between models.

Yet, the road is still long before models beat humans on all tasks, with models still lagging behind on tasks implying logical or causal reasoning. This great improvement has also been focused on English and performance on other languages is still lower than the performance on English, especially for low-resource languages. We hope that the setting up of a new multilingual benchmark, entitled XTREME, will motivate research in the multilingual field and allow practitioners to use the NLP technology across languages.

Finally, while leaderboards are important to keep track of the general progress of NLP technology, they should not concentrate all of the community’s attention. For instance, smaller models are almost as performant as the big architectures on GLUE but are most importantly faster and greener, as described in our past blog post. Ideally, these two research trends should work together to make the best of the technology easily accessible for companies to run it in production.

We hope you had fun and learned new things reading this blog post. Please reach out to us at as we would love to get feedback on these posts and answer any questions you might have on how to use this technology for your business.


“NLP’s ImageNet moment has arrived”, Sebastian Ruder, Jul 12, 2018

NLP—The Most Important Field of ML // Clement Delangue, Hugging Face (FirstMark's Data Driven NYC)
General Language Understanding Evaluation (GLUE) website
Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems." Advances in Neural Information Processing Systems. 2019.
SuperGLUE website
The Future of NLP - Thomas Wolf (Hugging Face)
Hu, Junjie, et al. "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization." arXiv preprint arXiv:2003.11080 (2020).