### How do machines write text?

The past year has seen a tremendous rise in the capabilities of language models when it comes to writing text. The most impressive breakthrough, GPT-2 (second version of the Generative Pre-Training Model GPT), came from OpenAI and is now one year old. It is the model that received the most attention in the field, with interviews from major newspapers such as The Economist, varied applications from generating scripts for card games to creating Twitter bots, and even a popular webapp from Hugging Face that allows to test the model directly.

One important barrier that limits the use of language models in many fields of application is the lack of understanding of the way they write text for people outside of the NLP community. This post aims at empowering the latter to use these models and better understand how the text output is generated.

We start off this blog post by defining language modeling and briefly describe how text data is processed by Transformer models. We then introduce the beam search method, building on word probabilities to determine what to write next. Finally, we introduce temperature, top $$k$$ sampling, and nucleus sampling, bringing in the magic of randomness to  overcome beam search’s pitfalls.

#### What are language models?

A language model is a statistical tool that allows to represent sequences of words in a probabilistic way. It is usually trained on big text corpora, such as Wikipedia, to predict the next word in a sentence, based on the first words in this sentence. These models already play an important role in our daily lives, from smartphone keyboards to Gmail’s Smart Compose that predict which words will come next in text messages and emails.

Now, how exactly is the next word chosen when using a trained language model? Picture a language model, such as GPT-2, to which a text input prompt is given, for instance “Yesterday, I went to the”. This input is split into individual words and subwords called tokens, turned into embeddings (vectors representing these tokens), and fed into the model [a]. As an output, the model assigns a score to each word token in the model’s vocabulary, the latter being a list of words and subwords contained in the model's training data. The scores are finally turned into probabilities, using for instance a softmax function.

#### Greedy and beam search: the most probable word as the next word

We now know the probability for each word in the vocabulary to follow “Yesterday, I went to the”, according to the model. The most straightforward way to predict the next word is called greedy search and implies choosing the most probable word to come next. In our example, if the most probable words are “beach” with probability $$0.7$$ and “pool” with probability $$0.2$$, the word “beach” will be chosen to continue the sentence. At each step, this process is repeated to select the following word and ends when reaching a predefined maximum length or when reaching an end-of-sequence token such as a full stop.

This approach is a particular case of a wider approach called beam search where the scoring doesn’t take place anymore at the word level but at the sequence level. In this case, for a given beam width $$k$$, the $$k$$ most probable sequences are kept at each step. To discriminate between the sequences, all candidate word sequences are scored by multiplying the probability of each of their word components and only the k sequences with the highest probabilities are kept at each step.

Going back to our example presented above, the input prompt is “Yesterday, I went to the” and the beam width $$k$$ equals $$2$$. At step 1, the two most probable words to follow the prompt are identified, namely “beach” with probability $$0.7$$ and “pool” with probability $$0.2$$. At step 2, we determine the probability of the four possible sequences, keep the two most probable, namely “Yesterday, I went to the beach and” and “Yesterday, I went to the beach with” and drop the two other ones. The process then goes on at step 3, keeping only the two most probable sequences out of four, following the two sequences kept at step 2. The process ends in the same way as in greedy search and the most probable sequence is chosen over the two sequences kept at the last step.

Despite its popularity, this method has faced some criticism in the past years. An interesting paper from Holtzman et al. (2019) entitled “The Curious Case of Neural Text Degeneration” mentions some of these critiques.

One of the main problems with beam search is the main focus on word probabilities. The authors show that a human-written text (see the blue curve in the graph and the respective text below) is far from exhibiting the probability maximization behaviour observed with beam search (orange curve). On the contrary, humans optimize against stating the obvious and maximize the text’s originality, leading to high variations in the probability of the chosen next words. The example of the text generated with beam search also exhibits a problematic repetition pattern, which is a byproduct of this probability focus and which the authors name “the gravitational force of repetition”. They observe that when the generation gets stuck in a repetition loop (in our case, “He looks at the cloud. He looks” and so on), the probability that the same sentence follows goes up with the number of repetitions of this same sentence, which is why the repetition loop doesn’t end. This can be in part explained by the focus on the immediate previous context in the Transformer architecture.

#### Temperature, top-$$k$$, and nucleus sampling: introducing the magic of randomness

Upon the realization of the shortcomings of beam search, researchers have focused on introducing more flexibility in the way sentences are generated, namely by using randomness in the selection process.

One first result of this trend is called sampling with temperature. The main idea of this method is to modify the probability assigned to each word at a given step by introducing a temperature parameter  [b]. We obtain a modified probability distribution containing a probability $$p_i$$ to come next for each word $$i$$. We then sample from this modified probability distribution to determine the next word, picking word $$i$$ with probability $$p_i$$. When the temperature parameter gets very small, the word that received the highest score from the model gets assigned a very high probability which maximizes its chance to be chosen and resembles greedy search. On the contrary, when the temperature gets very big, the distribution becomes uniform with the probabilities to pick each word becoming equal. Keeping temperature at $$1$$ gives the same probability distribution as if a softmax function without temperature was used.

The top-$$k$$ sampling method is another result of this trend and is the method used by GPT-2 to generate the next word and therefore the next sentence. Instead of keeping the top $$k$$ most probable sequences at each step as in beam search, we consider the top $$k$$ most probable words at each step and choose one word as the following word by randomly sampling from these $$k$$ words. The next top $$k$$ most probable words to follow this chosen word are determined, sampled from and so on. In practice, if we take back our original example and define $$k=2$$, the word “beach” and “pool” follow “Yesterday, I went to the” with probabilities of $$0.7$$ and $$0.2$$ respectively. During the sampling process, “beach” is chosen with probability $$0.7$$ and “pool” is chosen with probability $$0.2$$.

Though this method generally increases the output text quality, it can potentially give too much importance to edge cases. In our example, let’s now say that the four most probable words to follow “Yesterday, I went to the” are “beach”, “pool”, “supermarket”, and “library” with respective probabilities of $$0.2$$, $$0.1$$, $$0.05$$, and $$0.05$$. If we perform top-$$k$$ sampling with $$k=4$$, the probability to pick one of the less probable words (“pool”, “supermarket”, and “library”) is the same as picking the more probable word (“beach”). This leads to cases where we distrust the model even though it is confident about the next word. In another setting, if the probability distribution is uniform across words, taking the top-k words eliminates other words that would have been as relevant as the top-k selected words. To solve this issue, Holtzman et al. (2019) introduce nucleus sampling. The essential idea is that instead of choosing a fixed number of words $$k$$ to sample from, we set a probability threshold that determines the number of words to keep based on their respective probabilities. In the past example, if the probability threshold $$p$$ equals $$0.3$$, we look at the probabilities of each word, sum them starting from the most probable ones and stop before the value of the probability sum get bigger than the threshold value. In our case, we will therefore only sample from the first two words “beach” and “office” since the sum of their probabilities is equal to the threshold $$p = 0.3$$.

#### Conclusion

To conclude, there are many ways to make a language model speak. The shortcomings of methods such as greedy search and beam search have been overcome by introducing randomness in the choice of the next word. Finally, nucleus sampling proposes an intermediate solution between flexibility and confidence in the model’s prediction. The example above summarizes the methods presented in this post by proposing examples of generated texts, following the input original text (Context). When comparing to the original text continuation (Gold), while greedy and beam search get stuck in repetition loops, nucleus and top-$$k$$ sampling are clear winners.

Yet, there is still room for improvement and a special focus will have to be set on how text is actually written when developing new language models in the future rather than just making bigger models and hoping they get the writing right.