“Mind your language, GPT-2”: how to control style and content in automatic text writing
By Manuel Tonneau
March 11, 2020
How cool would it be to have machines do the repetitive writing for us, lightening our workload and allowing us to focus on more important tasks? For a while, this was not technically feasible. With the advent of new models such as GPT-2, writing original texts automatically is not science fiction anymore. Nevertheless, despite the impressive results, it is still hard to control the text generation to make the machine write following our wishes. Indeed, it is easy to have GPT-2 ramble about some random topic but it is more difficult to make sure it stays on topic and mentions specific information, which is fundamental for business applications.
In this blog post, we discuss the latest improvements in language modeling that allow for controllable text generation, namely for controlling text type and content, and discuss the perspectives in the field.
Write like Shakespeare, not like Wikipedia: controlling text type
In a previous piece, we explained the way machines write text, namely by training them to predict the next word given an input text, and discussed different methods to influence which word comes next. Yet, these methods are not of great help when making sure the text is written in a specific language style (e.g. youth slang) or with a specific structure (e.g. a one-sentence title followed by two paragraphs).
One first possible solution is to adapt the data the model is trained on. The model’s training can be divided into two parts: pre-training and fine-tuning. Pre-training means training the model from scratch on a large amount of text data, allowing it to learn a general understanding of language. In the context of text generation, the second training part called fine-tuning implies training the pre-trained model again on a smaller specific text dataset in order to learn a specific text structure or language style.
Pre-training a model from scratch on a specific text type, e.g. emails, birthday cards or scientific articles, is one possibility to control the structure and language style of the generated text. However, this solution is very costly in terms of computing power and requires a large training set. Another easier possibility is to fine-tune a pre-trained model on a smaller specific dataset. For instance, the great folks at Hugging Face have followed this second path, fine-tuning OpenAI’s GPT-2 on a small dataset containing research papers from the arXiv library, hence the research and deep learning orientation in the text generated by their model (in bold in the figure below).
Now, having to train one separate model for each desired text type can be tedious. One solution, proposed by Salesforce with their recent model CTRL published in September 2019, is to add a mention of the desired text type to the input text. They define control codes specific to each type of text (e.g. “Wikipedia” for Wikipedia articles or “books” for novel-type texts) and include one at the beginning of each text during the training phase. That way, the model learns the relationship between the control codes and the text that follows and the user only has to specify a desired control code to determine the text type the machine needs to generate.
In the example above, where the control codes are outlined in red and the input written by the user in blue, the control codes refer to the domain style and can be defined as domain codes. Following the latter, the same prompt (e.g “a knife”) can lead to various generated texts, from a horror story (trained on the Reddit channel nosleep) to Amazon-like reviews.
The challenge of controlling text content
While controlling text type seems feasible, controlling text content is more challenging. Defining the text type certainly helps in making sure a specific vocabulary is used but it is not enough to control the text structure or make sure specific information is mentioned in the text content. Following the same path, the CTRL authors propose to add more specific controls codes after the domain codes to constrain the text generation.
In the example above, extra control codes are added to the domain codes and influence the generated text’s structure and content. For instance, “Title” implies that the first sentence in the generated text will be the text’s title. In the case of reviews, one can also specify the rating’s value to influence sentiment in the generated text. Funny enough, the model is even able to understand the structure of hyperlinks and generate fake ones, such as the National Geographic link in the first generated text.
One drawback of the CTRL model is its size (1.63 billion parameters) which makes it hard to reproduce without massive computing power. This is the motivation for the Plug-and-Play Language Model (PPLM), published in December 2019 by Uber AI.
The idea is the following: instead of training again a gigantic language model from scratch, we use the knowledge accumulated by existing available ones such as GPT-2 and train smaller attribute models to influence the generation. Attribute models are essentially models that estimate the probability that a text sequence \( x \) has a certain pre-defined attribute (=control code) \( a \) (e.g Positive or Negative). For each word to be generated, the probability that the sequence \( x \) to be generated has the desired attribute \( a \) is determined and the model’s parameters are updated in a way that maximizes this probability. That way, the sentence that is finally generated likely has the pre-defined attribute.
The authors of PPLM follow the control code approach and increase the number of control codes to constrain the text generation even more, as in the example below. It is obviously important to choose control codes that make sense when grouped together, which is not necessarily the case in the first example, where the text generation moves from Putin and state oil companies to frozen food. The input prompt also needs to be related to the control codes which is hardly the case in the second example and might explain why the model quickly stops talking about pizza after a few generated words.
Additionally to the control code approach, another possible solution to control the way text is generated is to predefine the end of the sentence and let the model fill in the blanks. One of the latest and currently best performing language models, T5 by Google, is trained to do such a task, namely fill-in-the-blank text generation.
As seen above, the user can define where the blank is and the number of words \( N \) to be generated in this blank. When compared to the control code approach, it obviously doesn’t allow to control for text type but it still avoids having the model go in unexpected directions.
Conclusion and perspectives
In conclusion, we presented several ways to control text type and content during automatic text generation, including control codes that guide the generation towards specific contexts and fill-in-the-blank text generation where the end of the text is pre-defined.
These methods are obviously not perfect. Stacking up too many control codes lowers the value of automatic text generation as the cost of writing a text oneself starting from numerous keywords is lower. Also, only defining the end of the text limits the control on the overall text generation. Combining both approaches could be potentially interesting, allowing for enough control on text type and content and avoiding nonsensical endings. Adding extra information if available could also help the generation take the right direction, as in this recent paper from February 2020, where the authors generate biomedical abstracts automatically from a title, an intended publication year, and a set of keywords. In a more distant future, one could even imagine using data from brain sensors, collected when a user thinks about a text he wants to write, to produce a draft of this text.
As of now, despite its imperfections, the current technology can be used to create value for companies, for instance by guiding the model to use specific SEO keywords, generating original texts with high SEO value at scale. In this regard, we at Creatext leverage this technology to power our SEO content generator.
We hope you had fun and learned new things reading this blog post. Please reach out to us at email@example.com as we would love to get feedback on these posts and answer any questions you might have on how to use this technology for your business.
Blog post on GPT-2, Open AI
Write with Transformer, Hugging Face
Keskar, Nitish Shirish, et al. "Ctrl: A conditional transformer language model for controllable generation." arXiv preprint arXiv:1909.05858 (2019).
Dathathri, Sumanth, et al. "Plug and play language models: a simple approach to controlled text generation." arXiv preprint arXiv:1912.02164 (2019).
Blog post on PPLM (Uber AI)
Blog post on T5 (Google AI)
Sybrandt, Justin, and Ilya Safro. "CBAG: Conditional Biomedical Abstract Generation." arXiv preprint arXiv:2002.05637 (2020).