Conversing with chatbots: DialoGPT

Published in

Chatbots Life

10 min readJul 18, 2021

Source: What is a Chatbot and is it Better than a Human? | Tulie Finley-Moise

Introduction

In a previous module, we examined language models and explored n-gram and neural approaches. We found that the n-gram approach is generally better for higher values of N but this may be constrained by available compute resources. There was also the concern about the lack of representation for n-grams not present in the training corpus. On the other hand, applying subword tokenization methods such as Byte Pair Encoding and Wordpiece, recent neural approaches are able to resolve the issues with n-gram language models and show impressive results.

We also traced the development of neural language models from feedforward networks that rely on word embeddings and fixed input length to recurrent neural networks which allowed for variable length input but struggled to capture long term dependencies. We explored the concept of attention and their importance in transformer models using Jay Alammar’s awesome The Illustrated Transformer. We learned that transformer models have the benefit of being parallelizable during training.

Building on our knowledge of transformer models, we explored OpenAI’s GPT-2 which has 1.5 billion parameters and was trained on 40GB of text — wow! Then, Microsoft’s DialoGPT which extends this language model to conversational response generation.

All these provide context for this article. Here, we will play around with the pretrained DialoGPT model and generate responses using a variety of strategies.

Loading the DialoGPT model

Microsoft makes variants of the pretrained DialoGPT model checkpoints available through a download link listed in the GitHub repository and through Hugging Face’s Transformers. 🤗Transformers library offers a unified API through which a lot of models can be loaded, trained, saved and shared so we will use that here.

A tqdm download starts when you run this code as 🤗Transformers downloads and caches the checkpoints for the model and tokenizer. Do not worry, this download only occurs the first time a particular model/tokenizer is loaded. When this completes, we should see a 12-layer model with an embedding dimension of 768. We can interact with this model just as we interact with pytorch models. For example, we can check the number of parameters in the DialoGPT-small model.

If you would prefer to interact with a Tensorflow model, simply import TFGPT2LMHeadModel instead.

Chatting with DialoGPT

To interact with the model we loaded, we have to prompt the model with text. We must tokenize this text in a fashion that the model understands. To do this, we use the tokenizer we loaded earlier. We will also use the tokenizer to process the output of the model in order to obtain our response. We can write a simple function to handle all this neatly, taking advantage of 🤗Transformers’ generate method

So that we can focus on just the parameters passed to our model, we can bind the model and tokenizer using partial. Then, we can prompt our model.

from functools import partialgenerator = partial(generate, model=model, tokenizer=tokenizer)generator(“Try this cake. I baked it myself.”)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.[“I’m not sure if I’m missing something,”]

There are two other things to pay attention to about our generate function. First, we switch the model to evaluation mode. This is necessary because of layers that behave differently during training and inference e.g. dropout layers. Secondly, we only decode a part of history_ids which our model returns. This is because the text we prompted our model with is also returned. We can confirm this by decoding all of history_ids. Try it out!

Decoding Strategies

We have already demonstrated one strategy for decoding above. Our generate function as used above employs Greedy Search. During generation, the greedy search algorithm simply selects the word/token with the highest probability as the model’s next output during generation.

Source: How to generate text: using different decoding methods for language generation with Transformers | Patrick von Platen

Following the image above, the algorithm generates “The nice woman” by choosing the most probable word at each turn. Prompt the model with different contexts to get a feel for the kind of responses it generates.

Beam Search

We may quickly discover that greedy search often produces generic responses. Consider the scenario below (Source: Decoding Strategies that You Need to Know for Response Generation | Vitou Phy)

Context: Try this cake. I baked it myself.
Optimal response: This tastes great!
Model response: This is okay.

The model generates a suboptimal response even though it starts with the same token as the optimal response — “This”. This may happen because ‘is’ is a more popular token following ‘That’ in the training data. To speak in terms of probabilities, a more probable sequence may be ‘hiding’ behind a low-probability token. Referring to the tree diagram above as an example, “The dog has” (0.36) has a higher probability than “The nice woman” (0.2).

Beam search circumvents this issue by tracking a predefined number of most likely tokens at each step before eventually choosing the sequence with the highest probability. We can employ beam search using our `generate` function as follows

generate(     “Would you like to have dinner with me?”,     num_beams=5,     early_stopping=True,     num_return_sequences=5)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — [‘How did you make it?’,‘I baked it myself.’,‘How did you make this?’,‘How did you make the cake?’,‘I baked it myself’]

These responses are great. We can improve them by conditioning our search. no_repeat_ngram_size ensures that n-grams already generated are not repeated in the model’s response later on. min_length and max_length ensure that the generated responses have lengths within a predefined range. We can also retrieve more than one sequence by setting num_return_sequences to a value less than or equal to num_beams.

generator(     “Try this cake. I baked it myself.”,      max_length=50,      num_beams=5,      early_stopping=True,      no_repeat_ngram_size=2,      min_length=25,      num_return_sequences=5)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —[‘How did you make the cake? I want to make one too!!’,‘How did you make the cake? I want to make one too! :D’,‘How did you make the cake? I want to make one myself!!’,‘How did you make the cake? I want to make one too. It looks amazing!’,‘How did you make the cake? I want to make one too. It looks amazing.’]

Now, these responses feel a lot more natural. However, striking a balance between all these penalties we have imposed may require some tuning.

2. Random Sampling

Thinking about human conversations, there’s quite a bit of unpredictability. We do not simply choose the most likely word or sentence in reply to a friend. We can introduce this randomness by sampling. Thus, we select the next token using a conditional probability distribution.

import torchtorch.manual_seed(42)generator(    “Try this cake. I baked it myself.”,     do_sample=True,     top_k=0,     max_length=50)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — [‘Surely it was caused by some sort of cake’]

We set top_k to zero for now. We will visit Top-K sampling next.

The response is truly more surprising than others so far. It’s also not very coherent. It is unclear how baking a cake was “surely caused some sort of sake”. We can improve this if we reduce the likelihood of low probability words using the softmax temperature.

generator(   “Try this cake. I baked it myself.”,   do_sample=True,    top_k=0,   max_length=50,   temperature=0.7)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —[“Can you post the recipe? I’m really interested in trying it.”]

Now, we obtain a response that connects better with what the model was prompted with. However, random sampling sometimes generates errors. Fan et al. (2018) pointed out that for words like can’t which is tokenized to ca and n’t, the model may produce the first token but miss the second. Random sampling may also affect the generation of longer text as sampling interrupts the coherence across consecutive sentences.

3. Top-K Sampling

This strategy is employed by GPT2 and it improves story generation. The K most likely next words are filtered and become the sampling pool. This ensures that unexpected tokens are not sampled and improves the generation of the model.

generator(  “Try this cake. I baked it myself.”,  do_sample=True,  top_k=50,  max_length=50)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — [“I had to google what a dutch oven is because i couldn’t figure it out. I’m not the expert. Who knew?”]

This response sounds even more natural than the response we obtained by random sampling.

4. Nucleus Sampling

Top-k sampling focuses on sampling amongst a defined K number of potential tokens. However, the probability distribution of the next word may vary from sharp distributions (left side of the image below) to flatter ones.

Source: How to generate text: using different decoding methods for language generation with Transformers | Patrick von Platen

When dealing with sharper distributions, top-k sampling may introduce tokens into the sampling pool that affect the naturalness of the sentence generated. Holtzman et al. (2019) introduced Nucleus/Top-P Sampling where the model chooses from the smallest possible set of words with cumulative probability exceeding a predefined value, p.

We can generate using this strategy as follows

generator(   “Try this cake. I baked it myself.”,   do_sample=True,   top_p=0.9,   top_k=0,   max_length=50)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —[‘Then my work here is done.’]

Conclusion

That’s it! We’ve examined four important strategies you can use to generate text and what to watch out for when working with each one. Still, this is an open research area so it is best to experiment with them all and decide which works best for your use case.