Ultimate Guide to Leveraging NLP & Machine Learning for your Chatbot

Stefan Kojouharov
Chatbots Life
Published in
24 min readSep 18, 2016

--

Code Snippets and Github Included

Artificial Intelligence

For the Love of Chatbots

Over the past few months I have been collecting the best resources on NLP and how to apply NLP and Deep Learning to Chatbots.

Every once in awhile, I would run across an exception piece of content and I quickly started putting together a master list. Soon I found myself sharing this list and some of the most useful articles with developers and other people in bot community.

In process, my list became a Guide and after some urging, I have decided to share it or at least a condensed version of it -for length reasons.

This guide is mostly based on the work done by Denny Britz who has done a phenomenal job exploring the depths of Deep Learning for Bots. Code Snippets and Github included!

Without further ado… Let us Begin!

DEEP LEARNING FOR CHATBOTS OVERVIEW

Deep Learning

Chatbots, are a hot topic and many companies are hoping to develop bots to have natural conversations indistinguishable from human ones, and many are claiming to be using NLP and Deep Learning techniques to make this possible. But with all the hype around AI it’s sometimes difficult to tell fact from fiction.

In this series I want to go over some of the Deep Learning techniques that are used to build conversational agents, starting off by explaining where we are right now, what’s possible, and what will stay nearly impossible for at least a little while.

If you find this article interesting you can let me know here.

A TAXONOMY OF MODELS

RETRIEVAL-BASED VS. GENERATIVE MODELS

Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context. The heuristic could be as simple as a rule-based expression match, or as complex as an ensemble of Machine Learning classifiers. These systems don’t generate any new text, they just pick a response from a fixed set.

Generative models (harder) don’t rely on pre-defined responses. They generate new responses from scratch. Generative models are typically based on Machine Translation techniques, but instead of translating from one language to another, we “translate” from an input to an output (response).

Machine Learning

Both approaches have some obvious pros and cons. Due to the repository of handcrafted responses, retrieval-based methods don’t make grammatical mistakes. However, they may be unable to handle unseen cases for which no appropriate predefined response exists. For the same reasons, these models can’t refer back to contextual entity information like names mentioned earlier in the conversation. Generative models are “smarter”. They can refer back to entities in the input and give the impression that you’re talking to a human. However, these models are hard to train, are quite likely to make grammatical mistakes (especially on longer sentences), and typically require huge amounts of training data.

Deep Learning techniques can be used for both retrieval-based or generative models, but research seems to be moving into the generative direction. Deep Learning architectures likeSequence to Sequence are uniquely suited for generating text and researchers are hoping to make rapid progress in this area. However, we’re still at the early stages of building generative models that work reasonably well. Production systems are more likely to be retrieval-based for now.

LONG VS. SHORT CONVERSATIONS

Bots

The longer the conversation the more difficult to automate it. On one side of the spectrum areShort-Text Conversations (easier) where the goal is to create a single response to a single input. For example, you may receive a specific question from a user and reply with an appropriate answer. Then there are long conversations (harder) where you go through multiple turns and need to keep track of what has been said. Customer support conversations are typically long conversational threads with multiple questions.

OPEN DOMAIN VS. CLOSED DOMAIN

Chatbot Conversation Framework

In an open domain (harder) setting the user can take the conversation anywhere. There isn’t necessarily have a well-defined goal or intention. Conversations on social media sites like Twitter and Reddit are typically open domain — they can go into all kinds of directions. The infinite number of topics and the fact that a certain amount of world knowledge is required to create reasonable responses makes this a hard problem.

“Open Domain: I can ask a question about any topic… and expect a relevant response. (Harder) Think of a long conversation around refinancing my mortgage where I could ask anything.” Mark Clark

In a closed domain (easier) setting the space of possible inputs and outputs is somewhat limited because the system is trying to achieve a very specific goal. Technical Customer Support or Shopping Assistants are examples of closed domain problems. These systems don’t need to be able to talk about politics, they just need to fulfill their specific task as efficiently as possible. Sure, users can still take the conversation anywhere they want, but the system isn’t required to handle all these cases — and the users don’t expect it to.

“Closed Domain: You can ask a limited set of questions on specific topics. (Easier). What is the Weather in Miami?”

“Square 1 is a great first step for a chatbot because it is contained, may not require the complexity of smart machines and can deliver both business and user value.

Square 2, questions are asked and the Chatbot has smart machine technology that generates responses. Generated responses allow the Chatbot to handle both the common questions and some unforeseen cases for which there are no predefined responses. The smart machine can handle longer conversations and appear to be more human-like. But generative response increases complexity, often by a lot.

The way we get around this problem in the contact center today is when there is an unforeseen case for which there is no predefined responses in self-service, we pass the call to an agent.” Mark Clark

COMMON CHALLENGES

There are some obvious and not-so-obvious challenges when building conversational agents most of which are active research areas.

INCORPORATING CONTEXT

Chatbot

To produce sensible responses systems may need to incorporate both linguistic context andphysical context. In long dialogs people keep track of what has been said and what information has been exchanged. That’s an example of linguistic context. The most common approach is toembed the conversation into a vector, but doing that with long conversations is challenging. Experiments in Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models and Attention with Intention for a Neural Network Conversation Model both go into that direction. One may also need to incorporate other kinds of contextual data such as date/time, location, or information about a user.

COHERENT PERSONALITY

Ai Personality

When generating responses the agent should ideally produce consistent answers to semantically identical inputs. For example, you want to get the same reply to “How old are you?” and “What is your age?”. This may sound simple, but incorporating such fixed knowledge or “personality” into models is very much a research problem. Many systems learn to generate linguistic plausible responses, but they are not trained to generate semantically consistent ones. Usually that’s because they are trained on a lot of data from multiple different users. Models like that in A Persona-Based Neural Conversation Model are making first steps into the direction of explicitly modeling a personality.

EVALUATION OF MODELS

The ideal way to evaluate a conversational agent is to measure whether or not it is fulfilling its task, e.g. solve a customer support problem, in a given conversation. But such labels are expensive to obtain because they require human judgment and evaluation. Sometimes there is no well-defined goal, as is the case with open-domain models. Common metrics such as BLEUthat are used for Machine Translation and are based on text matching aren’t well suited because sensible responses can contain completely different words or phrases. In fact, in How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation researchers find that none of the commonly used metrics really correlate with human judgment.

Siri Chatbot

INTENTION AND DIVERSITY

A common problem with generative systems is that they tend to produce generic responses like “That’s great!” or “I don’t know” that work for a lot of input cases. Early versions of Google’s Smart Reply tended to respond with “I love you” to almost anything. That’s partly a result of how these systems are trained, both in terms of data and in terms of actual training objective/algorithm. Some researchers have tried to artificially promote diversity through various objective functions. However, humans typically produce responses that are specific to the input and carry an intention. Because generative systems (and particularly open-domain systems) aren’t trained to have specific intentions they lack this kind of diversity.

HOW WELL DOES IT ACTUALLY WORK?

Given all the cutting edge research right now, where are we and how well do these systems actually work? Let’s consider our taxonomy again. A retrieval-based open domain system is obviously impossible because you can never handcraft enough responses to cover all cases. A generative open-domain system is almost Artificial General Intelligence (AGI) because it needs to handle all possible scenarios. We’re very far away from that as well (but a lot of research is going on in that area).

This leaves us with problems in restricted domains where both generative and retrieval based methods are appropriate. The longer the conversations and the more important the context, the more difficult the problem becomes.

In a recent interview, Andrew Ng, now chief scientist of Baidu, puts it well:

Most of the value of deep learning today is in narrow domains where you can get a lot of data. Here’s one example of something it cannot do: have a meaningful conversation. There are demos, and if you cherry-pick the conversation, it looks like it’s having a meaningful conversation, but if you actually try it yourself, it quickly goes off the rails.

Many companies start off by outsourcing their conversations to human workers and promise that they can “automate” it once they’ve collected enough data. That’s likely to happen only if they are operating in a pretty narrow domain — like a chat interface to call an Uber for example. Anything that’s a bit more open domain (like sales emails) is beyond what we can currently do. However, we can also use these systems to assist human workers by proposing and correcting responses. That’s much more feasible.

Grammatical mistakes in production systems are very costly and may drive away users. That’s why most systems are probably best off using retrieval-based methods that are free of grammatical errors and offensive responses. If companies can somehow get their hands on huge amounts of data then generative models become feasible — but they must be assisted by other techniques to prevent them from going off the rails like Microsoft’s Tay did.

IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW

Deep Learning

The Code and data for this tutorial is on Github.

RETRIEVAL-BASED BOTS

The vast majority of production systems today are retrieval-based, or a combination of retrieval-based and generative. Google’s Smart Reply is a good example. Generative models are an active area of research, but we’re not quite there yet. If you want to build a conversational agent today your best bet is most likely a retrieval-based model.

If you want me to write more articles like this, please let me know here.

THE UBUNTU DIALOG CORPUS

In this post we’ll work with the Ubuntu Dialog Corpus (paper, github). The Ubuntu Dialog Corpus (UDC) is one of the largest public dialog datasets available. It’s based on chat logs from the Ubuntu channels on a public IRC network. The paper goes into detail on how exactly the corpus was created, so I won’t repeat that here. However, it’s important to understand what kind of data we’re working with, so let’s do some exploration first.

The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0). Each example consists of a context, the conversation up to this point, and an utterance, a response to the context. A positive label means that an utterance was an actual response to a context, and a negative label means that the utterance wasn’t — it was picked randomly from somewhere in the corpus. Here is some sample data.

Training your Chatot

Note that the dataset generation script has already done a bunch of preprocessing for us — it hastokenized, stemmed, and lemmatized the output using the NLTK tool. The script also replaced entities like names, locations, organizations, URLs, and system paths with special tokens. This preprocessing isn’t strictly necessary, but it’s likely to improve performance by a few percent. The average context is 86 words long and the average utterance is 17 words long. Check out the Jupyter notebook to see the data analysis.

The data set comes with test and validations sets. The format of these is different from that of the training data. Each record in the test/validation set consists of a context, a ground truth utterance (the real response) and 9 incorrect utterances called distractors. The goal of the model is to assign the highest score to the true utterance, and lower scores to wrong utterances.

The are various ways to evaluate how well our model does. A commonly used metric is recall@k. Recall@k means that we let the model pick the k best responses out of the 10 possible responses (1 true and 9 distractors). If the correct one is among the picked ones we mark that test example as correct. So, a larger k means that the task becomes easier. If we set k=10 we get a recall of 100% because we only have 10 responses to pick from. If we set k=1 the model has only one chance to pick the right response.

At this point you may be wondering how the 9 distractors were chosen. In this data set the 9 distractors were picked at random. However, in the real world you may have millions of possible responses and you don’t know which one is correct. You can’t possibly evaluate a million potential responses to pick the one with the highest score — that’d be too expensive. Google’sSmart Reply uses clustering techniques to come up with a set of possible responses to choose from first. Or, if you only have a few hundred potential responses in total you could just evaluate all of them.

BASELINES

Before starting with fancy Neural Network models let’s build some simple baseline models to help us understand what kind of performance we can expect. We’ll use the following function to evaluate our recall@k metric:

def evaluate_recall(y, y_test, k=1):

num_examples = float(len(y))

num_correct = 0

for predictions, label in zip(y, y_test):

if label in predictions[:k]:

num_correct += 1

return num_correct/num_examples

Here, y is a list of our predictions sorted by score in descending order, and y_test is the actual label. For example, a y of [0,3,1,2,5,6,4,7,8,9] Would mean that the utterance number 0 got the highest score, and utterance 9 got the lowest score. Remember that we have 10 utterances for each test example, and the first one (index 0) is always the correct one because the utterance column comes before the distractor columns in our data.

Intuitively, a completely random predictor should get a score of 10% for recall@1, a score of 20% for recall@2, and so on. Let’s see if that’s the case.

# Random Predictor

def predict_random(context, utterances):

return np.random.choice(len(utterances), 10, replace=False)

# Evaluate Random predictor

y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]

y_test = np.zeros(len(y_random))

for n in [1, 2, 5, 10]:

print(“Recall @ ({}, 10): {:g}”.format(n, evaluate_recall(y_random, y_test, n)))

Recall @ (1, 10): 0.0937632

Recall @ (2, 10): 0.194503

Recall @ (5, 10): 0.49297

Recall @ (10, 10): 1

Great, seems to work. Of course we don’t just want a random predictor. Another baseline that was discussed in the original paper is a tf-idf predictor. tf-idf stands for “term frequency — inverse document” frequency and it measures how important a word in a document is relative to the whole corpus. Without going into too much detail (you can find many tutorials about tf-idf on the web), documents that have similar content will have similar tf-idf vectors. Intuitively, if a context and a response have similar words they are more likely to be a correct pair. At least more likely than random. Many libraries out there (such as scikit-learn) come with built-in tf-idf functions, so it’s very easy to use. Let’s build a tf-idf predictor and see how well it performs.

class TFIDFPredictor:

def __init__(self):

self.vectorizer = TfidfVectorizer()

def train(self, data):

self.vectorizer.fit(np.append(data.Context.values,data.Utterance.values))

def predict(self, context, utterances):

# Convert context and utterances into tfidf vector

vector_context = self.vectorizer.transform([context])

vector_doc = self.vectorizer.transform(utterances)

# The dot product measures the similarity of the resulting vectors

result = np.dot(vector_doc, vector_context.T).todense()

result = np.asarray(result).flatten()

# Sort by top results and return the indices in descending order

return np.argsort(result, axis=0)[::-1]

# Evaluate TFIDF predictor

pred = TFIDFPredictor()

pred.train(train_df)

y = [pred.predict(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]

for n in [1, 2, 5, 10]:

print(“Recall @ ({}, 10): {:g}”.format(n, evaluate_recall(y, y_test, n)))

Recall @ (1, 10): 0.495032

Recall @ (2, 10): 0.596882

Recall @ (5, 10): 0.766121

Recall @ (10, 10): 1

We can see that the tf-idf model performs significantly better than the random model. It’s far from perfect though. The assumptions we made aren’t that great. First of all, a response doesn’t necessarily need to be similar to the context to be correct. Secondly, tf-idf ignores word order, which can be an important signal. With a Neural Network model we can do a bit better.

DUAL ENCODER LSTM

The Deep Learning model we will build in this post is called a Dual Encoder LSTM network. This type of network is just one of many we could apply to this problem and it’s not necessarily the best one. You can come up with all kinds of Deep Learning architectures that haven’t been tried yet — it’s an active research area. For example, the seq2seq model often used in Machine Translation would probably do well on this task. The reason we are going for the Dual Encoder is because it has been reported to give decent performance on this data set. This means we know what to expect and can be sure that our implementation is correct. Applying other models to this problem would be an interesting project.

The Dual Encoder LSTM we’ll build looks like this (paper):

Duel Encoder

It roughly works as follows:

  1. Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Stanford’s GloVe vectors and are fine-tuned during training (Side note: This is optional and not shown in the picture. I found that initializing the word embeddings with GloVe did not make a big difference in terms of model performance).
  2. Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions.
  3. We multiply c with a matrix M to “predict” a response r’. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.
  4. We measure the similarity of the predicted response r’ and the actual response r by taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. We then apply a sigmoid function to convert that score into a probability. Note that steps 3 and 4 are combined in the figure.

To train the network, we also need a loss (cost) function. We’ll use the binary cross-entropy loss common for classification problems. Let’s call our true label for a context-response pair y. This can be either 1 (actual response) or 0 (incorrect response). Let’s call our predicted probability from 4. above y’. Then, the cross entropy loss is calculated as L= −y * ln(y’) − (1 − y) * ln(1−y’). The intuition behind this formula is simple. If y=1 we are left with L = -ln(y’), which penalizes a prediction far away from 1, and if y=0 we are left with L= −ln(1−y’), which penalizes a prediction far away from 0.

For our implementation we’ll use a combination of numpy, pandas, Tensorflow and TF Learn (a combination of high-level convenience functions for Tensorflow).

DATA PREPROCESSING

The dataset originally comes in CSV format. We could work directly with CSVs, but it’s better to convert our data into Tensorflow’s proprietary Example format. (Quick side note: There’s alsotf.SequenceExample but it doesn’t seem to be supported by tf.learn yet). The main benefit of this format is that it allows us to load tensors directly from the input files and let Tensorflow handle all the shuffling, batching and queuing of inputs. As part of the preprocessing we also create a vocabulary. This means we map each word to an integer number, e.g. “cat” may become 2631. The TFRecord files we will generate store these integer numbers instead of the word strings. We will also save the vocabulary so that we can map back from integers to words later on.

Each Example contains the following fields:

  • context: A sequence of word ids representing the context text, e.g. [231, 2190, 737, 0, 912]
  • context_len: The length of the context, e.g. 5 for the above example
  • utterance A sequence of word ids representing the utterance (response)
  • utterance_len: The length of the utterance
  • label: Only in the training data. 0 or 1.
  • distractor_[N]: Only in the test/validation data. N ranges from 0 to 8. A sequence of word ids representing the distractor utterance.
  • distractor_[N]_len: Only in the test/validation data. N ranges from 0 to 8. The length of the utterance.

The preprocessing is done by the prepare_data.py Python script, which generates 3 files:train.tfrecords, validation.tfrecords and test.tfrecords. You can run the script yourself or download the data files here.

CREATING AN INPUT FUNCTION

In order to use Tensorflow’s built-in support for training and evaluation we need to create an input function — a function that returns batches of our input data. In fact, because our training and test data have different formats, we need different input functions for them. The input function should return a batch of features and labels (if available). Something along the lines of:

def input_fn():

# TODO Load and preprocess data here

return batched_features, labels

Because we need different input functions during training and evaluation and because we hate code duplication we create a wrapper called create_input_fn that creates an input function for the appropriate mode. It also takes a few other parameters. Here’s the definition we’re using:

def create_input_fn(mode, input_files, batch_size, num_epochs=None):

def input_fn():

# TODO Load and preprocess data here

return batched_features, labels

return input_fn

The complete code can be found in udc_inputs.py. On a high level, the function does the following:

  1. Create a feature definition that describes the fields in our Example file
  2. Read records from the input_files with tf.TFRecordReader
  3. Parse the records according to the feature definition
  4. Extract the training labels
  5. Batch multiple examples and training labels
  6. Return the batched examples and training labels

DEFINING EVALUATION METRICS

We already mentioned that we want to use the recall@k metric to evaluate our model. Luckily, Tensorflow already comes with many standard evaluation metrics that we can use, including recall@k. To use these metrics we need to create a dictionary that maps from a metric name to a function that takes the predictions and label as arguments:

def create_evaluation_metrics():

eval_metrics = {}

for k in [1, 2, 5, 10]:

eval_metrics[“recall_at_%d” % k] = functools.partial(

tf.contrib.metrics.streaming_sparse_recall_at_k,

k=k)

return eval_metrics

Above, we use functools.partial to convert a function that takes 3 arguments to one that only takes 2 arguments. Don’t let the name streaming_sparse_recall_at_k confuse you. Streaming just means that the metric is accumulated over multiple batches, and sparse refers to the format of our labels.

This brings is to an important point: What exactly is the format of our predictions during evaluation? During training, we predict the probability of the example being correct. But during evaluation our goal is to score the utterance and 9 distractors and pick the best one — we don’t simply predict correct/incorrect. This means that during evaluation each example should result in a vector of 10 scores, e.g. [0.34, 0.11, 0.22, 0.45, 0.01, 0.02, 0.03, 0.08, 0.33, 0.11], where the scores correspond to the true response and the 9 distractors respectively. Each utterance is scored independently, so the probabilities don’t need to add up to 1. Because the true response is always element 0 in array, the label for each example is 0. The example above would be counted as classified incorrectly by recall@1because the third distractor got a probability of 0.45 while the true response only got 0.34. It would be scored as correct by recall@2 however.

Ai Models

BOILERPLATE TRAINING CODE

Before writing the actual neural network code I like to write the boilerplate code for training and evaluating the model. That’s because, as long as you adhere to the right interfaces, it’s easy to swap out what kind of network you are using. Let’s assume we have a model functionmodel_fn that takes as inputs our batched features, labels and mode (train or evaluation) and returns the predictions. Then we can write general-purpose code to train our model as follows:

estimator = tf.contrib.learn.Estimator(

model_fn=model_fn,

model_dir=MODEL_DIR,

config=tf.contrib.learn.RunConfig())

input_fn_train = udc_inputs.create_input_fn(

mode=tf.contrib.learn.ModeKeys.TRAIN,

input_files=[TRAIN_FILE],

batch_size=hparams.batch_size)

input_fn_eval = udc_inputs.create_input_fn(

mode=tf.contrib.learn.ModeKeys.EVAL,

input_files=[VALIDATION_FILE],

batch_size=hparams.eval_batch_size,

num_epochs=1)

eval_metrics = udc_metrics.create_evaluation_metrics()

# We need to subclass theis manually for now. The next TF version will

# have support ValidationMonitors with metrics built-in.

# It’s already on the master branch.

class EvaluationMonitor(tf.contrib.learn.monitors.EveryN):

def every_n_step_end(self, step, outputs):

self._estimator.evaluate(

input_fn=input_fn_eval,

metrics=eval_metrics,

steps=None)

eval_monitor = EvaluationMonitor(every_n_steps=FLAGS.eval_every)

estimator.fit(input_fn=input_fn_train, steps=None, monitors=[eval_monitor])

Here we create an estimator for our model_fn, two input functions for training and evaluation data, and our evaluation metrics dictionary. We also define a monitor that evaluates our model every FLAGS.eval_every steps during training. Finally, we train the model. The training runs indefinitely, but Tensorflow automatically saves checkpoint files in MODEL_DIR, so you can stop the training at any time. A more fancy technique would be to use early stopping, which means you automatically stop training when a validation set metric stops improving (i.e. you are starting to overfit). You can see the full code in udc_train.py.

Two things I want to mention briefly is the usage of FLAGS. This is a way to give command line parameters to the program (similar to Python’s argparse). hparams is a custom object we create in hparams.py that holds hyperparameters, nobs we can tweak, of our model. This hparams object is given to the model when we instantiate it.

CREATING THE MODEL

Now that we have set up the boilerplate code around inputs, parsing, evaluation and training it’s time to write code for our Dual LSTM neural network. Because we have different formats of training and evaluation data I’ve written a create_model_fn wrapper that takes care of bringing the data into the right format for us. It takes a model_impl argument, which is a function that actually makes predictions. In our case it’s the Dual Encoder LSTM we described above, but we could easily swap it out for some other neural network. Let’s see what that looks like:

def dual_encoder_model(

hparams,

mode,

context,

context_len,

utterance,

utterance_len,

targets):

# Initialize embedidngs randomly or with pre-trained vectors if available

embeddings_W = get_embeddings(hparams)

# Embed the context and the utterance

context_embedded = tf.nn.embedding_lookup(

embeddings_W, context, name=”embed_context”)

utterance_embedded = tf.nn.embedding_lookup(

embeddings_W, utterance, name=”embed_utterance”)

# Build the RNN

with tf.variable_scope(“rnn”) as vs:

# We use an LSTM Cell

cell = tf.nn.rnn_cell.LSTMCell(

hparams.rnn_dim,

forget_bias=2.0,

use_peepholes=True,

state_is_tuple=True)

# Run the utterance and context through the RNN

rnn_outputs, rnn_states = tf.nn.dynamic_rnn(

cell,

tf.concat(0, [context_embedded, utterance_embedded]),

sequence_length=tf.concat(0, [context_len, utterance_len]),

dtype=tf.float32)

encoding_context, encoding_utterance = tf.split(0, 2, rnn_states.h)

with tf.variable_scope(“prediction”) as vs:

M = tf.get_variable(“M”,

shape=[hparams.rnn_dim, hparams.rnn_dim],

initializer=tf.truncated_normal_initializer())

# “Predict” a response: c * M

generated_response = tf.matmul(encoding_context, M)

generated_response = tf.expand_dims(generated_response, 2)

encoding_utterance = tf.expand_dims(encoding_utterance, 2)

# Dot product between generated response and actual response

# (c * M) * r

logits = tf.batch_matmul(generated_response, encoding_utterance, True)

logits = tf.squeeze(logits, [2])

# Apply sigmoid to convert logits to probabilities

probs = tf.sigmoid(logits)

# Calculate the binary cross-entropy loss

losses = tf.nn.sigmoid_cross_entropy_with_logits(logits, tf.to_float(targets))

# Mean loss across the batch of examples

mean_loss = tf.reduce_mean(losses, name=”mean_loss”)

return probs, mean_loss

The full code is in dual_encoder.py. Given this, we can now instantiate our model function in the main routine in udc_train.py that we defined earlier.

model_fn = udc_model.create_model_fn(

hparams=hparams,

model_impl=dual_encoder_model)

That’s it! We can now run python udc_train.py and it should start training our networks, occasionally evaluating recall on our validation data (you can choose how often you want to evaluate using the — eval_every switch). To get a complete list of all available command line flags that we defined using tf.flags and hparams you can run python udc_train.py — help.

INFO:tensorflow:training step 20200, loss = 0.36895 (0.330 sec/batch).

INFO:tensorflow:Step 20201: mean_loss:0 = 0.385877

INFO:tensorflow:training step 20300, loss = 0.25251 (0.338 sec/batch).

INFO:tensorflow:Step 20301: mean_loss:0 = 0.405653

INFO:tensorflow:Results after 270 steps (0.248 sec/batch): recall_at_1 = 0.507581018519, recall_at_2 = 0.689699074074, recall_at_5 = 0.913020833333, recall_at_10 = 1.0, loss = 0.5383

EVALUATING THE MODEL

After you’ve trained the model you can evaluate it on the test set using python udc_test.py — model_dir=$MODEL_DIR_FROM_TRAINING, e.g. python udc_test.py — model_dir=~/github/chatbot-retrieval/runs/1467389151. This will run the recall@k evaluation metrics on the test set instead of the validation set. Note that you must call udc_test.py with the same parameters you used during training. So, if you trained with — embedding_size=128 you need to call the test script with the same.

After training for about 20,000 steps (around an hour on a fast GPU) our model gets the following results on the test set:

recall_at_1 = 0.507581018519

recall_at_2 = 0.689699074074

recall_at_5 = 0.913020833333

While recall@1 is close to our TFIDF model, recall@2 and recall@5 are significantly better, suggesting that our neural network assigns higher scores to the correct answers. The original paper reported 0.55, 0.72 and 0.92 for recall@1, recall@2, and recall@5 respectively, but I haven’t been able to reproduce scores quite as high. Perhaps additional data preprocessing or hyperparameter optimization may bump scores up a bit more.

MAKING PREDICTIONS

You can modify and run udc_predict.py to get probability scores for unseen data. For example python udc_predict.py — model_dir=./runs/1467576365/ outputs:

Context: Example context

Response 1: 0.44806

Response 2: 0.481638

You could imagine feeding in 100 potential responses to a context and then picking the one with the highest score.

CONCLUSION

In this post we’ve implemented a retrieval-based neural network model that can assign scores to potential responses given a conversation context. There is still a lot of room for improvement, however. One can imagine that other neural networks do better on this task than a dual LSTM encoder. There is also a lot of room for hyperparameter optimization, or improvements to the preprocessing step. The Code and data for this tutorial is on Github, so check it out.

Resources:

Denny’s Blogs: http://blog.dennybritz.com/ & http://www.wildml.com/

Mark Clark: https://www.linkedin.com/in/markwclark

Final Word

I hope you have found this Condensed NLP Guide Helpful. I wanted to publish a longer version (imagine if this was 5x longer) however I don’t want to scare the readers away.

As someone who develops the front end of bots (user experience, personality, flow, etc) I find it extremely helpful to the understand the stack, know the technological pros and cons and so to be able to effectively design around NLP/NLU limitations. Ultimately a lot of the issues bots face today (eg: context) can be designed around, effectively.

If you have any suggestions on regarding this article and how it can be improved, feel free to drop me a line.

About Stefan

Stefan Kojouharov is a pioneering figure in the AI and chatbot industry, with a rich history of contributing to its evolution since 2016. Through his influential publications, conferences, and workshops, Stefan has been at the forefront of shaping the landscape of conversational AI.

Current Focus: Currently, Stefan is channeling his expertise into developing AI agents within the mental health and wellbeing sector. These projects aim to revolutionize the way we approach wellness, merging cutting-edge AI with human-centric care.

Join the Journey on Substack: For exclusive insights into the development process, breakthrough experiments, and in-depth tutorials, follow Stefan’s journey on Substack. Join a community of forward-thinkers and be a part of the conversation shaping the future of AI in mental health and wellbeing.

Subscribe now to Stefan’s Substack and unlock the full potential of AI-driven transformation.

Enjoyed the article? Click the ❤ below to recommend it to other interested readers!

--

--

Building AI Agents since 2016. Today, I am creating AI Agents for Wellness & Personal Growth and Sharing my Insights. Join me at: stefanspeaks.substack.com/