Benchmarking Conversational AIs

Published in

Chatbots Life

5 min readJul 27, 2019

Introduction

When starting the development of a chatbot, one of the critical decissions is to choose the conversational artificial intelligence to use, and to take that decission we have to evaluate how they behave. One of the most relevant papers around this topic is SIGDIAL22 that propose 3 corpus (sets of data) to test that. The problem around that is that the amount of data used for the study is far from a real bot. A description of the three corpus:

Chatbot: Train using 100 sentences classified into 2 intents.
Ask Ubuntu: Train using 53 sentences classified into 5 intents.
Web Applications: Train using 30 sentences classified into 8 intents.

But, ¿how this conversational AIs works in a real situation?

We have evaluated them using a real chatbot with 854 sentences (in english) classified into 126 intents, 127 if we take into account the intent “None” that means that the input sentence does not classify to any of the training intents. We also used a test dataset with 82 sentences asked to the bot from real users, where none of them are part of the training set.

Evaluated systems

Those are the evaluated systems:

Microsoft LUIS: https://www.luis.ai/
Google Dialogflow: https://dialogflow.com/
IBM Watson Assistant: https://www.ibm.com/cloud/watson-assistant/
SAP Conversational AI: https://cai.tools.sap
NLP.js: Open Source https://github.com/axa-group/nlp.js

We also wanted to evaluate Amazon Lex, but the limitations of 100 intents per bot means that we cannot compare the results with the other providers, because for the tests we had to remove 26 intents, and we tested it with only 723 sentences from the total of 854.

Results from the Training Sentences

The first evaluation is, after training with those 854 sentences classified into 126 intents, is to check every sentence of this training data, where we expect that the returned intent from the system is the one used to train.

Those are the amounts of errors (sentences not classified to the correct intent) using the training sentences:

Microsoft LUIS: fails to predict the intent of 20 sentences
Google Dialogflow: fails to predict the intent of 27 sentences
IBM Watson Assistant: fails to predict the intent of 27 sentences
SAP Conversational AI: fails to predict the intent 10 sentences
NLP.js: fails to predict the intent of 1 sentence

Results of the Tests

The test data have 82 sentences, said by real users, that are not in the training, so the conversational AIs have never seen them, but they must match them into the correct intent. This is the amount of errors of each one:

Microsoft LUIS: fails to predict the intent of 18 sentences
Google Dialogflow: fails to predict the intent of 19 sentences
IBM Watson Assistant: fails to predict the intent of 6 sentences
SAP Conversational AI: fails to predict the intent of 20 sentences
NLP.js: fails to predict the intent of 3 sentences

Total Results

Those are the total results for each one:

Microsoft LUIS: fails to predict the intent of 20 training sentences and 18 test sentences
Google Dialogflow: fails to predict the intent of 27 training sentences and 19 test sentences
IBM Watson Assistant: fails to predit the intent of 27 training sentences and 6 test sentences
SAP Conversational AI: fails to predict the intent of 10 training sentences and 20 test sentences
NLP.js: fails to predict the intent of 1 training sentences and 3 test sentences

Conclussions

Taking a look into the last chart we can see that SAP Conversational AI is very good guessing intents from the training data, but no so good generalizing to sentences never seen by the AI. This is called “overfitting”, when an AI is fitted to much to the training data but does not work so well with other data.

On the other hand, Watson Assistant is just the opposite situation: is very good at generalizing to other data, but no so good with the data used in the training.

And then we have Microsoft LUIS and Google Dialogflow, very balanced, but at the average worst than the other competitors.

As I explained in my previous experiment, NLP.js performs quite well with SIGDIAL22 corpuses, but it also shows strong results, as we can see in the graphs above, in a real chatbot scenario.

NLP.js it is an open source project, slowly but steadily getting more traction from the development community. Latest updates have improved quite significantly its performance. If you would like to help improving NLP.js results, it is open for contributions or if you want use it for your own projects, I would love to hear how it performs.

Don’t forget to give us your 👏 !