Benchmarking Conversational AIs

Jesús Seijas
Chatbots Life
Published in
5 min readJul 27, 2019

--

Introduction

When starting the development of a chatbot, one of the critical decissions is to choose the conversational artificial intelligence to use, and to take that decission we have to evaluate how they behave. One of the most relevant papers around this topic is SIGDIAL22 that propose 3 corpus (sets of data) to test that. The problem around that is that the amount of data used for the study is far from a real bot. A description of the three corpus:

  • Chatbot: Train using 100 sentences classified into 2 intents.
  • Ask Ubuntu: Train using 53 sentences classified into 5 intents.
  • Web Applications: Train using 30 sentences classified into 8 intents.

But, ¿how this conversational AIs works in a real situation?

We have evaluated them using a real chatbot with 854 sentences (in english) classified into 126 intents, 127 if we take into account the intent “None” that means that the input sentence does not classify to any of the training intents. We also used a test dataset with 82 sentences asked to the bot from real users, where none of them are part of the training set.

Evaluated systems

Those are the evaluated systems:

We also wanted to evaluate Amazon Lex, but the limitations of 100 intents per bot means that we cannot compare the results with the other providers, because for the tests we had to remove 26 intents, and we tested it with only 723 sentences from the total of 854.

Top Articles on How Businesses are using Bots:

1. Is Chatbot a synonym for great Customer Experience?

2. Five Inspirational Startups Using AI and Chatbot Technology

3. How Businesses are Winning with Chatbots & Ai

4. Chatbot Conference in San Francisco

Results from the Training Sentences

The first evaluation is, after training with those 854 sentences classified into 126 intents, is to check every sentence of this training data, where we expect that the returned intent from the system is the one used to train.

Those are the amounts of errors (sentences not classified to the correct intent) using the training sentences:

  • Microsoft LUIS: fails to predict the intent of 20 sentences
  • Google Dialogflow: fails to predict the intent of 27 sentences
  • IBM Watson Assistant: fails to predict the intent of 27 sentences
  • SAP Conversational AI: fails to predict the intent 10 sentences
  • NLP.js: fails to predict the intent of 1 sentence

Results of the Tests

The test data have 82 sentences, said by real users, that are not in the training, so the conversational AIs have never seen them, but they must match them into the correct intent. This is the amount of errors of each one:

  • Microsoft LUIS: fails to predict the intent of 18 sentences
  • Google Dialogflow: fails to predict the intent of 19 sentences
  • IBM Watson Assistant: fails to predict the intent of 6 sentences
  • SAP Conversational AI: fails to predict the intent of 20 sentences
  • NLP.js: fails to predict the intent of 3 sentences

Total Results

Those are the total results for each one:

  • Microsoft LUIS: fails to predict the intent of 20 training sentences and 18 test sentences
  • Google Dialogflow: fails to predict the intent of 27 training sentences and 19 test sentences
  • IBM Watson Assistant: fails to predit the intent of 27 training sentences and 6 test sentences
  • SAP Conversational AI: fails to predict the intent of 10 training sentences and 20 test sentences
  • NLP.js: fails to predict the intent of 1 training sentences and 3 test sentences

Conclussions

Taking a look into the last chart we can see that SAP Conversational AI is very good guessing intents from the training data, but no so good generalizing to sentences never seen by the AI. This is called “overfitting”, when an AI is fitted to much to the training data but does not work so well with other data.

On the other hand, Watson Assistant is just the opposite situation: is very good at generalizing to other data, but no so good with the data used in the training.

And then we have Microsoft LUIS and Google Dialogflow, very balanced, but at the average worst than the other competitors.

As I explained in my previous experiment, NLP.js performs quite well with SIGDIAL22 corpuses, but it also shows strong results, as we can see in the graphs above, in a real chatbot scenario.

NLP.js it is an open source project, slowly but steadily getting more traction from the development community. Latest updates have improved quite significantly its performance. If you would like to help improving NLP.js results, it is open for contributions or if you want use it for your own projects, I would love to hear how it performs.

Don’t forget to give us your 👏 !

--

--

New Tech Team Lead at AXA Group Operations. Chatbot and voice agent expert, AI advocate.