Chatbot Challenge #2: Knowledge Extraction

Doris Chi
Chatbots Life
Published in
5 min readAug 16, 2019

--

Photo by Martin Adams on Unsplash

Information is everywhere.

It is common for a large corporation to have thousands of policies, procedures and FAQ documents as well as many databases storing employees, customers, vendors, tools, and projects information.

It is also common for a large company with many products and services to have many user manuals and FAQs. For example, Amazon Web Services have 165 services as of 2019 and all of them have tutorials and FAQs.

Searching for a piece of information, however, can be really hard.

One major reason is that information is stored at different systems — SharePoint, Confluence pages, ticketing systems, content management systems, customer relationship management systems, forums, someone’s local drives etc. Employees and customers don’t know where to find the information they need.

A Chatbot can be a good starting place to search for information, but…

Creating knowledge articles automatically from existing documents, or so-called knowledge extraction, can be really challenging.

Top 4 Most Popular Bot Design Articles:

1. Ultimate Bot Design Guide

2. How to design a Chatbot

3. Distributing a slack app

4. Chatbot Conference in San Francisco

From the easiest to the hardest, there are three types of knowledge extraction:

1. Get Q&As from FAQ documents to a Chatbot’s knowledge base.

FAQ files are documents with organized questions and answers. Usually, a question is immediately followed by its answer. However, getting the pairs of questions and answers out in the desired format is not easy.

First, FAQ files could have different format s — PDF, HTML, XML, TXT, EXCEL, WORD, PowerPoint etc. (Yes! I have seen people using PowerPoint to create FAQ files.) Second, even if two FAQ files have the same file format, they can have different internal structures. For example, one FAQ file in the HTML format may use <h3> for questions and <p> for answers while the other may use <p> for questions and <ol> or <ul> for answers. Third, questions and answers in FAQ files have context information from document title, section headers, and previous Q&As. While in a chatbot, each Q&A is a stand-alone intent or knowledge article which requires the context information to be included explicitly.

To automatically extract questions and answers from FAQ documents, we need a tool that can deal with different file formats with different internal structures which can also add context information to the questions and answers.

Some existing tools can automatically load Q&As from HTML files. For example, Bold360 can import FAQs from a webpage or knowledge articles in ServiceNow to its knowledge base. Dialogflow Knowledge (Beta) can also create new knowledge documents from webpages. However, it doesn’t always work. For example, https://help.ipsy.com/ has about 136 FAQs, but Dialogflow only loaded 10 in the knowledge document.

Creating Knowledge Document in Dialogflow with FAQs from https://help.ipsy.com
Screenshot of https://help.ipsy.com/
The Result of Dialogflow parsing https://help.ipsy.com/

2. Ask questions to databases with Chatbots

Databases store data that are related to each other. For example, a “Product” table might have fields like product_id, name, price etc. If an end user wants to know the price of a product, it would be really nice to have a chatbot that can query the database and return the answer. However, this one is also very hard.

The first challenge is entity recognition. When a user asks “What is the price of product A?” The chatbot needs to know that “price” and “product A” are two entities. Second, the chatbot needs to know the relation between the entities. In this case, “price” is a property of “product A”. Third, the chatbot needs to know where to find the information, and also, how to query the information. For example, product prices can be found in the “Product” table by searching “product A” in the “name” field and then return the value in the corresponding “price” field.

One way to make this kind of extraction happen is to convert natural language to query languages, such as SQL for relational databases or Cypher for Graph databases; Or to API calls which uses the entities as parameters. For example, Lymba has a “Natural Language query interface” that can turn user questions into database queries. Bold360 also have a function called “provider API”, which can be setup using Google Doc, CSV file, Node.js etc. and their chatbot can get information from data stored in the “provider API”. The performance of those tools is yet to be tested.

3. Ask questions to documents with chatbots.

If a chatbot can turn unstructured documents into structured data and store them in the database, then the rest of the task will be similar to the second type of knowledge extraction. To convert the unstructured natural language to structured data, a chatbot needs to understand all the information inside the documents, recognizing entities, the relationship between them, and then store the information logically so that the information can be queried later. The chatbot needs to be pre-trained with lexicon, syntax, semantics and discourse information. All layers of natural language processing are required for this type of knowledge extraction.

Although this is incredibly challenging, some companies has made some progress. For example, Watson, the question and answering system made by IBM, is famous for winning the Jeopardy quiz show in 2011. Lymba also has a product called K-Extractor that can turn documents into knowledge. However, human needs to define the ontology and syntactic rules for the documents to recognize entities and their relationships, which requires a lot of time and effort before the chatbot can be useful. Also, this approach may only work in small domains cause the amount of information needed to train chatbots for a general domain is incredibly large.

Final thoughts

There are hundreds of chatbot tools and platforms. However, most of them rely on knowledge managers to manually create knowledge articles and user intents without leveraging existing documents and databases. An “End to End” chatbot tool that can turn existing resources into queryable knowledge will be really helpful.

Thanks for reading and any feedback is appreciated!

Don’t forget to give us your 👏 !

--

--

I am a full-stack software engineering interested in NLP and ML. Opinions expressed are solely my own and do not represent my employer.