Creating a chatterbot (Part 1)

Creating a chatterbot (Part 1)

Ever since the first time I heard about the Turing Test I’ve wanted to make my own chatbot. It started probably twenty plus years ago, when the only language I could program in was QBASIC. At that time I never got further than:

Hi computer!
Hello.

…. and now? ….

Current state

Since that first try (aged 10 or something) I’ve never tried to build another chatbot. But a couple of days ago I read a news article about the Loebner Prize, an annual event that tests the best chatbots in the world against some human judges, and it sparked my interest again.

I started researching the Loebner Prize winners and there seems to be three distinct groups/types of chatbots and algorithms:

  1. Template based bots
  2. Crowdsourced bots
  3. Markov Chain bots

Let me quickly describe how they work.

Template based bots

A lot of populair bots (like A.L.I.C.E. winner of the 2000, 2001 and 2004 Loebner Prize) use pre-defined templates. Most of them use a certain XML template called AIML. This is a short example:

<category>
  <pattern>* YOUR NAME?</pattern>
  <template>My name is Chatbot.</template>
</category>

When somebody talks to the bot it quickly goes through all the templates and finds matches. This particular pattern will for example match:

What is your name?
My name is Chatbot.

These kinds of bots have enormous databases of AIML templates, mostly hand-crafted by skilled bot configurers. Although they work very good, they don’t seem very ‘smart’ and AI to me; so I’m not going to make another AIML-like bot.

Crowdsourced bots

One of the best (most human-like) chatbots is Cleverbot. But in my opinion it is one of the most simple bots around. It uses a nice trick and huge database. When it encounters a question it doesn’t know/understand, it just ignores it, and stores it. In another chat session, it’ll mimic and repeat that question to another human. The result is stored in the huge database for future conversations. As a result all the answers it gives are very human-like (because, well, they are human answers).

But there is obviously a huge drawback, for example at one moment the bot is pretending to be a 18 year old male, the next moment it claims to be a 40 year old female. Then it starts talking about how much it loves horses, the next moment it says it hates animals…

Markov Chain bots

To keep this (long) blog post within reasonable length I’m not going to elaborately explain how Markov Chains work. Markov Chain bots they store words in Markov Chains, let’s for example say we store chains of length three.

"Let's store this sentence, let's store it in a markov chain."

1> Let's store this
2> store this sentence
3> this sentence let's
4> sentence let's store
5> let's store it
6> store it in
7> it in a
8> in a markov
9> a markov chain

Now let’s imagine we are building a valid reply to a question and we already have “let’s store”… what can we do next? We go into the chain and walk the nodes until we find the two matching results (1) and (5). So the sentence can continue with “this” or “it”.

Famous bots like MegaHAL kind of work this way (see their explanation). Although this feels much more like real AI/knowledge to me, it also has drawbacks. For example you can’t say these bots are reasoning, they don’t understand the environment/context it is in.

A new attempt

I’ve made a list of what my ideal chatbot should be:

  • Learn through conversation/reading text
  • Not just repeat, but understand relations and concepts
  • Have different scopes, global knowledge, conversational scope

Two days ago these ideas started to take shape in my head and I started writing the first code. The first goal is to make the bot able to read text and extract ‘knowledge’ from it.

The first thing I had to do was to break up the input text into pieces, for this I found a great open source framework called Apache OpenNLP. It recognises words and sentences, it detects verbs, nouns, pronouns, adjectives, adverbs etc.

The next thing I wanted to do was to turn all the nouns and verbs into their ‘base’. When storing relations in the bots memory I want to avoid duplicate entries, so for example “fish are animals” and “fish is an animal” is the same. For that purpose I’m using WordNet® in combination with the Java library JWNL.

Currently this is what the bot sees when given some input:

> The bot can parse sentences and understand their base form
Transformed: (The DT : WordPart) (bot NN : NounPart) (can MD : WordPart) (parse VB : VerbPart) (sentence NNS : NounPart) (and CC : WordPart) (understand VB : VerbPart) (their PRP$ : WordPart) (base form NN : NounPart) 
> It can also parse names like Roy van Rijn and numbers like five hundred million and 4 hundred
Transformed: (It PRP : WordPart) (can MD : WordPart) (also RB : AdverbPart) (parse VB : VerbPart) (name NNS : NounPart) (like IN : WordPart) (Roy van Rijn) (and CC : WordPart) (number NNS : NounPart) (like IN : WordPart) (500000000) (and CC : WordPart) (400)

Understanding what kind of words are used and being able to transform them into the base-form will make it easier to store ‘knowledge’ and make sense of the world in the future.

Graph database

Instead of learning how to use a real graph database (like Neo4J) I decided to build something myself. Normally this is a horrible choice, but I’m in it for the fun and to learn new skills. Although it is yet another distraction from the actual bot, after a couple of hours I’ve got the following unit test working:

// Set up some fruity test data:

// [thing] [connection] [thing] [optional: inverse connection]

graph.connect("human", "eat", "food", "eaten by");
graph.connect("apple", "be", "fruit", "contains");
graph.connect("banana", "be", "fruit", "contains");
graph.connect("strawberry", "be", "fruit", "contains");
graph.connect("fruit", "be", "food", "contains");
graph.connect("fruit", "fall", "tree");
    
//True: Does human eat banana?

Assert.assertTrue(graph.validate("human", "eat", "banana"));
//True: Does human eat strawberry?

Assert.assertTrue(graph.validate("human", "eat", "strawberry"));
//True: Is banana eaten by human?

Assert.assertTrue(graph.validate("banana", "eaten by", "human"));
//True: Does the group fruit contain strawberry?

Assert.assertTrue(graph.validate("food", "contains", "strawberry"));
//True: Is apple a fruit?

Assert.assertTrue(graph.validate("apple", "be", "food"));
//True: Do apples fall from trees?

Assert.assertTrue(graph.validate("apple", "fall", "tree"));
    
//False: Is strawberry a human?

Assert.assertFalse(graph.validate("strawberry", "be", "human"));
//False: Is fruit a strawberry?

Assert.assertFalse(graph.validate("fruit", "be", "strawberry"));

There are already more facts possible to extract from this ‘knowledge database’ than just the plain information I put in.

Data extraction

The next step in making my bot is going to be data extraction. I’m probably going to make a small template language that (might!) look like this:

Pattern: [noun] {be} (adj) [noun]
Store: %1 is %2

The template should match sentences like “fish are animals”, “roses are red”, “Roy is a handsome dude” and “Roy is an obvious liar”.

The bot should be able to store all these ‘facts’ and put them in my graph/relation database. With these data extraction templates it should be possible to build a large knowledge base with facts for the bot, for example just by parsing a Wikipedia dump.

Now I’m going to dive back into the code and continue my chatbot adventure, keep an eye out for part 2!