Simple NLP - Token Matching

In this tutorial will explore the simple NLP technique of token matching in order to perform natural language understanding and a basic form of parsing.

Lets set up our environment....

Lets say we have a simple house, we have a set of rooms, and each room has some attributes that can be retrieved from sensors. For our purposes here, lets say we just want to look up the temperature.

We define our set of rooms, and we define our attributes.

In [68]:
ROOMS = ["livingroom", "kitchen", "diningroom", "bedroom", "bathroom"] # As many or as few rooms is appropriate
ATTRIBUTES = ["temperature"] # For now, this is just temperature, but it could be humidity, lighting, sound, anything you want

The first thing we need to do is get some input.

Here we have 2 elements

  • A string that contains a question that we want the computer to answer
  • A dictionary that has two elements, the question that we just entered, and a conversation history (if there is one)

We put the code in this structure so that it is easier to use down the road

In [69]:
inputString = "What is the temperature in the livingroom?" # The question that we want to answer
systemInput = {"question": inputString, 
			   "history" : []}                             # A simple datastructure for controlling that data

Now, this is where things get fun, here we are going to actually look at the input and try and get some simple understanding

First thing we are going to do is clean the input that we are getting so that it is in a understandable format. We are going to write a function to do just that.

In [70]:
def cleanText(inputString):
    #First we convert the input to lower case
    loweredInput = inputString.lower() 

    #Then we remove all the characters that are not alphanumeric, or spaces
    cleanedInput = ""
    for character in loweredInput:                   #For every character in the question that has been converted to lower case
        if(character.isalnum() or character == " "):     #Check to see if it is an alpha numeric character (A-Z, a-z, 0-9) or a space and if it is...
            cleanedInput += character                        #Then we add it to the cleaned input string, building it up character by character
        else:                                            #If it isn't alpha numeric, or a space
            pass                                             #Then we ignore it because we no longer need to keep track of it

    #This is what our input question looks like now...
    print "cleanedInput:", cleanedInput
    return cleanedInput

#Lets run it on some simple data so you can see what this actually does
exampleText = "There are 4 lights!"

print "dirtyInput:", exampleText
cleanedExampleText = cleanText(exampleText)
dirtyInput: There are 4 lights!
cleanedInput: there are 4 lights

We are going to use what is called a "Bag of Words" approach (or BOW). This means that we are going to ignore the order of the words (as if they were just placed in a bag... hence bag of words). While this does loose some amount of resolution regarding what was actually said, for simple NLP like this, it shouldnt matter to much.

So our next step is to split the words into tokens based on their spaces... So lets write another function.

In [71]:
def makeBow(text):
    bagOfWords = set(text.split(" ")) #split the string that we have been given based on where the spaces are
    #This is what a bag of words looks like...
    print "bagOfWords:", bagOfWords
    return bagOfWords

#Lets run it on some simple data so you can see what this actually does
exampleText = "hello my baby hello my honey hello my rag time gal"

print "textInput:", exampleText
exampleBow = bow(exampleText)
textInput: hello my baby hello my honey hello my rag time gal
bagOfWords: set(['rag', 'honey', 'gal', 'time', 'baby', 'my', 'hello'])

Our next step is to remove all the words that are unimportant from this list. This may sound complicated, how do you decide what is and isnt important? While that question is still an open problem, one simple way is to maintain a list of the most common words in a given language and claim that these words are the unimportant ones. We call this list of unimportant words "stopwords".

Included below is a simple list of stopwords. These words are typically chosen using statistical measures such as word frequency.

In [72]:
STOPWORDS = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "arent", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do", "does", "doesnt", "doing", "dont", "down", "during", "each", "few", "for", "from", "further", "had", "hadnt", "has", "hasnt", "have", "havent", "having", "he", "hed", "hell", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "isnt", "it", "its", "its", "itself", "lets", "me", "more", "most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shant", "she", "shed", "shell", "shes", "should", "shouldnt", "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then", "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasnt", "we", "wed", "well", "were", "weve", "were", "werent", "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why", "whys", "with", "wont", "would", "wouldnt", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself", "yourselves"]

def removeStopWords(inputBagOfWords):
    coreWords = set()
    for word in inputBagOfWords:
        if(not word in STOPWORDS):

    #This is what our core set of tokens looks like now...
    print "coreWords:", coreWords
    return coreWords

#Lets run it on some simple data so you can see what this actually does
print "inputBagOfWords:", exampleBow
exampleCoreWords = removeStopWords(exampleBow)
inputBagOfWords: set(['rag', 'honey', 'gal', 'time', 'baby', 'my', 'hello'])
coreWords: set(['rag', 'honey', 'gal', 'time', 'baby', 'hello'])

Now that we have cleaned our input, we need to start looking at its contents for information...

Lets write a function to extract the information we want from the text.

In [73]:
def extractAttributeAndRoom(inputBagOfWords):
    targetRoom = None
    targetAttribute = None

    for word in inputBagOfWords:
        if(word in ROOMS):
            targetRoom = word
        if(word in ATTRIBUTES):
            targetAttribute = word

    #Now we have extracted the attribute and the room...
    print "targetAttribute: \"" + str(targetAttribute) +"\" targetRoom: \"" + targetRoom + "\""     
    return targetAttribute, targetRoom

#We will run an example of this function at the end of the tutorial

Next we are going to define a dummy function that returns a fake value when asked for a temperature for a specific room. While this function is simple, its contents could be replaced with anything you want. A call to a sensor that is plugged into your machine, a network request to a server that hosts sensing equipment... anything you want.

In [74]:
def getAttributeValue(room, attribute):
    if (room != None and attribute != None):
        if(room == "livingroom" and attribute == "temperature"):
            return 72
        elif(room == "bathroom" and attribute == "temperature"):
            return 73
        elif(room == "kitchen" and attribute == "temperature"):
            return 81
        elif(room == "bedroom" and attribute == "temperature"):
            return 68
        elif(room == "diningroom" and attribute == "temperature"):
            return 79
            return 50
        raise None

Lets put it all together now. Then we can pass it to the function to look up the information...

In [75]:
rawInputString = systemInput["question"]
print "Got input:", rawInputString

cleanedInput = cleanText(rawInputString)
cleanBow = makeBow(cleanedInput)
stoppedCleanBow = removeStopWords(cleanBow)
targetAttribute, targetRoom = extractAttributeAndRoom(stoppedCleanBow)

print "The", targetAttribute, "in the", targetRoom, "is", getAttributeValue(targetRoom, targetAttribute)
Got input: What is the temperature in the livingroom?
cleanedInput: what is the temperature in the livingroom
bagOfWords: set(['what', 'livingroom', 'temperature', 'is', 'in', 'the'])
coreWords: set(['livingroom', 'temperature'])
targetAttribute: "temperature" targetRoom: "livingroom"
The temperature in the livingroom is 72


You have just written a simple, token based, NLP system

But you should already be thinking about how this system would fail. Unfortunately this system requires that you use very specific vocabulary (You can't call your livingroom a familyroom, etc). Spellings of words matter greatly, and if a word is incorrectly spelled the system falls apart.

How would you handle an input such as "How hot is it in my family room?"