Simple NLP - Jaccard Matching

In this tutorial will explore the simple NLP technique of token matching in combination with the jaccard index in order to perform natural language understanding and a basic form of parsing.

Lets set up our environment....

Lets say we have a simple house, we have a set of rooms, and each room has some attributes that can be retrieved from sensors. For our purposes here, lets say we just want to look up the temperature and the humidity.

We define our set of rooms, and we define our attributes... same as the last tutorial, but this time with a spin. We are going to include a series of words that we can use for reference with each room and attribute as well. This will give us some context to try and better decide what the user is talking about.

In [10]:
#Here lets come up with a bunch of words that can be associated with each room
LIVINGROOM_WORDS = ["livingroom", "living", "familyroom", "family", "television", "tv", "media", "chill", "communal", "couch"]
KITCHEN_WORDS = ["kitchen", "cook", "cooking", "food", "meal", "family", "dinner", "lunch", "breakfast", "brunch", "snack", "fridge", "stove", "grill", "refridgerator", "eat"]
DININGROOM_WORDS = ["dining", "diningroom", "dinner", "lunch", "breakfast", "brunch", "table", "chair", "eat"]
BEDROOM_WORDS = ["bed", "bedroom", "sleep", "rest", "sheets", "personal", "mattress"]
BATHROOM_WORDS = ["bathroom", "bath", "shower", "sink", "toilet", "towel", "clean", "wash"]

#And a bunch of words that can be associated with each attribute
TEMPERATURE_WORDS = ["temperature", "temp", "hot", "cold", "chilly", "warm", "cool", "fresh", "roast", "freeze"]
HUMIDITY_WORDS = ["humidity", "humid", "wet", "moist", "dry", "steamy", "damp", "sultry", "clammy"]

#All the rooms that you want to keep track of
ROOM_DATA = {"livingroom":LIVINGROOM_WORDS,
             "kitchen":KITCHEN_WORDS,
             "diningroom":DININGROOM_WORDS, 
             "bedroom":BEDROOM_WORDS,
             "bathroom":BATHROOM_WORDS}

#All of the attributes that we want to keep track of, notice this time we are also looking at humidity
ATTRIBUTE_DATA = {"temperature":TEMPERATURE_WORDS, 
                  "humidity":HUMIDITY_WORDS}

Now that we have all of those words defined, we can play a kind of association game to try and see which bucket of words the input sentence is most associated with. Obviously the above lists could be extended as much as you want, but these work for the time being.

The first thing we need to do is get some input, this is the same as last time, but notice now the question we are asking is a little more abstract.

In [11]:
inputString = "How cool is it in the place where I normally cook?" # The question that we want to answer
systemInput = {"question": inputString, 
               "history" : []}                             # A simple datastructure for controlling that data

Next we are going to do some of the same steps that we have done in the previous tutorials, we are going to clean our input text, and convert it to a bag of words... This code should be review somewhat.

In [12]:
#Here is the "cleanText" function that we defined in the first tutorial (Token Matching)
def cleanText(inputString):
    #First we convert the input to lower case
    loweredInput = inputString.lower() 

    #Then we remove all the characters that are not alphanumeric, or spaces
    cleanedInput = ""
    for character in loweredInput:                   #For every character in the question that has been converted to lower case
        if(character.isalnum() or character == " "):     #Check to see if it is an alpha numeric character (A-Z, a-z, 0-9) or a space and if it is...
            cleanedInput += character                        #Then we add it to the cleaned input string, building it up character by character
        else:                                            #If it isn't alpha numeric, or a space
            pass                                             #Then we ignore it because we no longer need to keep track of it

    #This is what our input question looks like now...
    print("cleanedInput:", cleanedInput)
    return cleanedInput

#Here is the "makeBow" function that we also defined in the first tutorial (Token Matching)
def makeBow(text):
    bagOfWords = set(text.split(" ")) #split the string that we have been given based on where the spaces are
    
    #This is what a bag of words looks like...
    print("bagOfWords:", bagOfWords)
    
    return bagOfWords

#Finally, here is the "removeStopWords" function that we also wrote in the first tutorial (Token Matching)
STOPWORDS = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "arent", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do", "does", "doesnt", "doing", "dont", "down", "during", "each", "few", "for", "from", "further", "had", "hadnt", "has", "hasnt", "have", "havent", "having", "he", "hed", "hell", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "isnt", "it", "its", "its", "itself", "lets", "me", "more", "most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shant", "she", "shed", "shell", "shes", "should", "shouldnt", "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then", "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasnt", "we", "wed", "well", "were", "weve", "were", "werent", "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why", "whys", "with", "wont", "would", "wouldnt", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself", "yourselves"]

def removeStopWords(inputBagOfWords):
    coreWords = set()
    for word in inputBagOfWords:
        if(not word in STOPWORDS):
            coreWords.add(word)

    #This is what our core set of tokens looks like now...
    print("coreWords:", coreWords)
    return coreWords

Now that we have all of our utility functions written we need to be able to decide which room in the house we are going to be talking about. This is where we get to do some simple math.

There is a simple scoring function called "Jaccard", sometimes called the "Jaccard Index" or the "Jaccard Coefficient". Take a look here for some reading on the subject but it all boils down to a simple equation:

Jaccard Score = (|A Intersect B|)/(|A Union B|)

I know this can look a little intimidating, but this image from wikipedia does a good job of explaining the concept: Drawing

For our purposes, this can be restated again to hopefully make it even more simple:

Jaccard Score = (The number of words that are in both A AND B)/(The number of unique words that are in either A OR B)

In the mean time, lets write a function to compute the jaccard score

In [13]:
def jaccard(setA, setB):
    return float(len(setA.intersection(setB)))/float(len(setA.union(setB)))

Now the next step is to write a funtion that takes in a sentence and tries to find the set of words that most closely matches it based on the jaccard index.

In [14]:
def jaccardMatch(inputTokens, referenceData):
    bestCandidate = None #First we are going to define a variable to store the current best candidate answer, based on our data
    bestScore = -1       #Next we are going to define a current best score to keep track of our best answers
    
    for candidateName, relatedWords in referenceData.items():       #Then we are going to look through all of the possible candidates (in our case, this will be rooms and attributes)
        
        currentScore = jaccard(set(relatedWords), set(inputTokens))       #Run jaccard to get a score
        
        if(currentScore > bestScore):                                     #If the score represents an improvement...
            bestCandidate = candidateName                                        #Save the current candiate answer as our best answer
            bestScore = currentScore                                             #And save the current score as our best score
    
    print("bestCandidate:", bestCandidate, "-", bestScore)          #Print the best candiate that we got, and the best score for that candiate
    return bestCandidate                                            #Finally, now that we have looked at all of the candidate answers, and scored them, we can return the answer that we have

Next we are going to define the same dummy function that we have used in the past tutorials to return a fake value when asked for a temperature for a specific room. While this function is simple, its contents could be replaced with anything you want. A call to a sensor that is plugged into your machine, a network request to a server that hosts sensing equipment... anything you want.

In [15]:
def getAttributeValue(room, attribute):
    if (room != None and attribute != None):
        if(room == "livingroom" and attribute == "temperature"):
            return 72
        elif(room == "bathroom" and attribute == "temperature"):
            return 73
        elif(room == "kitchen" and attribute == "temperature"):
            return 81
        elif(room == "bedroom" and attribute == "temperature"):
            return 68
        elif(room == "diningroom" and attribute == "temperature"):
            return 79
        else:
            return 50
    else: 
        raise None

So now we have code that can clean our data, tokenize it, score it, and selected the best answer based on those scores. All we need to do now is put these things together.

In [16]:
rawInputString = systemInput["question"]
print("Got input:", rawInputString)

#Preprocess our data
cleanedInput = cleanText(rawInputString)
cleanBow = makeBow(cleanedInput)
stoppedCleanBow = removeStopWords(cleanBow)

#Do some analysis
targetAttribute = jaccardMatch(stoppedCleanBow, ATTRIBUTE_DATA)
targetRoom = jaccardMatch(stoppedCleanBow, ROOM_DATA)

#Present the answer
print("The", targetAttribute, "in the", targetRoom, "is", getAttributeValue(targetRoom, targetAttribute))
Got input: How cool is it in the place where I normally cook?
cleanedInput: how cool is it in the place where i normally cook
bagOfWords: {'cook', 'is', 'cool', 'place', 'where', 'it', 'in', 'how', 'i', 'the', 'normally'}
coreWords: {'cool', 'place', 'cook', 'normally'}
bestCandidate: temperature - 0.07692307692307693
bestCandidate: kitchen - 0.05263157894736842
The temperature in the kitchen is 81

Congratulations

You have just written a slightly more intense, token based, NLP system

But you should already be thinking about how this system would fail. While this system gives you a bit more freedom with your word choice, you still need to list all the different words that you want to check for when you consturct the system. This can be fairly time consuming when you are constructing a large-scale system. Spellings of words still matter greatly, and if a word is incorrectly spelled the system falls apart.

What are some ways that you can think of to make it so you dont have to write every single word that you want to check for?