Ads 468x60px

Saturday, December 22, 2018

Building a chatbot using TF-IDF


We want to build a basic chatbot which trains on previous messages and responses. In this tutorial we look at the math that we are using to convert the messages and their associated responses into weights using term frequency and inverse document frequency. (tf-idf).

Once we have the appropriate weights of words present in messages and responses. We write the messages and responses in vector form of the weight present. We then try to find how similar are these vectors using cosine similarity.

We multiply term-frequency and inverse document frequency to obtain the final weight of the word that would be used to construct the vector.  

Cosine Similarity:
This is a measure of orientation and not magnitude. The reason we are not considering magnitude of the vectors is because the magnitude can be more depending on the length of the query or response associated but that does not tell us about how similar is the query and the messages that we have in our training data.

Angle gives us the direction where the vector points towards thus if the query has similar weighted words only 5 times and the message has 500 words but having similar weights then they would point in same direction and be more similar.

The reason for choosing cos(theta) is because it is monotonically decreasing function in [0, pi/2]. We use dot product to calculate the cos(theta) as shown in figure.


In this tutorial we would give a walkthrough of the code. The libraries that have been used are the scikit learn and numpy.

Full code present on github.

First we import the following libraries.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

No comments:

Post a Comment