Now that we understand how term frequency and inverse document frequency are calculated, let’s put it all together to calculate tf-idf!
Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term t
in a document d
in some corpus
is calculated as follows:
tf(t,d)
is the term frequency of termt
in documentd
idf(t,corpus)
is the inverse document frequency of a termt
acrosscorpus
We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s TfidfVectorizer
:
vectorizer = TfidfVectorizer(norm=None) tfidf_vectorizer = vectorizer.fit_transform(corpus)
- a
TfidfVectorizer
object is initialized. Thenorm=None
keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency - the
TfidfVectorizer
object is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair
Instructions
The same selection of 6 Emily Dickinson poems from the previous exercise is given in poems.py.
In script.py, the poems are preprocessed. Let’s calculate the tf-idf scores for each term-document pair.
Begin by creating a TfidfVectorizer
object named vectorizer
with keyword argument norm=None
.
Fit and transform your vectorizer
on the corpus of preprocessed poems. Save the result to a variable named tfidf_scores
.
Like CountVectorizer
objects, TfidfVectorizer
objects have a .get_feature_names()
method which returns a list of all the unique terms in the corpus.
Paste the below line into the “get vocabulary of terms” section of script.py to display the tf-idf matrix.
feature_names = vectorizer.get_feature_names()
Which term-document pairs have the highest tf-idf scores?