The first component of tf-idf is term frequency, or how often a word appears in a document within the corpus.
The value for the term frequency is the same as if applying the bag-of-words language model to a document. If you have previously studied bag-of-words, this will all be familiar! If not, have no fear.
Term frequency indicates how often each word appears in the document. The intuition for including term frequency in the tf-idf calculation is that the more frequently a word appears in a single document, the more important that term is to the document.
Consider the stanza from Emily Dickinson’s poem I’m Nobody! Who are you? below:
stanza = '''I'm nobody! Who are you? Are you nobody, too? Then there's a pair of us — don't tell! They'd banish us, you know.'''
The term frequency for “you” is
3, “nobody” is
2, “are” is
2, “us” is
2, and the rest of the terms have a frequency of
1. We can get a general sense of what this stanza is about by the most frequently used words.
Term frequency can be calculated in Python using scikit-learn’s
CountVectorizer, as shown below:
vectorizer = CountVectorizer() term_frequencies = vectorizer.fit_transform([stanza])
CountVectorizerobject is initialized
CountVectorizerobject is fit (trained) and transformed (applied) on the corpus of data, returning the term frequencies for each term-document pair
Provided in script.py is Emily Dickinson’s poem Success is counted sweetest, stored in the variable
poem. Read through
poem yourself and find the term-frequency of “clear”. Save your answer, as an integer, to a variable named
Reading through each word of a document to count how many times it appears isn’t too efficient.
The code in script.py preprocesses the poem and then uses scikit-learn’s
CountVectorizer to get the term-frequencies for all terms in
poem. It’s currently missing one line of code to display the term-document matrix of term-frequencies.
CountVectorizer objects have a
.get_feature_names() method which returns a list of all the unique terms in the corpus.
Paste the below line into the “get vocabulary of terms” section of script.py to display the term-frequencies matrix.
feature_names = vectorizer.get_feature_names()
Which term appears the most frequently?