The inverse document frequency component of the tf-idf score penalizes terms that appear more frequently across a corpus. The intuition is that words that appear more frequently in the corpus give less insight into the topic or meaning of an individual document, and should thus be deprioritized.
For example, terms like “the” or “go” are used all over the place, so in a bag-of-words model, they would be given priority even though they don’t provide much meaning; tf-idf would deprioritize these sorts of common words.
We can calculate the inverse document frequency for some term
t across a corpus using the below equation. Don’t be scared if you aren’t a math person!
The important take away from the equation is that as the number of documents with the term
t increases, the inverse document frequency decreases (due to the nature of the log function). The more frequently a term appears across the corpus, the less important it becomes to an individual document.
Inverse document frequency can be calculated on a group of documents using scikit-learn’s
transformer = TfidfTransformer(norm=None) transformer.fit(term_frequencies) inverse_doc_frequency = transformer.idf_
TfidfTransformerobject is initialized. Don’t worry about the
norm=Nonekeyword argument for now, we will dig into this in the next exercise
TfidfTransformeris fit (trained) on a term-document matrix of term frequencies
.idf_attribute of the
TfidfTransformerstores the inverse document frequencies of the terms as a NumPy array
A selection of 6 Emily Dickinson poems is given in poems.py. The term frequencies of each term-document pair are calculated in term_frequency.py and stored in
term_frequencies as a matrix and
df_term_frequencies as a Pandas DataFrame.
In script.py, print
df_term_frequencies to view the term-document matrix of term frequencies.
Let’s calculate the inverse document frequency for each term in the selection of Emily Dickinson’s poems! Begin by creating a
TfidfTransformer object named
Store the calculated inverse document frequency values in a variable named
When you run the code, a table of inverse document frequency scores for all the terms will appear. Which terms are penalized for occurring across multiple documents?
Refer back to the poems in poems.py to see the original poems.