Amazing work! As is the case with many tasks in Python, there’s already a library that can do all of that work for you.
For text_to_bow()
, you can approximate the functionality with the collections
module’s Counter()
function:
from collections import Counter tokens = ['another', 'five', 'fish', 'find', 'another', 'faraway', 'fish'] print(Counter(tokens)) # Counter({'fish': 2, 'another': 2, 'find': 1, 'five': 1, 'faraway': 1})
For vectorization, you can use CountVectorizer
from the machine learning library scikit-learn
. You can use fit()
to train the features dictionary and then transform()
to transform text into a vector:
from sklearn.feature_extraction.text import CountVectorizer training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"] test_text = ["Another five fish find another faraway fish."] bow_vectorizer = CountVectorizer() bow_vectorizer.fit(training_documents) bow_vector = bow_vectorizer.transform(test_text) print(bow_vector.toarray()) # [[2 0 1 1 2 1 0 0 0 0 0 0 0 0 0]]
Instructions
Now, let’s see how scikit-learn stacks up with the same bag-of-words functionality! Import CountVectorizer
from sklearn
. (Check out the example we gave for how to import CountVectorizer
.)
Define bow_vectorizer
as our vectorizer using CountVectorizer()
.
Define training_vectors
as bow_vectorizer.fit_transform()
called on training_docs
.
fit_transform()
does two things: creation of the features dictionary and the vectorization of the training data.
Define test_vectors
as bow_vectorizer.transform()
called on test_docs
.
Uncomment the code at the bottom of script.py. Run the code again to see why it makes sense to use sklearn
‘s optimized functions!