Skip to Content
Bag-of-Words Language Model
Spam A Lot No More

Amazing work! As is the case with many tasks in Python, there’s already a library that can do all of that work for you.

For text_to_bow(), you can approximate the functionality with the collections module’s Counter() function:

from collections import Counter tokens = ['another', 'five', 'fish', 'find', 'another', 'faraway', 'fish'] print(Counter(tokens)) # Counter({'fish': 2, 'another': 2, 'find': 1, 'five': 1, 'faraway': 1})

For vectorization, you can use CountVectorizer from the machine learning library scikit-learn. You can use fit() to train the features dictionary and then transform() to transform text into a vector:

from sklearn.feature_extraction.text import CountVectorizer training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"] test_text = ["Another five fish find another faraway fish."] bow_vectorizer = CountVectorizer() bow_vector = bow_vectorizer.transform(test_text) print(bow_vector.toarray()) # [[2 0 1 1 2 1 0 0 0 0 0 0 0 0 0]]



Now, let’s see how scikit-learn stacks up with the same bag-of-words functionality! Import CountVectorizer from sklearn. (Check out the example we gave for how to import CountVectorizer.)


Define bow_vectorizer as our vectorizer using CountVectorizer().


Define training_vectors as bow_vectorizer.fit_transform() called on training_docs.

fit_transform() does two things: creation of the features dictionary and the vectorization of the training data.

Define test_vectors as bow_vectorizer.transform() called on test_docs.


Uncomment the code at the bottom of Run the code again to see why it makes sense to use sklearn‘s optimized functions!

Folder Icon

Sign up to start coding

Already have an account?