Key Concepts

Review core concepts you need to learn to master this subject

Bag-of-words

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Bag-of-words(BoW) is a statistical language model used to analyze text and documents based on word count. The model does not account for word order within a document. BoW can be implemented as a Python dictionary with each key set to a word and each value set to the number of times that word appears in a text.

Feature Extraction in NLP

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Feature extraction (or vectorization) in NLP is the process of turning text into a BoW vector, in which features are unique words and feature values are word counts.

Bag-of-words Test Data

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Bag-of-words test data is the new text that is converted to a BoW vector using a trained features dictionary. The new test data can be converted to a BoW vector based on the index mapping of the trained features dictionary.

For example, given the training data “There are many ways to success.” and the test data “many success ways”, the trained features dictionary and the test data BoW vector could be the following.

Feature Vector

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

In machine learning, a feature vector is a numeric depiction of an object’s salient features. In the case of bag-of-words (BoW), the objects are text samples and those features are word counts.

For example, given this features dictionary mapping, a BoW feature vector of “Another five fish find another faraway fish.” would be [1, 0, 2, 0, 0, 0, 1, 1, 0, 0, 2].

Language Smoothing in NLP

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Language smoothing is a solution to avoid overfitting in NLP. It takes a bit of probability from known words and allots it to unknown words. This causes the unknown words to have a probability of more than 0.

Features Dictionary in NLP

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

A features dictionary is a mapping of each unique word in the training data to a unique index. This is used to build out bag-of-words vectors.

For instance, given the training data “Squealing suitcase squids are not like regular squids”, the features dictionary could be as the following.

Bag-of-words Data Sparcity

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Bag-of-words has less data sparsity (i.e., it has more training knowledge to draw from) than other statistical models. When vectorizing a document, the vector is considered sparse if the majority of its values are zero which means that most of the words are not contained in the vocabulary design. BoW also suffers less from overfitting (adapting a model too strongly to training data).

Perplexity in NLP

# the sentence "Squealing suitcase squids are not like regular squids." could be changed into the following BoW dictionary: {'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

For text prediction tasks, the ideal language model is one that can predict an unseen test text (gives the highest probability). In this case, the model is said to have lower perplexity.

Bag-of-words has higher perplexity (it is less predictive of natural language) than other models. For instance, using a Markov chain for text prediction with bag-of-words, you might get a resulting nonsensical sequence like: “we i there your your”.

A trigram model, meanwhile, might generate the far more coherent (though still strange): “i ascribe to his dreams to love beauty”.

Bag-of-Words Language Model
Lesson 1 of 1
  1. 1
    “A bag-of-words is all you need,” some NLPers have decreed. The bag-of-words language model is a simple-yet-powerful tool to have up your sleeve when working on natural language processing (NLP)….
  2. 2
    Bag-of-words (BoW) is a statistical language model based on word count. Say what? Let’s start with that first part: a statistical language model is a way for computers to make sense o…
  3. 3
    One of the most common ways to implement the BoW model in Python is as a dictionary with each key set to a word and each value set to the number of times that word appears. Take the example below:…
  4. 4
    Sometimes a dictionary just won’t fit the bill. Topic modelling applications, for example, require an implementation of bag-of-words that is a bit more mathematical: feature vectors. A feat…
  5. 5
    Now that you know what a bag-of-words vector looks like, you can create a function that builds them! First, we need a way of generating a features dictionary from a list of training documents. We…
  6. 6
    Nice work! Time to put that dictionary of vocabulary to good use and build a bag-of-words vector from a new document. In Python, we can use a list to represent a vector. Each index in the list wil…
  7. 7
    Phew! That was a lot of work. It’s time to put create_features_dictionary() and tokens_to_bow_vector() together and use them in a spam filter we created that uses a Naive Bayes classifier. We’ve s…
  8. 8
    Amazing work! As is the case with many tasks in Python, there’s already a library that can do all of that work for you. For text_to_bow(), you can approximate the functionality with the collectio…
  9. 9
    As you can see, bag-of-words is pretty useful! BoW also has several advantages over other language models. For one, it’s an easier model to get started with and a few Python libraries already have …
  10. 10
    Alas, there is a trade-off for all the brilliance BoW brings to the table. Unless you want sentences that look like “the a but for the”, BoW is NOT a great primary model for text prediction. If t…
  11. 11
    You made it! And you’ve learned plenty about the bag-of-words language model along the way: - Bag-of-words (BoW) — also referred to as the unigram model — is a statistical language model based on w…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo