Skip to Content
Text Preprocessing

Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part-of-speech for each word. Since lemmatization requires the part of speech, it is a less efficient approach than stemming.

In the next exercise, we will consider how to tag each word with a part of speech. In the meantime, let’s see how to use NLTK’s lemmatize operation.

We can use NLTK’s WordNetLemmatizer to lemmatize text:

from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer()

Once we have the lemmatizer initialized, we can use a list comprehension to apply the lemmatize operation to each word in a list:

tokenized = ["NBC", "was", "founded", "in", "1926"] lemmatized = [lemmatizer.lemmatize(token) for token in tokenized] print(lemmatized) # ["NBC", "wa", "founded", "in", "1926"]

The result, saved to lemmatized contains 'wa', while the rest of the words remain the same. Not too useful. This happened because lemmatize() treats every word as a noun. To take advantage of the power of lemmatization, we need to tag each word in our text with the most likely part of speech. We’ll do that in the next exercise.



At the top of, import WordNetLemmatizer, then initialize an instance of it and save the result to lemmatizer.


Tokenize the string saved to populated_island. Save the result to tokenized_string.


Use a list comprehension to lemmatize every word in tokenized_string. Save the result to the variable lemmatized_words.

Folder Icon

Sign up to start coding

Already have an account?