Skip to Content
Learn
Text Preprocessing
Stopword Removal

Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the tone of a statement. They include words such as “a”, “an”, and “the”.

NLTK provides a built-in library with these words. You can import them using the following statement:

from nltk.corpus import stopwords stop_words = set(stopwords.words('english'))

We create a set with the stop words so we can check if the words are in a list below.

Now that we have the words saved to stop_words, we can use tokenization and a list comprehension to remove them from a sentence:

nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA" word_tokens = word_tokenize(nbc_statement) # tokenize nbc_statement statement_no_stop = [word for word in word_tokens if word not in stop_words] print(statement_no_stop) # ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']

In this code, we first tokenized our string, nbc_statement, then used a list comprehension to return a list with all of the stopwords removed.

Instructions

1.

At the top of your script, import stopwords from NLTK. Save all English stopwords, as a set, to a variable called stop_words.

2.

Tokenize the text in survey_text and save the result to tokenized_survey.

3.

Remove stop words from tokenized_survey and save the result to text_no_stops.

Folder Icon

Sign up to start coding

Already have an account?