Key Concepts

Review core concepts you need to learn to master this subject

Text Preprocessing

In natural language processing, text preprocessing is the practice of cleaning and preparing text data. NLTK and re are common Python libraries used to handle many text preprocessing tasks.

Noise Removal

In natural language processing, noise removal is a text preprocessing task devoted to stripping text of formatting.

Tokenization

In natural language processing, tokenization is the text preprocessing task of breaking up text into smaller components of text (known as tokens).

Text Normalization

In natural language processing, normalization encompasses many text preprocessing tasks including stemming, lemmatization, upper or lowercasing, and stopwords removal.

Stemming

In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes).

Lemmatization

In natural language processing, lemmatization is the text preprocessing normalization task concerned with bringing words down to their root forms.

Stopword Removal

In natural language processing, stopword removal is the process of removing words from a string that don’t provide any information about the tone of a statement.

Part-of-Speech Tagging

In natural language processing, part-of-speech tagging is the process of assigning a part of speech to every word in a string. Using the part of speech can improve the results of lemmatization.

Text Preprocessing
Lesson 1 of 1
  1. 1
    Text preprocessing is an approach for cleaning and preparing text data for use in a specific context. Developers use it in almost all natural language processing (NLP) pipelines, including voice re…
  2. 2
    Text cleaning is a technique that developers use in a variety of domains. Depending on the goal of your project and where you get your data from, you may want to remove unwanted information, such a…
  3. 3
    For many natural language processing tasks, we need access to each word in a string. To access each word, we first have to break the text into smaller components. The method for breaking text into …
  4. 4
    Tokenization and noise removal are staples of almost all text pre-processing pipelines. However, some data may require further processing through text normalization. Text normalization is a catch…
  5. 5
    Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. They are usually the most common words in a language and don’t provide any information about the…
  6. 6
    In natural language processing, stemming is the text preprocessing normalization task concerned with bluntly removing word affixes (prefixes and suffixes). For example, stemming would cast the wo…
  7. 7
    Lemmatization is a method for casting words to their root forms. This is a more involved process than stemming, because it requires the method to know the part-of-speech for each word. Since lemm…
  8. 8
    To improve the performance of lemmatization, we need to find the part of speech for each word in our string. In script.py, to the right, we created a part-of-speech tagging function. The functi…
  9. 9
    This lesson is not an exhaustive introduction to text preprocessing. However, it does show a few of the most common tricks for cleaning your data. Before building a text preprocessing pipeline, it’…

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo