Key Concepts

Review core concepts you need to learn to master this subject

gsub() R Function

# Replace the element "1" with the empty string in the teams vector in order to get the teams_clean vector with the correct names. teams <- c("Fal1cons", "Cardinals", "Seah1awks", "Vikings", "Bro1nco", "Patrio1ts") teams_clean <- gsub("1", "", teams) print(teams_clean) # Output: # "Falcons" "Cardinals" "Seahawks" "Vikings" "Bronco" "Patriots"

The base R gsub() function searches for a regular expression in a string and replaces it. The function recieve a string or character to replace, a replacement value, and the object that contains the regular expression. We can use it to replace substrings within a single string or in each string in a vector.

When combined with dplyr’s mutate() function, a column of a data frame can be cleaned to enable analysis.

Data Cleaning in R
Lesson 1 of 1
  1. 1
    A huge part of data science involves acquiring raw data and getting it into a form ready for analysis. Some have estimated that data scientists spend 80% of their time cleaning and manipulating dat…
  2. 2
    We often describe data that is easy to analyze and visualize as “tidy data”. What does it mean to have tidy data? For data to be tidy, it must have: - Each variable as a separate column - Each row…
  3. 3
    Often, you have the same data separated out into multiple files. Let’s say that you have a ton of files following the filename structure: ‘file_1.csv’, ‘file_2.csv’, ‘file_3.csv’, and so on. The p…
  4. 4
    Since we want - Each variable as a separate column - Each row as a separate observation We would want to reshape a table like: |Account|Checking|Savings| |-|-|-| |”12456543”|8500|8900| |…
  5. 5
    Often we see duplicated rows of data in the data frames we are working with. This could happen due to errors in data collection or in saving and loading the data. To check for duplicates, we can u…
  6. 6
    In trying to get clean data, we want to make sure each column represents one type of measurement. Often, multiple measurements are recorded in the same column, and we want to separate these out so …
  7. 7
    Let’s say we have a column called “type” with data entries in the format “admin_US” or “user_Kenya”, as shown in the table below. |id|type| |-|-| |1011|”user_Kenya”| |1112|”admin_US”| |1113…
  8. 8
    Each column of a data frame can hold items of the same data type. The data types that R uses are: character, numeric (real or decimal), integer, logical, or complex. Often, we want to convert bet…
  9. 9
    Sometimes we need to modify strings in our data frames to help us transform them into more meaningful metrics. For example, in our fruits table from before: |item|price|calories| |-|-|-|…
  10. 10
    Great! We have looked at a number of different methods we may use to get data into the format we want for analysis. Specifically, we have covered: * diagnosing the “tidiness” of data * combining …

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo