When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data.
Suppose we have a grade book with columns student
, assignment_name
, and grade
:
student | assignment_name | grade |
---|---|---|
Amy | Assignment 1 | 96 |
Amy | Assignment 2 | 87 |
Bob | Assignment 1 | 91 |
Bob | Assignment 2 | 75 |
Chris | Assignment 1 | 83 |
Chris | Assignment 2 | 88 |
We want to get an average grade for each student across all assignments. We can do this using the helpful dplyr function group_by()
.
For this example, we’d use the following piece of code:
grades <- df %>% group_by(student) %>% summarize(mean_grade = mean(grade))
The output might look something like this:
student | mean_grade |
---|---|
Amy | 91.5 |
Bob | 83 |
Chris | 85.5 |
In general, we use the following syntax to calculate aggregates:
df %>% group_by(column_1) %>% summarize(aggregate_name = command(column_2))
column_1
(student
in our example) is the column that we want togroup_by()
column_2
(grade
in our example) is the column that we want to applycommand()
, a summary function, to usingsummarize()
aggregate_name
is the name assigned to the calculated aggregate
In addition to the summary functions discussed in the last exercise (mean()
, median()
, sd()
, var()
, min()
, max()
, IQR()
and n_distinct()
), another helpful summary function, especially for grouped data, is n()
. n()
will return the count of the rows within a group, and does not require a column as an argument. To get the count of the rows in each group of students from our example:
grades <- df %>% group_by(student) %>% summarize(count = n())
Instructions
Let’s return to our orders
data from ShoeFly.com.
In the previous exercise, our finance department wanted to know the most expensive shoe that we sold.
Now, they want to know the price of the most expensive shoe for each shoe_type
(i.e., the price of the most expensive boot, the price of the most expensive ballet flat, etc.).
Save your answer to the variable pricey_shoes
, and view it.
The inventory team wants to know how many of each shoe_type
has been sold so they can forecast inventory for the future.
Save your answer to the variable shoes_sold
, and view it.