Part 22 Grouping and summarizing data
22.1 summarize()
Like mutate()
, the summarize()
function also creates new columns,
but the calculations that make the new columns must reduce down to a single number.
For example, let’s compute the mean and standard deviation of life expectancy
in the gapminder
data set:
|>
gapminder summarize(
mean = mean(lifeExp),
sd = sd(lifeExp)
)
Notice that all other columns were dropped.
This is necessary, because there’s no obvious way to compress the other columns down to a single row.
This is unlike mutate()
, which keeps all columns,
and more like transmute()
, which drops all other columns.
As it is, this is useful for creating summary tables, but it’s more useful in the context of grouping, coming up next.
22.2 group_by()
The true power of dplyr lies in its ability to group a tibble, with the
group_by()
function. As usual, this function takes in a tibble and returns a
(grouped) tibble.
Let’s group the gapminder dataset by continent and year:
|>
gapminder group_by(continent, year)
The only thing different from a regular tibble is the indication of grouping variables above the tibble. This means that the tibble is recognized as having “chunks” defined by unique combinations of continent and year:
- Asia in 1952 is one chunk.
- Asia in 1957 is another chunk.
- Europe in 1952 is another chunk.
- etc…
Notice that the data frame isn’t rearranged by chunk! The grouping is something stored internally about the grouped tibble.
Now that the tibble is grouped, operations that you do on a grouped tibble will be done independently within each chunk, as if no other chunks exist.
You can also create new variables and group by that variable simultaneously. Try splitting life expectancy by “small” and “large” using 60 as a threshold:
|>
gapminder group_by(smallLifeExp = lifeExp < 60)
22.3 Grouped summarize()
Want to compute the mean and standard deviation for each year for every continent? No problem:
|>
gapminder group_by(continent, year) |>
summarize(mu = mean(lifeExp),
sigma = sd(lifeExp))
Notice:
- The grouping variables are kept in the tibble, because their values are unique within each chunk (by definition of the chunk!)
- With each call to
summarize()
, the grouping variables are “peeled back” from last grouping variable to first.
This means the above tibble is now only grouped by continent. What happens when we reverse the grouping?
|>
gapminder group_by(year, continent) |> # Different order
summarize(mu = mean(lifeExp),
sigma = sd(lifeExp))
The grouping columns are switched, and now the tibble is grouped by year instead of continent.
dplyr has a bunch of convenience functions that help us write code more
eloquently. We could use group_by()
and summarize()
with length()
to find
the number of entries each country has:
|>
gapminder group_by(country) |>
transmute(n = length(country))
Or, we can use the more elegant dplyr::n()
to count the number of rows in each
group:
|>
gapminder group_by(country) |>
summarize(n = n())
Or better yet, if this is all we want, just use dplyr::count()
:
|>
gapminder count(country)
22.4 Grouped mutate()
Want to get the increase in GDP per capita for each country? No problem:
<- gapminder |>
gap_inc arrange(year) |>
group_by(country) |>
mutate(gdpPercap_inc = gdpPercap - lag(gdpPercap))
print(gap_inc)
The tibble is still grouped by country.
Drop the NA
s with another convenience function, this time supplied by the
tidyr
package (another tidyverse package that we’ll see soon):
|>
gap_inc ::drop_na() tidyr
You can specify specific columns to drop NA
s from in the drop_na()
function.
22.5 Function types
We’ve seen cases of transforming variables using mutate()
and summarize()
,
both with and without group_by()
. How can you know what combination to use?
Here’s a summary based on one of three types of functions.
Function type | Explanation | Examples | In dplyr |
---|---|---|---|
Vectorized functions | These take a vector, and operate on each component independently to return a vector of the same length. In other words, they work element-wise. | cos() , sin() , log() , exp() , round() |
mutate() |
Aggregate functions | These take a vector, and return a vector of length 1 | mean() , sd() , length() |
summarize() , esp with group_by() . |
Window Functions | these take a vector, and return a vector of the same length that depends on the vector as a whole. | lag() , rank() , cumsum() |
mutate() , esp with group_by() |