Part 22 Grouping and summarizing data

22.1 summarize()

Like mutate(), the summarize() function also creates new columns, but the calculations that make the new columns must reduce down to a single number.

For example, let’s compute the mean and standard deviation of life expectancy in the gapminder data set:

gapminder |> 
    mean = mean(lifeExp),
    sd = sd(lifeExp)

Notice that all other columns were dropped. This is necessary, because there’s no obvious way to compress the other columns down to a single row. This is unlike mutate(), which keeps all columns, and more like transmute(), which drops all other columns.

As it is, this is useful for creating summary tables, but it’s more useful in the context of grouping, coming up next.

22.2 group_by()

The true power of dplyr lies in its ability to group a tibble, with the group_by() function. As usual, this function takes in a tibble and returns a (grouped) tibble.

Let’s group the gapminder dataset by continent and year:

gapminder |> 
  group_by(continent, year)

The only thing different from a regular tibble is the indication of grouping variables above the tibble. This means that the tibble is recognized as having “chunks” defined by unique combinations of continent and year:

  • Asia in 1952 is one chunk.
  • Asia in 1957 is another chunk.
  • Europe in 1952 is another chunk.
  • etc…

Notice that the data frame isn’t rearranged by chunk! The grouping is something stored internally about the grouped tibble.

Now that the tibble is grouped, operations that you do on a grouped tibble will be done independently within each chunk, as if no other chunks exist.

You can also create new variables and group by that variable simultaneously. Try splitting life expectancy by “small” and “large” using 60 as a threshold:

gapminder |> 
  group_by(smallLifeExp = lifeExp < 60)

22.3 Grouped summarize()

Want to compute the mean and standard deviation for each year for every continent? No problem:

gapminder |> 
  group_by(continent, year) |> 
  summarize(mu    = mean(lifeExp),
            sigma = sd(lifeExp))


  • The grouping variables are kept in the tibble, because their values are unique within each chunk (by definition of the chunk!)
  • With each call to summarize(), the grouping variables are “peeled back” from last grouping variable to first.

This means the above tibble is now only grouped by continent. What happens when we reverse the grouping?

gapminder |> 
  group_by(year, continent) |>    # Different order
  summarize(mu    = mean(lifeExp),
            sigma = sd(lifeExp))

The grouping columns are switched, and now the tibble is grouped by year instead of continent.

dplyr has a bunch of convenience functions that help us write code more eloquently. We could use group_by() and summarize() with length() to find the number of entries each country has:

gapminder |> 
  group_by(country) |> 
  transmute(n = length(country))

Or, we can use the more elegant dplyr::n() to count the number of rows in each group:

gapminder |> 
  group_by(country) |> 
  summarize(n = n())

Or better yet, if this is all we want, just use dplyr::count():

gapminder |> 

22.4 Grouped mutate()

Want to get the increase in GDP per capita for each country? No problem:

gap_inc <- gapminder |> 
  arrange(year) |> 
  group_by(country) |>
  mutate(gdpPercap_inc = gdpPercap - lag(gdpPercap))

The tibble is still grouped by country.

Drop the NAs with another convenience function, this time supplied by the tidyr package (another tidyverse package that we’ll see soon):

gap_inc |> 

You can specify specific columns to drop NAs from in the drop_na() function.

22.5 Function types

We’ve seen cases of transforming variables using mutate() and summarize(), both with and without group_by(). How can you know what combination to use? Here’s a summary based on one of three types of functions.

Function type Explanation Examples In dplyr
Vectorized functions These take a vector, and operate on each component independently to return a vector of the same length. In other words, they work element-wise. cos(), sin(), log(), exp(), round() mutate()
Aggregate functions These take a vector, and return a vector of length 1 mean(), sd(), length() summarize(), esp with group_by().
Window Functions these take a vector, and return a vector of the same length that depends on the vector as a whole. lag(), rank(), cumsum() mutate(), esp with group_by()