Part 23 Advanced dplyr functions

23.1 recode()

recode() is useful for recoding categorical variables.

Unlike most of the other function in dplyr, recode() is backwards in it’s syntax:

recode(.x, old = new)

Lets take a look at recoding different variables using the psychTools::bfi dataset:

In the dataset, our gender variable has values 1 and 2.

This is a little vague since we don’t know what 1 or 2 is in respect to gender.

dat_bfi <- psychTools::bfi |> 
  rownames_to_column(var = ".id")

dat_bfi |>
  mutate(
    gender = recode(gender, "1" = "man", "2" = "woman")
  ) |>
  select(.id, gender, education) |>
  head()

Note that for numeric values on the left side of =, you need to wrap them in “quotes” or backticks; however, that’s not necessary for character values

We can also specify a .default value within our recode().

For example, say we want to have just “HS or less” versus “more than HS”

dat_bfi |>
  mutate(
    education = recode(education, "1" = "HS", "2" = "HS", .default = "More than HS")
  ) |>
  select(.id, gender, education) |>
  head()

Another neat feature of the recode() function is the .missing value.

If we would rather convert NA values to something more explicit, we can specify that in the .missing argument.

dat_bfi |>
  mutate(
    education = recode(
      education, 
      "1" = "HS", 
      "2" = "HS", 
      .default = "More than HS", 
      .missing = "(Unknown)"
    )
  ) |>
  select(.id, gender, education) |>
  head()

Or we can use tidyr::replace_na()

dat_bfi |>
  mutate(
    education = replace_na(education, replace = "(Unknown)")
  ) |>
  select(.id, gender, education) |>
  head()

23.2 across()

The across function allows us to apply transformations across multiple columns

Say we wanted to look at the mean of each agreeable variable between gender groups:

dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      mean,
      na.rm = TRUE
    )
  )

If we want to put the function name mean, togther with all of its arguments, we can write it as an anonymous function:

dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      \(x) mean(x, na.rm = TRUE)
    )
  )

What if we wanted to include the standard deviation as well? We can pass a list of functions into across()

dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      list(
        mean = \(x) mean(x, na.rm = TRUE),
        sd = \(x) sd(x, na.rm = TRUE)
      )
    )
  )

23.3 Complex recoding plus across()

Now sometimes with our scales we may encounter variables that are reverse scored.

dat_bfi |>
  mutate(
    A1r = recode(
      A1, 
      "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6
    )
  ) |>
  select(A1, A1r) |> 
  head()

# or

dat_bfi |>
  mutate(A1r = max(A1, na.rm = TRUE) - A1 + min(A1, na.rm = TRUE)) |>
  select(A1, A1r) |> 
  head()

However, we can implement some more complex code that will reverse recode() in one fell swoop!

We start with either specifying our columns that need reverse coding or get it from a data dictionary:

reversed <- c("A1", "C4", "C5", "E1", "E2", "O2", "O5")

# or

dict <- psychTools::bfi.dictionary |>
  as_tibble(rownames = "item")

reversed <- dict |>
  filter(Keying == -1) |>
  pull(item)

Putting it all together:

dat_bfi |>
  mutate(across(
    all_of(reversed),
    \(x) recode(x, "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6),
    .names = "{.col}r"
  )) |>
  head()

The .names argument tells how to name the new columns. If you omit .names, the columns will be modified in place. In .names, the {.col} bit means “the column name”, and any text around that (here the letter r) is added to the name.

23.4 rowwise()

rowwise() is a special group_by(). It tells R to treat each row of a data frame as its own group.

rowwise() is useful for computing summary scores across items for each person. For example, to compute total scores for each person in the dat_bfi data:

dat_bfi |>
  rowwise() |> 
  mutate(
    .id = .id,
    A_total = mean(c_across(A1:A5), na.rm = TRUE),
    C_total = mean(c_across(C1:C5), na.rm = TRUE),
    E_total = mean(c_across(E1:E5), na.rm = TRUE),
    N_total = mean(c_across(N1:N5), na.rm = TRUE),
    O_total = mean(c_across(O1:O5), na.rm = TRUE),
    .before = everything()
  ) |>
  head()

The c_across() function combines c() and across() into one. It is like c() and creates a vector ala c(1, 3, 5, 7), but you can use the same options for selecting column names as select().

The .before argument says where to put the new columns you mutate().

everything() means “all the columns have I haven’t named yet”, so .before = everything() means put the new columns at the beginning of the data frame.