Part 23 Advanced dplyr functions
23.1 recode()
recode()
is useful for recoding categorical variables.
Unlike most of the other function in dplyr, recode()
is backwards in it’s syntax:
recode(.x, old = new)
Lets take a look at recoding different variables using the psychTools::bfi
dataset:
In the dataset, our gender
variable has values 1 and 2.
This is a little vague since we don’t know what 1 or 2 is in respect to gender.
<- psychTools::bfi |>
dat_bfi rownames_to_column(var = ".id")
|>
dat_bfi mutate(
gender = recode(gender, "1" = "man", "2" = "woman")
|>
) select(.id, gender, education) |>
head()
Note that for numeric values on the left side of =
,
you need to wrap them in “quotes” or backticks
;
however, that’s not necessary for character values
We can also specify a .default
value within our recode()
.
For example, say we want to have just “HS or less” versus “more than HS”
|>
dat_bfi mutate(
education = recode(education, "1" = "HS", "2" = "HS", .default = "More than HS")
|>
) select(.id, gender, education) |>
head()
Another neat feature of the recode()
function is the .missing
value.
If we would rather convert NA values to something more explicit,
we can specify that in the .missing
argument.
|>
dat_bfi mutate(
education = recode(
education, "1" = "HS",
"2" = "HS",
.default = "More than HS",
.missing = "(Unknown)"
)|>
) select(.id, gender, education) |>
head()
Or we can use tidyr::replace_na()
|>
dat_bfi mutate(
education = replace_na(education, replace = "(Unknown)")
|>
) select(.id, gender, education) |>
head()
23.2 across()
The across
function allows us to apply transformations across multiple columns
Say we wanted to look at the mean of each agreeable variable between gender groups:
|>
dat_bfi group_by(gender) |>
summarize(
across(
:A5,
A1
mean,na.rm = TRUE
) )
If we want to put the function name mean
, togther with all of its arguments,
we can write it as an anonymous function:
|>
dat_bfi group_by(gender) |>
summarize(
across(
:A5,
A1mean(x, na.rm = TRUE)
\(x)
) )
What if we wanted to include the standard deviation as well? We can pass a list
of functions into across()
|>
dat_bfi group_by(gender) |>
summarize(
across(
:A5,
A1list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE)
)
) )
23.3 Complex recoding
plus across()
Now sometimes with our scales we may encounter variables that are reverse scored.
|>
dat_bfi mutate(
A1r = recode(
A1, "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6
)|>
) select(A1, A1r) |>
head()
# or
|>
dat_bfi mutate(A1r = max(A1, na.rm = TRUE) - A1 + min(A1, na.rm = TRUE)) |>
select(A1, A1r) |>
head()
However, we can implement some more complex code that will reverse recode()
in one fell swoop!
We start with either specifying our columns that need reverse coding or get it from a data dictionary:
<- c("A1", "C4", "C5", "E1", "E2", "O2", "O5")
reversed
# or
<- psychTools::bfi.dictionary |>
dict as_tibble(rownames = "item")
<- dict |>
reversed filter(Keying == -1) |>
pull(item)
Putting it all together:
|>
dat_bfi mutate(across(
all_of(reversed),
recode(x, "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6),
\(x) .names = "{.col}r"
|>
)) head()
The .names
argument tells how to name the new columns.
If you omit .names
, the columns will be modified in place.
In .names
, the {.col}
bit means “the column name”,
and any text around that (here the letter r
) is added to the name.
23.4 rowwise()
rowwise()
is a special group_by()
.
It tells R to treat each row of a data frame as its own group.
rowwise()
is useful for computing summary scores across items for each person.
For example, to compute total scores for each person in the dat_bfi
data:
|>
dat_bfi rowwise() |>
mutate(
.id = .id,
A_total = mean(c_across(A1:A5), na.rm = TRUE),
C_total = mean(c_across(C1:C5), na.rm = TRUE),
E_total = mean(c_across(E1:E5), na.rm = TRUE),
N_total = mean(c_across(N1:N5), na.rm = TRUE),
O_total = mean(c_across(O1:O5), na.rm = TRUE),
.before = everything()
|>
) head()
The c_across()
function combines c()
and across()
into one.
It is like c()
and creates a vector ala c(1, 3, 5, 7)
,
but you can use the same options for selecting column names as select()
.
The .before
argument says where to put the new columns you mutate()
.
everything()
means “all the columns have I haven’t named yet”,
so .before = everything()
means put the new columns at the beginning of the data frame.