Part 27 Tidy data

A data set is tidy if:

  • Each row is an observation appropriate for the analysis;
  • Each column is a variable;
  • Each cell is a value.

This means that each value belongs to exactly one variable and one observation.

Why bother? Because doing computations with untidy data can be a nightmare. Computations become simple with tidy data.

Whether or not a data set is “tidy” depends on the type of analysis you are doing or plot you are making. It depends on how you define your “observation” and “variables” for the current analysis.

haireye <- as_tibble(HairEyeColor) |> 
  count(Hair, Eye, wt = n) |> 
  rename(hair = Hair, eye = Eye)

As an example, consider this example derived from the datasets::HairEyeColor dataset, containing the number of people having a certain hair and eye color.

If one observation is identified by a hair-eye color combination, then the tidy dataset is:

haireye |> 
  print()

If one observation is identified by a single person, then the tidy dataset has one pair of values per person, and one row for each person. We can use the handy tidyr::uncount() function, the opposite of dplyr::count():

haireye |> 
  tidyr::uncount(n) |> 
  print()

27.1 Untidy Examples

The following are examples of untidy data. They’re untidy for either of the cases considered above, but for discussion, let’s take a hair-eye color combination to be one observational unit.

Note that untidy does not always mean “bad”, just inconvenient for the analysis you want to do.

Untidy Example 1: The following table is untidy because there are multiple observations per row. It’s too wide.

Imagine calculating the total number of people with each hair color. You can’t just group_by() and summarize(), here!

This sort of table is common when presenting results. It’s easy for humans to read, but hard for computers to work with. Untidy data is usually that way because it was structured for human, not machine, reading.

Untidy Example 2: The following table is untidy for the same reason as Example 1—multiple observations are contained per row. It’s too wide.

Untidy Example 3: This is untidy because each observational unit is spread across multiple columns. It’s too long.

In fact, we needed to add an identifier for each observation, otherwise we would have lost which row belongs to which observation!

Does red hair ever occur with blue eyes? Can’t just filter(hair == "red", eye == "blue")!

Untidy Example 4: Just when you thought a data set couldn’t get any longer! Now, each variable has its own row: hair color, eye color, and n.

This is the sort of format that is common pulling data from the web or other “Big Data” sources.