Part 40 Reading data from disk

The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:

dat <- read_csv(here::here("participation", "data", "gap_asia_2007.csv"))
dat

Notice that the output of the imported file is the same as the original tibble. read_csv() was intelligent enough to detect the types of the columns. This won’t always be true so it’s worth checking! In particular, be on the lookout for any columns it imports as col_character()!

The read_csv() function has many additional options including the ability to specify column types (e.g., is “1990” a year or a number?), skip columns, skip rows, rename columns on import, trim whitespace, and more.

To control the column types, use the cols() function:

dat <- read_csv(
  here::here("participation", "data", "gap_asia_2007.csv"),
  col_types = cols(
    country = col_factor(),
    continent = col_factor(),
    year = col_date(format = "%Y"),
    .default = col_double() # all other columns as numeric (double)
  )
)
dat

By default, it leaves all columns as col_guess(), but it’s better to be explicit.

Another important option to set is the na argument, which specifies what values to treat as NA on import. By default, read_csv() treats blank cells (i.e., "") and cells with "NA" as missing. You might need to change this (e.g., if missing values are entered as -999). Note that readxl::read_excel() by default only has na = c("") (no "NA")!

dat <- read_csv(
  here::here("participation", "data", "gap_asia_2007.csv"),
  col_types = cols(
    country = col_factor(),
    continent = col_factor(),
    year = col_date(format = "%Y"),
    .default = col_double() # all other columns as numeric (double)
  ),
  na = c("", "NA", -99, "No response")
)
dat