Part 7 R Scripts

Usually, we want to save the analyses we run so we can re-run them later or refer back to them. We do that with scripts.

Let’s make an R script. Click File -> New File -> R Script.

A new part of the RStudio window appears. The Script pane. This is where you can write a series of lines of code than you can then execute (send to the Console to run) later.

7.1 Assign an Object

Pick a number, assign it to an object called favorite_number, and print it.

favorite_number <- 42
favorite_number
#> [1] 42

Run these lines of code by placing your cursor on the line and either clicking the Run button or typing Ctrl/Cmd + Shift + Enter/Return.

7.2 Types of Objects and Operations

Now let’s explore some basic types of objects and operations in R.

7.2.1 Vectors

Vectors store multiple entries of one data type, like numbers or characters. You’ll discover that they show up just about everywhere in R.

Let’s collect some data and store this in a vector called times.

How many hours did you sleep last night?

Drop your answer in the chat.

Here’s starter code:

times <- c()

The c() function is how we make a vector in R. The “c” stands for “concatenate”.

Operations happen component-wise. Change those times to minutes. How can we “save” the results?

All parts of a vector have the same type. There are many types of variables in R. The most common types are:

  1. numeric (numbers) 1. double (numbers with decimal values; 2, 3.4, 1000) 1. integer (1L, 2L, 100L)
  2. character (words or strings; "a", "foo", "lastname")
  3. logical (TRUE or FALSE)
  4. factor (categorical variable; factor(c("control", "experiment")))

7.2.2 Functions

R comes with many many functions. Functions take one or more inputs and return one or more outputs. You can think of functions as prewritten R programs.

Functions all take the general form:

functionName(arg1 = val1, arg2 = val2, and so on)

To call a function, type its name, then parentheses (). Inside the (), type the arguments and values to use as input.

What’s the average sleep time? Let’s compute that using the mean() function.

mean(times)
#> [1] 6.95

To learn how a function works, we can look at its help file. Open the help file for mean() by typing ?mean or help(mean) in the console.

The help file will includes the following:

  1. A brief description of the function
  2. A list of the arguments and how to call the function
  3. A detailed description of the arguments
  4. (Optionally) Other usage details
  5. Coded examples

We can see that mean() has 4 arguments:

  1. x: A vector to compute the mean of
  2. trim: the fraction (0 to 0.5) of observations to be trim from each end
  3. na.rm: TRUE or FALSE–should missing values be removed before computing?
  4. ...: Other arguments. More on that later.

Default values for arugments are given in the Usage section by =. If an argument has no default (like x in mean()), it usually means it’s required.

Let’s compute the trimmed mean of times, dropping 10% of values from each end.

You can either enter arguments in order:

mean(times, .1)
#> [1] 7

Or by name:

mean(times, trim = .1)
#> [1] 7

It’s good practice to name all arguments after the first (or maybe second if its clear).

Try out some other functions, such as sd(), range(), and length().

Much of R is about becoming familiar with R’s “vocabulary”. A nice list can be found in Advanced R - Vocabulary.

7.2.3 Comparisons

How many people slept less than 6 hours? Let’s answer that using comparisons.

We can compare the values of times to another value using <.

times < 6
#>  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
#> [13]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Comparisons return a vector of logical values.

We can do other logical comparisons:

times > 6
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
#> [13] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

times == 5
#>  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

times <= 7
#>  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
#> [13]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

times != 2
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE

We can combine multiple comparisons using & (AND), | (OR), and ! (NOT).

(times < 4) | (times > 9)
#>  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Try out these functions that work with logical values.

any(times < 6)
#> [1] TRUE

all(times < 6)
#> [1] FALSE

which(times < 6)
#> [1]  5  8 10 11 13 17 18

7.2.4 Subsetting

Use [] to subset values from a vector. You can subset with an integer (by position) or with logicals.

times[4]
#> [1] 9
times[c(2, 5)]
#> [1] 10  5

times[-6]
#>  [1]  8 10  9  9  5  9  5  9  5  5  9  4  9  6  6  4  4  6  8
times[-c(2, 3)]
#>  [1] 8 9 5 9 9 5 9 5 5 9 4 9 6 6 4 4 6 8

times[4:8]
#> [1] 9 5 9 9 5

times[times >= 7]
#>  [1]  8 10  9  9  9  9  9  9  9  8

Subset times:

  1. Extract the third entry.
  2. Extract everything except the third entry.
  3. Extract the second and fourth entry. The fourth and second entry.
  4. Extract the second through fifth entry.
  5. Extract all entries that are less than 4 hours Why does this work? Logical subsetting!

7.2.5 Modifying a vector

You can change the vector by combining [] with <-.

Let’s say that we learned that the second time was incorrect and we wanted to replace it with missing data. In R, missing data is noted by NA.

Replace the second entry in times with NA.

Now, let’s “cap” all entries that are bigger than 8 at 8 hours.

If this is more than one value, why don’t we need to match the number of values? Recycling! Be careful of recycling!

Let’s compute the mean of these new times:

mean(times)
#> [1] 6.95

What happened? How do we compute the mean of the non-missing values?

7.2.6 Data frames

We usually work with more than one variable at a time. When we do that, we will work with data frames. A data frame is a list of vectors, all of the same length.

R has some data frames “built in”. For example, some car data is attached to the variable name mtcars.

Print mtcars to screen. Notice the tabular format.

Your turn: Finish the exercises of this section:

  1. Use some of these built-in R functions to explore mtcars

    • head()

    • tail()

    • str()

    • nrow()

    • ncol()

    • summary()

    • row.names()

    • names()

  2. What’s the first column name in the mtcars dataset?

  3. Which column number is named "wt"?

With data frames,each column is its own vector. You can extract a vector by name using $. For example, we can extract the cyl column with mtcars$cyl.

You can also extract columns using [[]]. For example, try mtcars[["cyl"]] or mtcars[[2]].

  1. Extract the vector of mpg values. What’s the mean mpg of all cars in the dataset?

7.2.7 R packages

Often, the suite of functions that “come with” R are not enough to do an analysis.

Sometimes, the suite of functions that “come with” R are not very convenient.

In these cases, R packages come to the rescue. These are “add ons”, each coming with their own suite of functions and objects, usually designed to do one type of task. CRAN stores many R packages that. It’s easy to install packages from CRAN using the install.packages() function.

Run the following lines of code to install the tibble and gapminder packages. (But don’t include install.packages() lines in your scripts—it’s not very nice to others!)

install.packages("tibble")
install.packages("palmerpenguins")
  • tibble: a data frame with some useful “bells and whistles”
  • palmerpenguins: a package with the penguins dataset (as a tibble!)

After you install a package, you need to load it to use it. Use the library() function to load a package.

(Note: Do not use the similar function require() to load packages. This has some different, undesirable, behavior for normal usage.)

Run the following lines of code to load the packages. (Put these in your scripts, and near the top.)

library(tibble)
library(palmerpenguins)

You can explore the objects in a package in the Environment pane.

Try the following two approaches to access information about the tibble package. Run the lines one-at-a-time. Vignettes are your friend, but do not always exist.

?tibble
browseVignettes(package = "tibble")

Print out the penguins object to screen. It’s a tibble—how does it differ from a data frame in terms of how it’s printed?