Part 7 R Scripts
Usually, we want to save the analyses we run so we can re-run them later or refer back to them. We do that with scripts.
Let’s make an R script. Click File -> New File -> R Script.
A new part of the RStudio window appears. The Script pane. This is where you can write a series of lines of code than you can then execute (send to the Console to run) later.
7.1 Assign an Object
Pick a number, assign it to an object called favorite_number
, and print it.
<- 42
favorite_number
favorite_number#> [1] 42
Run these lines of code by placing your cursor on the line and either clicking the Run button or typing Ctrl/Cmd + Shift + Enter/Return.
7.2 Types of Objects and Operations
Now let’s explore some basic types of objects and operations in R.
7.2.1 Vectors
Vectors store multiple entries of one data type, like numbers or characters. You’ll discover that they show up just about everywhere in R.
Let’s collect some data and store this in a vector called times
.
How many hours did you sleep last night?
Drop your answer in the chat.
Here’s starter code:
<- c() times
The c()
function is how we make a vector in R.
The “c” stands for “concatenate”.
Operations happen component-wise. Change those times to minutes. How can we “save” the results?
All parts of a vector have the same type. There are many types of variables in R. The most common types are:
- numeric (numbers) 1. double (numbers with decimal values;
2
,3.4
,1000
) 1. integer (1L
,2L
,100L
) - character (words or strings;
"a"
,"foo"
,"lastname"
) - logical (
TRUE
orFALSE
) - factor (categorical variable;
factor(c("control", "experiment"))
)
7.2.2 Functions
R comes with many many functions. Functions take one or more inputs and return one or more outputs. You can think of functions as prewritten R programs.
Functions all take the general form:
functionName(arg1 = val1, arg2 = val2, and so on)
To call a function, type its name, then parentheses ()
.
Inside the ()
, type the arguments and values to use as input.
What’s the average sleep time?
Let’s compute that using the mean()
function.
mean(times)
#> [1] 6.95
To learn how a function works, we can look at its help file.
Open the help file for mean()
by typing ?mean
or help(mean)
in the console.
The help file will includes the following:
- A brief description of the function
- A list of the arguments and how to call the function
- A detailed description of the arguments
- (Optionally) Other usage details
- Coded examples
We can see that mean()
has 4 arguments:
x
: A vector to compute the mean oftrim
: the fraction (0 to 0.5) of observations to be trim from each endna.rm
:TRUE
orFALSE
–should missing values be removed before computing?...
: Other arguments. More on that later.
Default values for arugments are given in the Usage section by =
.
If an argument has no default (like x
in mean()
), it usually means it’s required
.
Let’s compute the trimmed mean of times
, dropping 10% of values from each end.
You can either enter arguments in order:
mean(times, .1)
#> [1] 7
Or by name:
mean(times, trim = .1)
#> [1] 7
It’s good practice to name all arguments after the first (or maybe second if its clear).
Try out some other functions, such as sd()
, range()
, and length()
.
Much of R is about becoming familiar with R’s “vocabulary”. A nice list can be found in Advanced R - Vocabulary.
7.2.3 Comparisons
How many people slept less than 6 hours? Let’s answer that using comparisons.
We can compare the values of times
to another value using <
.
< 6
times #> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
#> [13] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
Comparisons return a vector of logical values.
We can do other logical comparisons:
> 6
times #> [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
#> [13] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
== 5
times #> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
<= 7
times #> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
#> [13] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
!= 2
times #> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE
We can combine multiple comparisons using &
(AND), |
(OR), and !
(NOT).
< 4) | (times > 9)
(times #> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Try out these functions that work with logical values.
any(times < 6)
#> [1] TRUE
all(times < 6)
#> [1] FALSE
which(times < 6)
#> [1] 5 8 10 11 13 17 18
7.2.4 Subsetting
Use []
to subset values from a vector.
You can subset with an integer (by position) or with logicals.
4]
times[#> [1] 9
c(2, 5)]
times[#> [1] 10 5
-6]
times[#> [1] 8 10 9 9 5 9 5 9 5 5 9 4 9 6 6 4 4 6 8
-c(2, 3)]
times[#> [1] 8 9 5 9 9 5 9 5 5 9 4 9 6 6 4 4 6 8
4:8]
times[#> [1] 9 5 9 9 5
>= 7]
times[times #> [1] 8 10 9 9 9 9 9 9 9 8
Subset times
:
- Extract the third entry.
- Extract everything except the third entry.
- Extract the second and fourth entry. The fourth and second entry.
- Extract the second through fifth entry.
- Extract all entries that are less than 4 hours Why does this work? Logical subsetting!
7.2.5 Modifying a vector
You can change the vector by combining []
with <-
.
Let’s say that we learned that the second time was incorrect and we wanted to replace it with missing data.
In R, missing data is noted by NA
.
Replace the second entry in times
with NA
.
Now, let’s “cap” all entries that are bigger than 8 at 8 hours.
If this is more than one value, why don’t we need to match the number of values? Recycling! Be careful of recycling!
Let’s compute the mean of these new times:
mean(times)
#> [1] 6.95
What happened? How do we compute the mean of the non-missing values?
7.2.6 Data frames
We usually work with more than one variable at a time. When we do that, we will work with data frames. A data frame is a list of vectors, all of the same length.
R has some data frames “built in”.
For example, some car data is attached to the variable name mtcars
.
Print mtcars
to screen.
Notice the tabular format.
Your turn: Finish the exercises of this section:
Use some of these built-in R functions to explore
mtcars
head()
tail()
str()
nrow()
ncol()
summary()
row.names()
names()
What’s the first column name in the
mtcars
dataset?Which column number is named
"wt"
?
With data frames,each column is its own vector.
You can extract a vector by name using $
.
For example, we can extract the cyl
column with mtcars$cyl
.
You can also extract columns using [[]]
.
For example, try mtcars[["cyl"]]
or mtcars[[2]]
.
- Extract the vector of
mpg
values. What’s the meanmpg
of all cars in the dataset?
7.2.7 R packages
Often, the suite of functions that “come with” R are not enough to do an analysis.
Sometimes, the suite of functions that “come with” R are not very convenient.
In these cases, R packages come to the rescue.
These are “add ons”, each coming with their own suite of functions and objects, usually designed to do one type of task.
CRAN stores many R packages that.
It’s easy to install packages from CRAN using the install.packages()
function.
Run the following lines of code to install the tibble
and gapminder
packages.
(But don’t include install.packages()
lines in your scripts—it’s not very nice to others!)
install.packages("tibble")
install.packages("palmerpenguins")
tibble
: a data frame with some useful “bells and whistles”palmerpenguins
: a package with the penguins dataset (as atibble
!)
After you install a package, you need to load it to use it.
Use the library()
function to load a package.
(Note: Do not use the similar function require()
to load packages.
This has some different, undesirable, behavior for normal usage.)
Run the following lines of code to load the packages. (Put these in your scripts, and near the top.)
library(tibble)
library(palmerpenguins)
You can explore the objects in a package in the Environment pane.
Try the following two approaches to access information about the tibble
package.
Run the lines one-at-a-time.
Vignettes are your friend, but do not always exist.
?tibblebrowseVignettes(package = "tibble")
Print out the penguins
object to screen.
It’s a tibble—how does it differ from a data frame in terms of how it’s printed?