Chapter 1 Basics of R
Before we can dive into the actual work with data you should be familiar with the basics of the R programming language. We will not spend too much time here as you have already seen many of the concepts in our Python sequence and they will only look slightly different in R.
1.1 The Workspace
1.1.1 Working directory
To know in which folder we are working, you can use the getwd
function.
getwd()
## [1] "/Users/runner/work/pp4rs-rstats/pp4rs-rstats"
If you want to set a specific working directory, you can do this using the setwd
function. We will not follow this approach as there is a better one by now.
#setwd("/Users/jlanger/Dropbox/uzh_programming/r_programming")
Instead of setting the working directory explicitly, we create a new R
project. In order to do so, we click on File > New Project and create a new project folder in a directory of our choice.
1.1.2 R projects
This new folder now includes an .Rproj
file. If we click on this file a new RStudio session is started, our environment/workspace is cleaned, and our working directory is automatically set to the project folder.
Notice also that our command history and all open editor windows for the project are restored once we open it.
1.1.3 Removing and adding objects from workspace
Let’s create two variables.
= 1
y = 2 x
To see which objects are available in our workspace, we can use the ls
function.
ls()
## [1] "x" "y"
If you want to remove an object, use the rm
function.
rm(y)
ls()
## [1] "x"
If you want to remove all object to start with a clean slate, type the following.
rm(list = ls())
ls()
## character(0)
1.1.4 Installing and loading packages
Additional functionalities are available in R packages. To install these you have to use the install.packages
function. The package name has to be passed as a string.
# you do neet the repos argument!
# install.packages("purrr", repos = "http://cran.us.r-project.org")
To then load a package, we use the library
function. Here, we will load a packages with the curious name purrr
that we’ll use later.
library(purrr)
1.2 Vectors
Vectors provide the basic organization of data in R. There are basically two types of vectors:
- atomic vectors – logical, numeric, character – which are homogenous,
- lists, which can be heterogenous and can contain other lists.
Let’s start with the atomic vectors.
1.2.1 Atomic vector types
1.2.1.1 Logical
Logical vectors (or booleans) can have three different values in R:
TRUE
## [1] TRUE
FALSE
## [1] FALSE
NA
## [1] NA
The first two you already know from Python. What about the third one? It is R’s way of saying something is not available. For example, assume you have missing data in your dataframe and you conduct a logical comparison such as whether a value is bigger than 3. If a value is missing, then the resulting comparison for that row cannot be TRUE
or FALSE
but NA
.
By the way, to see the type of a vector, you can use the typeof
method.
typeof(TRUE)
## [1] "logical"
1.2.1.2 Numeric
Next in line are the numeric vectors. These come in two flavors: integers and doubles.
typeof(2.0)
## [1] "double"
typeof(2)
## [1] "double"
As you can see, R stores every number as a double by default. This takes up a lot of memory so if you are sure you only need integers you can append the number with a ‘L’ to force the coercion to integer values.
typeof(2L)
## [1] "integer"
Finally, note that while integers only have one type of NA
value, doubles also have the values Inf
and -Inf
. You can test for missing values or infinite values by using the following functions:
is.finite()
is.infinite()
is.na()
1.2.1.3 Character
Apart from the logical and numeric vectors there exist character vectors which allow you to store strings. These are created by using single or double quotes.
= "I'm sorry, Dave. I'm afraid I can't do that."
hal hal
## [1] "I'm sorry, Dave. I'm afraid I can't do that."
= 'I\'m sorry, Dave. I\'m afraid I can\'t do that.'
hal2 hal2
## [1] "I'm sorry, Dave. I'm afraid I can't do that."
1.2.2 Vector coercion
Just like Python, R features implicit as well as explicit coercion. You can do explicit coercion by using the following functions:
as.logical()
as.integer()
as.double()
as.character()
I will now give you some examples of implicit coercion:
If you pass a logical vector to a function that expects numeric vectors, it converts FALSE
to 0 and TRUE
and 1.
TRUE + FALSE
## [1] 1
You can also go implicitly from numerical to logical vectors. People sometimes use numeric vectors for logical conditions:
= 0
x = 2
y
if (x) {
print('Hello')
}if (y) {
print('World')
}
## [1] "World"
As you can see here, 0 gets converted into FALSE
while every other value gets converted to TRUE
.
To check whether a vector is of a specific type, use one of the following functions from the purrr
package:
is_logical()
is_integer()
is_double()
is_numeric()
is_character()
is_atomic()
is_list()
is_vector()
1.2.3 Vectors with multiple values
So far we have only spoken if vectors with length 1. However, vectors can store several elements of the same type. You can create such vectors using the c()
function. Let’s check out strings first.
= c('This', 'is', 'a', 'vector.')
my_strings my_strings
## [1] "This" "is" "a" "vector."
length(my_strings)
## [1] 4
# strings do not get automatically combined if you use the print fuction
print(my_strings)
## [1] "This" "is" "a" "vector."
# you need to combine the strings first with a function such as paste()
print(paste(my_strings, collapse = ' '))
## [1] "This is a vector."
The numeric vectors come next.
= c(1, 2, 3, 4)
my_numbers_1 my_numbers_1
## [1] 1 2 3 4
# a way of creating a numeric sequence
= 1:4
my_numbers_2 my_numbers_2
## [1] 1 2 3 4
# another way with seq()
= seq(1, 10, 2)
my_numbers_3 my_numbers_3
## [1] 1 3 5 7 9
If you create a vector with values of different type, R will automatically fall back to the most comprehensive type.
c(TRUE, 1)
## [1] 1 1
c(TRUE, 1, 'Hello')
## [1] "TRUE" "1" "Hello"
To check the length of a vector, use the length
function. Note that when you apply to a character vector it still gives you the number of elements, not the number of characters.
length(c('Hallo'))
## [1] 1
length(c(1, 2, 3))
## [1] 3
1.2.4 Operating with vectors and recycling
Notice that you can do the usual arithmetic operations with vectors. Addition, multiplication etc. with scalars works as you would probaly expect.
c(1, 2, 3) * 2
## [1] 2 4 6
c(1, 2, 3) + 2
## [1] 3 4 5
If you operate with two vectors of the same length, the operator is applied element-wise.
c(1, 2, 3) + c(1, 2, 3)
## [1] 2 4 6
c(1, 2, 3) * c(1, 2, 3)
## [1] 1 4 9
What happens though if you multiply two vectors of different length?
c(1, 2, 3, 4) * c(1, 2)
## [1] 1 4 3 8
Since the second vector is shorter than the first one, it gets recycled, i.e. it gets copied until it has the same length as the first one.
c(1, 2, 3, 4) * c(1, 2, 1, 2)
## [1] 1 4 3 8
Note that R does not warn you about this as long as the longer vector is a multiple of the shorter one.
c(1, 2, 3, 4) * c(1, 2, 3)
## Warning in c(1, 2, 3, 4) * c(1, 2, 3): longer object length is not a multiple of
## shorter object length
## [1] 1 4 9 4
1.2.5 Names
If you want to name the elements of a vector, you can do so using base R or purrr
.
# name elements during creation, note that you do not need quotes
= c(a = 1, b = 2, c = 3, d = 4)
my_named_vector my_named_vector
## a b c d
## 1 2 3 4
# use purrr to set the names after creation
= set_names(1:4, c('a', 'b', 'c', 'd'))
my_named_vector my_named_vector
## a b c d
## 1 2 3 4
1.2.6 Subsetting
There are basically three ways of subsetting a vector with the subsetting function []
.
- You can pass a numeric vector containing the indices of the elements you are interested in.
= c('A', 'B', 'C', 'D')
a c(1, 4)] a[
## [1] "A" "D"
# negative indices for dropping elements
c(-1)] a[
## [1] "B" "C" "D"
# duplicate elements by using the same index again
c(1, 1, 1)] a[
## [1] "A" "A" "A"
- You can pass a boolean vector to select the elements which meet a certain criterion.
= 1:10
a < 5 a
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
< 5] a[a
## [1] 1 2 3 4
- You can pass the names of elements as a character vector if you work with a named vector.
= c(a = 1, b = 2, c = 3, d = 4)
a c('b', 'd')] a[
## b d
## 2 4
1.2.7 Lists
Lists differ from atomic vectors in that they can be heterogenous, i.e. store vectors or elements of different types and are recursive in that they can contain other lists.
1.2.7.1 Defining lists
To define a list, use the list
function. To analyze its structure, you can apply the str
function to it.
# list with 4 scalar elements
= list(1, 2, 3, 4)
my_list str(my_list)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
# list with 1 column-vector element
= list(c(1, 2, 3, 4))
my_list str(my_list)
## List of 1
## $ : num [1:4] 1 2 3 4
# list with a boolean, a number, and a string element
= list(1, TRUE, 'Hello, hello')
my_list str(my_list)
## List of 3
## $ : num 1
## $ : logi TRUE
## $ : chr "Hello, hello"
# list of two lists
= list(list(1, 2), list('a', 'b'))
my_list str(my_list)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : chr "a"
## ..$ : chr "b"
1.2.7.2 Subsetting lists
Before we learn about subsetting, let’s create a fancy list which features sublists, names and heterogeneity.
= list(a = 1:12, b = 'Pancakes are lovely, dear!', c = TRUE, d = list(-99, 1)) fancy_list
There are two subsetting functions which you can apply to lists, []
and [[]]
:
- To get a sublist: If you use
[]
with lists, you get a sublist. For example, if we enter ‘4’ for the last element in our list, we get a list in return which contains the list named ‘d’.
str(fancy_list[4])
## List of 1
## $ d:List of 2
## ..$ : num -99
## ..$ : num 1
- To get the element itself: If you use
[[]]
, you get the element itself. For example, if we enter ‘4’ for the last element in our list, we get the list itself.
str(fancy_list[[4]])
## List of 2
## $ : num -99
## $ : num 1
If elements are named, you can get also get an element by using a shorthand involving the dollar sign.
str(fancy_list$d)
## List of 2
## $ : num -99
## $ : num 1
You can do more complicated subsetting operations by combining []
and [[]]
. Be very careful though.
# returns the last element in a list
str(fancy_list[4])
## List of 1
## $ d:List of 2
## ..$ : num -99
## ..$ : num 1
# returns the last element as itself
str(fancy_list[[4]])
## List of 2
## $ : num -99
## $ : num 1
# returns the first element of the last element as a list
str(fancy_list[[4]][1])
## List of 1
## $ : num -99
# returns the first element of the last element as itself
str(fancy_list[[4]][[1]])
## num -99
1.3 Checking cases with conditionals
Well, we know this stuff from Python already, but all of it looks slightly different in Python, so let’s dive right into it.
1.3.1 if
- else
An if-else structure is useful if you want to execute different code blocks depending on whether a certain statement is evaluated as TRUE
or FALSE
:
= TRUE
recession if (recession){
print("Booh!")
else {
} print("Yay!")
}
## [1] "Booh!"
In our case, the statement is very simple, it is just the value of our variable recession
, which is TRUE
. In this case, the code block is executed and R prints out ‘Booh!’ to the display. If the statement would have been evaluated as FALSE
, the code within the else-bock would have been executed instead. Notice the difference to the Python code we discussed:
- Statements have to surrounded by round brackets.
- We have to use curly brackets to delimit our code blocks. This problem is solved by colon + indentation in Python.
1.3.2 if
- else if
- else
If we want to check more than case, we can use the if - else if - else
structure.
= 'violet'
color if (color == 'red') {
print('It is a tomatoe!')
else if (color == 'yellow') {
} print('It is a yellow pepper!')
else if (color == 'violet') {
} print('It is an onion!')
else {
} print('No idea what this is!')
}
## [1] "It is an onion!"
R checks each of the provided statements and executes the block of the first statement that is evaluated as true.
1.3.3 Checking multiple cases with switch
If you want to pass options or have a lot of conditions to check, you can use the switch
function.
= 1
x = 2
y = 'plus'
operation
switch(operation,
plus = x + y,
minus = x - y,
times = x * y,
divide = x / y,
stop('You specified an unknown operation!')
)
## [1] 3
1.3.4 Comparison and logical operators
We have already used the ==
operator which checks for equality between two values. You can also use the identical
function. This makes sure that only one truth-value is returned. If more than one value is returned, say from a comparison of vectors with ==
, R will check only the first truth-value.
= c(1, 2, 3, 4)
A = c(1, 3, 4, 5)
B
if (A == B) {
print('They are equal!')
else {
} print('They are not equal!')
}
## Warning in if (A == B) {: the condition has length > 1 and only the first
## element will be used
## [1] "They are equal!"
if (identical(A, B)) {
print('They are equal!')
else {
} print('They are not equal!')
}
## [1] "They are not equal!"
A disadvantage of identical
is that you have to be very specific regarding types:
identical(0L, 0)
## [1] FALSE
You also have the following other operators for logical comparisons:
!=
: not identical,<
: smaller than,<=
: smaller than or equal,>
: bigger than,>=
: bigger than or equal,!
: not,&&
: logical ‘and’,||
: logical ‘or’,is.logical
etc.
Finally, note that the use of doubles in logical comparisons can be dangerous.
1 - 1/3 - 1/3 - 1/3 == 0
## [1] FALSE
R exhibits this strange behavior because there are always approximation errors when using floating point numbers.
1 - 1/3 - 1/3 - 1/3
## [1] 1.110223e-16
You can use the near
function from the dplyr
package to account for these cases.
::near(1 - 1/3 - 1/3 - 1/3, 0) dplyr
## [1] TRUE
1.4 Functions
You can of course also write functions in R. We will start by learning how to define them.
1.4.1 Function definitions
To see how to define a function, let’s just write one.
= function(x){
calc_percent_missing mean(is.na(x))
}calc_percent_missing(c(1, 2, 6, 3, 7, NA, 9, NA, NA, 1))
## [1] 0.3
You can see that we need three things to define a function:
- a function name, in this case it’s
calc_percent_missing
, - function arguments, in this case it’s just one, the vector
x
, - the function body which is enclosed by curly parentheses.
Note that the last statement that is evaluated in the function body is automatically taken as the return value of the function.
1.4.2 Function arguments and default values
Function arguments in R can usually be broadly divided into two categories:
- data: either a dataframe or a vector,
- details: parameters which govern the computation.
For the latter you often want to define default values. You can do this just as in Python and we will look at our old friend, the Cobb-Douglas utility function.
= function(x, a = 0.5, b = 0.5) {
cobb_douglas = x[1]**a * x[2]**b
u
}= c(1, 2)
x print(cobb_douglas(x))
## [1] 1.414214
print(cobb_douglas(x, b = 0.4, a = 0.6))
## [1] 1.319508
Note that I not only overwrote the default values but also changed the order of the arguments. Similarly to Python, you can change the order of arguments if you call the arguments by their keywords.
1.4.3 Arbitrary number of arguments
Sometimes you might want to write a function which takes an arbitrary number of arguments. You can do this with the dot-dot-dot argument. I demonstrate its usefulness with a nice little function by Hadley Wickham. Note that str_c
is a function to combine strings into a single one.
<- function(...) {
commas ::str_c(..., collapse = ", ")
stringr
}commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"
1.4.4 Function returns
I already told you that a function automatically returns the value of the last statement evaluated. You can however also be explicit about it by using the return
function.
1.4.5 Pipes
Piping is a practice that we will often use in this course to quickly chain a lot of functions together. I will tell you more about it later. For now, we only have to know under which conditions functions are pipeable.
A function is pipeable if you pass an object to it and the function returns a modified version of the object. An example of a pipeable function would be one to which you pass a numeric vector and it returns it in a sorted form.
A function is also pipeable if we pass an object to a function and the function returns it unmodified but creates what R programmers such as Hadley Wickham call a side-effect in the meantime. Such side-effects could be the drawing of a plot or the display of basic information about the object that is passed to the function. I will again illustrate this with a useful function by Wickham. We could program our function
calc_percent_missing
from above in such a way.
= function(x) {
calc_percent_missing = mean(is.na(x))
n cat("Percent missing: ", n*100, '!', sep='')
invisible(x)
}= c(1, 2, 3, NA, 6, 9, NA, NA, NA, 10)
test_vector calc_percent_missing(test_vector)
## Percent missing: 40!
As you can see, this function displays the percentage of missing values on the screen. It still returns the unchanged vector x
, although it does not print it out.
1.5 Iterations
Now, after we heard about functions and conditionals it is time to think about iterations again. We will first look at the kind of loops we also know from Python and then briefly look at the map
function family.
Before we actually start, let’s briefly create a dataframe or – more specifically – a tibble.
# letters denote variable names
= tibble::tibble(
my_df a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
You will notice that you have now a new dataframe in your workspace, my_df
. We will work with it in the following.
1.5.1 for
loops
1.5.1.1 Looping over numeric indices
Let’s say you want to calculate the median for each of the columns in our dataframe. You can of course do this by hand.
median(my_df$a)
## [1] 0.07052297
median(my_df$b)
## [1] 0.3100978
median(my_df$c)
## [1] 0.4634452
median(my_df$d)
## [1] 0.3145689
This looks like a lot of repetition though, right? Maybe we can loop over the different columns.
= vector("double", ncol(my_df))
output for (i in seq_along(my_df)) {
= median(my_df[[i]])
output[i]
} output
## [1] 0.07052297 0.31009782 0.46344516 0.31456887
The first line initializes an output vector. Make sure to always pre-allocate space for your output. In this case, we have pre-specified the length of the double vector to be equal to the number of columns in our dataframe.
The second line contains the looping statement. The seq_along
function acts as a safe alternative to length
here. Why is it safer your ask?
= vector("double", 0)
test_vector seq_along(test_vector)
## integer(0)
1:length(test_vector)
## [1] 1 0
The body of the loop calculates the median for each of the dataframe columns. Note also the use of the subsetting operators here. Why did I use [[]]
for subsetting the dataframe?
1.5.1.2 Looping over elements
You can also write loops to iterate over elements. This is particularly useful in those cases where you do not want to store output.
for (x in my_df) {
print(median(x))
}
## [1] 0.07052297
## [1] 0.3100978
## [1] 0.4634452
## [1] 0.3145689
In this case, I did not store the median values anywhere but just displayed them on the screen. Note that it would have been difficult to use this kind of loop for storing the elements in an initialized output vector as there is no natural way to index an output vector.
1.5.1.3 Looping over names
Finally, note that you can also loop over the names of a vector.
for (name in names(my_df)) {
cat("Median for variable ", name, " is: ", median(my_df[[name]]), '\n', sep='')
}
## Median for variable a is: 0.07052297
## Median for variable b is: 0.3100978
## Median for variable c is: 0.4634452
## Median for variable d is: 0.3145689
1.5.1.4 Repeat until with with
Naturally, R also has a while
structure which is particularly useful if you do not know how long a particular piece of code should be run. We had a longer discussion of this concept in the Python class, I will only provide one brief example here. The following loop prints out the square of a number that is provided by a user. If ‘q’ is entered, R leaves the loop.
#while (TRUE) {
# n <- readline(prompt="Enter an integer (quit with q): ")
# if (n == 'q') {
# break
# } else {
# cat(as.numeric(n), ' squared is ', as.numeric(n)**2, '!', sep = '')
# }
#}
1.5.2 Split, apply, combine with map
functions
There is one cool feature which I briefly want to tell you about. Very often we want to loop over a vector, do something to each element of the vector and then save the results. You could also summarize this as split - apply - combine. The purrr
package provides the functionalities for such operations in a handy manner. Let’s start with a basic example.
1.5.2.1 Basic example
To see how it works let’s get back to a simple dataframe. To remind you (and me), I create a new one here. Note that the b
column contains a missing value.
= tibble::tibble(
my_df a = rnorm(10),
b = c(rnorm(9), NA),
c = rnorm(10),
d = rnorm(10)
)
Let’s assume that we want to have the median of every column again. We already created a loop to do just that. There is a simpler way though. We can also use a map
function to do this in a very compact manner.
map_dbl(my_df, median)
## a b c d
## 0.69336239 NA 0.08664286 -0.01160225
We can compute the mean and standard deviation in a similar way.
map_dbl(my_df, mean)
## a b c d
## 0.4504680 NA -0.1818221 0.2939560
map_dbl(my_df, sd)
## a b c d
## 1.0995427 NA 0.7371080 0.9824286
Pretty convenient, right?
1.5.2.2 Syntax
While there are several map
functions they all have the same argument structure:
- You have to pass the vector / list on whose elements you want to operate on.
- You have to pass the function which you want to apply to each element.
You can also pass additional arguments to the map functions. Say we want to compute the mean for our b
column as well.
map_dbl(my_df, mean, na.rm = TRUE)
## a b c d
## 0.4504680 -0.4295373 -0.1818221 0.2939560
Finally, I already told you that there are several map functions, but so far we have only used the map_dbl
function. These are the others:
map()
: makes a list,map_lgl()
: makes a logical vector,map_int()
: makes an integer vector,map_dbl()
: makes a double vector,map_chr()
: makes a character vector.
The function choice depends on the output you expect. If we apply the mean
to each element (i.e. column) of our dataframe, we usually expect to obtain a double vector that contains the means. But if we want a vector of strings, we can also do that.
map_chr(my_df, mean, na.rm = TRUE)
## a b c d
## "0.450468" "-0.429537" "-0.181822" "0.293956"
1.5.2.3 map2
and pmap
There are some extensions to the basic map
functionality.
For example, what do we have to do if we want to iterate over two lists? We can use the map2
function. Say we want to create a new dataframe but this time not all variables are suppose to have the same mean and standard deviation. Let’s create two vectors with our desired means and standard deviations.
= list(0, 0, 0, 0)
mu = list(1, 5, 10, 20)
sd
= map2(mu, sd, rnorm, n = 20)
my_fancy_df = set_names(my_fancy_df, c('a', 'b', 'c', 'd'))
my_fancy_df = tibble::as_tibble(my_fancy_df)
my_fancy_df
# look at the head of the dataframe
head(my_fancy_df)
## # A tibble: 6 × 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 0.128 -0.996 -4.12 -14.1
## 2 0.653 -8.21 0.552 30.4
## 3 1.54 -0.00726 -0.649 -21.9
## 4 0.668 -3.30 2.98 7.72
## 5 0.422 -4.42 -2.06 12.7
## 6 -2.01 -0.967 -2.04 36.5
OK. So what should we do if we want to iterate over more than two lists? Say we want to create vectors with different lengths, means and standard deviations. We could use pmap
.
= list(1, 2, 3, 4)
mu = list(1, 10, 15, 20)
sd = list(1, 2, 3, 4)
n
# enter arguments in the right order
= list(n, mu, sd)
args
str(pmap(args, rnorm))
## List of 4
## $ : num 0.431
## $ : num [1:2] 13.82 3.46
## $ : num [1:3] -3.05 3.05 10.21
## $ : num [1:4] -28.46 -4.99 -5.07 -11.3
And this is how our brief tour of R programming ends.
Sources
The exposition here is heavily inspired by the notes for a new book on R data science by Garrett Grolemund and Hadley Wickham. You can find detailed outlines here: http://r4ds.had.co.nz.