Chapter 5 Reading and Writing Data

So far we have only used datasets from R packages or created toy tibbles. In this section, we will learn how to read in data from a variety of sources.

5.1 Reading `csv` files with `readr`

We will start with the readr packages which is useful to a) read in csv files and to b) correctly parsing data columns. Before we start, let’s load some packages.

library(readr)
library(tibble)
library(dplyr)

5.1.1 Reading in *-delimited data

The readr package provides several functions to read in delimited data:

read_csv(): comma delimited
read_csv2(): semicolon delimited
read_tsv(): tab delimited
read_delim(): any delimiter

To see how they work, let’s create some data and read them in. (I know, I know still no real data. Be patient!)

my_csv = "a, b, c, d 
          1, 2, 3, 4
          5, 6, 7, 8"

read_csv(my_csv)

## Rows: 2 Columns: 4

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): a, b, c, d

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 4
##       a     b     c     d
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     2     3     4
## 2     5     6     7     8

As you can see, the function correctly interpreted the first line of our string as variable names and the remaining elements as comma-separated integers. The other functions work in a similar way. There is one exception though: read_delim. It allows for more general specifications.

my_csv = "a_ b_ c_ d
          1_ 2_ Maria_ female
          5_ 6_ Teresa_ female"

read_delim(my_csv, delim = "_", trim_ws = TRUE)

## Rows: 2 Columns: 4

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "_"
## chr (2): c, d
## dbl (2): a, b

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 4
##       a     b c      d     
##   <dbl> <dbl> <chr>  <chr> 
## 1     1     2 Maria  female
## 2     5     6 Teresa female

In this case, we specified an underscore as the delimiter and told the function to trim leading and trailing whitespace. Sometimes the csv file includes lines which we want readr to ignore. To do this we use the skip argument.

my_csv = "Sometimes you can read some rubbish here
          We don't want to import this 
          name, age
          Julian, 29"
read_csv(my_csv, skip = 2)

## Rows: 1 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): age

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 2
##   name     age
##   <chr>  <dbl>
## 1 Julian    29

At other times, the csv file does not provide variable names, we can provide them by passing a character vector as the col_names argument.

my_csv = "Julian, 29\nTeresa, 25"

read_csv(my_csv, col_names = c("Name", "Age"))

## Rows: 2 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (1): Age

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 2
##   Name     Age
##   <chr>  <dbl>
## 1 Julian    29
## 2 Teresa    25

There are more options and you can explore them by looking them up in the help file. For now, I will only show you one more useful option: You can use the na argument to specify the characters used in the csv file to indicate missing values.

my_csv = "Julian, 29
          Teresa, 25
          Jonas, . 
          ., 64"
read_csv(my_csv, col_names = c("Name", "Age"), na = ".")

## Rows: 4 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (1): Age

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 4 × 2
##   Name     Age
##   <chr>  <dbl>
## 1 Julian    29
## 2 Teresa    25
## 3 Jonas     NA
## 4 <NA>      64

5.2 Parsing data correctly with the `parse` functions

Sometimes columns are not correctly interpreted when they are read in. For these vectors, we can parse them differently using the parse functions. Each of these functions takes a vector and returns a vector. For example, assume in the following that I want to parse the age column not as an integer vector but as a character vector. I can use the parse_character function for this.

my_csv = "Julian, 29
          Teresa, 25
          Jonas, . 
          ., 64"
my_tibble = read_csv(my_csv, col_names = c("name", "age"), na = ".")

## Rows: 4 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): age

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(my_tibble)

## # A tibble: 4 × 2
##   name     age
##   <chr>  <dbl>
## 1 Julian    29
## 2 Teresa    25
## 3 Jonas     NA
## 4 <NA>      64

str(parse_character(my_tibble$age))

## Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE

There are a bunch of functions, each for a different kind of data type:

parse_logical
parse_number (parse_double, parse_integer)
parse_character
parse_datetime (parse_date, parse_datetime)

Let’s look at some of them in the following.

5.2.1 Parsing numbers

The parse_number function is pretty amazing. It can recover number vectors from almost anything! Just look at the following example.

my_column = c("100$", "20%", "Something with a 100")
parse_number(my_column)

## [1] 100  20 100

You use the locale function to control for country-specific delimiters for the decimal point and grouping.

# comma instead of decimal point
my_column = c("1,23", "1,23", "1,245")
parse_number(my_column, locale = locale(decimal_mark = ","))

## [1] 1.230 1.230 1.245

# ' to group numbers
my_column = c("123'456'789")
parse_number(my_column, locale = locale(grouping_mark = "'"))

## [1] 123456789

5.2.2 Parsing characters

You would think parsing characters is pretty easy. There can be difficulties though because not everybody uses the same character encoding. To learn more about this topic, take a look at this website: http://www.w3.org/International/articles/definitions-characters/. We only need to know that different encodings exist and they can lead to problems. R usually assumes to ‘UTF-8’ encoding (and you should use it too!). See what happens if we read in characters with Latin-1 encoding:

x1 = "El Ni\xf1o was particularly bad this year"
parse_character(x1)

## [1] "El Ni\xf1o was particularly bad this year"

Well, that does not look nice. But luckily enough, we can use the locale function to tell readr that the string is encoded with Latin-1.

parse_character(x1, locale = locale(encoding = "Latin1"))

## [1] "El Niño was particularly bad this year"

Now, this time the parsing is correct! You can also use readr to try to guess the encoding with the guess_encoding function. Look up its help file if you want to know more.

5.2.3 Parsing dates

# specify time zone maybe
parse_datetime("2016-09-08T0708")

## [1] "2016-09-08 07:08:00 UTC"

parse_datetime("20160908T0708")

## [1] "2016-09-08 07:08:00 UTC"

parse_date("2016-09-08")

## [1] "2016-09-08"

parse_date("2016/09/08")

## [1] "2016-09-08"

library(hms)
parse_time("01:10 am")

## 01:10:00

parse_date("27/05/1987", "%d/%m/%Y")

## [1] "1987-05-27"

5.3 Parsing and reading at the same time

Each parse function has a corresponding col function. This allows you to use the parse function to find out how to correctly parse a column and then specify the correct parsing right at the beginning of the data processing using the corresponding col function. I usually read in data in three steps.

First, I read in all columns as character vectors. This allows me to browse the data and determine the correct parsing. To read in every column as a character vector, you can use the .default argument in the col_types function.

challenge1 = read_csv(readr_example("challenge.csv"),  
                      # there's an example dataset in the readr package called challenge.csv
                     col_types = cols(
                       .default = col_character()
                     ))

I can try out different parsers using the parse functions. (Note that you can use the parse functions from the readr package together with other packages such as readxl). In this case, browsing and parsing will lead you to conclude that the correct parsers are parse_double and parse_data, respectively.
Finally, we specify the correct parsers directly at the beginning of the data processing stage using the col functions that correspond to the parse functions.

challenge2 = read_csv(readr_example("challenge.csv"),
                      col_types = cols(
                        x = col_double(),
                        y = col_date()
                      ))
head(challenge2)

## # A tibble: 6 × 2
##       x y         
##   <dbl> <date>    
## 1   404 NA        
## 2  4172 NA        
## 3  3004 NA        
## 4   787 NA        
## 5    37 NA        
## 6  2332 NA

5.4 Writing `csv` files

You can use the readr package to write csv files as well. In this case, we want to save our cleaned up dataframe as a csv file in a dataframes sub-folder. For this, we first check whether the sub-folder already exists. If it does not, we create it.

if (!file.exists("dataframes")) {
  dir.create("dataframes")
}

Then, we write the csv file using the write_csv function.

write_csv(challenge2, "./dataframes/challenge2.csv")

You can check now in your working directory whether this worked. Note that a csv file does not store the information about the correct parsing of the data columns.

read_csv("./dataframes/challenge2.csv")

## Rows: 2000 Columns: 2

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): x
## date (1): y

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2,000 × 2
##        x y         
##    <dbl> <date>    
##  1   404 NA        
##  2  4172 NA        
##  3  3004 NA        
##  4   787 NA        
##  5    37 NA        
##  6  2332 NA        
##  7  2489 NA        
##  8  1449 NA        
##  9  3665 NA        
## 10  3863 NA        
## # … with 1,990 more rows

We have to specify the correct parsing again! If you only work with R and the dataframe is not too big, you can store the dataframe as an RDS file instead.

5.5 Write and read `RDS` files with `readr`

There is not much to say here apart from the fact that the RDS file ‘remembers’ the correct parsing.

write_rds(challenge2, "./dataframes/challenge2.rds")
challenge_rds = read_rds("./dataframes/challenge2.rds")
head(challenge_rds)

## # A tibble: 6 × 2
##       x y         
##   <dbl> <date>    
## 1   404 NA        
## 2  4172 NA        
## 3  3004 NA        
## 4   787 NA        
## 5    37 NA        
## 6  2332 NA

5.6 Reading in Excel sheets and Stata data with `readxl` and `haven`

Many datasets are stored in Excel sheets. You can read them in using the readxl package. A similar package, haven, exists for Stata files. I will not use them here, but can look at the helpfiles for the commands read_excel and read_dta.

library(readxl)
help(read_excel)

library(haven)
help(read_dta)

If you want to try them out, you can use the data files from the ‘Baby-Wooldridge’ here: http://www.cengage.com/aise/economics/wooldridge_3e_datasets/.

Sources

The exposition here is inspired by the notes for a new book on R data science by Garrett Grolemund and Hadley Wickham. You can find detailed outlines here: http://r4ds.had.co.nz.