Chapter 5 Reading and Writing Data
So far we have only used datasets from R
packages or created toy tibbles. In this section, we will learn how to read in data from a variety of sources.
5.1 Reading csv
files with readr
We will start with the readr
packages which is useful to a) read in csv
files and to b) correctly parsing data columns. Before we start, let’s load some packages.
library(readr)
library(tibble)
library(dplyr)
5.1.1 Reading in *-delimited data
The readr
package provides several functions to read in delimited data:
read_csv()
: comma delimitedread_csv2()
: semicolon delimitedread_tsv()
: tab delimitedread_delim()
: any delimiter
To see how they work, let’s create some data and read them in. (I know, I know still no real data. Be patient!)
= "a, b, c, d
my_csv 1, 2, 3, 4
5, 6, 7, 8"
read_csv(my_csv)
## Rows: 2 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 4
## 2 5 6 7 8
As you can see, the function correctly interpreted the first line of our string as variable names and the remaining elements as comma-separated integers. The other functions work in a similar way. There is one exception though: read_delim
. It allows for more general specifications.
= "a_ b_ c_ d
my_csv 1_ 2_ Maria_ female
5_ 6_ Teresa_ female"
read_delim(my_csv, delim = "_", trim_ws = TRUE)
## Rows: 2 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "_"
## chr (2): c, d
## dbl (2): a, b
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 4
## a b c d
## <dbl> <dbl> <chr> <chr>
## 1 1 2 Maria female
## 2 5 6 Teresa female
In this case, we specified an underscore as the delimiter and told the function to trim leading and trailing whitespace. Sometimes the csv
file includes lines which we want readr
to ignore. To do this we use the skip
argument.
= "Sometimes you can read some rubbish here
my_csv We don't want to import this
name, age
Julian, 29"
read_csv(my_csv, skip = 2)
## Rows: 1 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 2
## name age
## <chr> <dbl>
## 1 Julian 29
At other times, the csv
file does not provide variable names, we can provide them by passing a character vector as the col_names
argument.
= "Julian, 29\nTeresa, 25"
my_csv
read_csv(my_csv, col_names = c("Name", "Age"))
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (1): Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
## Name Age
## <chr> <dbl>
## 1 Julian 29
## 2 Teresa 25
There are more options and you can explore them by looking them up in the help file. For now, I will only show you one more useful option: You can use the na
argument to specify the characters used in the csv
file to indicate missing values.
= "Julian, 29
my_csv Teresa, 25
Jonas, .
., 64"
read_csv(my_csv, col_names = c("Name", "Age"), na = ".")
## Rows: 4 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Name
## dbl (1): Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 4 × 2
## Name Age
## <chr> <dbl>
## 1 Julian 29
## 2 Teresa 25
## 3 Jonas NA
## 4 <NA> 64
5.2 Parsing data correctly with the parse
functions
Sometimes columns are not correctly interpreted when they are read in. For these vectors, we can parse them differently using the parse
functions. Each of these functions takes a vector and returns a vector. For example, assume in the following that I want to parse the age
column not as an integer vector but as a character vector. I can use the parse_character
function for this.
= "Julian, 29
my_csv Teresa, 25
Jonas, .
., 64"
= read_csv(my_csv, col_names = c("name", "age"), na = ".") my_tibble
## Rows: 4 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (1): age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(my_tibble)
## # A tibble: 4 × 2
## name age
## <chr> <dbl>
## 1 Julian 29
## 2 Teresa 25
## 3 Jonas NA
## 4 <NA> 64
str(parse_character(my_tibble$age))
## Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws): is.character(x) is not TRUE
There are a bunch of functions, each for a different kind of data type:
parse_logical
parse_number
(parse_double
,parse_integer
)parse_character
parse_datetime
(parse_date
,parse_datetime
)
Let’s look at some of them in the following.
5.2.1 Parsing numbers
The parse_number
function is pretty amazing. It can recover number vectors from almost anything! Just look at the following example.
= c("100$", "20%", "Something with a 100")
my_column parse_number(my_column)
## [1] 100 20 100
You use the locale function to control for country-specific delimiters for the decimal point and grouping.
# comma instead of decimal point
= c("1,23", "1,23", "1,245")
my_column parse_number(my_column, locale = locale(decimal_mark = ","))
## [1] 1.230 1.230 1.245
# ' to group numbers
= c("123'456'789")
my_column parse_number(my_column, locale = locale(grouping_mark = "'"))
## [1] 123456789
5.2.2 Parsing characters
You would think parsing characters is pretty easy. There can be difficulties though because not everybody uses the same character encoding. To learn more about this topic, take a look at this website: http://www.w3.org/International/articles/definitions-characters/. We only need to know that different encodings exist and they can lead to problems. R usually assumes to ‘UTF-8’ encoding (and you should use it too!). See what happens if we read in characters with Latin-1
encoding:
= "El Ni\xf1o was particularly bad this year"
x1 parse_character(x1)
## [1] "El Ni\xf1o was particularly bad this year"
Well, that does not look nice. But luckily enough, we can use the locale
function to tell readr
that the string is encoded with Latin-1
.
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
Now, this time the parsing is correct! You can also use readr
to try to guess the encoding with the guess_encoding
function. Look up its help file if you want to know more.
5.2.3 Parsing dates
# specify time zone maybe
parse_datetime("2016-09-08T0708")
## [1] "2016-09-08 07:08:00 UTC"
parse_datetime("20160908T0708")
## [1] "2016-09-08 07:08:00 UTC"
parse_date("2016-09-08")
## [1] "2016-09-08"
parse_date("2016/09/08")
## [1] "2016-09-08"
library(hms)
parse_time("01:10 am")
## 01:10:00
parse_date("27/05/1987", "%d/%m/%Y")
## [1] "1987-05-27"
5.3 Parsing and reading at the same time
Each parse
function has a corresponding col
function. This allows you to use the parse
function to find out how to correctly parse a column and then specify the correct parsing right at the beginning of the data processing using the corresponding col
function. I usually read in data in three steps.
- First, I read in all columns as character vectors. This allows me to browse the data and determine the correct parsing. To read in every column as a character vector, you can use the
.default
argument in thecol_types
function.
= read_csv(readr_example("challenge.csv"),
challenge1 # there's an example dataset in the readr package called challenge.csv
col_types = cols(
.default = col_character()
))
I can try out different parsers using the
parse
functions. (Note that you can use theparse
functions from thereadr
package together with other packages such asreadxl
). In this case, browsing and parsing will lead you to conclude that the correct parsers areparse_double
andparse_data
, respectively.Finally, we specify the correct parsers directly at the beginning of the data processing stage using the
col
functions that correspond to theparse
functions.
= read_csv(readr_example("challenge.csv"),
challenge2 col_types = cols(
x = col_double(),
y = col_date()
))head(challenge2)
## # A tibble: 6 × 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
5.4 Writing csv
files
You can use the readr
package to write csv
files as well. In this case, we want to save our cleaned up dataframe as a csv
file in a dataframes
sub-folder. For this, we first check whether the sub-folder already exists. If it does not, we create it.
if (!file.exists("dataframes")) {
dir.create("dataframes")
}
Then, we write the csv
file using the write_csv
function.
write_csv(challenge2, "./dataframes/challenge2.csv")
You can check now in your working directory whether this worked. Note that a csv
file does not store the information about the correct parsing of the data columns.
read_csv("./dataframes/challenge2.csv")
## Rows: 2000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): x
## date (1): y
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2,000 × 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
## 7 2489 NA
## 8 1449 NA
## 9 3665 NA
## 10 3863 NA
## # … with 1,990 more rows
We have to specify the correct parsing again! If you only work with R
and the dataframe is not too big, you can store the dataframe as an RDS
file instead.
5.5 Write and read RDS
files with readr
There is not much to say here apart from the fact that the RDS
file ‘remembers’ the correct parsing.
write_rds(challenge2, "./dataframes/challenge2.rds")
= read_rds("./dataframes/challenge2.rds")
challenge_rds head(challenge_rds)
## # A tibble: 6 × 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
5.6 Reading in Excel sheets and Stata data with readxl
and haven
Many datasets are stored in Excel sheets. You can read them in using the readxl
package. A similar package, haven
, exists for Stata files. I will not use them here, but can look at the helpfiles for the commands read_excel
and read_dta
.
library(readxl)
help(read_excel)
library(haven)
help(read_dta)
If you want to try them out, you can use the data files from the ‘Baby-Wooldridge’ here: http://www.cengage.com/aise/economics/wooldridge_3e_datasets/.
Sources
The exposition here is inspired by the notes for a new book on R data science by Garrett Grolemund and Hadley Wickham. You can find detailed outlines here: http://r4ds.had.co.nz.