Chapter 12 Data Wrangling with dplyr
The goal of this exercise is to give you some experience using functions from dplyr
with a little less guidance.
We want to build up some experience with is dplyr
.
Load it:
library(dplyr)
We are going to use a classic dataset from Graduate Econometrics texts that has gasoline consumption and some other variables by country for the 60’s and 70’s.
The data is inside the package plm
and called Gasoline
.
Load it as follows:
# needs the plm libray ... install.packages("plm")
library(plm)
##
## Attaching package: 'plm'
## The following object is masked from 'package:lfe':
##
## sargan
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
data(Gasoline, package = "plm")
Use the help to get a sense of what variable names mean in the data.
Now work through the following exercises:
- Create a dataset
gasoline
which is a tibble - Create a subset of that only has data from the 1960s. Do this two ways, once where you don’t use piping,
%>%
, and once where you do. - Create a subset that contains data from the years ranging from 1969 to 1973.
- Create a subset that contains data for the years 1969, 1973 and 1977.
- Create a dataset that contains only the columns country, year, lrpmg.
- Create a dataset that does not contain the columns country, year, lrpmg.
- Rename the column year to be called date.
- Select all columns that start with “l”.
- Select the columns country, year, and all columns that contain the letters “car”.
- What does the function
pull()
do? Try it on the columnlrpmg
. - Create a grouped data set that groups the data by country.
- Ungroup the dataset from 10.
- Find the mean of lgaspcar by country. Call that variable
avg_lgaspcar
. - Return a dataset that computes the mean of lgaspcar for france.
- Compute the mean, standard deviation, min and max of lgaspcar by country.
- Which country has the highest average gasoline consumption.
- Return a dataset that returns the countries with the highest and lowest average consumption.
- Add a variable
count
to the dataset that has the number of times each country appears in the data - Is it balanced? - Create a meaningless dataset called
spam
that is the exponential of the sum of lgaspcar and lincomep. Also check out what happens if you replacemutate()
withtransmute()
. - Create the lead and lag of lgaspcar for each row of data. Call the new columns
lead_lgaspcar
andlag_lgaspcar
. - The following countries belong the to EU:
<- c("austria", "belgium", "bulgaria", "croatia", "republic of cyprus",
eu_countries "czech republic", "denmark", "estonia", "finland", "france", "germany",
"greece", "hungary", "ireland", "italy", "latvia", "lithuania", "luxembourg",
"malta", "netherla", "poland", "portugal", "romania", "slovakia", "slovenia",
"spain", "sweden", "u.k.")
Create a variable in_eu
in the gasoline data which takes the value TRUE
if a country is in the EU. (Note that the case of the string will matter!)
- Here’s a different way to classify countries:
Mediterranean: france, italy, turkey, greece, spain
Central Europe: germany, austria, switzerl, belgium, netherla
Anglosphere: canada, u.s.a. , u.k., ireland
Nordic: denmark, norway, sweden
asia: japan
Create a new variable
region
that uses these definitions. (Hint:case_when()
) will likely be your friend here.
- Notice that in the country names switzerl and netherla are a little funky. Use the functions
mutate
andrecode
to replace the name with the full country name. - Compute the variable
quintile
that computes which quintile of lgaspcar each country is. Do this only for 1960. Repeat this to create a variabledecile
with the appropriate definiiton. - Create a variable
high_consumption
that takes the valueTRUE
if lgaspcar is higher than the yearly average for a given country.