Concatenating and Merging Data¶
Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer
In the previous lesson we learned how to import a single data set from a file and how to write a pandas DataFrame to a file once we are finished working on it. Another common task is to bring multiple datasets together, either by:
concatenating/appending them, i.e adding new data about the same variables
merging data: adding new data about different variables for the same index.
That’s the focus of this lesson!
import glob
import pandas as pd
import os
Finding Files that look alike¶
In our main data directory data/bls-employment-state
we have a lot of different data files, but groups of them share common information, for example:
All files that end in *03 contain information on given states unemployment rates by month, similarly for file endings *04,*05 and *06.
All files that contain ST01 contain data about the labor market in Alabama
We can piece these files together into one pandas DataFrame that contains all employment statistics that we want to work with in what follows.
The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character.
We can use this to get the names of CSV files that match a certain pattern. For example we can find all files about unemployment rates as follows:
pathRoot = r'data/bls-employment-state/' # use your path
# advisable to use os.path.join as this makes concatenation OS independent
unemploymentFiles = glob.glob(os.path.join(pathRoot, "*03.csv"))
unemploymentFiles
['data/bls-employment-state/LAUST330000000000003.csv',
'data/bls-employment-state/LAUST220000000000003.csv',
'data/bls-employment-state/LAUST350000000000003.csv',
'data/bls-employment-state/LAUST180000000000003.csv',
'data/bls-employment-state/LAUST450000000000003.csv',
'data/bls-employment-state/LAUST250000000000003.csv',
'data/bls-employment-state/LAUST490000000000003.csv',
'data/bls-employment-state/LAUST040000000000003.csv',
'data/bls-employment-state/LAUST240000000000003.csv',
'data/bls-employment-state/LAUST170000000000003.csv',
'data/bls-employment-state/LAUST510000000000003.csv',
'data/bls-employment-state/LAUST150000000000003.csv',
'data/bls-employment-state/LAUST340000000000003.csv',
'data/bls-employment-state/LAUST200000000000003.csv',
'data/bls-employment-state/LAUST400000000000003.csv',
'data/bls-employment-state/LAUST260000000000003.csv',
'data/bls-employment-state/LAUST270000000000003.csv',
'data/bls-employment-state/LAUST320000000000003.csv',
'data/bls-employment-state/LAUST230000000000003.csv',
'data/bls-employment-state/LAUST290000000000003.csv',
'data/bls-employment-state/LAUST360000000000003.csv',
'data/bls-employment-state/LAUST060000000000003.csv',
'data/bls-employment-state/LAUST410000000000003.csv',
'data/bls-employment-state/LAUST210000000000003.csv',
'data/bls-employment-state/LAUST560000000000003.csv',
'data/bls-employment-state/LAUST190000000000003.csv',
'data/bls-employment-state/LAUST050000000000003.csv',
'data/bls-employment-state/LAUST530000000000003.csv',
'data/bls-employment-state/LAUST110000000000003.csv',
'data/bls-employment-state/LAUST500000000000003.csv',
'data/bls-employment-state/LAUST380000000000003.csv',
'data/bls-employment-state/LAUST480000000000003.csv',
'data/bls-employment-state/LAUST010000000000003.csv',
'data/bls-employment-state/LAUST420000000000003.csv',
'data/bls-employment-state/LAUST550000000000003.csv',
'data/bls-employment-state/LAUST300000000000003.csv',
'data/bls-employment-state/LAUST540000000000003.csv',
'data/bls-employment-state/LAUST390000000000003.csv',
'data/bls-employment-state/LAUST460000000000003.csv',
'data/bls-employment-state/LAUST160000000000003.csv',
'data/bls-employment-state/LAUST090000000000003.csv',
'data/bls-employment-state/LAUST280000000000003.csv',
'data/bls-employment-state/LAUST310000000000003.csv',
'data/bls-employment-state/LAUST120000000000003.csv',
'data/bls-employment-state/LAUST080000000000003.csv',
'data/bls-employment-state/LAUST370000000000003.csv',
'data/bls-employment-state/LAUST440000000000003.csv',
'data/bls-employment-state/LAUST470000000000003.csv',
'data/bls-employment-state/LAUST100000000000003.csv',
'data/bls-employment-state/LAUST020000000000003.csv',
'data/bls-employment-state/LAUST720000000000003.csv',
'data/bls-employment-state/LAUST130000000000003.csv']
Let’s verify that has all the states:
assert len(unemploymentFiles) == 52
the assert
statement yielded no output to the console, this means the logical statement is True, and we can continue knowing we are dealing with all states data
Now we need to write a command that imports the data. We do this in two steps, first we write a read_csv()
call that will correctly import an individual data set; then we loop over all datasets using the command.
Using what we learned in 02-importingData, lets set up a import command
test_import = pd.read_csv('data/bls-employment-state/LAUST010000000000003.csv',
usecols =['state', 'year', 'period', 'value'],
index_col = ['state','period', 'year'])
test_import.head()
value | |||
---|---|---|---|
state | period | year | |
Alabama | M12 | 2016 | 5.9 |
M11 | 2016 | 5.7 | |
M10 | 2016 | 6.2 | |
M09 | 2016 | 6.2 | |
M08 | 2016 | 6.2 |
Nice that works well! Now we need to nest that inside a loop over all files:
unemp_data = (pd.read_csv(iFile,
usecols =['state', 'year', 'period', 'value'],
index_col = ['state','period', 'year']) \
for iFile in unemploymentFiles)
type(unemp_data)
generator
Hmmm, that’s a type we haven’t seen before in Julian’s classes… so it probably needs a brief explanation:
A generator is simply a function which returns an object on which you can call next, such that for every call it returns some value, until it raises a StopIteration exception, signaling that all values have been generated. Such an object is called an iterator.
For us, what this has done is provide a function that will loop over all unemployment files, reading each in according to the read_csv()
call we built. Note that it hasn’t actually executed anything.
Now we want to get all those unemployment data sets appended to each other. We do that using the pd.concat()
function applied to the generator function:
unemployment = pd.concat(unemp_data)
let’s checkout what we get returned:
unemployment.head()
value | |||
---|---|---|---|
state | period | year | |
New Hampshire | M12 | 2016 | 2.5 |
M11 | 2016 | 2.6 | |
M10 | 2016 | 2.5 | |
M09 | 2016 | 2.6 | |
M08 | 2016 | 2.8 |
unemployment.tail()
value | |||
---|---|---|---|
state | period | year | |
Georgia | M05 | 2000 | 3.7 |
M04 | 2000 | 3.3 | |
M03 | 2000 | 3.8 | |
M02 | 2000 | 3.7 | |
M01 | 2000 | 3.8 |
If this worked correctly, we should have 52 x 17 x 12 observations:
assert unemployment.shape[0] == 52 * 17 * 12
Cool! Note this only worked to give the exact number because each state has the same number of observations in the its original data set.
Challenge:¶
Use the template above to append all state’s data for the following series, each as separate DataFrames:
all files ending in *04 as
qty_unemployed
all files ending in 05 as
qty_employed
all files ending in *06 as
labor_force
Hint: write a function that generates the list and returns the appended data
Solution¶
## list the files
qty_unemploymentFiles = glob.glob(os.path.join(pathRoot, "*04.csv"))
qty_employmentFiles = glob.glob(os.path.join(pathRoot, "*05.csv"))
labor_forceFiles = glob.glob(os.path.join(pathRoot, "*06.csv"))
def appendData(fileList):
data_generator = (pd.read_csv(iFile,
usecols =['state', 'year', 'period', 'value'],
index_col = ['state','period', 'year']) \
for iFile in fileList)
return pd.concat(data_generator)
qty_unemployment = appendData(qty_unemploymentFiles)
qty_employment = appendData(qty_employmentFiles)
labor_force = appendData(labor_forceFiles)
assert qty_unemployment.shape[0] == 52 * 17 * 12
assert qty_employment.shape[0] == 52 * 17 * 12
assert labor_force.shape[0] == 52 * 17 * 12
Merging Data¶
Now we have 4 data sets, that each tell us something about the labor market in a state-month pair. Ideally we want these combined into one big data set - this is where merging comes in.
Pandas comes with the merge()
function where you can merge dataSetA to dataSetB using
mergedData = dataSetA.merge(dataSetB,
left_on='Key', right_on='Key', how='method')
we want to keep only the data where each data-set successfully finds its ‘partner’ in the other data set, this is an inner-join, so we can set how='inner'
.
Since we merge on the index, we don’t have to specify the *_on
inputs, but instead set *_index=True
.
Before we proceed though - each of our data set’s main variables have the same variable name, value
, lets change these to something meaningful
unemployment.rename(columns={'value': 'unemployment_rate'},
inplace=True)
qty_employment.rename(columns={'value': 'qty_employed'},
inplace=True)
qty_unemployment.rename(columns={'value': 'qty_unemployed'},
inplace=True)
labor_force.rename(columns={'value': 'labour_force'},
inplace=True)
merged_data = unemployment.merge(qty_unemployment,
left_index = True,
right_index = True,
how='inner')
merged_data.head()
unemployment_rate | qty_unemployed | |||
---|---|---|---|---|
state | period | year | ||
New Hampshire | M12 | 2016 | 2.5 | 18907 |
M11 | 2016 | 2.6 | 19573 | |
M10 | 2016 | 2.5 | 18430 | |
M09 | 2016 | 2.6 | 19415 | |
M08 | 2016 | 2.8 | 20933 |
Challenge¶
Merge the remaining labour force data together so that we end up with one data set
merged_data
Save
merged_data
to a fileout_data/state_labour_statistics.csv
Solution¶
merged_data = merged_data.merge(qty_employment,
left_index = True,
right_index = True,
how='inner')
merged_data = merged_data.merge(labor_force,
left_index = True,
right_index = True,
how='inner')
assert merged_data.shape[0] == 52 * 17 * 12
merged_data.to_csv('out_data/state_labour_statistics.csv')