Concatenating and Merging Data

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

In the previous lesson we learned how to import a single data set from a file and how to write a pandas DataFrame to a file once we are finished working on it. Another common task is to bring multiple datasets together, either by:

  1. concatenating/appending them, i.e adding new data about the same variables

  2. merging data: adding new data about different variables for the same index.

That’s the focus of this lesson!

import glob
import pandas as pd
import os

Finding Files that look alike

In our main data directory data/bls-employment-state we have a lot of different data files, but groups of them share common information, for example:

  • All files that end in *03 contain information on given states unemployment rates by month, similarly for file endings *04,*05 and *06.

  • All files that contain ST01 contain data about the labor market in Alabama

We can piece these files together into one pandas DataFrame that contains all employment statistics that we want to work with in what follows.

The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character.

We can use this to get the names of CSV files that match a certain pattern. For example we can find all files about unemployment rates as follows:

pathRoot = r'data/bls-employment-state/'      # use your path
# advisable to use os.path.join as this makes concatenation OS independent
unemploymentFiles = glob.glob(os.path.join(pathRoot, "*03.csv")) 
unemploymentFiles
['data/bls-employment-state/LAUST330000000000003.csv',
 'data/bls-employment-state/LAUST220000000000003.csv',
 'data/bls-employment-state/LAUST350000000000003.csv',
 'data/bls-employment-state/LAUST180000000000003.csv',
 'data/bls-employment-state/LAUST450000000000003.csv',
 'data/bls-employment-state/LAUST250000000000003.csv',
 'data/bls-employment-state/LAUST490000000000003.csv',
 'data/bls-employment-state/LAUST040000000000003.csv',
 'data/bls-employment-state/LAUST240000000000003.csv',
 'data/bls-employment-state/LAUST170000000000003.csv',
 'data/bls-employment-state/LAUST510000000000003.csv',
 'data/bls-employment-state/LAUST150000000000003.csv',
 'data/bls-employment-state/LAUST340000000000003.csv',
 'data/bls-employment-state/LAUST200000000000003.csv',
 'data/bls-employment-state/LAUST400000000000003.csv',
 'data/bls-employment-state/LAUST260000000000003.csv',
 'data/bls-employment-state/LAUST270000000000003.csv',
 'data/bls-employment-state/LAUST320000000000003.csv',
 'data/bls-employment-state/LAUST230000000000003.csv',
 'data/bls-employment-state/LAUST290000000000003.csv',
 'data/bls-employment-state/LAUST360000000000003.csv',
 'data/bls-employment-state/LAUST060000000000003.csv',
 'data/bls-employment-state/LAUST410000000000003.csv',
 'data/bls-employment-state/LAUST210000000000003.csv',
 'data/bls-employment-state/LAUST560000000000003.csv',
 'data/bls-employment-state/LAUST190000000000003.csv',
 'data/bls-employment-state/LAUST050000000000003.csv',
 'data/bls-employment-state/LAUST530000000000003.csv',
 'data/bls-employment-state/LAUST110000000000003.csv',
 'data/bls-employment-state/LAUST500000000000003.csv',
 'data/bls-employment-state/LAUST380000000000003.csv',
 'data/bls-employment-state/LAUST480000000000003.csv',
 'data/bls-employment-state/LAUST010000000000003.csv',
 'data/bls-employment-state/LAUST420000000000003.csv',
 'data/bls-employment-state/LAUST550000000000003.csv',
 'data/bls-employment-state/LAUST300000000000003.csv',
 'data/bls-employment-state/LAUST540000000000003.csv',
 'data/bls-employment-state/LAUST390000000000003.csv',
 'data/bls-employment-state/LAUST460000000000003.csv',
 'data/bls-employment-state/LAUST160000000000003.csv',
 'data/bls-employment-state/LAUST090000000000003.csv',
 'data/bls-employment-state/LAUST280000000000003.csv',
 'data/bls-employment-state/LAUST310000000000003.csv',
 'data/bls-employment-state/LAUST120000000000003.csv',
 'data/bls-employment-state/LAUST080000000000003.csv',
 'data/bls-employment-state/LAUST370000000000003.csv',
 'data/bls-employment-state/LAUST440000000000003.csv',
 'data/bls-employment-state/LAUST470000000000003.csv',
 'data/bls-employment-state/LAUST100000000000003.csv',
 'data/bls-employment-state/LAUST020000000000003.csv',
 'data/bls-employment-state/LAUST720000000000003.csv',
 'data/bls-employment-state/LAUST130000000000003.csv']

Let’s verify that has all the states:

assert len(unemploymentFiles) == 52

the assert statement yielded no output to the console, this means the logical statement is True, and we can continue knowing we are dealing with all states data

Now we need to write a command that imports the data. We do this in two steps, first we write a read_csv() call that will correctly import an individual data set; then we loop over all datasets using the command.

Using what we learned in 02-importingData, lets set up a import command

test_import = pd.read_csv('data/bls-employment-state/LAUST010000000000003.csv',
                usecols =['state', 'year', 'period', 'value'],
                index_col = ['state','period', 'year'])
test_import.head()
value
state period year
Alabama M12 2016 5.9
M11 2016 5.7
M10 2016 6.2
M09 2016 6.2
M08 2016 6.2

Nice that works well! Now we need to nest that inside a loop over all files:

unemp_data = (pd.read_csv(iFile,
                            usecols =['state', 'year', 'period', 'value'],
                            index_col = ['state','period', 'year']) \
              for iFile in unemploymentFiles)
type(unemp_data)
generator

Hmmm, that’s a type we haven’t seen before in Julian’s classes… so it probably needs a brief explanation:

A generator is simply a function which returns an object on which you can call next, such that for every call it returns some value, until it raises a StopIteration exception, signaling that all values have been generated. Such an object is called an iterator.

For us, what this has done is provide a function that will loop over all unemployment files, reading each in according to the read_csv() call we built. Note that it hasn’t actually executed anything.

Now we want to get all those unemployment data sets appended to each other. We do that using the pd.concat() function applied to the generator function:

unemployment = pd.concat(unemp_data)

let’s checkout what we get returned:

unemployment.head()
value
state period year
New Hampshire M12 2016 2.5
M11 2016 2.6
M10 2016 2.5
M09 2016 2.6
M08 2016 2.8
unemployment.tail()
value
state period year
Georgia M05 2000 3.7
M04 2000 3.3
M03 2000 3.8
M02 2000 3.7
M01 2000 3.8

If this worked correctly, we should have 52 x 17 x 12 observations:

assert unemployment.shape[0] == 52 * 17 * 12

Cool! Note this only worked to give the exact number because each state has the same number of observations in the its original data set.

Challenge:

Use the template above to append all state’s data for the following series, each as separate DataFrames:

  • all files ending in *04 as qty_unemployed

  • all files ending in 05 as qty_employed

  • all files ending in *06 as labor_force

Hint: write a function that generates the list and returns the appended data

Solution

## list the files
qty_unemploymentFiles = glob.glob(os.path.join(pathRoot, "*04.csv")) 
qty_employmentFiles   = glob.glob(os.path.join(pathRoot, "*05.csv")) 
labor_forceFiles      = glob.glob(os.path.join(pathRoot, "*06.csv")) 
def appendData(fileList):
    data_generator = (pd.read_csv(iFile,
                            usecols =['state', 'year', 'period', 'value'],
                            index_col = ['state','period', 'year']) \
                      for iFile in fileList)
    return pd.concat(data_generator)
qty_unemployment = appendData(qty_unemploymentFiles)
qty_employment   = appendData(qty_employmentFiles)
labor_force      = appendData(labor_forceFiles)
assert qty_unemployment.shape[0] == 52 * 17 * 12
assert qty_employment.shape[0]   == 52 * 17 * 12
assert labor_force.shape[0]      == 52 * 17 * 12

Merging Data

Now we have 4 data sets, that each tell us something about the labor market in a state-month pair. Ideally we want these combined into one big data set - this is where merging comes in.

Pandas comes with the merge() function where you can merge dataSetA to dataSetB using

mergedData = dataSetA.merge(dataSetB, 
                            left_on='Key', right_on='Key', how='method')

we want to keep only the data where each data-set successfully finds its ‘partner’ in the other data set, this is an inner-join, so we can set how='inner'. Since we merge on the index, we don’t have to specify the *_on inputs, but instead set *_index=True.

Before we proceed though - each of our data set’s main variables have the same variable name, value, lets change these to something meaningful

unemployment.rename(columns={'value': 'unemployment_rate'},
                    inplace=True)
qty_employment.rename(columns={'value': 'qty_employed'},
                    inplace=True)
qty_unemployment.rename(columns={'value': 'qty_unemployed'},
                    inplace=True)
labor_force.rename(columns={'value': 'labour_force'},
                    inplace=True)
merged_data = unemployment.merge(qty_unemployment, 
                                 left_index  = True,
                                 right_index = True,
                                 how='inner')
merged_data.head()
unemployment_rate qty_unemployed
state period year
New Hampshire M12 2016 2.5 18907
M11 2016 2.6 19573
M10 2016 2.5 18430
M09 2016 2.6 19415
M08 2016 2.8 20933

Challenge

  • Merge the remaining labour force data together so that we end up with one data set merged_data

  • Save merged_data to a file out_data/state_labour_statistics.csv

Solution

merged_data = merged_data.merge(qty_employment, 
                                 left_index  = True,
                                 right_index = True,
                                 how='inner')
merged_data = merged_data.merge(labor_force, 
                                 left_index  = True,
                                 right_index = True,
                                 how='inner')
assert merged_data.shape[0]      == 52 * 17 * 12
merged_data.to_csv('out_data/state_labour_statistics.csv')