Importing (and Exporting) Data

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter]: @lachlandeer

In our early adventures we have focussed on attributes of pandas objects, and created small data sets on the fly. This is rarely how empirical work happens - usually we have some data stored on our computer / in the cloud that we want to load in, work with and then potentially save the results.

Pandas has functionality to read and write from/to many different formats. A complete list is available here: http://pandas.pydata.org/pandas-docs/version/0.20/io.html. For economists, we are typically interested in working with:

  • Csv files: read_csv(), write_csv()

  • Stata Files: read_stata(), write_stata()

  • Excel files: read_excel(), write_excel()

and possibly working with sql or sas data too.

Hopefully this course has emphasized flat files, that are not encrypted to a particular language are a preferred go-to - so we focus on the csv file functions. Exceptions to this rule are OK if your data is ‘big’ when you might have to look at sql, or HDF5 or Google Big Query structures, but we are getting way ahead of ourselves.

Most functions work similar enough that focussing on csv based read/write operations should be almost without loss of generality.

import pandas as pd
import glob

Importing data with read_csv

Our github repository comes with a lot of data in it that we will want to use throughout the session. Let’s look at one subdirectory that stores a bunch of data from the BLS:

print(glob.glob("data/bls-employment-state/*.csv")[:10])
['data/bls-employment-state/LAUST720000000000006.csv', 'data/bls-employment-state/LAUST250000000000006.csv', 'data/bls-employment-state/LAUST380000000000004.csv', 'data/bls-employment-state/LAUST510000000000006.csv', 'data/bls-employment-state/LAUST330000000000003.csv', 'data/bls-employment-state/LAUST560000000000005.csv', 'data/bls-employment-state/LAUST300000000000006.csv', 'data/bls-employment-state/LAUST310000000000005.csv', 'data/bls-employment-state/LAUST220000000000003.csv', 'data/bls-employment-state/LAUST340000000000004.csv']

and what does a particular file look like?

!head data/bls-employment-state/LAUST010000000000005.csv
state,year,period,value,footnotes

Alabama,2016,M12,2050965,Reflects revised population controls and model reestimation.

Alabama,2016,M11,2060807,Reflects revised population controls and model reestimation.

Alabama,2016,M10,2057151,Reflects revised population controls and model reestimation.

Alabama,2016,M09,2042696,Reflects revised population controls and model reestimation.

Alabama,2016,M08,2033706,Reflects revised population controls and model reestimation.

Alabama,2016,M07,2040731,Reflects revised population controls and model reestimation.

Alabama,2016,M06,2040095,Reflects revised population controls and model reestimation.

Alabama,2016,M05,2042665,Reflects revised population controls and model reestimation.

Alabama,2016,M04,2040921,Reflects revised population controls and model reestimation.

Let’s read in that file to our session using read_csv

The Simplest Import

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv')
df
state year period value footnotes
0 Alabama 2016 M12 2050965 Reflects revised population controls and model...
1 Alabama 2016 M11 2060807 Reflects revised population controls and model...
2 Alabama 2016 M10 2057151 Reflects revised population controls and model...
3 Alabama 2016 M09 2042696 Reflects revised population controls and model...
4 Alabama 2016 M08 2033706 Reflects revised population controls and model...
... ... ... ... ... ...
199 Alabama 2000 M05 2037787 NaN
200 Alabama 2000 M04 2045171 NaN
201 Alabama 2000 M03 2038654 NaN
202 Alabama 2000 M02 2027226 NaN
203 Alabama 2000 M01 2024262 NaN

204 rows × 5 columns

it worked. What is the type of object we just imported?

type(df)
pandas.core.frame.DataFrame

For the future, we may not want to look at such a big excerpt of data - useful functions are then:

df.head()
state year period value footnotes
0 Alabama 2016 M12 2050965 Reflects revised population controls and model...
1 Alabama 2016 M11 2060807 Reflects revised population controls and model...
2 Alabama 2016 M10 2057151 Reflects revised population controls and model...
3 Alabama 2016 M09 2042696 Reflects revised population controls and model...
4 Alabama 2016 M08 2033706 Reflects revised population controls and model...
df.tail()
state year period value footnotes
199 Alabama 2000 M05 2037787 NaN
200 Alabama 2000 M04 2045171 NaN
201 Alabama 2000 M03 2038654 NaN
202 Alabama 2000 M02 2027226 NaN
203 Alabama 2000 M01 2024262 NaN

Selectively importing columns

Suppose we didn’t want to import the footnotes column. Pandas allows us to specify the columns we want imported:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'])
df.head()
state year period value
0 Alabama 2016 M12 2050965
1 Alabama 2016 M11 2060807
2 Alabama 2016 M10 2057151
3 Alabama 2016 M09 2042696
4 Alabama 2016 M08 2033706

Selectively Importing Rows

The default behaviour is to import any row from the file that contains text. This is not always what we want.

For example:

  • sometimes a file contains a header or footer full of unneccesary text

  • the whole data set may not fit into the computer’s memory

  • we only need certain data for our analysis

we can use some of read_csv’s functionaliy to help out here

we can read in the first 20 rows:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                nrows=20)
df
state year period value
0 Alabama 2016 M12 2050965
1 Alabama 2016 M11 2060807
2 Alabama 2016 M10 2057151
3 Alabama 2016 M09 2042696
4 Alabama 2016 M08 2033706
5 Alabama 2016 M07 2040731
6 Alabama 2016 M06 2040095
7 Alabama 2016 M05 2042665
8 Alabama 2016 M04 2040921
9 Alabama 2016 M03 2031490
10 Alabama 2016 M02 2017039
11 Alabama 2016 M01 2007036
12 Alabama 2015 M12 2030765
13 Alabama 2015 M11 2035229
14 Alabama 2015 M10 2032455
15 Alabama 2015 M09 2017049
16 Alabama 2015 M08 2015129
17 Alabama 2015 M07 2022263
18 Alabama 2015 M06 2024858
19 Alabama 2015 M05 2029625

we can skip the earliest year:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                skipfooter=12)
df.tail()
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  return func(*args, **kwargs)
state year period value
187 Alabama 2001 M05 2013679
188 Alabama 2001 M04 2018617
189 Alabama 2001 M03 2021612
190 Alabama 2001 M02 2009904
191 Alabama 2001 M01 2008876

we can skip the most recent year:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                names=['state', 'year', 'period', 'value'],
                index_col=False,
                skiprows=13)
df.head()
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
  return func(*args, **kwargs)
state year period value
0 Alabama 2015 M12 2030765
1 Alabama 2015 M11 2035229
2 Alabama 2015 M10 2032455
3 Alabama 2015 M09 2017049
4 Alabama 2015 M08 2015129

Challenge

  1. Read in the file data/bls-employment-state/LAUST010000000000006.csv, as df2

  2. Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010.

Setting the Index

Notice how in df, pandas created a new index for us? We may want to set an index ourselves.

Let’s do this, first by importing only the 2016 data for Alabama, and using the month as our index:

df_2016 = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['period', 'value'],
                index_col = ['period'],
                nrows=12)
df_2016
value
period
M12 2050965
M11 2060807
M10 2057151
M09 2042696
M08 2033706
M07 2040731
M06 2040095
M05 2042665
M04 2040921
M03 2031490
M02 2017039
M01 2007036

we can also have a ‘multi-index’ structure, where mutliple rows define the index. For example we can import the years 2015 and 2016 data as

df_recent = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['year', 'period', 'value'],
                index_col = ['period', 'year'],
                nrows=24)
df_recent
value
period year
M12 2016 2050965
M11 2016 2060807
M10 2016 2057151
M09 2016 2042696
M08 2016 2033706
M07 2016 2040731
M06 2016 2040095
M05 2016 2042665
M04 2016 2040921
M03 2016 2031490
M02 2016 2017039
M01 2016 2007036
M12 2015 2030765
M11 2015 2035229
M10 2015 2032455
M09 2015 2017049
M08 2015 2015129
M07 2015 2022263
M06 2015 2024858
M05 2015 2029625
M04 2015 2025920
M03 2015 2013339
M02 2015 2003076
M01 2015 1998388

Challenge:

Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010. Set the index to consist of the state-month-year.

We have now covered the basics for importing data. If you want to know (much!) more, consult the documentation at http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_csv.html#pandas.read_csv. Warning: it’s hard to get the hang of reading the pandas documentation, but over time you become accustomed to it.

Exporting Data

Now let’s turn to exporting data. We created the DataFrame df_2016

df_2016
value
period
M12 2050965
M11 2060807
M10 2057151
M09 2042696
M08 2033706
M07 2040731
M06 2040095
M05 2042665
M04 2040921
M03 2031490
M02 2017039
M01 2007036

Suppose we want to export that data and save it to our project. The function to_csv() will do this for us:

df_2016.to_csv('out_data/alabama_2016.csv')

Let’s verify that pandas did as expected:

!head out_data/alabama_2016.csv
period,value
M12,2050965
M11,2060807
M10,2057151
M09,2042696
M08,2033706
M07,2040731
M06,2040095
M05,2042665
M04,2040921

There are many options for the to_csv() function - many of which I don’t find that useful. One I sometimes use is to change the column separator. Let’s change to ‘&’ separators when we write df_recent to file:

df_recent.to_csv('out_data/alabama_recent.csv', sep='&')
!head out_data/alabama_recent.csv
period&year&value
M12&2016&2050965
M11&2016&2060807
M10&2016&2057151
M09&2016&2042696
M08&2016&2033706
M07&2016&2040731
M06&2016&2040095
M05&2016&2042665
M04&2016&2040921

Challenge

  1. Take your financial crisis data, and write it to file. Use tab separators (‘t’).

  2. Reload your financial crisis data under the name ‘df_financialCrisis’ using read_csv(). What did you have to change to make the data import successful?