Importing (and Exporting) Data¶
Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter]: @lachlandeer
In our early adventures we have focussed on attributes of pandas objects, and created small data sets on the fly. This is rarely how empirical work happens - usually we have some data stored on our computer / in the cloud that we want to load in, work with and then potentially save the results.
Pandas has functionality to read and write from/to many different formats. A complete list is available here: http://pandas.pydata.org/pandas-docs/version/0.20/io.html. For economists, we are typically interested in working with:
Csv files: read_csv(), write_csv()
Stata Files: read_stata(), write_stata()
Excel files: read_excel(), write_excel()
and possibly working with sql
or sas
data too.
Hopefully this course has emphasized flat files, that are not encrypted to a particular language are a preferred go-to - so we focus on the csv file functions. Exceptions to this rule are OK if your data is ‘big’ when you might have to look at sql, or HDF5 or Google Big Query structures, but we are getting way ahead of ourselves.
Most functions work similar enough that focussing on csv based read/write operations should be almost without loss of generality.
import pandas as pd
import glob
Importing data with read_csv
¶
Our github repository comes with a lot of data in it that we will want to use throughout the session. Let’s look at one subdirectory that stores a bunch of data from the BLS:
print(glob.glob("data/bls-employment-state/*.csv")[:10])
['data/bls-employment-state/LAUST720000000000006.csv', 'data/bls-employment-state/LAUST250000000000006.csv', 'data/bls-employment-state/LAUST380000000000004.csv', 'data/bls-employment-state/LAUST510000000000006.csv', 'data/bls-employment-state/LAUST330000000000003.csv', 'data/bls-employment-state/LAUST560000000000005.csv', 'data/bls-employment-state/LAUST300000000000006.csv', 'data/bls-employment-state/LAUST310000000000005.csv', 'data/bls-employment-state/LAUST220000000000003.csv', 'data/bls-employment-state/LAUST340000000000004.csv']
and what does a particular file look like?
!head data/bls-employment-state/LAUST010000000000005.csv
state,year,period,value,footnotes
Alabama,2016,M12,2050965,Reflects revised population controls and model reestimation.
Alabama,2016,M11,2060807,Reflects revised population controls and model reestimation.
Alabama,2016,M10,2057151,Reflects revised population controls and model reestimation.
Alabama,2016,M09,2042696,Reflects revised population controls and model reestimation.
Alabama,2016,M08,2033706,Reflects revised population controls and model reestimation.
Alabama,2016,M07,2040731,Reflects revised population controls and model reestimation.
Alabama,2016,M06,2040095,Reflects revised population controls and model reestimation.
Alabama,2016,M05,2042665,Reflects revised population controls and model reestimation.
Alabama,2016,M04,2040921,Reflects revised population controls and model reestimation.
Let’s read in that file to our session using read_csv
…
The Simplest Import¶
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv')
df
state | year | period | value | footnotes | |
---|---|---|---|---|---|
0 | Alabama | 2016 | M12 | 2050965 | Reflects revised population controls and model... |
1 | Alabama | 2016 | M11 | 2060807 | Reflects revised population controls and model... |
2 | Alabama | 2016 | M10 | 2057151 | Reflects revised population controls and model... |
3 | Alabama | 2016 | M09 | 2042696 | Reflects revised population controls and model... |
4 | Alabama | 2016 | M08 | 2033706 | Reflects revised population controls and model... |
... | ... | ... | ... | ... | ... |
199 | Alabama | 2000 | M05 | 2037787 | NaN |
200 | Alabama | 2000 | M04 | 2045171 | NaN |
201 | Alabama | 2000 | M03 | 2038654 | NaN |
202 | Alabama | 2000 | M02 | 2027226 | NaN |
203 | Alabama | 2000 | M01 | 2024262 | NaN |
204 rows × 5 columns
it worked. What is the type of object we just imported?
type(df)
pandas.core.frame.DataFrame
For the future, we may not want to look at such a big excerpt of data - useful functions are then:
df.head()
state | year | period | value | footnotes | |
---|---|---|---|---|---|
0 | Alabama | 2016 | M12 | 2050965 | Reflects revised population controls and model... |
1 | Alabama | 2016 | M11 | 2060807 | Reflects revised population controls and model... |
2 | Alabama | 2016 | M10 | 2057151 | Reflects revised population controls and model... |
3 | Alabama | 2016 | M09 | 2042696 | Reflects revised population controls and model... |
4 | Alabama | 2016 | M08 | 2033706 | Reflects revised population controls and model... |
df.tail()
state | year | period | value | footnotes | |
---|---|---|---|---|---|
199 | Alabama | 2000 | M05 | 2037787 | NaN |
200 | Alabama | 2000 | M04 | 2045171 | NaN |
201 | Alabama | 2000 | M03 | 2038654 | NaN |
202 | Alabama | 2000 | M02 | 2027226 | NaN |
203 | Alabama | 2000 | M01 | 2024262 | NaN |
Selectively importing columns¶
Suppose we didn’t want to import the footnotes
column. Pandas allows us to specify the columns we want imported:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
usecols=['state', 'year', 'period', 'value'])
df.head()
state | year | period | value | |
---|---|---|---|---|
0 | Alabama | 2016 | M12 | 2050965 |
1 | Alabama | 2016 | M11 | 2060807 |
2 | Alabama | 2016 | M10 | 2057151 |
3 | Alabama | 2016 | M09 | 2042696 |
4 | Alabama | 2016 | M08 | 2033706 |
Selectively Importing Rows¶
The default behaviour is to import any row from the file that contains text. This is not always what we want.
For example:
sometimes a file contains a header or footer full of unneccesary text
the whole data set may not fit into the computer’s memory
we only need certain data for our analysis
we can use some of read_csv
’s functionaliy to help out here
we can read in the first 20 rows:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
usecols=['state', 'year', 'period', 'value'],
nrows=20)
df
state | year | period | value | |
---|---|---|---|---|
0 | Alabama | 2016 | M12 | 2050965 |
1 | Alabama | 2016 | M11 | 2060807 |
2 | Alabama | 2016 | M10 | 2057151 |
3 | Alabama | 2016 | M09 | 2042696 |
4 | Alabama | 2016 | M08 | 2033706 |
5 | Alabama | 2016 | M07 | 2040731 |
6 | Alabama | 2016 | M06 | 2040095 |
7 | Alabama | 2016 | M05 | 2042665 |
8 | Alabama | 2016 | M04 | 2040921 |
9 | Alabama | 2016 | M03 | 2031490 |
10 | Alabama | 2016 | M02 | 2017039 |
11 | Alabama | 2016 | M01 | 2007036 |
12 | Alabama | 2015 | M12 | 2030765 |
13 | Alabama | 2015 | M11 | 2035229 |
14 | Alabama | 2015 | M10 | 2032455 |
15 | Alabama | 2015 | M09 | 2017049 |
16 | Alabama | 2015 | M08 | 2015129 |
17 | Alabama | 2015 | M07 | 2022263 |
18 | Alabama | 2015 | M06 | 2024858 |
19 | Alabama | 2015 | M05 | 2029625 |
we can skip the earliest year:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
usecols=['state', 'year', 'period', 'value'],
skipfooter=12)
df.tail()
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
return func(*args, **kwargs)
state | year | period | value | |
---|---|---|---|---|
187 | Alabama | 2001 | M05 | 2013679 |
188 | Alabama | 2001 | M04 | 2018617 |
189 | Alabama | 2001 | M03 | 2021612 |
190 | Alabama | 2001 | M02 | 2009904 |
191 | Alabama | 2001 | M01 | 2008876 |
we can skip the most recent year:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
names=['state', 'year', 'period', 'value'],
index_col=False,
skiprows=13)
df.head()
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
return func(*args, **kwargs)
state | year | period | value | |
---|---|---|---|---|
0 | Alabama | 2015 | M12 | 2030765 |
1 | Alabama | 2015 | M11 | 2035229 |
2 | Alabama | 2015 | M10 | 2032455 |
3 | Alabama | 2015 | M09 | 2017049 |
4 | Alabama | 2015 | M08 | 2015129 |
Challenge¶
Read in the file
data/bls-employment-state/LAUST010000000000006.csv
, asdf2
Read in the file
data/bls-employment-state/LAUST010000000000003.csv
, asdf3
, but only the years around the financial crisis, 2007-2010.
Setting the Index¶
Notice how in df
, pandas created a new index for us? We may want to set an index ourselves.
Let’s do this, first by importing only the 2016 data for Alabama, and using the month as our index:
df_2016 = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
usecols =['period', 'value'],
index_col = ['period'],
nrows=12)
df_2016
value | |
---|---|
period | |
M12 | 2050965 |
M11 | 2060807 |
M10 | 2057151 |
M09 | 2042696 |
M08 | 2033706 |
M07 | 2040731 |
M06 | 2040095 |
M05 | 2042665 |
M04 | 2040921 |
M03 | 2031490 |
M02 | 2017039 |
M01 | 2007036 |
we can also have a ‘multi-index’ structure, where mutliple rows define the index. For example we can import the years 2015 and 2016 data as
df_recent = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
usecols =['year', 'period', 'value'],
index_col = ['period', 'year'],
nrows=24)
df_recent
value | ||
---|---|---|
period | year | |
M12 | 2016 | 2050965 |
M11 | 2016 | 2060807 |
M10 | 2016 | 2057151 |
M09 | 2016 | 2042696 |
M08 | 2016 | 2033706 |
M07 | 2016 | 2040731 |
M06 | 2016 | 2040095 |
M05 | 2016 | 2042665 |
M04 | 2016 | 2040921 |
M03 | 2016 | 2031490 |
M02 | 2016 | 2017039 |
M01 | 2016 | 2007036 |
M12 | 2015 | 2030765 |
M11 | 2015 | 2035229 |
M10 | 2015 | 2032455 |
M09 | 2015 | 2017049 |
M08 | 2015 | 2015129 |
M07 | 2015 | 2022263 |
M06 | 2015 | 2024858 |
M05 | 2015 | 2029625 |
M04 | 2015 | 2025920 |
M03 | 2015 | 2013339 |
M02 | 2015 | 2003076 |
M01 | 2015 | 1998388 |
Challenge:¶
Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010. Set the index to consist of the state-month-year.
We have now covered the basics for importing data. If you want to know (much!) more, consult the documentation at http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_csv.html#pandas.read_csv. Warning: it’s hard to get the hang of reading the pandas documentation, but over time you become accustomed to it.
Exporting Data¶
Now let’s turn to exporting data. We created the DataFrame df_2016
df_2016
value | |
---|---|
period | |
M12 | 2050965 |
M11 | 2060807 |
M10 | 2057151 |
M09 | 2042696 |
M08 | 2033706 |
M07 | 2040731 |
M06 | 2040095 |
M05 | 2042665 |
M04 | 2040921 |
M03 | 2031490 |
M02 | 2017039 |
M01 | 2007036 |
Suppose we want to export that data and save it to our project. The function to_csv()
will do this for us:
df_2016.to_csv('out_data/alabama_2016.csv')
Let’s verify that pandas did as expected:
!head out_data/alabama_2016.csv
period,value
M12,2050965
M11,2060807
M10,2057151
M09,2042696
M08,2033706
M07,2040731
M06,2040095
M05,2042665
M04,2040921
There are many options for the to_csv()
function - many of which I don’t find that useful. One I sometimes use is to change the column separator. Let’s change to ‘&’ separators when we write df_recent
to file:
df_recent.to_csv('out_data/alabama_recent.csv', sep='&')
!head out_data/alabama_recent.csv
period&year&value
M12&2016&2050965
M11&2016&2060807
M10&2016&2057151
M09&2016&2042696
M08&2016&2033706
M07&2016&2040731
M06&2016&2040095
M05&2016&2042665
M04&2016&2040921
Challenge¶
Take your financial crisis data, and write it to file. Use tab separators (‘t’).
Reload your financial crisis data under the name ‘df_financialCrisis’ using
read_csv()
. What did you have to change to make the data import successful?