Importing (and Exporting) Data¶

Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter]: @lachlandeer

In our early adventures we have focussed on attributes of pandas objects, and created small data sets on the fly. This is rarely how empirical work happens - usually we have some data stored on our computer / in the cloud that we want to load in, work with and then potentially save the results.

Pandas has functionality to read and write from/to many different formats. A complete list is available here: http://pandas.pydata.org/pandas-docs/version/0.20/io.html. For economists, we are typically interested in working with:

Csv files: read_csv(), write_csv()
Stata Files: read_stata(), write_stata()
Excel files: read_excel(), write_excel()

and possibly working with sql or sas data too.

Hopefully this course has emphasized flat files, that are not encrypted to a particular language are a preferred go-to - so we focus on the csv file functions. Exceptions to this rule are OK if your data is ‘big’ when you might have to look at sql, or HDF5 or Google Big Query structures, but we are getting way ahead of ourselves.

Most functions work similar enough that focussing on csv based read/write operations should be almost without loss of generality.

import pandas as pd
import glob

Importing data with `read_csv`¶

Our github repository comes with a lot of data in it that we will want to use throughout the session. Let’s look at one subdirectory that stores a bunch of data from the BLS:

print(glob.glob("data/bls-employment-state/*.csv")[:10])

['data/bls-employment-state/LAUST720000000000006.csv', 'data/bls-employment-state/LAUST250000000000006.csv', 'data/bls-employment-state/LAUST380000000000004.csv', 'data/bls-employment-state/LAUST510000000000006.csv', 'data/bls-employment-state/LAUST330000000000003.csv', 'data/bls-employment-state/LAUST560000000000005.csv', 'data/bls-employment-state/LAUST300000000000006.csv', 'data/bls-employment-state/LAUST310000000000005.csv', 'data/bls-employment-state/LAUST220000000000003.csv', 'data/bls-employment-state/LAUST340000000000004.csv']

and what does a particular file look like?

!head data/bls-employment-state/LAUST010000000000005.csv

state,year,period,value,footnotes

Alabama,2016,M12,2050965,Reflects revised population controls and model reestimation.

Alabama,2016,M11,2060807,Reflects revised population controls and model reestimation.

Alabama,2016,M10,2057151,Reflects revised population controls and model reestimation.

Alabama,2016,M09,2042696,Reflects revised population controls and model reestimation.

Alabama,2016,M08,2033706,Reflects revised population controls and model reestimation.

Alabama,2016,M07,2040731,Reflects revised population controls and model reestimation.

Alabama,2016,M06,2040095,Reflects revised population controls and model reestimation.

Alabama,2016,M05,2042665,Reflects revised population controls and model reestimation.

Alabama,2016,M04,2040921,Reflects revised population controls and model reestimation.

Let’s read in that file to our session using read_csv…

The Simplest Import¶

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv')
df

	state	year	period	value	footnotes
0	Alabama	2016	M12	2050965	Reflects revised population controls and model...
1	Alabama	2016	M11	2060807	Reflects revised population controls and model...
2	Alabama	2016	M10	2057151	Reflects revised population controls and model...
3	Alabama	2016	M09	2042696	Reflects revised population controls and model...
4	Alabama	2016	M08	2033706	Reflects revised population controls and model...
...	...	...	...	...	...
199	Alabama	2000	M05	2037787	NaN
200	Alabama	2000	M04	2045171	NaN
201	Alabama	2000	M03	2038654	NaN
202	Alabama	2000	M02	2027226	NaN
203	Alabama	2000	M01	2024262	NaN

204 rows × 5 columns

it worked. What is the type of object we just imported?

type(df)

pandas.core.frame.DataFrame

For the future, we may not want to look at such a big excerpt of data - useful functions are then:

df.head()

	state	year	period	value	footnotes
0	Alabama	2016	M12	2050965	Reflects revised population controls and model...
1	Alabama	2016	M11	2060807	Reflects revised population controls and model...
2	Alabama	2016	M10	2057151	Reflects revised population controls and model...
3	Alabama	2016	M09	2042696	Reflects revised population controls and model...
4	Alabama	2016	M08	2033706	Reflects revised population controls and model...

df.tail()

	state	year	period	value	footnotes
199	Alabama	2000	M05	2037787	NaN
200	Alabama	2000	M04	2045171	NaN
201	Alabama	2000	M03	2038654	NaN
202	Alabama	2000	M02	2027226	NaN
203	Alabama	2000	M01	2024262	NaN

Selectively importing columns¶

Suppose we didn’t want to import the footnotes column. Pandas allows us to specify the columns we want imported:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'])
df.head()

	state	year	period	value
0	Alabama	2016	M12	2050965
1	Alabama	2016	M11	2060807
2	Alabama	2016	M10	2057151
3	Alabama	2016	M09	2042696
4	Alabama	2016	M08	2033706

Selectively Importing Rows¶

The default behaviour is to import any row from the file that contains text. This is not always what we want.

For example:

sometimes a file contains a header or footer full of unneccesary text
the whole data set may not fit into the computer’s memory
we only need certain data for our analysis

we can use some of read_csv’s functionaliy to help out here

we can read in the first 20 rows:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                nrows=20)
df

	state	year	period	value
0	Alabama	2016	M12	2050965
1	Alabama	2016	M11	2060807
2	Alabama	2016	M10	2057151
3	Alabama	2016	M09	2042696
4	Alabama	2016	M08	2033706
5	Alabama	2016	M07	2040731
6	Alabama	2016	M06	2040095
7	Alabama	2016	M05	2042665
8	Alabama	2016	M04	2040921
9	Alabama	2016	M03	2031490
10	Alabama	2016	M02	2017039
11	Alabama	2016	M01	2007036
12	Alabama	2015	M12	2030765
13	Alabama	2015	M11	2035229
14	Alabama	2015	M10	2032455
15	Alabama	2015	M09	2017049
16	Alabama	2015	M08	2015129
17	Alabama	2015	M07	2022263
18	Alabama	2015	M06	2024858
19	Alabama	2015	M05	2029625

we can skip the earliest year:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                skipfooter=12)
df.tail()

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  return func(*args, **kwargs)

	state	year	period	value
187	Alabama	2001	M05	2013679
188	Alabama	2001	M04	2018617
189	Alabama	2001	M03	2021612
190	Alabama	2001	M02	2009904
191	Alabama	2001	M01	2008876

we can skip the most recent year:

df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                names=['state', 'year', 'period', 'value'],
                index_col=False,
                skiprows=13)
df.head()

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
  return func(*args, **kwargs)

	state	year	period	value
0	Alabama	2015	M12	2030765
1	Alabama	2015	M11	2035229
2	Alabama	2015	M10	2032455
3	Alabama	2015	M09	2017049
4	Alabama	2015	M08	2015129

Challenge¶

Read in the file data/bls-employment-state/LAUST010000000000006.csv, as df2
Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010.

Setting the Index¶

Notice how in df, pandas created a new index for us? We may want to set an index ourselves.

Let’s do this, first by importing only the 2016 data for Alabama, and using the month as our index:

df_2016 = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['period', 'value'],
                index_col = ['period'],
                nrows=12)

df_2016

	value
period
M12	2050965
M11	2060807
M10	2057151
M09	2042696
M08	2033706
M07	2040731
M06	2040095
M05	2042665
M04	2040921
M03	2031490
M02	2017039
M01	2007036

we can also have a ‘multi-index’ structure, where mutliple rows define the index. For example we can import the years 2015 and 2016 data as

df_recent = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['year', 'period', 'value'],
                index_col = ['period', 'year'],
                nrows=24)
df_recent

		value
period	year
M12	2016	2050965
M11	2016	2060807
M10	2016	2057151
M09	2016	2042696
M08	2016	2033706
M07	2016	2040731
M06	2016	2040095
M05	2016	2042665
M04	2016	2040921
M03	2016	2031490
M02	2016	2017039
M01	2016	2007036
M12	2015	2030765
M11	2015	2035229
M10	2015	2032455
M09	2015	2017049
M08	2015	2015129
M07	2015	2022263
M06	2015	2024858
M05	2015	2029625
M04	2015	2025920
M03	2015	2013339
M02	2015	2003076
M01	2015	1998388

Challenge:¶

Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010. Set the index to consist of the state-month-year.

We have now covered the basics for importing data. If you want to know (much!) more, consult the documentation at http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_csv.html#pandas.read_csv. Warning: it’s hard to get the hang of reading the pandas documentation, but over time you become accustomed to it.

Exporting Data¶

Now let’s turn to exporting data. We created the DataFrame df_2016

df_2016

	value
period
M12	2050965
M11	2060807
M10	2057151
M09	2042696
M08	2033706
M07	2040731
M06	2040095
M05	2042665
M04	2040921
M03	2031490
M02	2017039
M01	2007036

Suppose we want to export that data and save it to our project. The function to_csv() will do this for us:

df_2016.to_csv('out_data/alabama_2016.csv')

Let’s verify that pandas did as expected:

!head out_data/alabama_2016.csv

period,value
M12,2050965
M11,2060807
M10,2057151
M09,2042696
M08,2033706
M07,2040731
M06,2040095
M05,2042665
M04,2040921

There are many options for the to_csv() function - many of which I don’t find that useful. One I sometimes use is to change the column separator. Let’s change to ‘&’ separators when we write df_recent to file:

df_recent.to_csv('out_data/alabama_recent.csv', sep='&')
!head out_data/alabama_recent.csv

period&year&value
M12&2016&2050965
M11&2016&2060807
M10&2016&2057151
M09&2016&2042696
M08&2016&2033706
M07&2016&2040731
M06&2016&2040095
M05&2016&2042665
M04&2016&2040921

Challenge¶

Take your financial crisis data, and write it to file. Use tab separators (‘t’).
Reload your financial crisis data under the name ‘df_financialCrisis’ using read_csv(). What did you have to change to make the data import successful?

Python for Economics and Business Research

Importing (and Exporting) Data¶

Importing data with read_csv¶

The Simplest Import¶

Selectively importing columns¶

Selectively Importing Rows¶

Challenge¶

Setting the Index¶

Challenge:¶

Exporting Data¶

Challenge¶

Importing data with `read_csv`¶