Data Manipulation and Exploratory Analysis with Pandas

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

Motivation

In yesterday’s sessions on Numpy and SciPy we learned how to use python for scientific computing when the object of interest is a matrix. This is the way that someone who comes from (and outdated) matlab training will think of pretty much everything, and in fact proves helpful for a lot of the analysis we do as economists.

However, if we are working with data sets - like we would do in Stata or R it would be nice to work with a similar object in Python. The package pandas gives us that option - it brings with it objects called Series to store an individual column of data, and Dataframes to store multiple columns. These objects build on Numpy’s array structure and work well when we want to do the typical ‘data wrangling’ tasks that empirical work typicall entails.

Pandas also brings with it many important features for working with data. For example it deals well with missing data, works well with pivot tables and aggregation functions.

Let’s beging our adventures with pandas…

Importing Pandas

import pandas

or in the python world, more typically

import pandas as pd
pandas.__version__
'1.3.3'

Pandas Documentation inside jupyter notebooks

To display pandas built in documentation:

pd.functionName?
Object `pd.functionName` not found.

and we get tab completion when using the contents of the pandas namespace

pd.<TAB>
  File "/tmp/ipykernel_2759/2747507604.py", line 1
    pd.<TAB>
       ^
SyntaxError: invalid syntax