In [2]:
import pandas as pd
import numpy as np

## Pandas Series Object

A Pandas series is a one-dimensional array of indexed data

In [3]:
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75])
unemployment_rates

0    4.20
1    4.40
2    6.30
3    4.75
dtype: float64

we see that the output is a sequence of indices and a sequence of values:

In [4]:
unemployment_rates.values

array([ 4.2 ,  4.4 ,  6.3 ,  4.75])

In [5]:
unemployment_rates.index

RangeIndex(start=0, stop=4, step=1)

Like `Numpy` we can access data by slicing using the famililar square bracket notation

In [6]:
unemployment_rates[1]

4.4000000000000004

In [7]:
unemployment_rates[2:4]

2    6.30
3    4.75
dtype: float64

### Setting the Index

By default, the index is set for us by pandas. However, we can set it ourselves

In [8]:
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates

California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

and we can index as expected:

In [9]:
unemployment_rates['Alabama']

6.2999999999999998

or maybe, unexpectedly:   

In [10]:
unemployment_rates['New York':'Washington']

New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

We are not limited to indexing by continuous numbers or strings:

In [11]:
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=[1983, 1985, 1990, 2016])
unemployment_rates

1983    4.20
1985    4.40
1990    6.30
2016    4.75
dtype: float64

In [12]:
unemployment_rates[2016]

4.75

## The Pandas Data Frame

A `DataFrame` is a two dimensional array that allows multiple columns. It can be thought of as a sequence of aligned arrays, or aligned pandas Series. "aligned" here refers to sharing the same index

In [14]:
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates

California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

In [15]:
participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Washington'])
participation_rates

California    70.5
New York      68.7
Alabama       62.3
Washington    64.0
dtype: float64

In [19]:
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment

Unnamed: 0,participation_rates,unemployment_rate
California,70.5,4.2
New York,68.7,4.4
Alabama,62.3,6.3
Washington,64.0,4.75


the `DataFrame` has attributes

In [20]:
state_employment.index

Index(['California', 'New York', 'Alabama', 'Washington'], dtype='object')

In [21]:
state_employment.columns

Index(['participation_rates', 'unemployment_rate'], dtype='object')

Notice, DataFrames will cope when some indices do not match:

In [23]:
participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Nebraska'])
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment

Unnamed: 0,participation_rates,unemployment_rate
Alabama,62.3,6.3
California,70.5,4.2
Nebraska,64.0,
New York,68.7,4.4
Washington,,4.75


by using `NaNs`, i.e  `Not a Number` to deal with missing values. We will return to missing data later on..

### Making DataFrames from other objects

Pandas DataFrames don't have to come from collecting a bunch of `Series` together. You can assemble them from many different objects.

The most useful for us, is probably transforming Numpy Arrays into DataFrame and the vice versa:

In [24]:
df = pd.DataFrame(np.random.rand(3, 2),
                 columns=['col1', 'col2'],
                 index=['row1', 'row2', 'row3'])
df



Unnamed: 0,col1,col2
row1,0.19042,0.940049
row2,0.487376,0.636101
row3,0.482621,0.74892


In [26]:
array = np.array(df)
array

array([[ 0.19041979,  0.94004879],
       [ 0.48737555,  0.63610075],
       [ 0.48262066,  0.74891969]])

In [28]:
type(array)

numpy.ndarray

## A little more on the Pandas Index Object

The Pandas object has interesting structure in itself, and is probably worth understanding.. it can be thought of as an immutable array, or as an ordered set. This leads to interesting consequences...

In [29]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

it can be sliced:

In [30]:
ind[1]

3

In [31]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

and has attributes familiar to NumPy arrays

In [32]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


but they are immutable:

In [33]:
ind[1] = 70

TypeError: Index does not support mutable operations

being immutable has desirable properties when the indices are shared across multiple DataFrames

Indexes are also ordered sets:

In [34]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [35]:
indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [36]:
indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [37]:
indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

these are important concepts when thinking about joins across multiple data sets