import pandas as pd
import numpy as np

Pandas Series Object

A Pandas series is a one-dimensional array of indexed data

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75])
unemployment_rates
0    4.20
1    4.40
2    6.30
3    4.75
dtype: float64

we see that the output is a sequence of indices and a sequence of values:

unemployment_rates.values
array([4.2 , 4.4 , 6.3 , 4.75])
unemployment_rates.index
RangeIndex(start=0, stop=4, step=1)

Like Numpy we can access data by slicing using the famililar square bracket notation

unemployment_rates[1]
4.4
unemployment_rates[2:4]
2    6.30
3    4.75
dtype: float64

Setting the Index

By default, the index is set for us by pandas. However, we can set it ourselves

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates
California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

and we can index as expected:

unemployment_rates['Alabama']
6.3

or maybe, unexpectedly:

unemployment_rates['New York':'Washington']
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

We are not limited to indexing by continuous numbers or strings:

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=[1983, 1985, 1990, 2016])
unemployment_rates
1983    4.20
1985    4.40
1990    6.30
2016    4.75
dtype: float64
unemployment_rates[2016]
4.75

The Pandas Data Frame

A DataFrame is a two dimensional array that allows multiple columns. It can be thought of as a sequence of aligned arrays, or aligned pandas Series. “aligned” here refers to sharing the same index

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates
California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64
participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Washington'])
participation_rates
California    70.5
New York      68.7
Alabama       62.3
Washington    64.0
dtype: float64
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment
unemployment_rate participation_rates
California 4.20 70.5
New York 4.40 68.7
Alabama 6.30 62.3
Washington 4.75 64.0

the DataFrame has attributes

state_employment.index
Index(['California', 'New York', 'Alabama', 'Washington'], dtype='object')
state_employment.columns
Index(['unemployment_rate', 'participation_rates'], dtype='object')

Notice, DataFrames will cope when some indices do not match:

participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Nebraska'])
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment
unemployment_rate participation_rates
Alabama 6.30 62.3
California 4.20 70.5
Nebraska NaN 64.0
New York 4.40 68.7
Washington 4.75 NaN

by using NaNs, i.e Not a Number to deal with missing values. We will return to missing data later on..

Making DataFrames from other objects

Pandas DataFrames don’t have to come from collecting a bunch of Series together. You can assemble them from many different objects.

The most useful for us, is probably transforming Numpy Arrays into DataFrame and the vice versa:

df = pd.DataFrame(np.random.rand(3, 2),
                 columns=['col1', 'col2'],
                 index=['row1', 'row2', 'row3'])
df
col1 col2
row1 0.138909 0.420695
row2 0.609878 0.194575
row3 0.034022 0.883869
array = np.array(df)
array
array([[0.13890912, 0.42069488],
       [0.60987771, 0.19457507],
       [0.03402185, 0.88386909]])
type(array)
numpy.ndarray

A little more on the Pandas Index Object

The Pandas object has interesting structure in itself, and is probably worth understanding.. it can be thought of as an immutable array, or as an ordered set. This leads to interesting consequences…

ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

it can be sliced:

ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')

and has attributes familiar to NumPy arrays

print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64

but they are immutable:

ind[1] = 70
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_2778/770004720.py in <module>
----> 1 ind[1] = 70

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   4583     @final
   4584     def __setitem__(self, key, value):
-> 4585         raise TypeError("Index does not support mutable operations")
   4586 
   4587     def __getitem__(self, key):

TypeError: Index does not support mutable operations

being immutable has desirable properties when the indices are shared across multiple DataFrames

Indexes are also ordered sets:

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB  # intersection
Int64Index([3, 5, 7], dtype='int64')
indA | indB  # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB  # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')

these are important concepts when thinking about joins across multiple data sets