import pandas as pd
import numpy as np
Pandas Series Object¶
A Pandas series is a one-dimensional array of indexed data
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75])
unemployment_rates
0 4.20
1 4.40
2 6.30
3 4.75
dtype: float64
we see that the output is a sequence of indices and a sequence of values:
unemployment_rates.values
array([4.2 , 4.4 , 6.3 , 4.75])
unemployment_rates.index
RangeIndex(start=0, stop=4, step=1)
Like Numpy
we can access data by slicing using the famililar square bracket notation
unemployment_rates[1]
4.4
unemployment_rates[2:4]
2 6.30
3 4.75
dtype: float64
Setting the Index¶
By default, the index is set for us by pandas. However, we can set it ourselves
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates
California 4.20
New York 4.40
Alabama 6.30
Washington 4.75
dtype: float64
and we can index as expected:
unemployment_rates['Alabama']
6.3
or maybe, unexpectedly:
unemployment_rates['New York':'Washington']
New York 4.40
Alabama 6.30
Washington 4.75
dtype: float64
We are not limited to indexing by continuous numbers or strings:
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
index=[1983, 1985, 1990, 2016])
unemployment_rates
1983 4.20
1985 4.40
1990 6.30
2016 4.75
dtype: float64
unemployment_rates[2016]
4.75
The Pandas Data Frame¶
A DataFrame
is a two dimensional array that allows multiple columns. It can be thought of as a sequence of aligned arrays, or aligned pandas Series. “aligned” here refers to sharing the same index
unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates
California 4.20
New York 4.40
Alabama 6.30
Washington 4.75
dtype: float64
participation_rates = pd.Series([70.5,68.7,62.3,64.0],
index=['California', 'New York', 'Alabama', 'Washington'])
participation_rates
California 70.5
New York 68.7
Alabama 62.3
Washington 64.0
dtype: float64
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
'participation_rates': participation_rates})
state_employment
unemployment_rate | participation_rates | |
---|---|---|
California | 4.20 | 70.5 |
New York | 4.40 | 68.7 |
Alabama | 6.30 | 62.3 |
Washington | 4.75 | 64.0 |
the DataFrame
has attributes
state_employment.index
Index(['California', 'New York', 'Alabama', 'Washington'], dtype='object')
state_employment.columns
Index(['unemployment_rate', 'participation_rates'], dtype='object')
Notice, DataFrames will cope when some indices do not match:
participation_rates = pd.Series([70.5,68.7,62.3,64.0],
index=['California', 'New York', 'Alabama', 'Nebraska'])
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
'participation_rates': participation_rates})
state_employment
unemployment_rate | participation_rates | |
---|---|---|
Alabama | 6.30 | 62.3 |
California | 4.20 | 70.5 |
Nebraska | NaN | 64.0 |
New York | 4.40 | 68.7 |
Washington | 4.75 | NaN |
by using NaNs
, i.e Not a Number
to deal with missing values. We will return to missing data later on..
Making DataFrames from other objects¶
Pandas DataFrames don’t have to come from collecting a bunch of Series
together. You can assemble them from many different objects.
The most useful for us, is probably transforming Numpy Arrays into DataFrame and the vice versa:
df = pd.DataFrame(np.random.rand(3, 2),
columns=['col1', 'col2'],
index=['row1', 'row2', 'row3'])
df
col1 | col2 | |
---|---|---|
row1 | 0.138909 | 0.420695 |
row2 | 0.609878 | 0.194575 |
row3 | 0.034022 | 0.883869 |
array = np.array(df)
array
array([[0.13890912, 0.42069488],
[0.60987771, 0.19457507],
[0.03402185, 0.88386909]])
type(array)
numpy.ndarray
A little more on the Pandas Index Object¶
The Pandas object has interesting structure in itself, and is probably worth understanding.. it can be thought of as an immutable array, or as an ordered set. This leads to interesting consequences…
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
it can be sliced:
ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
and has attributes familiar to NumPy arrays
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
but they are immutable:
ind[1] = 70
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_2778/770004720.py in <module>
----> 1 ind[1] = 70
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
4583 @final
4584 def __setitem__(self, key, value):
-> 4585 raise TypeError("Index does not support mutable operations")
4586
4587 def __getitem__(self, key):
TypeError: Index does not support mutable operations
being immutable has desirable properties when the indices are shared across multiple DataFrames
Indexes are also ordered sets:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')
these are important concepts when thinking about joins across multiple data sets