import pandas as pd
import numpy as np

Pandas Series Object¶

A Pandas series is a one-dimensional array of indexed data

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75])
unemployment_rates

  4.20
  4.40
  6.30
  4.75
dtype: float64

we see that the output is a sequence of indices and a sequence of values:

unemployment_rates.values

array([4.2 , 4.4 , 6.3 , 4.75])

unemployment_rates.index

RangeIndex(start=0, stop=4, step=1)

Like Numpy we can access data by slicing using the famililar square bracket notation

unemployment_rates[1]

4.4

unemployment_rates[2:4]

2    6.30
3    4.75
dtype: float64

Setting the Index¶

By default, the index is set for us by pandas. However, we can set it ourselves

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates

California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

and we can index as expected:

unemployment_rates['Alabama']

6.3

or maybe, unexpectedly:

unemployment_rates['New York':'Washington']

New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

We are not limited to indexing by continuous numbers or strings:

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=[1983, 1985, 1990, 2016])
unemployment_rates

  4.20
  4.40
  6.30
  4.75
dtype: float64

unemployment_rates[2016]

4.75

The Pandas Data Frame¶

A DataFrame is a two dimensional array that allows multiple columns. It can be thought of as a sequence of aligned arrays, or aligned pandas Series. “aligned” here refers to sharing the same index

unemployment_rates = pd.Series([4.2,4.4,6.3,4.75],
                              index=['California', 'New York', 'Alabama', 'Washington'])
unemployment_rates

California    4.20
New York      4.40
Alabama       6.30
Washington    4.75
dtype: float64

participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Washington'])
participation_rates

California    70.5
New York      68.7
Alabama       62.3
Washington    64.0
dtype: float64

state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment

	unemployment_rate	participation_rates
California	4.20	70.5
New York	4.40	68.7
Alabama	6.30	62.3
Washington	4.75	64.0

the DataFrame has attributes

state_employment.index

Index(['California', 'New York', 'Alabama', 'Washington'], dtype='object')

state_employment.columns

Index(['unemployment_rate', 'participation_rates'], dtype='object')

Notice, DataFrames will cope when some indices do not match:

participation_rates = pd.Series([70.5,68.7,62.3,64.0],
                              index=['California', 'New York', 'Alabama', 'Nebraska'])
state_employment = pd.DataFrame({'unemployment_rate': unemployment_rates,
                                    'participation_rates': participation_rates})
state_employment

	unemployment_rate	participation_rates
Alabama	6.30	62.3
California	4.20	70.5
Nebraska	NaN	64.0
New York	4.40	68.7
Washington	4.75	NaN

by using NaNs, i.e Not a Number to deal with missing values. We will return to missing data later on..

Making DataFrames from other objects¶

Pandas DataFrames don’t have to come from collecting a bunch of Series together. You can assemble them from many different objects.

The most useful for us, is probably transforming Numpy Arrays into DataFrame and the vice versa:

df = pd.DataFrame(np.random.rand(3, 2),
                 columns=['col1', 'col2'],
                 index=['row1', 'row2', 'row3'])
df

	col1	col2
row1	0.138909	0.420695
row2	0.609878	0.194575
row3	0.034022	0.883869

array = np.array(df)
array

array([[0.13890912, 0.42069488],
       [0.60987771, 0.19457507],
       [0.03402185, 0.88386909]])

type(array)

numpy.ndarray

A little more on the Pandas Index Object¶

The Pandas object has interesting structure in itself, and is probably worth understanding.. it can be thought of as an immutable array, or as an ordered set. This leads to interesting consequences…

ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

it can be sliced:

ind[1]

ind[::2]

Int64Index([2, 5, 11], dtype='int64')

and has attributes familiar to NumPy arrays

print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64

but they are immutable:

ind[1] = 70

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_2778/770004720.py in <module>
----> 1 ind[1] = 70

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   4583     @final
   4584     def __setitem__(self, key, value):
-> 4585         raise TypeError("Index does not support mutable operations")
   4586 
   4587     def __getitem__(self, key):

TypeError: Index does not support mutable operations

being immutable has desirable properties when the indices are shared across multiple DataFrames

Indexes are also ordered sets:

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

these are important concepts when thinking about joins across multiple data sets

Data Manipulation and Exploratory Analysis with Pandas

Importing (and Exporting) Data

Python for Economics and Business Research

Pandas Series Object¶

Setting the Index¶

The Pandas Data Frame¶

Making DataFrames from other objects¶

A little more on the Pandas Index Object¶