Operations on Data¶

Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

When working with Numpy, and essential part of our toolkit was the ability to perform element wise operations such as adding and multiply or operations like exponentiating or taking the log. We also want this functionality for DataFrames and pandas provides it - borrowing functionality from NumPy.

In this notebook we will look at how Pandas allows us to perform operations like those mentioned above - using universal functions.

import pandas as pd
import numpy as np

UFuncs: Preserving Indices¶

Pandas is designed to work with Numpy. This means any numpy ufunc applied to a pandas object will return another pandas object of the same type.

To see this, let’s set up some example pandas objects:

rng = np.random.RandomState(1234567890)
series = pd.Series(rng.randint(0, 10, 10))
series

  2
  6
  3
  1
  9
  9
  4
  0
  9
  6
dtype: int64

type(series)

pandas.core.series.Series

df = pd.DataFrame(rng.randint(0, 10, (10, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

	A	B	C	D
0	4	0	7	3
1	5	1	6	8
2	3	5	1	7
3	1	8	1	1
4	2	5	8	7
5	8	9	8	2
6	1	1	8	4
7	5	7	2	5
8	6	3	4	3
9	9	4	8	5

type(df)

pandas.core.frame.DataFrame

If we apply a NumPy ufunc to an object, let’s see what happens

series2 = np.log(series)

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/arraylike.py:364: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)

type(series2)

pandas.core.series.Series

df2 = np.exp(df)

type(df2)

pandas.core.frame.DataFrame

type(df['B'])

pandas.core.series.Series

type(np.log10(df['B']))

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/arraylike.py:364: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)

pandas.core.series.Series

We can also apply a numpy ufunc to an individual row of a DataFrame

df3 = df2.copy()
df3['B'] = np.log(df3['B'])
df3

	A	B	C	D
0	54.598150	0.0	1096.633158	20.085537
1	148.413159	1.0	403.428793	2980.957987
2	20.085537	5.0	2.718282	1096.633158
3	2.718282	8.0	2.718282	2.718282
4	7.389056	5.0	2980.957987	1096.633158
5	2980.957987	9.0	2980.957987	7.389056
6	2.718282	1.0	2980.957987	54.598150
7	148.413159	7.0	7.389056	148.413159
8	403.428793	3.0	54.598150	20.085537
9	8103.083928	4.0	2980.957987	148.413159

In addition to NumPy functions, pandas provides it’s own operators using operator overloading. These are summarized here:

[insert table]

We can use these operators to combine a pandas object with a scalar:

df3['A'] * 10

    545.981500
   1484.131591
    200.855369
     27.182818
     73.890561
  29809.579870
     27.182818
   1484.131591
   4034.287935
  81030.839276
Name: A, dtype: float64

and can also create new columns using ufuncs (either via NumPy or the pandas operators):

df3['E'] = df3['D'] / df3['C']
df3

	A	B	C	D	E
0	54.598150	0.0	1096.633158	20.085537	0.018316
1	148.413159	1.0	403.428793	2980.957987	7.389056
2	20.085537	5.0	2.718282	1096.633158	403.428793
3	2.718282	8.0	2.718282	2.718282	1.000000
4	7.389056	5.0	2980.957987	1096.633158	0.367879
5	2980.957987	9.0	2980.957987	7.389056	0.002479
6	2.718282	1.0	2980.957987	54.598150	0.018316
7	148.413159	7.0	7.389056	148.413159	20.085537
8	403.428793	3.0	54.598150	20.085537	0.367879
9	8103.083928	4.0	2980.957987	148.413159	0.049787

Challenge¶

Load in our labor market statistics data and perform the following operations:

Verify the unemployment_rate variable is correctly computed. (Hint, round your computation to 1 decimal place using .round(decimels=1), test equality using df[‘’].equals(df[‘’]))
Does the Decomposition labour_force = qty_employed + qty_employed hold true?
Assume that each worker was employed for 38 hours per week. Create a new column that estimates the number of hours worked in each state-year-month
Assume that 60 percent of workers work 45 hours per week, whilst 40 percent work 20 hours per week, estimate the number of labour hours in each state-year-month
Calculate the difference in your estimates from 3 and 4.

Solutions¶

data = pd.read_csv('out_data/state_labour_statistics.csv')
data.head()

	state	period	year	unemployment_rate	qty_unemployed	qty_employed	labour_force
0	New Hampshire	M12	2016	2.5	18907	728556	747463
1	New Hampshire	M11	2016	2.6	19573	729720	749293
2	New Hampshire	M10	2016	2.5	18430	729792	748222
3	New Hampshire	M09	2016	2.6	19415	727548	746963
4	New Hampshire	M08	2016	2.8	20933	737236	758169

data['ue_rate2'] = (data['qty_unemployed'] / data['labour_force'] * 100).round(1)
data.head()

	state	period	year	unemployment_rate	qty_unemployed	qty_employed	labour_force	ue_rate2
0	New Hampshire	M12	2016	2.5	18907	728556	747463	2.5
1	New Hampshire	M11	2016	2.6	19573	729720	749293	2.6
2	New Hampshire	M10	2016	2.5	18430	729792	748222	2.5
3	New Hampshire	M09	2016	2.6	19415	727548	746963	2.6
4	New Hampshire	M08	2016	2.8	20933	737236	758169	2.8

data['ue_rate2'].equals(data['unemployment_rate'])

True

data['labour_force'].equals(data['qty_employed'] + data['qty_unemployed'])

True

Index Alignment¶

When pandas is performing binary operations on between different series, it will align indices when performing the operation. This is useful if we have incomplete data in one of both data frames when we are trying to combine them:

A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])

A + B

  NaN
  5.0
  9.0
  NaN
dtype: float64

by default, pandas returns NaNs - but this may not always be the behaviour we want. The option fill_value allows us to specify a value for the missing value to take:

A.add(B, fill_value = 0)

  2.0
  5.0
  9.0
  5.0
dtype: float64

although this may not be desirable.

A similar idea holds for DataFrames:

A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
A+B

	A	B	C
0	8.0	18.0	NaN
1	21.0	18.0	NaN
2	NaN	NaN	NaN

Finally you can use ufuncs to combine info from a DataFrame and a Series:

	B	A	C
0	9	7	0
1	3	2	4
2	7	1	1

B.subtract(B['B'], axis=0)

	A	C
0	-2	-9
1	-1	1
2	-6	-6

Python for Economics and Business Research

Operations on Data¶

UFuncs: Preserving Indices¶

Challenge¶

Solutions¶

Index Alignment¶