Aggregate Functions and Group Statistics

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

So far our work with Pandas as been purely about simple transformations of each row or data; whether selecting them or creating new columns out of old ones. This is rarely enough - we typically want some degree of summarization of the data; either of an entire column; or on parts of (groups) of the column in question.

This is what we will go after here:

import pandas as pd
import numpy as np
data = pd.read_csv('out_data/state_labour_statistics.csv')

Simple Aggregation

Like NumPy, Pandas has some of the standard data aggregations:

rng = np.random.RandomState(1234567890)
series = pd.Series(rng.rand(5))
series
0    0.618795
1    0.591624
2    0.888684
3    0.891655
4    0.457567
dtype: float64
series.sum()
3.4483242705353976
series.mean(), series.median(), series.max()
(0.6896648541070796, 0.61879477158568, 0.8916548001156082)

For a data-frame; these aggregations return results for each column:

df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df
A B
0 0.778188 0.400999
1 0.267064 0.705406
2 0.996106 0.405186
3 0.540095 0.949991
4 0.537522 0.030754
df.mean()
A    0.623795
B    0.498467
dtype: float64

if you want the aggregation across a row, you can specify thats what you want:

df.mean(axis='columns')
0    0.589594
1    0.486235
2    0.700646
3    0.745043
4    0.284138
dtype: float64

Often times, we want to see all of the common aggregates at once. Pandas provides a function .describe() to do this for us:

data.describe()
year unemployment_rate qty_unemployed qty_employed labour_force
count 10608.00000 10608.000000 1.060800e+04 1.060800e+04 1.060800e+04
mean 2008.00000 5.943194 1.849809e+05 2.751157e+06 2.936138e+06
std 4.89921 2.261388 2.433025e+05 3.021023e+06 3.242657e+06
min 2000.00000 1.700000 6.697000e+03 2.486420e+05 2.619910e+05
25% 2004.00000 4.400000 4.088675e+04 7.359608e+05 7.836765e+05
50% 2008.00000 5.500000 1.150930e+05 1.837300e+06 1.961558e+06
75% 2012.00000 7.000000 2.212182e+05 3.226836e+06 3.451084e+06
max 2016.00000 17.300000 2.316525e+06 1.819380e+07 1.920923e+07

Some useful aggregation functions in pandas are in the table below:

[insert table]

Whilst these aggregations can provide some notions of patterns in the data - often we want to go deeper - and compute these summaries at the level of a group (like a state of a year). Pandas has the groupby function that will allow us to efficiently looks at aggregate statistics on subsets of the data.

GroupBy: The Split, Apply, Combine Paradigm

The canonical example of the groupby paradigm is best represented in a figure:

[insert figure]

In words, GroupBy does the following:

  • The split step breaks up a DataFrame into subsets depending on the value of a specified key

  • the apply step computes some aggregate functions within an individual group

  • The combine step merges the results of the computations within groups back to a single dataset

We could certainly perform all these operations by ourselves, manually - but the tedium of the task combined with a high probability of error makes us immediately want to use a built in function that abstracts away from individual steps and asks the user to rather think of the operation as a whole rather than the individual steps.

As a start - let’s look at the average unemployment rate by state. As a first step we must specify the name of the column we are looking to split on:

data.groupby('state')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f4cf53d58e0>

Think of this object as an under the hood view of a DataFrame. Note that it does not compute anything until an aggregation is applied to it; i.e. it uses ‘lazy evaluation’

Now lets get those mena unemployment rates:

data.groupby('state').unemployment_rate.mean()
state
Alabama                  6.514706
Alaska                   7.005392
Arizona                  6.381373
Arkansas                 5.997059
California               7.443627
Colorado                 5.453922
Connecticut              5.821078
Delaware                 5.214216
District of Columbia     7.306373
Florida                  6.158824
Georgia                  6.446078
Hawaii                   4.456373
Idaho                    5.525000
Illinois                 6.936275
Indiana                  6.184804
Iowa                     4.373039
Kansas                   5.084314
Kentucky                 6.639216
Louisiana                6.197549
Maine                    5.470098
Maryland                 5.121078
Massachusetts            5.521078
Michigan                 7.662255
Minnesota                4.886765
Mississippi              7.285784
Missouri                 6.069118
Montana                  5.023039
Nebraska                 3.629412
Nevada                   7.220098
New Hampshire            4.205882
New Jersey               6.248039
New Mexico               5.972059
New York                 6.182353
North Carolina           6.741176
North Dakota             3.279902
Ohio                     6.455392
Oklahoma                 4.784804
Oregon                   7.242647
Pennsylvania             5.946569
Puerto Rico             12.653922
Rhode Island             7.000490
South Carolina           7.130882
South Dakota             3.579412
Tennessee                6.369118
Texas                    5.775000
Utah                     4.716667
Vermont                  4.247549
Virginia                 4.543137
Washington               6.706863
West Virginia            6.168137
Wisconsin                5.668627
Wyoming                  4.399510
Name: unemployment_rate, dtype: float64

Notice if we wanted the aggregation function to be applied to all of the columns, we can skip the part where we specify the column:

data.groupby('state').mean()
year unemployment_rate qty_unemployed qty_employed labour_force
state
Alabama 2008.0 6.514706 1.407627e+05 2.016128e+06 2.156891e+06
Alaska 2008.0 7.005392 2.449233e+04 3.251195e+05 3.496119e+05
Arizona 2008.0 6.381373 1.901428e+05 2.759190e+06 2.949333e+06
Arkansas 2008.0 5.997059 7.955364e+04 1.244863e+06 1.324416e+06
California 2008.0 7.443627 1.347057e+06 1.665503e+07 1.800209e+07
Colorado 2008.0 5.453922 1.452646e+05 2.503614e+06 2.648878e+06
Connecticut 2008.0 5.821078 1.082514e+05 1.736436e+06 1.844688e+06
Delaware 2008.0 5.214216 2.292001e+04 4.152155e+05 4.381355e+05
District of Columbia 2008.0 7.306373 2.490122e+04 3.137833e+05 3.386846e+05
Florida 2008.0 6.158824 5.573326e+05 8.397621e+06 8.954954e+06
Georgia 2008.0 6.446078 3.016779e+05 4.339163e+06 4.640840e+06
Hawaii 2008.0 4.456373 2.844513e+04 6.090308e+05 6.374759e+05
Idaho 2008.0 5.525000 4.104917e+04 7.013501e+05 7.423993e+05
Illinois 2008.0 6.936275 4.523743e+05 6.059067e+06 6.511441e+06
Indiana 2008.0 6.184804 1.977921e+05 3.002629e+06 3.200421e+06
Iowa 2008.0 4.373039 7.248086e+04 1.583003e+06 1.655483e+06
Kansas 2008.0 5.084314 7.486110e+04 1.394324e+06 1.469186e+06
Kentucky 2008.0 6.639216 1.339159e+05 1.874098e+06 2.008013e+06
Louisiana 2008.0 6.197549 1.283975e+05 1.938786e+06 2.067184e+06
Maine 2008.0 5.470098 3.788920e+04 6.543850e+05 6.922742e+05
Maryland 2008.0 5.121078 1.548394e+05 2.847295e+06 3.002135e+06
Massachusetts 2008.0 5.521078 1.911474e+05 3.266176e+06 3.457324e+06
Michigan 2008.0 7.662255 3.749323e+05 4.543822e+06 4.918755e+06
Minnesota 2008.0 4.886765 1.426212e+05 2.771413e+06 2.914034e+06
Mississippi 2008.0 7.285784 9.449842e+04 1.201615e+06 1.296114e+06
Missouri 2008.0 6.069118 1.842034e+05 2.849336e+06 3.033539e+06
Montana 2008.0 5.023039 2.489118e+04 4.706648e+05 4.955559e+05
Nebraska 2008.0 3.629412 3.576543e+04 9.494372e+05 9.852027e+05
Nevada 2008.0 7.220098 9.513693e+04 1.193879e+06 1.289016e+06
New Hampshire 2008.0 4.205882 3.072605e+04 6.977644e+05 7.284904e+05
New Jersey 2008.0 6.248039 2.798071e+05 4.176598e+06 4.456405e+06
New Mexico 2008.0 5.972059 5.465965e+04 8.586955e+05 9.133552e+05
New York 2008.0 6.182353 5.869909e+05 8.888198e+06 9.475189e+06
North Carolina 2008.0 6.741176 3.046283e+05 4.188285e+06 4.492913e+06
North Dakota 2008.0 3.279902 1.220031e+04 3.621385e+05 3.743388e+05
Ohio 2008.0 6.455392 3.763495e+05 5.449497e+06 5.825846e+06
Oklahoma 2008.0 4.784804 8.368540e+04 1.661057e+06 1.744743e+06
Oregon 2008.0 7.242647 1.389804e+05 1.776049e+06 1.915029e+06
Pennsylvania 2008.0 5.946569 3.771446e+05 5.952701e+06 6.329845e+06
Puerto Rico 2008.0 12.653922 1.603699e+05 1.112528e+06 1.272898e+06
Rhode Island 2008.0 7.000490 3.921885e+04 5.201720e+05 5.593909e+05
South Carolina 2008.0 7.130882 1.517508e+05 1.967512e+06 2.119263e+06
South Dakota 2008.0 3.579412 1.562588e+04 4.203784e+05 4.360043e+05
Tennessee 2008.0 6.369118 1.926002e+05 2.816181e+06 3.008781e+06
Texas 2008.0 5.775000 6.833513e+05 1.111923e+07 1.180258e+07
Utah 2008.0 4.716667 6.224774e+04 1.261578e+06 1.323826e+06
Vermont 2008.0 4.247549 1.492144e+04 3.346937e+05 3.496151e+05
Virginia 2008.0 4.543137 1.844006e+05 3.830106e+06 4.014506e+06
Washington 2008.0 6.706863 2.257521e+05 3.132887e+06 3.358639e+06
West Virginia 2008.0 6.168137 4.937335e+04 7.508340e+05 8.002073e+05
Wisconsin 2008.0 5.668627 1.737884e+05 2.889477e+06 3.063265e+06
Wyoming 2008.0 4.399510 1.283748e+04 2.771341e+05 2.899716e+05

and if we want multiple, but not all columns:

data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()
/tmp/ipykernel_2921/1019232107.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()
unemployment_rate qty_unemployed
state
Alabama 6.514706 1.407627e+05
Alaska 7.005392 2.449233e+04
Arizona 6.381373 1.901428e+05
Arkansas 5.997059 7.955364e+04
California 7.443627 1.347057e+06
Colorado 5.453922 1.452646e+05
Connecticut 5.821078 1.082514e+05
Delaware 5.214216 2.292001e+04
District of Columbia 7.306373 2.490122e+04
Florida 6.158824 5.573326e+05
Georgia 6.446078 3.016779e+05
Hawaii 4.456373 2.844513e+04
Idaho 5.525000 4.104917e+04
Illinois 6.936275 4.523743e+05
Indiana 6.184804 1.977921e+05
Iowa 4.373039 7.248086e+04
Kansas 5.084314 7.486110e+04
Kentucky 6.639216 1.339159e+05
Louisiana 6.197549 1.283975e+05
Maine 5.470098 3.788920e+04
Maryland 5.121078 1.548394e+05
Massachusetts 5.521078 1.911474e+05
Michigan 7.662255 3.749323e+05
Minnesota 4.886765 1.426212e+05
Mississippi 7.285784 9.449842e+04
Missouri 6.069118 1.842034e+05
Montana 5.023039 2.489118e+04
Nebraska 3.629412 3.576543e+04
Nevada 7.220098 9.513693e+04
New Hampshire 4.205882 3.072605e+04
New Jersey 6.248039 2.798071e+05
New Mexico 5.972059 5.465965e+04
New York 6.182353 5.869909e+05
North Carolina 6.741176 3.046283e+05
North Dakota 3.279902 1.220031e+04
Ohio 6.455392 3.763495e+05
Oklahoma 4.784804 8.368540e+04
Oregon 7.242647 1.389804e+05
Pennsylvania 5.946569 3.771446e+05
Puerto Rico 12.653922 1.603699e+05
Rhode Island 7.000490 3.921885e+04
South Carolina 7.130882 1.517508e+05
South Dakota 3.579412 1.562588e+04
Tennessee 6.369118 1.926002e+05
Texas 5.775000 6.833513e+05
Utah 4.716667 6.224774e+04
Vermont 4.247549 1.492144e+04
Virginia 4.543137 1.844006e+05
Washington 6.706863 2.257521e+05
West Virginia 6.168137 4.937335e+04
Wisconsin 5.668627 1.737884e+05
Wyoming 4.399510 1.283748e+04

Very Cool!

Challenge

Try and compute statistics where the grouping is described by multiple variables:

  1. Compute the average unemployment rate by state-year

  2. Compute the average unemployment rate by state-year-month

  3. Compute the average unemployment rate by state-year for the Carolinas (Hint: use the .filter() function after the computation)

  4. Compute the average unemployment rate by state-year and use unstack to view the data in a way that is different from the default output

Partial Solution

data.groupby(['state', 'year'])['unemployment_rate'].mean().filter(like='Carolina')
state           year
North Carolina  2000     3.666667
                2001     5.516667
                2002     6.658333
                2003     6.375000
                2004     5.491667
                2005     5.233333
                2006     4.750000
                2007     4.725000
                2008     6.100000
                2009    10.575000
                2010    10.850000
                2011    10.275000
                2012     9.291667
                2013     7.983333
                2014     6.300000
                2015     5.750000
                2016     5.058333
South Carolina  2000     3.758333
                2001     5.216667
                2002     5.808333
                2003     6.908333
                2004     6.825000
                2005     6.725000
                2006     6.433333
                2007     5.666667
                2008     6.791667
                2009    11.258333
                2010    11.158333
                2011    10.558333
                2012     9.216667
                2013     7.641667
                2014     6.441667
                2015     5.975000
                2016     4.841667
Name: unemployment_rate, dtype: float64
data.groupby(['state', 'year'])['unemployment_rate'].mean().unstack(level=0)
state Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida ... South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
year
2000 4.583333 6.375000 3.950000 4.258333 4.941667 2.758333 2.391667 3.733333 5.608333 3.683333 ... 2.458333 3.866667 4.291667 3.341667 2.808333 2.300000 5.166667 5.483333 3.525000 3.908333
2001 5.125000 6.433333 4.791667 4.958333 5.433333 3.800000 3.133333 3.500000 6.333333 4.641667 ... 3.091667 4.550000 4.941667 4.416667 3.333333 3.233333 6.300000 5.008333 4.525000 3.825000
2002 5.900000 7.291667 6.083333 5.483333 6.666667 5.533333 4.333333 4.025000 6.408333 5.633333 ... 3.183333 5.183333 6.341667 5.800000 3.991667 4.216667 7.408333 5.916667 5.375000 4.041667
2003 6.016667 7.825000 5.741667 5.933333 6.800000 5.958333 5.391667 4.266667 6.833333 5.208333 ... 3.533333 5.633333 6.683333 5.633333 4.266667 4.108333 7.408333 6.050000 5.700000 4.275000
2004 5.683333 7.458333 5.033333 5.708333 6.208333 5.483333 4.966667 3.991667 7.783333 4.633333 ... 3.725000 5.333333 5.941667 4.991667 3.675000 3.825000 6.266667 5.325000 5.025000 3.808333
2005 4.491667 6.883333 4.683333 5.250000 5.400000 5.025000 4.883333 4.141667 6.375000 3.691667 ... 3.816667 5.550000 5.408333 4.091667 3.475000 3.566667 5.550000 5.108333 4.741667 3.583333
2006 4.033333 6.616667 4.208333 5.158333 4.900000 4.308333 4.333333 3.575000 5.791667 3.225000 ... 3.108333 5.200000 4.891667 2.975000 3.683333 3.100000 5.041667 4.941667 4.741667 3.175000
2007 3.983333 6.333333 3.858333 5.308333 5.375000 3.733333 4.475000 3.441667 5.525000 4.008333 ... 2.841667 4.675000 4.291667 2.566667 4.008333 3.033333 4.708333 4.583333 4.900000 2.816667
2008 5.658333 6.691667 6.125000 5.466667 7.275000 4.825000 5.683333 4.941667 6.458333 6.283333 ... 3.050000 6.575000 4.808333 3.591667 4.683333 3.941667 5.383333 4.358333 4.908333 3.050000
2009 11.033333 7.741667 9.933333 7.850000 11.141667 7.283333 7.933333 8.266667 9.275000 10.408333 ... 4.925000 10.483333 7.558333 7.325000 6.591667 6.700000 9.141667 7.725000 8.575000 6.325000
2010 10.541667 7.883333 10.366667 8.191667 12.225000 8.725000 9.108333 8.416667 9.416667 11.041667 ... 4.983333 9.675000 8.150000 7.833333 6.091667 7.150000 10.000000 8.641667 8.666667 6.458333
2011 9.641667 7.591667 9.500000 8.308333 11.716667 8.358333 8.808333 7.500000 10.158333 9.991667 ... 4.741667 8.991667 7.758333 6.733333 5.475000 6.591667 9.291667 8.058333 7.758333 5.808333
2012 7.941667 7.141667 8.350000 7.575000 10.375000 7.875000 8.341667 7.200000 9.050000 8.458333 ... 4.275000 7.841667 6.725000 5.408333 4.950000 6.033333 8.133333 7.508333 7.041667 5.316667
2013 7.225000 6.925000 7.741667 7.341667 8.933333 6.825000 7.750000 6.716667 8.500000 7.233333 ... 3.783333 7.791667 6.225000 4.600000 4.416667 5.716667 7.033333 6.766667 6.741667 4.708333
2014 6.791667 6.841667 6.775000 6.091667 7.500000 5.000000 6.608333 5.708333 7.775000 6.266667 ... 3.433333 6.516667 5.100000 3.816667 3.950000 5.241667 6.125000 6.633333 5.416667 4.158333
2015 6.108333 6.441667 6.033333 5.075000 6.216667 3.908333 5.733333 4.833333 6.900000 5.416667 ... 3.075000 5.600000 4.441667 3.633333 3.566667 4.466667 5.625000 6.750000 4.575000 4.250000
2016 5.991667 6.616667 5.308333 3.991667 5.433333 3.316667 5.083333 4.383333 6.016667 4.875000 ... 2.825000 4.808333 4.616667 3.425000 3.241667 4.008333 5.433333 6.000000 4.150000 5.283333

17 rows × 52 columns

More useful GroupBy functionality

Often we want more than one aggregate statistic, for example we may want the min, mean and max by State. We can use the .aggregate() function to get these:

data.groupby('state')['unemployment_rate'].aggregate([np.min, np.mean, np.max])
amin mean amax
state
Alabama 3.3 6.514706 12.3
Alaska 5.2 7.005392 9.1
Arizona 3.4 6.381373 11.1
Arkansas 3.4 5.997059 9.5
California 4.4 7.443627 12.7
Colorado 2.4 5.453922 9.2
Connecticut 1.7 5.821078 9.8
Delaware 2.9 5.214216 9.4
District of Columbia 4.7 7.306373 10.9
Florida 2.8 6.158824 11.5
Georgia 3.0 6.446078 10.9
Hawaii 1.9 4.456373 7.9
Idaho 2.2 5.525000 10.5
Illinois 3.6 6.936275 12.2
Indiana 2.2 6.184804 11.8
Iowa 2.3 4.373039 7.3
Kansas 3.2 5.084314 7.9
Kentucky 3.7 6.639216 11.9
Louisiana 3.6 6.197549 10.9
Maine 2.6 5.470098 9.5
Maryland 3.1 5.121078 8.3
Massachusetts 2.1 5.521078 9.6
Michigan 2.9 7.662255 15.4
Minnesota 2.7 4.886765 8.6
Mississippi 4.6 7.285784 11.7
Missouri 3.0 6.069118 10.5
Montana 2.7 5.023039 8.6
Nebraska 2.6 3.629412 5.4
Nevada 3.7 7.220098 13.9
New Hampshire 2.3 4.205882 7.0
New Jersey 3.3 6.248039 10.3
New Mexico 3.4 5.972059 9.0
New York 3.9 6.182353 9.6
North Carolina 3.0 6.741176 12.0
North Dakota 2.0 3.279902 5.1
Ohio 3.4 6.455392 12.0
Oklahoma 2.6 4.784804 7.6
Oregon 4.2 7.242647 12.6
Pennsylvania 3.6 5.946569 9.5
Puerto Rico 8.9 12.653922 17.3
Rhode Island 3.6 7.000490 12.2
South Carolina 3.2 7.130882 12.4
South Dakota 2.0 3.579412 5.9
Tennessee 3.3 6.369118 11.3
Texas 3.5 5.775000 8.6
Utah 2.1 4.716667 8.4
Vermont 2.3 4.247549 7.6
Virginia 1.8 4.543137 7.9
Washington 4.2 6.706863 11.3
West Virginia 3.6 6.168137 10.0
Wisconsin 2.8 5.668627 10.4
Wyoming 2.4 4.399510 8.1

We can dress it up a little too if we like:

data.groupby('state')['unemployment_rate'] \
        .aggregate({'Min. Unemploy.' : np.min,
                    'Mean Unemploy.' : np.mean,
                    'Max. Unemploy.' : np.max})
---------------------------------------------------------------------------
SpecificationError                        Traceback (most recent call last)
/tmp/ipykernel_2921/3467181496.py in <module>
----> 1 data.groupby('state')['unemployment_rate'] \
      2         .aggregate({'Min. Unemploy.' : np.min,
      3                     'Mean Unemploy.' : np.mean,
      4                     'Max. Unemploy.' : np.max})

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    247             # but not the class list / tuple itself.
    248             func = maybe_mangle_lambdas(func)
--> 249             ret = self._aggregate_multiple_funcs(func)
    250             if relabeling:
    251                 # error: Incompatible types in assignment (expression has type

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg)
    282             # have not shown a higher level one
    283             # GH 15931
--> 284             raise SpecificationError("nested renamer is not supported")
    285 
    286         elif any(isinstance(x, (tuple, list)) for x in arg):

SpecificationError: nested renamer is not supported

In the challenge above, we used filter to select out states based on a string, but there are more uses of filter than this simple one. For example, we can filter based on the group properties:

def filter_func(x):
    return x['unemployment_rate'].min() > 5

problem_states = data.groupby('state').filter(filter_func)
problem_states.state.unique()
array(['Alaska', 'Puerto Rico'], dtype=object)
problem_states.groupby(['state', 'year']).unemployment_rate.mean().unstack(level=0)
state Alaska Puerto Rico
year
2000 6.375000 10.125000
2001 6.433333 11.350000
2002 7.291667 12.250000
2003 7.825000 11.966667
2004 7.458333 10.575000
2005 6.883333 11.325000
2006 6.616667 10.500000
2007 6.333333 11.183333
2008 6.691667 11.766667
2009 7.741667 15.333333
2010 7.883333 16.358333
2011 7.591667 15.941667
2012 7.141667 14.466667
2013 6.925000 14.291667
2014 6.841667 13.900000
2015 6.441667 11.991667
2016 6.616667 11.791667

Other useful functions:

GroupBy also works alongside the following functions:

  • transform()

  • apply()

For lack of time we leave these for you to explore.