Aggregate Functions and Group Statistics¶

Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

So far our work with Pandas as been purely about simple transformations of each row or data; whether selecting them or creating new columns out of old ones. This is rarely enough - we typically want some degree of summarization of the data; either of an entire column; or on parts of (groups) of the column in question.

This is what we will go after here:

import pandas as pd
import numpy as np

data = pd.read_csv('out_data/state_labour_statistics.csv')

Simple Aggregation¶

Like NumPy, Pandas has some of the standard data aggregations:

rng = np.random.RandomState(1234567890)
series = pd.Series(rng.rand(5))
series

  0.618795
  0.591624
  0.888684
  0.891655
  0.457567
dtype: float64

series.sum()

3.4483242705353976

series.mean(), series.median(), series.max()

(0.6896648541070796, 0.61879477158568, 0.8916548001156082)

For a data-frame; these aggregations return results for each column:

df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

	A	B
0	0.778188	0.400999
1	0.267064	0.705406
2	0.996106	0.405186
3	0.540095	0.949991
4	0.537522	0.030754

df.mean()

A    0.623795
B    0.498467
dtype: float64

if you want the aggregation across a row, you can specify thats what you want:

df.mean(axis='columns')

  0.589594
  0.486235
  0.700646
  0.745043
  0.284138
dtype: float64

Often times, we want to see all of the common aggregates at once. Pandas provides a function .describe() to do this for us:

data.describe()

	year	unemployment_rate	qty_unemployed	qty_employed	labour_force
count	10608.00000	10608.000000	1.060800e+04	1.060800e+04	1.060800e+04
mean	2008.00000	5.943194	1.849809e+05	2.751157e+06	2.936138e+06
std	4.89921	2.261388	2.433025e+05	3.021023e+06	3.242657e+06
min	2000.00000	1.700000	6.697000e+03	2.486420e+05	2.619910e+05
25%	2004.00000	4.400000	4.088675e+04	7.359608e+05	7.836765e+05
50%	2008.00000	5.500000	1.150930e+05	1.837300e+06	1.961558e+06
75%	2012.00000	7.000000	2.212182e+05	3.226836e+06	3.451084e+06
max	2016.00000	17.300000	2.316525e+06	1.819380e+07	1.920923e+07

Some useful aggregation functions in pandas are in the table below:

[insert table]

Whilst these aggregations can provide some notions of patterns in the data - often we want to go deeper - and compute these summaries at the level of a group (like a state of a year). Pandas has the groupby function that will allow us to efficiently looks at aggregate statistics on subsets of the data.

GroupBy: The Split, Apply, Combine Paradigm¶

The canonical example of the groupby paradigm is best represented in a figure:

[insert figure]

In words, GroupBy does the following:

The split step breaks up a DataFrame into subsets depending on the value of a specified key
the apply step computes some aggregate functions within an individual group
The combine step merges the results of the computations within groups back to a single dataset

We could certainly perform all these operations by ourselves, manually - but the tedium of the task combined with a high probability of error makes us immediately want to use a built in function that abstracts away from individual steps and asks the user to rather think of the operation as a whole rather than the individual steps.

As a start - let’s look at the average unemployment rate by state. As a first step we must specify the name of the column we are looking to split on:

data.groupby('state')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f4cf53d58e0>

Think of this object as an under the hood view of a DataFrame. Note that it does not compute anything until an aggregation is applied to it; i.e. it uses ‘lazy evaluation’

Now lets get those mena unemployment rates:

data.groupby('state').unemployment_rate.mean()

state
Alabama                  6.514706
Alaska                   7.005392
Arizona                  6.381373
Arkansas                 5.997059
California               7.443627
Colorado                 5.453922
Connecticut              5.821078
Delaware                 5.214216
District of Columbia     7.306373
Florida                  6.158824
Georgia                  6.446078
Hawaii                   4.456373
Idaho                    5.525000
Illinois                 6.936275
Indiana                  6.184804
Iowa                     4.373039
Kansas                   5.084314
Kentucky                 6.639216
Louisiana                6.197549
Maine                    5.470098
Maryland                 5.121078
Massachusetts            5.521078
Michigan                 7.662255
Minnesota                4.886765
Mississippi              7.285784
Missouri                 6.069118
Montana                  5.023039
Nebraska                 3.629412
Nevada                   7.220098
New Hampshire            4.205882
New Jersey               6.248039
New Mexico               5.972059
New York                 6.182353
North Carolina           6.741176
North Dakota             3.279902
Ohio                     6.455392
Oklahoma                 4.784804
Oregon                   7.242647
Pennsylvania             5.946569
Puerto Rico             12.653922
Rhode Island             7.000490
South Carolina           7.130882
South Dakota             3.579412
Tennessee                6.369118
Texas                    5.775000
Utah                     4.716667
Vermont                  4.247549
Virginia                 4.543137
Washington               6.706863
West Virginia            6.168137
Wisconsin                5.668627
Wyoming                  4.399510
Name: unemployment_rate, dtype: float64

Notice if we wanted the aggregation function to be applied to all of the columns, we can skip the part where we specify the column:

data.groupby('state').mean()

	year	unemployment_rate	qty_unemployed	qty_employed	labour_force
state
Alabama	2008.0	6.514706	1.407627e+05	2.016128e+06	2.156891e+06
Alaska	2008.0	7.005392	2.449233e+04	3.251195e+05	3.496119e+05
Arizona	2008.0	6.381373	1.901428e+05	2.759190e+06	2.949333e+06
Arkansas	2008.0	5.997059	7.955364e+04	1.244863e+06	1.324416e+06
California	2008.0	7.443627	1.347057e+06	1.665503e+07	1.800209e+07
Colorado	2008.0	5.453922	1.452646e+05	2.503614e+06	2.648878e+06
Connecticut	2008.0	5.821078	1.082514e+05	1.736436e+06	1.844688e+06
Delaware	2008.0	5.214216	2.292001e+04	4.152155e+05	4.381355e+05
District of Columbia	2008.0	7.306373	2.490122e+04	3.137833e+05	3.386846e+05
Florida	2008.0	6.158824	5.573326e+05	8.397621e+06	8.954954e+06
Georgia	2008.0	6.446078	3.016779e+05	4.339163e+06	4.640840e+06
Hawaii	2008.0	4.456373	2.844513e+04	6.090308e+05	6.374759e+05
Idaho	2008.0	5.525000	4.104917e+04	7.013501e+05	7.423993e+05
Illinois	2008.0	6.936275	4.523743e+05	6.059067e+06	6.511441e+06
Indiana	2008.0	6.184804	1.977921e+05	3.002629e+06	3.200421e+06
Iowa	2008.0	4.373039	7.248086e+04	1.583003e+06	1.655483e+06
Kansas	2008.0	5.084314	7.486110e+04	1.394324e+06	1.469186e+06
Kentucky	2008.0	6.639216	1.339159e+05	1.874098e+06	2.008013e+06
Louisiana	2008.0	6.197549	1.283975e+05	1.938786e+06	2.067184e+06
Maine	2008.0	5.470098	3.788920e+04	6.543850e+05	6.922742e+05
Maryland	2008.0	5.121078	1.548394e+05	2.847295e+06	3.002135e+06
Massachusetts	2008.0	5.521078	1.911474e+05	3.266176e+06	3.457324e+06
Michigan	2008.0	7.662255	3.749323e+05	4.543822e+06	4.918755e+06
Minnesota	2008.0	4.886765	1.426212e+05	2.771413e+06	2.914034e+06
Mississippi	2008.0	7.285784	9.449842e+04	1.201615e+06	1.296114e+06
Missouri	2008.0	6.069118	1.842034e+05	2.849336e+06	3.033539e+06
Montana	2008.0	5.023039	2.489118e+04	4.706648e+05	4.955559e+05
Nebraska	2008.0	3.629412	3.576543e+04	9.494372e+05	9.852027e+05
Nevada	2008.0	7.220098	9.513693e+04	1.193879e+06	1.289016e+06
New Hampshire	2008.0	4.205882	3.072605e+04	6.977644e+05	7.284904e+05
New Jersey	2008.0	6.248039	2.798071e+05	4.176598e+06	4.456405e+06
New Mexico	2008.0	5.972059	5.465965e+04	8.586955e+05	9.133552e+05
New York	2008.0	6.182353	5.869909e+05	8.888198e+06	9.475189e+06
North Carolina	2008.0	6.741176	3.046283e+05	4.188285e+06	4.492913e+06
North Dakota	2008.0	3.279902	1.220031e+04	3.621385e+05	3.743388e+05
Ohio	2008.0	6.455392	3.763495e+05	5.449497e+06	5.825846e+06
Oklahoma	2008.0	4.784804	8.368540e+04	1.661057e+06	1.744743e+06
Oregon	2008.0	7.242647	1.389804e+05	1.776049e+06	1.915029e+06
Pennsylvania	2008.0	5.946569	3.771446e+05	5.952701e+06	6.329845e+06
Puerto Rico	2008.0	12.653922	1.603699e+05	1.112528e+06	1.272898e+06
Rhode Island	2008.0	7.000490	3.921885e+04	5.201720e+05	5.593909e+05
South Carolina	2008.0	7.130882	1.517508e+05	1.967512e+06	2.119263e+06
South Dakota	2008.0	3.579412	1.562588e+04	4.203784e+05	4.360043e+05
Tennessee	2008.0	6.369118	1.926002e+05	2.816181e+06	3.008781e+06
Texas	2008.0	5.775000	6.833513e+05	1.111923e+07	1.180258e+07
Utah	2008.0	4.716667	6.224774e+04	1.261578e+06	1.323826e+06
Vermont	2008.0	4.247549	1.492144e+04	3.346937e+05	3.496151e+05
Virginia	2008.0	4.543137	1.844006e+05	3.830106e+06	4.014506e+06
Washington	2008.0	6.706863	2.257521e+05	3.132887e+06	3.358639e+06
West Virginia	2008.0	6.168137	4.937335e+04	7.508340e+05	8.002073e+05
Wisconsin	2008.0	5.668627	1.737884e+05	2.889477e+06	3.063265e+06
Wyoming	2008.0	4.399510	1.283748e+04	2.771341e+05	2.899716e+05

and if we want multiple, but not all columns:

data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()

/tmp/ipykernel_2921/1019232107.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()

	unemployment_rate	qty_unemployed
state
Alabama	6.514706	1.407627e+05
Alaska	7.005392	2.449233e+04
Arizona	6.381373	1.901428e+05
Arkansas	5.997059	7.955364e+04
California	7.443627	1.347057e+06
Colorado	5.453922	1.452646e+05
Connecticut	5.821078	1.082514e+05
Delaware	5.214216	2.292001e+04
District of Columbia	7.306373	2.490122e+04
Florida	6.158824	5.573326e+05
Georgia	6.446078	3.016779e+05
Hawaii	4.456373	2.844513e+04
Idaho	5.525000	4.104917e+04
Illinois	6.936275	4.523743e+05
Indiana	6.184804	1.977921e+05
Iowa	4.373039	7.248086e+04
Kansas	5.084314	7.486110e+04
Kentucky	6.639216	1.339159e+05
Louisiana	6.197549	1.283975e+05
Maine	5.470098	3.788920e+04
Maryland	5.121078	1.548394e+05
Massachusetts	5.521078	1.911474e+05
Michigan	7.662255	3.749323e+05
Minnesota	4.886765	1.426212e+05
Mississippi	7.285784	9.449842e+04
Missouri	6.069118	1.842034e+05
Montana	5.023039	2.489118e+04
Nebraska	3.629412	3.576543e+04
Nevada	7.220098	9.513693e+04
New Hampshire	4.205882	3.072605e+04
New Jersey	6.248039	2.798071e+05
New Mexico	5.972059	5.465965e+04
New York	6.182353	5.869909e+05
North Carolina	6.741176	3.046283e+05
North Dakota	3.279902	1.220031e+04
Ohio	6.455392	3.763495e+05
Oklahoma	4.784804	8.368540e+04
Oregon	7.242647	1.389804e+05
Pennsylvania	5.946569	3.771446e+05
Puerto Rico	12.653922	1.603699e+05
Rhode Island	7.000490	3.921885e+04
South Carolina	7.130882	1.517508e+05
South Dakota	3.579412	1.562588e+04
Tennessee	6.369118	1.926002e+05
Texas	5.775000	6.833513e+05
Utah	4.716667	6.224774e+04
Vermont	4.247549	1.492144e+04
Virginia	4.543137	1.844006e+05
Washington	6.706863	2.257521e+05
West Virginia	6.168137	4.937335e+04
Wisconsin	5.668627	1.737884e+05
Wyoming	4.399510	1.283748e+04

Very Cool!

Challenge¶

Try and compute statistics where the grouping is described by multiple variables:

Compute the average unemployment rate by state-year
Compute the average unemployment rate by state-year-month
Compute the average unemployment rate by state-year for the Carolinas (Hint: use the .filter() function after the computation)
Compute the average unemployment rate by state-year and use unstack to view the data in a way that is different from the default output

Partial Solution¶

data.groupby(['state', 'year'])['unemployment_rate'].mean().filter(like='Carolina')

state           year
North Carolina  2000     3.666667
                2001     5.516667
                2002     6.658333
                2003     6.375000
                2004     5.491667
                2005     5.233333
                2006     4.750000
                2007     4.725000
                2008     6.100000
                2009    10.575000
                2010    10.850000
                2011    10.275000
                2012     9.291667
                2013     7.983333
                2014     6.300000
                2015     5.750000
                2016     5.058333
South Carolina  2000     3.758333
                2001     5.216667
                2002     5.808333
                2003     6.908333
                2004     6.825000
                2005     6.725000
                2006     6.433333
                2007     5.666667
                2008     6.791667
                2009    11.258333
                2010    11.158333
                2011    10.558333
                2012     9.216667
                2013     7.641667
                2014     6.441667
                2015     5.975000
                2016     4.841667
Name: unemployment_rate, dtype: float64

data.groupby(['state', 'year'])['unemployment_rate'].mean().unstack(level=0)

state	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware	District of Columbia	Florida	...	South Dakota	Tennessee	Texas	Utah	Vermont	Virginia	Washington	West Virginia	Wisconsin	Wyoming
year
2000	4.583333	6.375000	3.950000	4.258333	4.941667	2.758333	2.391667	3.733333	5.608333	3.683333	...	2.458333	3.866667	4.291667	3.341667	2.808333	2.300000	5.166667	5.483333	3.525000	3.908333
2001	5.125000	6.433333	4.791667	4.958333	5.433333	3.800000	3.133333	3.500000	6.333333	4.641667	...	3.091667	4.550000	4.941667	4.416667	3.333333	3.233333	6.300000	5.008333	4.525000	3.825000
2002	5.900000	7.291667	6.083333	5.483333	6.666667	5.533333	4.333333	4.025000	6.408333	5.633333	...	3.183333	5.183333	6.341667	5.800000	3.991667	4.216667	7.408333	5.916667	5.375000	4.041667
2003	6.016667	7.825000	5.741667	5.933333	6.800000	5.958333	5.391667	4.266667	6.833333	5.208333	...	3.533333	5.633333	6.683333	5.633333	4.266667	4.108333	7.408333	6.050000	5.700000	4.275000
2004	5.683333	7.458333	5.033333	5.708333	6.208333	5.483333	4.966667	3.991667	7.783333	4.633333	...	3.725000	5.333333	5.941667	4.991667	3.675000	3.825000	6.266667	5.325000	5.025000	3.808333
2005	4.491667	6.883333	4.683333	5.250000	5.400000	5.025000	4.883333	4.141667	6.375000	3.691667	...	3.816667	5.550000	5.408333	4.091667	3.475000	3.566667	5.550000	5.108333	4.741667	3.583333
2006	4.033333	6.616667	4.208333	5.158333	4.900000	4.308333	4.333333	3.575000	5.791667	3.225000	...	3.108333	5.200000	4.891667	2.975000	3.683333	3.100000	5.041667	4.941667	4.741667	3.175000
2007	3.983333	6.333333	3.858333	5.308333	5.375000	3.733333	4.475000	3.441667	5.525000	4.008333	...	2.841667	4.675000	4.291667	2.566667	4.008333	3.033333	4.708333	4.583333	4.900000	2.816667
2008	5.658333	6.691667	6.125000	5.466667	7.275000	4.825000	5.683333	4.941667	6.458333	6.283333	...	3.050000	6.575000	4.808333	3.591667	4.683333	3.941667	5.383333	4.358333	4.908333	3.050000
2009	11.033333	7.741667	9.933333	7.850000	11.141667	7.283333	7.933333	8.266667	9.275000	10.408333	...	4.925000	10.483333	7.558333	7.325000	6.591667	6.700000	9.141667	7.725000	8.575000	6.325000
2010	10.541667	7.883333	10.366667	8.191667	12.225000	8.725000	9.108333	8.416667	9.416667	11.041667	...	4.983333	9.675000	8.150000	7.833333	6.091667	7.150000	10.000000	8.641667	8.666667	6.458333
2011	9.641667	7.591667	9.500000	8.308333	11.716667	8.358333	8.808333	7.500000	10.158333	9.991667	...	4.741667	8.991667	7.758333	6.733333	5.475000	6.591667	9.291667	8.058333	7.758333	5.808333
2012	7.941667	7.141667	8.350000	7.575000	10.375000	7.875000	8.341667	7.200000	9.050000	8.458333	...	4.275000	7.841667	6.725000	5.408333	4.950000	6.033333	8.133333	7.508333	7.041667	5.316667
2013	7.225000	6.925000	7.741667	7.341667	8.933333	6.825000	7.750000	6.716667	8.500000	7.233333	...	3.783333	7.791667	6.225000	4.600000	4.416667	5.716667	7.033333	6.766667	6.741667	4.708333
2014	6.791667	6.841667	6.775000	6.091667	7.500000	5.000000	6.608333	5.708333	7.775000	6.266667	...	3.433333	6.516667	5.100000	3.816667	3.950000	5.241667	6.125000	6.633333	5.416667	4.158333
2015	6.108333	6.441667	6.033333	5.075000	6.216667	3.908333	5.733333	4.833333	6.900000	5.416667	...	3.075000	5.600000	4.441667	3.633333	3.566667	4.466667	5.625000	6.750000	4.575000	4.250000
2016	5.991667	6.616667	5.308333	3.991667	5.433333	3.316667	5.083333	4.383333	6.016667	4.875000	...	2.825000	4.808333	4.616667	3.425000	3.241667	4.008333	5.433333	6.000000	4.150000	5.283333

17 rows × 52 columns

More useful GroupBy functionality¶

Often we want more than one aggregate statistic, for example we may want the min, mean and max by State. We can use the .aggregate() function to get these:

data.groupby('state')['unemployment_rate'].aggregate([np.min, np.mean, np.max])

	amin	mean	amax
state
Alabama	3.3	6.514706	12.3
Alaska	5.2	7.005392	9.1
Arizona	3.4	6.381373	11.1
Arkansas	3.4	5.997059	9.5
California	4.4	7.443627	12.7
Colorado	2.4	5.453922	9.2
Connecticut	1.7	5.821078	9.8
Delaware	2.9	5.214216	9.4
District of Columbia	4.7	7.306373	10.9
Florida	2.8	6.158824	11.5
Georgia	3.0	6.446078	10.9
Hawaii	1.9	4.456373	7.9
Idaho	2.2	5.525000	10.5
Illinois	3.6	6.936275	12.2
Indiana	2.2	6.184804	11.8
Iowa	2.3	4.373039	7.3
Kansas	3.2	5.084314	7.9
Kentucky	3.7	6.639216	11.9
Louisiana	3.6	6.197549	10.9
Maine	2.6	5.470098	9.5
Maryland	3.1	5.121078	8.3
Massachusetts	2.1	5.521078	9.6
Michigan	2.9	7.662255	15.4
Minnesota	2.7	4.886765	8.6
Mississippi	4.6	7.285784	11.7
Missouri	3.0	6.069118	10.5
Montana	2.7	5.023039	8.6
Nebraska	2.6	3.629412	5.4
Nevada	3.7	7.220098	13.9
New Hampshire	2.3	4.205882	7.0
New Jersey	3.3	6.248039	10.3
New Mexico	3.4	5.972059	9.0
New York	3.9	6.182353	9.6
North Carolina	3.0	6.741176	12.0
North Dakota	2.0	3.279902	5.1
Ohio	3.4	6.455392	12.0
Oklahoma	2.6	4.784804	7.6
Oregon	4.2	7.242647	12.6
Pennsylvania	3.6	5.946569	9.5
Puerto Rico	8.9	12.653922	17.3
Rhode Island	3.6	7.000490	12.2
South Carolina	3.2	7.130882	12.4
South Dakota	2.0	3.579412	5.9
Tennessee	3.3	6.369118	11.3
Texas	3.5	5.775000	8.6
Utah	2.1	4.716667	8.4
Vermont	2.3	4.247549	7.6
Virginia	1.8	4.543137	7.9
Washington	4.2	6.706863	11.3
West Virginia	3.6	6.168137	10.0
Wisconsin	2.8	5.668627	10.4
Wyoming	2.4	4.399510	8.1

We can dress it up a little too if we like:

data.groupby('state')['unemployment_rate'] \
        .aggregate({'Min. Unemploy.' : np.min,
                    'Mean Unemploy.' : np.mean,
                    'Max. Unemploy.' : np.max})

---------------------------------------------------------------------------
SpecificationError                        Traceback (most recent call last)
/tmp/ipykernel_2921/3467181496.py in <module>
----> 1 data.groupby('state')['unemployment_rate'] \
      2         .aggregate({'Min. Unemploy.' : np.min,
      3                     'Mean Unemploy.' : np.mean,
      4                     'Max. Unemploy.' : np.max})

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    247             # but not the class list / tuple itself.
    248             func = maybe_mangle_lambdas(func)
--> 249             ret = self._aggregate_multiple_funcs(func)
    250             if relabeling:
    251                 # error: Incompatible types in assignment (expression has type

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg)
    282             # have not shown a higher level one
    283             # GH 15931
--> 284             raise SpecificationError("nested renamer is not supported")
    285 
    286         elif any(isinstance(x, (tuple, list)) for x in arg):

SpecificationError: nested renamer is not supported

In the challenge above, we used filter to select out states based on a string, but there are more uses of filter than this simple one. For example, we can filter based on the group properties:

def filter_func(x):
    return x['unemployment_rate'].min() > 5

problem_states = data.groupby('state').filter(filter_func)

problem_states.state.unique()

array(['Alaska', 'Puerto Rico'], dtype=object)

problem_states.groupby(['state', 'year']).unemployment_rate.mean().unstack(level=0)

state	Alaska	Puerto Rico
year
2000	6.375000	10.125000
2001	6.433333	11.350000
2002	7.291667	12.250000
2003	7.825000	11.966667
2004	7.458333	10.575000
2005	6.883333	11.325000
2006	6.616667	10.500000
2007	6.333333	11.183333
2008	6.691667	11.766667
2009	7.741667	15.333333
2010	7.883333	16.358333
2011	7.591667	15.941667
2012	7.141667	14.466667
2013	6.925000	14.291667
2014	6.841667	13.900000
2015	6.441667	11.991667
2016	6.616667	11.791667

Other useful functions:¶

GroupBy also works alongside the following functions:

transform()
apply()

For lack of time we leave these for you to explore.

Python for Economics and Business Research