Aggregate Functions and Group Statistics¶
Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer
So far our work with Pandas as been purely about simple transformations of each row or data; whether selecting them or creating new columns out of old ones. This is rarely enough - we typically want some degree of summarization of the data; either of an entire column; or on parts of (groups) of the column in question.
This is what we will go after here:
import pandas as pd
import numpy as np
data = pd.read_csv('out_data/state_labour_statistics.csv')
Simple Aggregation¶
Like NumPy, Pandas has some of the standard data aggregations:
rng = np.random.RandomState(1234567890)
series = pd.Series(rng.rand(5))
series
0 0.618795
1 0.591624
2 0.888684
3 0.891655
4 0.457567
dtype: float64
series.sum()
3.4483242705353976
series.mean(), series.median(), series.max()
(0.6896648541070796, 0.61879477158568, 0.8916548001156082)
For a data-frame; these aggregations return results for each column:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
A | B | |
---|---|---|
0 | 0.778188 | 0.400999 |
1 | 0.267064 | 0.705406 |
2 | 0.996106 | 0.405186 |
3 | 0.540095 | 0.949991 |
4 | 0.537522 | 0.030754 |
df.mean()
A 0.623795
B 0.498467
dtype: float64
if you want the aggregation across a row, you can specify thats what you want:
df.mean(axis='columns')
0 0.589594
1 0.486235
2 0.700646
3 0.745043
4 0.284138
dtype: float64
Often times, we want to see all of the common aggregates at once. Pandas provides a function .describe()
to do this for us:
data.describe()
year | unemployment_rate | qty_unemployed | qty_employed | labour_force | |
---|---|---|---|---|---|
count | 10608.00000 | 10608.000000 | 1.060800e+04 | 1.060800e+04 | 1.060800e+04 |
mean | 2008.00000 | 5.943194 | 1.849809e+05 | 2.751157e+06 | 2.936138e+06 |
std | 4.89921 | 2.261388 | 2.433025e+05 | 3.021023e+06 | 3.242657e+06 |
min | 2000.00000 | 1.700000 | 6.697000e+03 | 2.486420e+05 | 2.619910e+05 |
25% | 2004.00000 | 4.400000 | 4.088675e+04 | 7.359608e+05 | 7.836765e+05 |
50% | 2008.00000 | 5.500000 | 1.150930e+05 | 1.837300e+06 | 1.961558e+06 |
75% | 2012.00000 | 7.000000 | 2.212182e+05 | 3.226836e+06 | 3.451084e+06 |
max | 2016.00000 | 17.300000 | 2.316525e+06 | 1.819380e+07 | 1.920923e+07 |
Some useful aggregation functions in pandas are in the table below:
[insert table]
Whilst these aggregations can provide some notions of patterns in the data - often we want to go deeper - and compute these summaries at the level of a group (like a state of a year). Pandas has the groupby
function that will allow us to efficiently looks at aggregate statistics on subsets of the data.
GroupBy: The Split, Apply, Combine Paradigm¶
The canonical example of the groupby paradigm is best represented in a figure:
[insert figure]
In words, GroupBy does the following:
The split step breaks up a DataFrame into subsets depending on the value of a specified key
the apply step computes some aggregate functions within an individual group
The combine step merges the results of the computations within groups back to a single dataset
We could certainly perform all these operations by ourselves, manually - but the tedium of the task combined with a high probability of error makes us immediately want to use a built in function that abstracts away from individual steps and asks the user to rather think of the operation as a whole rather than the individual steps.
As a start - let’s look at the average unemployment rate by state. As a first step we must specify the name of the column we are looking to split on:
data.groupby('state')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f4cf53d58e0>
Think of this object as an under the hood view of a DataFrame
. Note that it does not compute anything until an aggregation is applied to it; i.e. it uses ‘lazy evaluation’
Now lets get those mena unemployment rates:
data.groupby('state').unemployment_rate.mean()
state
Alabama 6.514706
Alaska 7.005392
Arizona 6.381373
Arkansas 5.997059
California 7.443627
Colorado 5.453922
Connecticut 5.821078
Delaware 5.214216
District of Columbia 7.306373
Florida 6.158824
Georgia 6.446078
Hawaii 4.456373
Idaho 5.525000
Illinois 6.936275
Indiana 6.184804
Iowa 4.373039
Kansas 5.084314
Kentucky 6.639216
Louisiana 6.197549
Maine 5.470098
Maryland 5.121078
Massachusetts 5.521078
Michigan 7.662255
Minnesota 4.886765
Mississippi 7.285784
Missouri 6.069118
Montana 5.023039
Nebraska 3.629412
Nevada 7.220098
New Hampshire 4.205882
New Jersey 6.248039
New Mexico 5.972059
New York 6.182353
North Carolina 6.741176
North Dakota 3.279902
Ohio 6.455392
Oklahoma 4.784804
Oregon 7.242647
Pennsylvania 5.946569
Puerto Rico 12.653922
Rhode Island 7.000490
South Carolina 7.130882
South Dakota 3.579412
Tennessee 6.369118
Texas 5.775000
Utah 4.716667
Vermont 4.247549
Virginia 4.543137
Washington 6.706863
West Virginia 6.168137
Wisconsin 5.668627
Wyoming 4.399510
Name: unemployment_rate, dtype: float64
Notice if we wanted the aggregation function to be applied to all of the columns, we can skip the part where we specify the column:
data.groupby('state').mean()
year | unemployment_rate | qty_unemployed | qty_employed | labour_force | |
---|---|---|---|---|---|
state | |||||
Alabama | 2008.0 | 6.514706 | 1.407627e+05 | 2.016128e+06 | 2.156891e+06 |
Alaska | 2008.0 | 7.005392 | 2.449233e+04 | 3.251195e+05 | 3.496119e+05 |
Arizona | 2008.0 | 6.381373 | 1.901428e+05 | 2.759190e+06 | 2.949333e+06 |
Arkansas | 2008.0 | 5.997059 | 7.955364e+04 | 1.244863e+06 | 1.324416e+06 |
California | 2008.0 | 7.443627 | 1.347057e+06 | 1.665503e+07 | 1.800209e+07 |
Colorado | 2008.0 | 5.453922 | 1.452646e+05 | 2.503614e+06 | 2.648878e+06 |
Connecticut | 2008.0 | 5.821078 | 1.082514e+05 | 1.736436e+06 | 1.844688e+06 |
Delaware | 2008.0 | 5.214216 | 2.292001e+04 | 4.152155e+05 | 4.381355e+05 |
District of Columbia | 2008.0 | 7.306373 | 2.490122e+04 | 3.137833e+05 | 3.386846e+05 |
Florida | 2008.0 | 6.158824 | 5.573326e+05 | 8.397621e+06 | 8.954954e+06 |
Georgia | 2008.0 | 6.446078 | 3.016779e+05 | 4.339163e+06 | 4.640840e+06 |
Hawaii | 2008.0 | 4.456373 | 2.844513e+04 | 6.090308e+05 | 6.374759e+05 |
Idaho | 2008.0 | 5.525000 | 4.104917e+04 | 7.013501e+05 | 7.423993e+05 |
Illinois | 2008.0 | 6.936275 | 4.523743e+05 | 6.059067e+06 | 6.511441e+06 |
Indiana | 2008.0 | 6.184804 | 1.977921e+05 | 3.002629e+06 | 3.200421e+06 |
Iowa | 2008.0 | 4.373039 | 7.248086e+04 | 1.583003e+06 | 1.655483e+06 |
Kansas | 2008.0 | 5.084314 | 7.486110e+04 | 1.394324e+06 | 1.469186e+06 |
Kentucky | 2008.0 | 6.639216 | 1.339159e+05 | 1.874098e+06 | 2.008013e+06 |
Louisiana | 2008.0 | 6.197549 | 1.283975e+05 | 1.938786e+06 | 2.067184e+06 |
Maine | 2008.0 | 5.470098 | 3.788920e+04 | 6.543850e+05 | 6.922742e+05 |
Maryland | 2008.0 | 5.121078 | 1.548394e+05 | 2.847295e+06 | 3.002135e+06 |
Massachusetts | 2008.0 | 5.521078 | 1.911474e+05 | 3.266176e+06 | 3.457324e+06 |
Michigan | 2008.0 | 7.662255 | 3.749323e+05 | 4.543822e+06 | 4.918755e+06 |
Minnesota | 2008.0 | 4.886765 | 1.426212e+05 | 2.771413e+06 | 2.914034e+06 |
Mississippi | 2008.0 | 7.285784 | 9.449842e+04 | 1.201615e+06 | 1.296114e+06 |
Missouri | 2008.0 | 6.069118 | 1.842034e+05 | 2.849336e+06 | 3.033539e+06 |
Montana | 2008.0 | 5.023039 | 2.489118e+04 | 4.706648e+05 | 4.955559e+05 |
Nebraska | 2008.0 | 3.629412 | 3.576543e+04 | 9.494372e+05 | 9.852027e+05 |
Nevada | 2008.0 | 7.220098 | 9.513693e+04 | 1.193879e+06 | 1.289016e+06 |
New Hampshire | 2008.0 | 4.205882 | 3.072605e+04 | 6.977644e+05 | 7.284904e+05 |
New Jersey | 2008.0 | 6.248039 | 2.798071e+05 | 4.176598e+06 | 4.456405e+06 |
New Mexico | 2008.0 | 5.972059 | 5.465965e+04 | 8.586955e+05 | 9.133552e+05 |
New York | 2008.0 | 6.182353 | 5.869909e+05 | 8.888198e+06 | 9.475189e+06 |
North Carolina | 2008.0 | 6.741176 | 3.046283e+05 | 4.188285e+06 | 4.492913e+06 |
North Dakota | 2008.0 | 3.279902 | 1.220031e+04 | 3.621385e+05 | 3.743388e+05 |
Ohio | 2008.0 | 6.455392 | 3.763495e+05 | 5.449497e+06 | 5.825846e+06 |
Oklahoma | 2008.0 | 4.784804 | 8.368540e+04 | 1.661057e+06 | 1.744743e+06 |
Oregon | 2008.0 | 7.242647 | 1.389804e+05 | 1.776049e+06 | 1.915029e+06 |
Pennsylvania | 2008.0 | 5.946569 | 3.771446e+05 | 5.952701e+06 | 6.329845e+06 |
Puerto Rico | 2008.0 | 12.653922 | 1.603699e+05 | 1.112528e+06 | 1.272898e+06 |
Rhode Island | 2008.0 | 7.000490 | 3.921885e+04 | 5.201720e+05 | 5.593909e+05 |
South Carolina | 2008.0 | 7.130882 | 1.517508e+05 | 1.967512e+06 | 2.119263e+06 |
South Dakota | 2008.0 | 3.579412 | 1.562588e+04 | 4.203784e+05 | 4.360043e+05 |
Tennessee | 2008.0 | 6.369118 | 1.926002e+05 | 2.816181e+06 | 3.008781e+06 |
Texas | 2008.0 | 5.775000 | 6.833513e+05 | 1.111923e+07 | 1.180258e+07 |
Utah | 2008.0 | 4.716667 | 6.224774e+04 | 1.261578e+06 | 1.323826e+06 |
Vermont | 2008.0 | 4.247549 | 1.492144e+04 | 3.346937e+05 | 3.496151e+05 |
Virginia | 2008.0 | 4.543137 | 1.844006e+05 | 3.830106e+06 | 4.014506e+06 |
Washington | 2008.0 | 6.706863 | 2.257521e+05 | 3.132887e+06 | 3.358639e+06 |
West Virginia | 2008.0 | 6.168137 | 4.937335e+04 | 7.508340e+05 | 8.002073e+05 |
Wisconsin | 2008.0 | 5.668627 | 1.737884e+05 | 2.889477e+06 | 3.063265e+06 |
Wyoming | 2008.0 | 4.399510 | 1.283748e+04 | 2.771341e+05 | 2.899716e+05 |
and if we want multiple, but not all columns:
data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()
/tmp/ipykernel_2921/1019232107.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()
unemployment_rate | qty_unemployed | |
---|---|---|
state | ||
Alabama | 6.514706 | 1.407627e+05 |
Alaska | 7.005392 | 2.449233e+04 |
Arizona | 6.381373 | 1.901428e+05 |
Arkansas | 5.997059 | 7.955364e+04 |
California | 7.443627 | 1.347057e+06 |
Colorado | 5.453922 | 1.452646e+05 |
Connecticut | 5.821078 | 1.082514e+05 |
Delaware | 5.214216 | 2.292001e+04 |
District of Columbia | 7.306373 | 2.490122e+04 |
Florida | 6.158824 | 5.573326e+05 |
Georgia | 6.446078 | 3.016779e+05 |
Hawaii | 4.456373 | 2.844513e+04 |
Idaho | 5.525000 | 4.104917e+04 |
Illinois | 6.936275 | 4.523743e+05 |
Indiana | 6.184804 | 1.977921e+05 |
Iowa | 4.373039 | 7.248086e+04 |
Kansas | 5.084314 | 7.486110e+04 |
Kentucky | 6.639216 | 1.339159e+05 |
Louisiana | 6.197549 | 1.283975e+05 |
Maine | 5.470098 | 3.788920e+04 |
Maryland | 5.121078 | 1.548394e+05 |
Massachusetts | 5.521078 | 1.911474e+05 |
Michigan | 7.662255 | 3.749323e+05 |
Minnesota | 4.886765 | 1.426212e+05 |
Mississippi | 7.285784 | 9.449842e+04 |
Missouri | 6.069118 | 1.842034e+05 |
Montana | 5.023039 | 2.489118e+04 |
Nebraska | 3.629412 | 3.576543e+04 |
Nevada | 7.220098 | 9.513693e+04 |
New Hampshire | 4.205882 | 3.072605e+04 |
New Jersey | 6.248039 | 2.798071e+05 |
New Mexico | 5.972059 | 5.465965e+04 |
New York | 6.182353 | 5.869909e+05 |
North Carolina | 6.741176 | 3.046283e+05 |
North Dakota | 3.279902 | 1.220031e+04 |
Ohio | 6.455392 | 3.763495e+05 |
Oklahoma | 4.784804 | 8.368540e+04 |
Oregon | 7.242647 | 1.389804e+05 |
Pennsylvania | 5.946569 | 3.771446e+05 |
Puerto Rico | 12.653922 | 1.603699e+05 |
Rhode Island | 7.000490 | 3.921885e+04 |
South Carolina | 7.130882 | 1.517508e+05 |
South Dakota | 3.579412 | 1.562588e+04 |
Tennessee | 6.369118 | 1.926002e+05 |
Texas | 5.775000 | 6.833513e+05 |
Utah | 4.716667 | 6.224774e+04 |
Vermont | 4.247549 | 1.492144e+04 |
Virginia | 4.543137 | 1.844006e+05 |
Washington | 6.706863 | 2.257521e+05 |
West Virginia | 6.168137 | 4.937335e+04 |
Wisconsin | 5.668627 | 1.737884e+05 |
Wyoming | 4.399510 | 1.283748e+04 |
Very Cool!
Challenge¶
Try and compute statistics where the grouping is described by multiple variables:
Compute the average unemployment rate by state-year
Compute the average unemployment rate by state-year-month
Compute the average unemployment rate by state-year for the Carolinas (Hint: use the .filter() function after the computation)
Compute the average unemployment rate by state-year and use
unstack
to view the data in a way that is different from the default output
Partial Solution¶
data.groupby(['state', 'year'])['unemployment_rate'].mean().filter(like='Carolina')
state year
North Carolina 2000 3.666667
2001 5.516667
2002 6.658333
2003 6.375000
2004 5.491667
2005 5.233333
2006 4.750000
2007 4.725000
2008 6.100000
2009 10.575000
2010 10.850000
2011 10.275000
2012 9.291667
2013 7.983333
2014 6.300000
2015 5.750000
2016 5.058333
South Carolina 2000 3.758333
2001 5.216667
2002 5.808333
2003 6.908333
2004 6.825000
2005 6.725000
2006 6.433333
2007 5.666667
2008 6.791667
2009 11.258333
2010 11.158333
2011 10.558333
2012 9.216667
2013 7.641667
2014 6.441667
2015 5.975000
2016 4.841667
Name: unemployment_rate, dtype: float64
data.groupby(['state', 'year'])['unemployment_rate'].mean().unstack(level=0)
state | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | District of Columbia | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
2000 | 4.583333 | 6.375000 | 3.950000 | 4.258333 | 4.941667 | 2.758333 | 2.391667 | 3.733333 | 5.608333 | 3.683333 | ... | 2.458333 | 3.866667 | 4.291667 | 3.341667 | 2.808333 | 2.300000 | 5.166667 | 5.483333 | 3.525000 | 3.908333 |
2001 | 5.125000 | 6.433333 | 4.791667 | 4.958333 | 5.433333 | 3.800000 | 3.133333 | 3.500000 | 6.333333 | 4.641667 | ... | 3.091667 | 4.550000 | 4.941667 | 4.416667 | 3.333333 | 3.233333 | 6.300000 | 5.008333 | 4.525000 | 3.825000 |
2002 | 5.900000 | 7.291667 | 6.083333 | 5.483333 | 6.666667 | 5.533333 | 4.333333 | 4.025000 | 6.408333 | 5.633333 | ... | 3.183333 | 5.183333 | 6.341667 | 5.800000 | 3.991667 | 4.216667 | 7.408333 | 5.916667 | 5.375000 | 4.041667 |
2003 | 6.016667 | 7.825000 | 5.741667 | 5.933333 | 6.800000 | 5.958333 | 5.391667 | 4.266667 | 6.833333 | 5.208333 | ... | 3.533333 | 5.633333 | 6.683333 | 5.633333 | 4.266667 | 4.108333 | 7.408333 | 6.050000 | 5.700000 | 4.275000 |
2004 | 5.683333 | 7.458333 | 5.033333 | 5.708333 | 6.208333 | 5.483333 | 4.966667 | 3.991667 | 7.783333 | 4.633333 | ... | 3.725000 | 5.333333 | 5.941667 | 4.991667 | 3.675000 | 3.825000 | 6.266667 | 5.325000 | 5.025000 | 3.808333 |
2005 | 4.491667 | 6.883333 | 4.683333 | 5.250000 | 5.400000 | 5.025000 | 4.883333 | 4.141667 | 6.375000 | 3.691667 | ... | 3.816667 | 5.550000 | 5.408333 | 4.091667 | 3.475000 | 3.566667 | 5.550000 | 5.108333 | 4.741667 | 3.583333 |
2006 | 4.033333 | 6.616667 | 4.208333 | 5.158333 | 4.900000 | 4.308333 | 4.333333 | 3.575000 | 5.791667 | 3.225000 | ... | 3.108333 | 5.200000 | 4.891667 | 2.975000 | 3.683333 | 3.100000 | 5.041667 | 4.941667 | 4.741667 | 3.175000 |
2007 | 3.983333 | 6.333333 | 3.858333 | 5.308333 | 5.375000 | 3.733333 | 4.475000 | 3.441667 | 5.525000 | 4.008333 | ... | 2.841667 | 4.675000 | 4.291667 | 2.566667 | 4.008333 | 3.033333 | 4.708333 | 4.583333 | 4.900000 | 2.816667 |
2008 | 5.658333 | 6.691667 | 6.125000 | 5.466667 | 7.275000 | 4.825000 | 5.683333 | 4.941667 | 6.458333 | 6.283333 | ... | 3.050000 | 6.575000 | 4.808333 | 3.591667 | 4.683333 | 3.941667 | 5.383333 | 4.358333 | 4.908333 | 3.050000 |
2009 | 11.033333 | 7.741667 | 9.933333 | 7.850000 | 11.141667 | 7.283333 | 7.933333 | 8.266667 | 9.275000 | 10.408333 | ... | 4.925000 | 10.483333 | 7.558333 | 7.325000 | 6.591667 | 6.700000 | 9.141667 | 7.725000 | 8.575000 | 6.325000 |
2010 | 10.541667 | 7.883333 | 10.366667 | 8.191667 | 12.225000 | 8.725000 | 9.108333 | 8.416667 | 9.416667 | 11.041667 | ... | 4.983333 | 9.675000 | 8.150000 | 7.833333 | 6.091667 | 7.150000 | 10.000000 | 8.641667 | 8.666667 | 6.458333 |
2011 | 9.641667 | 7.591667 | 9.500000 | 8.308333 | 11.716667 | 8.358333 | 8.808333 | 7.500000 | 10.158333 | 9.991667 | ... | 4.741667 | 8.991667 | 7.758333 | 6.733333 | 5.475000 | 6.591667 | 9.291667 | 8.058333 | 7.758333 | 5.808333 |
2012 | 7.941667 | 7.141667 | 8.350000 | 7.575000 | 10.375000 | 7.875000 | 8.341667 | 7.200000 | 9.050000 | 8.458333 | ... | 4.275000 | 7.841667 | 6.725000 | 5.408333 | 4.950000 | 6.033333 | 8.133333 | 7.508333 | 7.041667 | 5.316667 |
2013 | 7.225000 | 6.925000 | 7.741667 | 7.341667 | 8.933333 | 6.825000 | 7.750000 | 6.716667 | 8.500000 | 7.233333 | ... | 3.783333 | 7.791667 | 6.225000 | 4.600000 | 4.416667 | 5.716667 | 7.033333 | 6.766667 | 6.741667 | 4.708333 |
2014 | 6.791667 | 6.841667 | 6.775000 | 6.091667 | 7.500000 | 5.000000 | 6.608333 | 5.708333 | 7.775000 | 6.266667 | ... | 3.433333 | 6.516667 | 5.100000 | 3.816667 | 3.950000 | 5.241667 | 6.125000 | 6.633333 | 5.416667 | 4.158333 |
2015 | 6.108333 | 6.441667 | 6.033333 | 5.075000 | 6.216667 | 3.908333 | 5.733333 | 4.833333 | 6.900000 | 5.416667 | ... | 3.075000 | 5.600000 | 4.441667 | 3.633333 | 3.566667 | 4.466667 | 5.625000 | 6.750000 | 4.575000 | 4.250000 |
2016 | 5.991667 | 6.616667 | 5.308333 | 3.991667 | 5.433333 | 3.316667 | 5.083333 | 4.383333 | 6.016667 | 4.875000 | ... | 2.825000 | 4.808333 | 4.616667 | 3.425000 | 3.241667 | 4.008333 | 5.433333 | 6.000000 | 4.150000 | 5.283333 |
17 rows × 52 columns
More useful GroupBy functionality¶
Often we want more than one aggregate statistic, for example we may want the min, mean and max by State. We can use the .aggregate()
function to get these:
data.groupby('state')['unemployment_rate'].aggregate([np.min, np.mean, np.max])
amin | mean | amax | |
---|---|---|---|
state | |||
Alabama | 3.3 | 6.514706 | 12.3 |
Alaska | 5.2 | 7.005392 | 9.1 |
Arizona | 3.4 | 6.381373 | 11.1 |
Arkansas | 3.4 | 5.997059 | 9.5 |
California | 4.4 | 7.443627 | 12.7 |
Colorado | 2.4 | 5.453922 | 9.2 |
Connecticut | 1.7 | 5.821078 | 9.8 |
Delaware | 2.9 | 5.214216 | 9.4 |
District of Columbia | 4.7 | 7.306373 | 10.9 |
Florida | 2.8 | 6.158824 | 11.5 |
Georgia | 3.0 | 6.446078 | 10.9 |
Hawaii | 1.9 | 4.456373 | 7.9 |
Idaho | 2.2 | 5.525000 | 10.5 |
Illinois | 3.6 | 6.936275 | 12.2 |
Indiana | 2.2 | 6.184804 | 11.8 |
Iowa | 2.3 | 4.373039 | 7.3 |
Kansas | 3.2 | 5.084314 | 7.9 |
Kentucky | 3.7 | 6.639216 | 11.9 |
Louisiana | 3.6 | 6.197549 | 10.9 |
Maine | 2.6 | 5.470098 | 9.5 |
Maryland | 3.1 | 5.121078 | 8.3 |
Massachusetts | 2.1 | 5.521078 | 9.6 |
Michigan | 2.9 | 7.662255 | 15.4 |
Minnesota | 2.7 | 4.886765 | 8.6 |
Mississippi | 4.6 | 7.285784 | 11.7 |
Missouri | 3.0 | 6.069118 | 10.5 |
Montana | 2.7 | 5.023039 | 8.6 |
Nebraska | 2.6 | 3.629412 | 5.4 |
Nevada | 3.7 | 7.220098 | 13.9 |
New Hampshire | 2.3 | 4.205882 | 7.0 |
New Jersey | 3.3 | 6.248039 | 10.3 |
New Mexico | 3.4 | 5.972059 | 9.0 |
New York | 3.9 | 6.182353 | 9.6 |
North Carolina | 3.0 | 6.741176 | 12.0 |
North Dakota | 2.0 | 3.279902 | 5.1 |
Ohio | 3.4 | 6.455392 | 12.0 |
Oklahoma | 2.6 | 4.784804 | 7.6 |
Oregon | 4.2 | 7.242647 | 12.6 |
Pennsylvania | 3.6 | 5.946569 | 9.5 |
Puerto Rico | 8.9 | 12.653922 | 17.3 |
Rhode Island | 3.6 | 7.000490 | 12.2 |
South Carolina | 3.2 | 7.130882 | 12.4 |
South Dakota | 2.0 | 3.579412 | 5.9 |
Tennessee | 3.3 | 6.369118 | 11.3 |
Texas | 3.5 | 5.775000 | 8.6 |
Utah | 2.1 | 4.716667 | 8.4 |
Vermont | 2.3 | 4.247549 | 7.6 |
Virginia | 1.8 | 4.543137 | 7.9 |
Washington | 4.2 | 6.706863 | 11.3 |
West Virginia | 3.6 | 6.168137 | 10.0 |
Wisconsin | 2.8 | 5.668627 | 10.4 |
Wyoming | 2.4 | 4.399510 | 8.1 |
We can dress it up a little too if we like:
data.groupby('state')['unemployment_rate'] \
.aggregate({'Min. Unemploy.' : np.min,
'Mean Unemploy.' : np.mean,
'Max. Unemploy.' : np.max})
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
/tmp/ipykernel_2921/3467181496.py in <module>
----> 1 data.groupby('state')['unemployment_rate'] \
2 .aggregate({'Min. Unemploy.' : np.min,
3 'Mean Unemploy.' : np.mean,
4 'Max. Unemploy.' : np.max})
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
247 # but not the class list / tuple itself.
248 func = maybe_mangle_lambdas(func)
--> 249 ret = self._aggregate_multiple_funcs(func)
250 if relabeling:
251 # error: Incompatible types in assignment (expression has type
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg)
282 # have not shown a higher level one
283 # GH 15931
--> 284 raise SpecificationError("nested renamer is not supported")
285
286 elif any(isinstance(x, (tuple, list)) for x in arg):
SpecificationError: nested renamer is not supported
In the challenge above, we used filter to select out states based on a string, but there are more uses of filter than this simple one. For example, we can filter based on the group properties:
def filter_func(x):
return x['unemployment_rate'].min() > 5
problem_states = data.groupby('state').filter(filter_func)
problem_states.state.unique()
array(['Alaska', 'Puerto Rico'], dtype=object)
problem_states.groupby(['state', 'year']).unemployment_rate.mean().unstack(level=0)
state | Alaska | Puerto Rico |
---|---|---|
year | ||
2000 | 6.375000 | 10.125000 |
2001 | 6.433333 | 11.350000 |
2002 | 7.291667 | 12.250000 |
2003 | 7.825000 | 11.966667 |
2004 | 7.458333 | 10.575000 |
2005 | 6.883333 | 11.325000 |
2006 | 6.616667 | 10.500000 |
2007 | 6.333333 | 11.183333 |
2008 | 6.691667 | 11.766667 |
2009 | 7.741667 | 15.333333 |
2010 | 7.883333 | 16.358333 |
2011 | 7.591667 | 15.941667 |
2012 | 7.141667 | 14.466667 |
2013 | 6.925000 | 14.291667 |
2014 | 6.841667 | 13.900000 |
2015 | 6.441667 | 11.991667 |
2016 | 6.616667 | 11.791667 |
Other useful functions:¶
GroupBy also works alongside the following functions:
transform()
apply()
For lack of time we leave these for you to explore.