{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Aggregate Functions and Group Statistics\n", "* Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer\n", "\n", "So far our work with Pandas as been purely about simple transformations of each row or data; whether selecting them or creating new columns out of old ones. This is rarely enough - we typically want some degree of summarization of the data; either of an entire column; or on parts of (groups) of the column in question.\n", "\n", "This is what we will go after here:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.read_csv('out_data/state_labour_statistics.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Aggregation\n", "\n", "Like NumPy, Pandas has some of the standard data aggregations:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.618795\n", "1 0.591624\n", "2 0.888684\n", "3 0.891655\n", "4 0.457567\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(1234567890)\n", "series = pd.Series(rng.rand(5))\n", "series\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3.4483242705353976" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(0.6896648541070796, 0.61879477158568, 0.89165480011560816)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.mean(), series.median(), series.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a data-frame; these aggregations return results for each column:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
00.7781880.400999
10.2670640.705406
20.9961060.405186
30.5400950.949991
40.5375220.030754
\n", "
" ], "text/plain": [ " A B\n", "0 0.778188 0.400999\n", "1 0.267064 0.705406\n", "2 0.996106 0.405186\n", "3 0.540095 0.949991\n", "4 0.537522 0.030754" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'A': rng.rand(5),\n", " 'B': rng.rand(5)})\n", "df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "A 0.623795\n", "B 0.498467\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "if you want the aggregation across a row, you can specify thats what you want:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0.589594\n", "1 0.486235\n", "2 0.700646\n", "3 0.745043\n", "4 0.284138\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean(axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often times, we want to see all of the common aggregates at once. Pandas provides a function `.describe()` to do this for us:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearunemployment_rateqty_unemployedqty_employedlabour_force
count10608.0000010608.0000001.060800e+041.060800e+041.060800e+04
mean2008.000005.9431941.849809e+052.751157e+062.936138e+06
std4.899212.2613882.433025e+053.021023e+063.242657e+06
min2000.000001.7000006.697000e+032.486420e+052.619910e+05
25%2004.000004.4000004.088675e+047.359608e+057.836765e+05
50%2008.000005.5000001.150930e+051.837300e+061.961558e+06
75%2012.000007.0000002.212182e+053.226836e+063.451084e+06
max2016.0000017.3000002.316525e+061.819380e+071.920923e+07
\n", "
" ], "text/plain": [ " year unemployment_rate qty_unemployed qty_employed \\\n", "count 10608.00000 10608.000000 1.060800e+04 1.060800e+04 \n", "mean 2008.00000 5.943194 1.849809e+05 2.751157e+06 \n", "std 4.89921 2.261388 2.433025e+05 3.021023e+06 \n", "min 2000.00000 1.700000 6.697000e+03 2.486420e+05 \n", "25% 2004.00000 4.400000 4.088675e+04 7.359608e+05 \n", "50% 2008.00000 5.500000 1.150930e+05 1.837300e+06 \n", "75% 2012.00000 7.000000 2.212182e+05 3.226836e+06 \n", "max 2016.00000 17.300000 2.316525e+06 1.819380e+07 \n", "\n", " labour_force \n", "count 1.060800e+04 \n", "mean 2.936138e+06 \n", "std 3.242657e+06 \n", "min 2.619910e+05 \n", "25% 7.836765e+05 \n", "50% 1.961558e+06 \n", "75% 3.451084e+06 \n", "max 1.920923e+07 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some useful aggregation functions in pandas are in the table below:\n", "\n", "[insert table]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whilst these aggregations can provide some notions of patterns in the data - often we want to go deeper - and compute these summaries at the level of a group (like a state of a year). Pandas has the `groupby` function that will allow us to efficiently looks at aggregate statistics on subsets of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GroupBy: The Split, Apply, Combine Paradigm\n", "The canonical example of the groupby paradigm is best represented in a figure:\n", "\n", "[insert figure]\n", "\n", "In words, GroupBy does the following:\n", "* The *split* step breaks up a DataFrame into subsets depending on the value of a specified key\n", "* the *apply* step computes some aggregate functions within an individual group\n", "* The *combine* step merges the results of the computations within groups back to a single dataset\n", "\n", "We could certainly perform all these operations by ourselves, manually - but the tedium of the task combined with a high probability of error makes us immediately want to use a built in function that abstracts away from individual steps and asks the user to rather think of the operation as a whole rather than the individual steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a start - let's look at the average unemployment rate by state. As a first step we must specify the name of the column we are looking to split on:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Think of this object as an under the hood view of a `DataFrame`. Note that it does not compute anything until an aggregation is applied to it; i.e. it uses 'lazy evaluation'\n", "\n", "Now lets get those mena unemployment rates:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "state\n", "Alabama 6.514706\n", "Alaska 7.005392\n", "Arizona 6.381373\n", "Arkansas 5.997059\n", "California 7.443627\n", "Colorado 5.453922\n", "Connecticut 5.821078\n", "Delaware 5.214216\n", "District of Columbia 7.306373\n", "Florida 6.158824\n", "Georgia 6.446078\n", "Hawaii 4.456373\n", "Idaho 5.525000\n", "Illinois 6.936275\n", "Indiana 6.184804\n", "Iowa 4.373039\n", "Kansas 5.084314\n", "Kentucky 6.639216\n", "Louisiana 6.197549\n", "Maine 5.470098\n", "Maryland 5.121078\n", "Massachusetts 5.521078\n", "Michigan 7.662255\n", "Minnesota 4.886765\n", "Mississippi 7.285784\n", "Missouri 6.069118\n", "Montana 5.023039\n", "Nebraska 3.629412\n", "Nevada 7.220098\n", "New Hampshire 4.205882\n", "New Jersey 6.248039\n", "New Mexico 5.972059\n", "New York 6.182353\n", "North Carolina 6.741176\n", "North Dakota 3.279902\n", "Ohio 6.455392\n", "Oklahoma 4.784804\n", "Oregon 7.242647\n", "Pennsylvania 5.946569\n", "Puerto Rico 12.653922\n", "Rhode Island 7.000490\n", "South Carolina 7.130882\n", "South Dakota 3.579412\n", "Tennessee 6.369118\n", "Texas 5.775000\n", "Utah 4.716667\n", "Vermont 4.247549\n", "Virginia 4.543137\n", "Washington 6.706863\n", "West Virginia 6.168137\n", "Wisconsin 5.668627\n", "Wyoming 4.399510\n", "Name: unemployment_rate, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state').unemployment_rate.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice if we wanted the aggregation function to be applied to all of the columns, we can skip the part where we specify the column:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearunemployment_rateqty_unemployedqty_employedlabour_force
state
Alabama2008.06.5147061.407627e+052.016128e+062.156891e+06
Alaska2008.07.0053922.449233e+043.251195e+053.496119e+05
Arizona2008.06.3813731.901428e+052.759190e+062.949333e+06
Arkansas2008.05.9970597.955364e+041.244863e+061.324416e+06
California2008.07.4436271.347057e+061.665503e+071.800209e+07
Colorado2008.05.4539221.452646e+052.503614e+062.648878e+06
Connecticut2008.05.8210781.082514e+051.736436e+061.844688e+06
Delaware2008.05.2142162.292001e+044.152155e+054.381355e+05
District of Columbia2008.07.3063732.490122e+043.137833e+053.386846e+05
Florida2008.06.1588245.573326e+058.397621e+068.954954e+06
Georgia2008.06.4460783.016779e+054.339163e+064.640840e+06
Hawaii2008.04.4563732.844513e+046.090308e+056.374759e+05
Idaho2008.05.5250004.104917e+047.013501e+057.423993e+05
Illinois2008.06.9362754.523743e+056.059067e+066.511441e+06
Indiana2008.06.1848041.977921e+053.002629e+063.200421e+06
Iowa2008.04.3730397.248086e+041.583003e+061.655483e+06
Kansas2008.05.0843147.486110e+041.394324e+061.469186e+06
Kentucky2008.06.6392161.339159e+051.874098e+062.008013e+06
Louisiana2008.06.1975491.283975e+051.938786e+062.067184e+06
Maine2008.05.4700983.788920e+046.543850e+056.922742e+05
Maryland2008.05.1210781.548394e+052.847295e+063.002135e+06
Massachusetts2008.05.5210781.911474e+053.266176e+063.457324e+06
Michigan2008.07.6622553.749323e+054.543822e+064.918755e+06
Minnesota2008.04.8867651.426212e+052.771413e+062.914034e+06
Mississippi2008.07.2857849.449842e+041.201615e+061.296114e+06
Missouri2008.06.0691181.842034e+052.849336e+063.033539e+06
Montana2008.05.0230392.489118e+044.706648e+054.955559e+05
Nebraska2008.03.6294123.576543e+049.494372e+059.852027e+05
Nevada2008.07.2200989.513693e+041.193879e+061.289016e+06
New Hampshire2008.04.2058823.072605e+046.977644e+057.284904e+05
New Jersey2008.06.2480392.798071e+054.176598e+064.456405e+06
New Mexico2008.05.9720595.465965e+048.586955e+059.133552e+05
New York2008.06.1823535.869909e+058.888198e+069.475189e+06
North Carolina2008.06.7411763.046283e+054.188285e+064.492913e+06
North Dakota2008.03.2799021.220031e+043.621385e+053.743388e+05
Ohio2008.06.4553923.763495e+055.449497e+065.825846e+06
Oklahoma2008.04.7848048.368540e+041.661057e+061.744743e+06
Oregon2008.07.2426471.389804e+051.776049e+061.915029e+06
Pennsylvania2008.05.9465693.771446e+055.952701e+066.329845e+06
Puerto Rico2008.012.6539221.603699e+051.112528e+061.272898e+06
Rhode Island2008.07.0004903.921885e+045.201720e+055.593909e+05
South Carolina2008.07.1308821.517508e+051.967512e+062.119263e+06
South Dakota2008.03.5794121.562588e+044.203784e+054.360043e+05
Tennessee2008.06.3691181.926002e+052.816181e+063.008781e+06
Texas2008.05.7750006.833513e+051.111923e+071.180258e+07
Utah2008.04.7166676.224774e+041.261578e+061.323826e+06
Vermont2008.04.2475491.492144e+043.346937e+053.496151e+05
Virginia2008.04.5431371.844006e+053.830106e+064.014506e+06
Washington2008.06.7068632.257521e+053.132887e+063.358639e+06
West Virginia2008.06.1681374.937335e+047.508340e+058.002073e+05
Wisconsin2008.05.6686271.737884e+052.889477e+063.063265e+06
Wyoming2008.04.3995101.283748e+042.771341e+052.899716e+05
\n", "
" ], "text/plain": [ " year unemployment_rate qty_unemployed qty_employed \\\n", "state \n", "Alabama 2008.0 6.514706 1.407627e+05 2.016128e+06 \n", "Alaska 2008.0 7.005392 2.449233e+04 3.251195e+05 \n", "Arizona 2008.0 6.381373 1.901428e+05 2.759190e+06 \n", "Arkansas 2008.0 5.997059 7.955364e+04 1.244863e+06 \n", "California 2008.0 7.443627 1.347057e+06 1.665503e+07 \n", "Colorado 2008.0 5.453922 1.452646e+05 2.503614e+06 \n", "Connecticut 2008.0 5.821078 1.082514e+05 1.736436e+06 \n", "Delaware 2008.0 5.214216 2.292001e+04 4.152155e+05 \n", "District of Columbia 2008.0 7.306373 2.490122e+04 3.137833e+05 \n", "Florida 2008.0 6.158824 5.573326e+05 8.397621e+06 \n", "Georgia 2008.0 6.446078 3.016779e+05 4.339163e+06 \n", "Hawaii 2008.0 4.456373 2.844513e+04 6.090308e+05 \n", "Idaho 2008.0 5.525000 4.104917e+04 7.013501e+05 \n", "Illinois 2008.0 6.936275 4.523743e+05 6.059067e+06 \n", "Indiana 2008.0 6.184804 1.977921e+05 3.002629e+06 \n", "Iowa 2008.0 4.373039 7.248086e+04 1.583003e+06 \n", "Kansas 2008.0 5.084314 7.486110e+04 1.394324e+06 \n", "Kentucky 2008.0 6.639216 1.339159e+05 1.874098e+06 \n", "Louisiana 2008.0 6.197549 1.283975e+05 1.938786e+06 \n", "Maine 2008.0 5.470098 3.788920e+04 6.543850e+05 \n", "Maryland 2008.0 5.121078 1.548394e+05 2.847295e+06 \n", "Massachusetts 2008.0 5.521078 1.911474e+05 3.266176e+06 \n", "Michigan 2008.0 7.662255 3.749323e+05 4.543822e+06 \n", "Minnesota 2008.0 4.886765 1.426212e+05 2.771413e+06 \n", "Mississippi 2008.0 7.285784 9.449842e+04 1.201615e+06 \n", "Missouri 2008.0 6.069118 1.842034e+05 2.849336e+06 \n", "Montana 2008.0 5.023039 2.489118e+04 4.706648e+05 \n", "Nebraska 2008.0 3.629412 3.576543e+04 9.494372e+05 \n", "Nevada 2008.0 7.220098 9.513693e+04 1.193879e+06 \n", "New Hampshire 2008.0 4.205882 3.072605e+04 6.977644e+05 \n", "New Jersey 2008.0 6.248039 2.798071e+05 4.176598e+06 \n", "New Mexico 2008.0 5.972059 5.465965e+04 8.586955e+05 \n", "New York 2008.0 6.182353 5.869909e+05 8.888198e+06 \n", "North Carolina 2008.0 6.741176 3.046283e+05 4.188285e+06 \n", "North Dakota 2008.0 3.279902 1.220031e+04 3.621385e+05 \n", "Ohio 2008.0 6.455392 3.763495e+05 5.449497e+06 \n", "Oklahoma 2008.0 4.784804 8.368540e+04 1.661057e+06 \n", "Oregon 2008.0 7.242647 1.389804e+05 1.776049e+06 \n", "Pennsylvania 2008.0 5.946569 3.771446e+05 5.952701e+06 \n", "Puerto Rico 2008.0 12.653922 1.603699e+05 1.112528e+06 \n", "Rhode Island 2008.0 7.000490 3.921885e+04 5.201720e+05 \n", "South Carolina 2008.0 7.130882 1.517508e+05 1.967512e+06 \n", "South Dakota 2008.0 3.579412 1.562588e+04 4.203784e+05 \n", "Tennessee 2008.0 6.369118 1.926002e+05 2.816181e+06 \n", "Texas 2008.0 5.775000 6.833513e+05 1.111923e+07 \n", "Utah 2008.0 4.716667 6.224774e+04 1.261578e+06 \n", "Vermont 2008.0 4.247549 1.492144e+04 3.346937e+05 \n", "Virginia 2008.0 4.543137 1.844006e+05 3.830106e+06 \n", "Washington 2008.0 6.706863 2.257521e+05 3.132887e+06 \n", "West Virginia 2008.0 6.168137 4.937335e+04 7.508340e+05 \n", "Wisconsin 2008.0 5.668627 1.737884e+05 2.889477e+06 \n", "Wyoming 2008.0 4.399510 1.283748e+04 2.771341e+05 \n", "\n", " labour_force \n", "state \n", "Alabama 2.156891e+06 \n", "Alaska 3.496119e+05 \n", "Arizona 2.949333e+06 \n", "Arkansas 1.324416e+06 \n", "California 1.800209e+07 \n", "Colorado 2.648878e+06 \n", "Connecticut 1.844688e+06 \n", "Delaware 4.381355e+05 \n", "District of Columbia 3.386846e+05 \n", "Florida 8.954954e+06 \n", "Georgia 4.640840e+06 \n", "Hawaii 6.374759e+05 \n", "Idaho 7.423993e+05 \n", "Illinois 6.511441e+06 \n", "Indiana 3.200421e+06 \n", "Iowa 1.655483e+06 \n", "Kansas 1.469186e+06 \n", "Kentucky 2.008013e+06 \n", "Louisiana 2.067184e+06 \n", "Maine 6.922742e+05 \n", "Maryland 3.002135e+06 \n", "Massachusetts 3.457324e+06 \n", "Michigan 4.918755e+06 \n", "Minnesota 2.914034e+06 \n", "Mississippi 1.296114e+06 \n", "Missouri 3.033539e+06 \n", "Montana 4.955559e+05 \n", "Nebraska 9.852027e+05 \n", "Nevada 1.289016e+06 \n", "New Hampshire 7.284904e+05 \n", "New Jersey 4.456405e+06 \n", "New Mexico 9.133552e+05 \n", "New York 9.475189e+06 \n", "North Carolina 4.492913e+06 \n", "North Dakota 3.743388e+05 \n", "Ohio 5.825846e+06 \n", "Oklahoma 1.744743e+06 \n", "Oregon 1.915029e+06 \n", "Pennsylvania 6.329845e+06 \n", "Puerto Rico 1.272898e+06 \n", "Rhode Island 5.593909e+05 \n", "South Carolina 2.119263e+06 \n", "South Dakota 4.360043e+05 \n", "Tennessee 3.008781e+06 \n", "Texas 1.180258e+07 \n", "Utah 1.323826e+06 \n", "Vermont 3.496151e+05 \n", "Virginia 4.014506e+06 \n", "Washington 3.358639e+06 \n", "West Virginia 8.002073e+05 \n", "Wisconsin 3.063265e+06 \n", "Wyoming 2.899716e+05 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and if we want multiple, but not all columns:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unemployment_rateqty_unemployed
state
Alabama6.5147061.407627e+05
Alaska7.0053922.449233e+04
Arizona6.3813731.901428e+05
Arkansas5.9970597.955364e+04
California7.4436271.347057e+06
Colorado5.4539221.452646e+05
Connecticut5.8210781.082514e+05
Delaware5.2142162.292001e+04
District of Columbia7.3063732.490122e+04
Florida6.1588245.573326e+05
Georgia6.4460783.016779e+05
Hawaii4.4563732.844513e+04
Idaho5.5250004.104917e+04
Illinois6.9362754.523743e+05
Indiana6.1848041.977921e+05
Iowa4.3730397.248086e+04
Kansas5.0843147.486110e+04
Kentucky6.6392161.339159e+05
Louisiana6.1975491.283975e+05
Maine5.4700983.788920e+04
Maryland5.1210781.548394e+05
Massachusetts5.5210781.911474e+05
Michigan7.6622553.749323e+05
Minnesota4.8867651.426212e+05
Mississippi7.2857849.449842e+04
Missouri6.0691181.842034e+05
Montana5.0230392.489118e+04
Nebraska3.6294123.576543e+04
Nevada7.2200989.513693e+04
New Hampshire4.2058823.072605e+04
New Jersey6.2480392.798071e+05
New Mexico5.9720595.465965e+04
New York6.1823535.869909e+05
North Carolina6.7411763.046283e+05
North Dakota3.2799021.220031e+04
Ohio6.4553923.763495e+05
Oklahoma4.7848048.368540e+04
Oregon7.2426471.389804e+05
Pennsylvania5.9465693.771446e+05
Puerto Rico12.6539221.603699e+05
Rhode Island7.0004903.921885e+04
South Carolina7.1308821.517508e+05
South Dakota3.5794121.562588e+04
Tennessee6.3691181.926002e+05
Texas5.7750006.833513e+05
Utah4.7166676.224774e+04
Vermont4.2475491.492144e+04
Virginia4.5431371.844006e+05
Washington6.7068632.257521e+05
West Virginia6.1681374.937335e+04
Wisconsin5.6686271.737884e+05
Wyoming4.3995101.283748e+04
\n", "
" ], "text/plain": [ " unemployment_rate qty_unemployed\n", "state \n", "Alabama 6.514706 1.407627e+05\n", "Alaska 7.005392 2.449233e+04\n", "Arizona 6.381373 1.901428e+05\n", "Arkansas 5.997059 7.955364e+04\n", "California 7.443627 1.347057e+06\n", "Colorado 5.453922 1.452646e+05\n", "Connecticut 5.821078 1.082514e+05\n", "Delaware 5.214216 2.292001e+04\n", "District of Columbia 7.306373 2.490122e+04\n", "Florida 6.158824 5.573326e+05\n", "Georgia 6.446078 3.016779e+05\n", "Hawaii 4.456373 2.844513e+04\n", "Idaho 5.525000 4.104917e+04\n", "Illinois 6.936275 4.523743e+05\n", "Indiana 6.184804 1.977921e+05\n", "Iowa 4.373039 7.248086e+04\n", "Kansas 5.084314 7.486110e+04\n", "Kentucky 6.639216 1.339159e+05\n", "Louisiana 6.197549 1.283975e+05\n", "Maine 5.470098 3.788920e+04\n", "Maryland 5.121078 1.548394e+05\n", "Massachusetts 5.521078 1.911474e+05\n", "Michigan 7.662255 3.749323e+05\n", "Minnesota 4.886765 1.426212e+05\n", "Mississippi 7.285784 9.449842e+04\n", "Missouri 6.069118 1.842034e+05\n", "Montana 5.023039 2.489118e+04\n", "Nebraska 3.629412 3.576543e+04\n", "Nevada 7.220098 9.513693e+04\n", "New Hampshire 4.205882 3.072605e+04\n", "New Jersey 6.248039 2.798071e+05\n", "New Mexico 5.972059 5.465965e+04\n", "New York 6.182353 5.869909e+05\n", "North Carolina 6.741176 3.046283e+05\n", "North Dakota 3.279902 1.220031e+04\n", "Ohio 6.455392 3.763495e+05\n", "Oklahoma 4.784804 8.368540e+04\n", "Oregon 7.242647 1.389804e+05\n", "Pennsylvania 5.946569 3.771446e+05\n", "Puerto Rico 12.653922 1.603699e+05\n", "Rhode Island 7.000490 3.921885e+04\n", "South Carolina 7.130882 1.517508e+05\n", "South Dakota 3.579412 1.562588e+04\n", "Tennessee 6.369118 1.926002e+05\n", "Texas 5.775000 6.833513e+05\n", "Utah 4.716667 6.224774e+04\n", "Vermont 4.247549 1.492144e+04\n", "Virginia 4.543137 1.844006e+05\n", "Washington 6.706863 2.257521e+05\n", "West Virginia 6.168137 4.937335e+04\n", "Wisconsin 5.668627 1.737884e+05\n", "Wyoming 4.399510 1.283748e+04" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state')['unemployment_rate', 'qty_unemployed'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Very Cool!\n", "\n", "## Challenge\n", "\n", "Try and compute statistics where the grouping is described by multiple variables:\n", "1. Compute the average unemployment rate by state-year\n", "2. Compute the average unemployment rate by state-year-month\n", "3. Compute the average unemployment rate by state-year for the Carolinas (Hint: use the .filter() function after the computation)\n", "4. Compute the average unemployment rate by state-year and use `unstack` to view the data in a way that is different from the default output\n", "\n", "#### Partial Solution" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "state year\n", "North Carolina 2000 3.666667\n", " 2001 5.516667\n", " 2002 6.658333\n", " 2003 6.375000\n", " 2004 5.491667\n", " 2005 5.233333\n", " 2006 4.750000\n", " 2007 4.725000\n", " 2008 6.100000\n", " 2009 10.575000\n", " 2010 10.850000\n", " 2011 10.275000\n", " 2012 9.291667\n", " 2013 7.983333\n", " 2014 6.300000\n", " 2015 5.750000\n", " 2016 5.058333\n", "South Carolina 2000 3.758333\n", " 2001 5.216667\n", " 2002 5.808333\n", " 2003 6.908333\n", " 2004 6.825000\n", " 2005 6.725000\n", " 2006 6.433333\n", " 2007 5.666667\n", " 2008 6.791667\n", " 2009 11.258333\n", " 2010 11.158333\n", " 2011 10.558333\n", " 2012 9.216667\n", " 2013 7.641667\n", " 2014 6.441667\n", " 2015 5.975000\n", " 2016 4.841667\n", "Name: unemployment_rate, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(['state', 'year'])['unemployment_rate'].mean().filter(like='Carolina')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict of ColumbiaFlorida...South DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming
year
20004.5833336.3750003.9500004.2583334.9416672.7583332.3916673.7333335.6083333.683333...2.4583333.8666674.2916673.3416672.8083332.3000005.1666675.4833333.5250003.908333
20015.1250006.4333334.7916674.9583335.4333333.8000003.1333333.5000006.3333334.641667...3.0916674.5500004.9416674.4166673.3333333.2333336.3000005.0083334.5250003.825000
20025.9000007.2916676.0833335.4833336.6666675.5333334.3333334.0250006.4083335.633333...3.1833335.1833336.3416675.8000003.9916674.2166677.4083335.9166675.3750004.041667
20036.0166677.8250005.7416675.9333336.8000005.9583335.3916674.2666676.8333335.208333...3.5333335.6333336.6833335.6333334.2666674.1083337.4083336.0500005.7000004.275000
20045.6833337.4583335.0333335.7083336.2083335.4833334.9666673.9916677.7833334.633333...3.7250005.3333335.9416674.9916673.6750003.8250006.2666675.3250005.0250003.808333
20054.4916676.8833334.6833335.2500005.4000005.0250004.8833334.1416676.3750003.691667...3.8166675.5500005.4083334.0916673.4750003.5666675.5500005.1083334.7416673.583333
20064.0333336.6166674.2083335.1583334.9000004.3083334.3333333.5750005.7916673.225000...3.1083335.2000004.8916672.9750003.6833333.1000005.0416674.9416674.7416673.175000
20073.9833336.3333333.8583335.3083335.3750003.7333334.4750003.4416675.5250004.008333...2.8416674.6750004.2916672.5666674.0083333.0333334.7083334.5833334.9000002.816667
20085.6583336.6916676.1250005.4666677.2750004.8250005.6833334.9416676.4583336.283333...3.0500006.5750004.8083333.5916674.6833333.9416675.3833334.3583334.9083333.050000
200911.0333337.7416679.9333337.85000011.1416677.2833337.9333338.2666679.27500010.408333...4.92500010.4833337.5583337.3250006.5916676.7000009.1416677.7250008.5750006.325000
201010.5416677.88333310.3666678.19166712.2250008.7250009.1083338.4166679.41666711.041667...4.9833339.6750008.1500007.8333336.0916677.15000010.0000008.6416678.6666676.458333
20119.6416677.5916679.5000008.30833311.7166678.3583338.8083337.50000010.1583339.991667...4.7416678.9916677.7583336.7333335.4750006.5916679.2916678.0583337.7583335.808333
20127.9416677.1416678.3500007.57500010.3750007.8750008.3416677.2000009.0500008.458333...4.2750007.8416676.7250005.4083334.9500006.0333338.1333337.5083337.0416675.316667
20137.2250006.9250007.7416677.3416678.9333336.8250007.7500006.7166678.5000007.233333...3.7833337.7916676.2250004.6000004.4166675.7166677.0333336.7666676.7416674.708333
20146.7916676.8416676.7750006.0916677.5000005.0000006.6083335.7083337.7750006.266667...3.4333336.5166675.1000003.8166673.9500005.2416676.1250006.6333335.4166674.158333
20156.1083336.4416676.0333335.0750006.2166673.9083335.7333334.8333336.9000005.416667...3.0750005.6000004.4416673.6333333.5666674.4666675.6250006.7500004.5750004.250000
20165.9916676.6166675.3083333.9916675.4333333.3166675.0833334.3833336.0166674.875000...2.8250004.8083334.6166673.4250003.2416674.0083335.4333336.0000004.1500005.283333
\n", "

17 rows × 52 columns

\n", "
" ], "text/plain": [ "state Alabama Alaska Arizona Arkansas California Colorado \\\n", "year \n", "2000 4.583333 6.375000 3.950000 4.258333 4.941667 2.758333 \n", "2001 5.125000 6.433333 4.791667 4.958333 5.433333 3.800000 \n", "2002 5.900000 7.291667 6.083333 5.483333 6.666667 5.533333 \n", "2003 6.016667 7.825000 5.741667 5.933333 6.800000 5.958333 \n", "2004 5.683333 7.458333 5.033333 5.708333 6.208333 5.483333 \n", "2005 4.491667 6.883333 4.683333 5.250000 5.400000 5.025000 \n", "2006 4.033333 6.616667 4.208333 5.158333 4.900000 4.308333 \n", "2007 3.983333 6.333333 3.858333 5.308333 5.375000 3.733333 \n", "2008 5.658333 6.691667 6.125000 5.466667 7.275000 4.825000 \n", "2009 11.033333 7.741667 9.933333 7.850000 11.141667 7.283333 \n", "2010 10.541667 7.883333 10.366667 8.191667 12.225000 8.725000 \n", "2011 9.641667 7.591667 9.500000 8.308333 11.716667 8.358333 \n", "2012 7.941667 7.141667 8.350000 7.575000 10.375000 7.875000 \n", "2013 7.225000 6.925000 7.741667 7.341667 8.933333 6.825000 \n", "2014 6.791667 6.841667 6.775000 6.091667 7.500000 5.000000 \n", "2015 6.108333 6.441667 6.033333 5.075000 6.216667 3.908333 \n", "2016 5.991667 6.616667 5.308333 3.991667 5.433333 3.316667 \n", "\n", "state Connecticut Delaware District of Columbia Florida ... \\\n", "year ... \n", "2000 2.391667 3.733333 5.608333 3.683333 ... \n", "2001 3.133333 3.500000 6.333333 4.641667 ... \n", "2002 4.333333 4.025000 6.408333 5.633333 ... \n", "2003 5.391667 4.266667 6.833333 5.208333 ... \n", "2004 4.966667 3.991667 7.783333 4.633333 ... \n", "2005 4.883333 4.141667 6.375000 3.691667 ... \n", "2006 4.333333 3.575000 5.791667 3.225000 ... \n", "2007 4.475000 3.441667 5.525000 4.008333 ... \n", "2008 5.683333 4.941667 6.458333 6.283333 ... \n", "2009 7.933333 8.266667 9.275000 10.408333 ... \n", "2010 9.108333 8.416667 9.416667 11.041667 ... \n", "2011 8.808333 7.500000 10.158333 9.991667 ... \n", "2012 8.341667 7.200000 9.050000 8.458333 ... \n", "2013 7.750000 6.716667 8.500000 7.233333 ... \n", "2014 6.608333 5.708333 7.775000 6.266667 ... \n", "2015 5.733333 4.833333 6.900000 5.416667 ... \n", "2016 5.083333 4.383333 6.016667 4.875000 ... \n", "\n", "state South Dakota Tennessee Texas Utah Vermont Virginia \\\n", "year \n", "2000 2.458333 3.866667 4.291667 3.341667 2.808333 2.300000 \n", "2001 3.091667 4.550000 4.941667 4.416667 3.333333 3.233333 \n", "2002 3.183333 5.183333 6.341667 5.800000 3.991667 4.216667 \n", "2003 3.533333 5.633333 6.683333 5.633333 4.266667 4.108333 \n", "2004 3.725000 5.333333 5.941667 4.991667 3.675000 3.825000 \n", "2005 3.816667 5.550000 5.408333 4.091667 3.475000 3.566667 \n", "2006 3.108333 5.200000 4.891667 2.975000 3.683333 3.100000 \n", "2007 2.841667 4.675000 4.291667 2.566667 4.008333 3.033333 \n", "2008 3.050000 6.575000 4.808333 3.591667 4.683333 3.941667 \n", "2009 4.925000 10.483333 7.558333 7.325000 6.591667 6.700000 \n", "2010 4.983333 9.675000 8.150000 7.833333 6.091667 7.150000 \n", "2011 4.741667 8.991667 7.758333 6.733333 5.475000 6.591667 \n", "2012 4.275000 7.841667 6.725000 5.408333 4.950000 6.033333 \n", "2013 3.783333 7.791667 6.225000 4.600000 4.416667 5.716667 \n", "2014 3.433333 6.516667 5.100000 3.816667 3.950000 5.241667 \n", "2015 3.075000 5.600000 4.441667 3.633333 3.566667 4.466667 \n", "2016 2.825000 4.808333 4.616667 3.425000 3.241667 4.008333 \n", "\n", "state Washington West Virginia Wisconsin Wyoming \n", "year \n", "2000 5.166667 5.483333 3.525000 3.908333 \n", "2001 6.300000 5.008333 4.525000 3.825000 \n", "2002 7.408333 5.916667 5.375000 4.041667 \n", "2003 7.408333 6.050000 5.700000 4.275000 \n", "2004 6.266667 5.325000 5.025000 3.808333 \n", "2005 5.550000 5.108333 4.741667 3.583333 \n", "2006 5.041667 4.941667 4.741667 3.175000 \n", "2007 4.708333 4.583333 4.900000 2.816667 \n", "2008 5.383333 4.358333 4.908333 3.050000 \n", "2009 9.141667 7.725000 8.575000 6.325000 \n", "2010 10.000000 8.641667 8.666667 6.458333 \n", "2011 9.291667 8.058333 7.758333 5.808333 \n", "2012 8.133333 7.508333 7.041667 5.316667 \n", "2013 7.033333 6.766667 6.741667 4.708333 \n", "2014 6.125000 6.633333 5.416667 4.158333 \n", "2015 5.625000 6.750000 4.575000 4.250000 \n", "2016 5.433333 6.000000 4.150000 5.283333 \n", "\n", "[17 rows x 52 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(['state', 'year'])['unemployment_rate'].mean().unstack(level=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More useful GroupBy functionality\n", "\n", "Often we want more than one aggregate statistic, for example we may want the min, mean and max by State. We can use the `.aggregate()` function to get these:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aminmeanamax
state
Alabama3.36.51470612.3
Alaska5.27.0053929.1
Arizona3.46.38137311.1
Arkansas3.45.9970599.5
California4.47.44362712.7
Colorado2.45.4539229.2
Connecticut1.75.8210789.8
Delaware2.95.2142169.4
District of Columbia4.77.30637310.9
Florida2.86.15882411.5
Georgia3.06.44607810.9
Hawaii1.94.4563737.9
Idaho2.25.52500010.5
Illinois3.66.93627512.2
Indiana2.26.18480411.8
Iowa2.34.3730397.3
Kansas3.25.0843147.9
Kentucky3.76.63921611.9
Louisiana3.66.19754910.9
Maine2.65.4700989.5
Maryland3.15.1210788.3
Massachusetts2.15.5210789.6
Michigan2.97.66225515.4
Minnesota2.74.8867658.6
Mississippi4.67.28578411.7
Missouri3.06.06911810.5
Montana2.75.0230398.6
Nebraska2.63.6294125.4
Nevada3.77.22009813.9
New Hampshire2.34.2058827.0
New Jersey3.36.24803910.3
New Mexico3.45.9720599.0
New York3.96.1823539.6
North Carolina3.06.74117612.0
North Dakota2.03.2799025.1
Ohio3.46.45539212.0
Oklahoma2.64.7848047.6
Oregon4.27.24264712.6
Pennsylvania3.65.9465699.5
Puerto Rico8.912.65392217.3
Rhode Island3.67.00049012.2
South Carolina3.27.13088212.4
South Dakota2.03.5794125.9
Tennessee3.36.36911811.3
Texas3.55.7750008.6
Utah2.14.7166678.4
Vermont2.34.2475497.6
Virginia1.84.5431377.9
Washington4.26.70686311.3
West Virginia3.66.16813710.0
Wisconsin2.85.66862710.4
Wyoming2.44.3995108.1
\n", "
" ], "text/plain": [ " amin mean amax\n", "state \n", "Alabama 3.3 6.514706 12.3\n", "Alaska 5.2 7.005392 9.1\n", "Arizona 3.4 6.381373 11.1\n", "Arkansas 3.4 5.997059 9.5\n", "California 4.4 7.443627 12.7\n", "Colorado 2.4 5.453922 9.2\n", "Connecticut 1.7 5.821078 9.8\n", "Delaware 2.9 5.214216 9.4\n", "District of Columbia 4.7 7.306373 10.9\n", "Florida 2.8 6.158824 11.5\n", "Georgia 3.0 6.446078 10.9\n", "Hawaii 1.9 4.456373 7.9\n", "Idaho 2.2 5.525000 10.5\n", "Illinois 3.6 6.936275 12.2\n", "Indiana 2.2 6.184804 11.8\n", "Iowa 2.3 4.373039 7.3\n", "Kansas 3.2 5.084314 7.9\n", "Kentucky 3.7 6.639216 11.9\n", "Louisiana 3.6 6.197549 10.9\n", "Maine 2.6 5.470098 9.5\n", "Maryland 3.1 5.121078 8.3\n", "Massachusetts 2.1 5.521078 9.6\n", "Michigan 2.9 7.662255 15.4\n", "Minnesota 2.7 4.886765 8.6\n", "Mississippi 4.6 7.285784 11.7\n", "Missouri 3.0 6.069118 10.5\n", "Montana 2.7 5.023039 8.6\n", "Nebraska 2.6 3.629412 5.4\n", "Nevada 3.7 7.220098 13.9\n", "New Hampshire 2.3 4.205882 7.0\n", "New Jersey 3.3 6.248039 10.3\n", "New Mexico 3.4 5.972059 9.0\n", "New York 3.9 6.182353 9.6\n", "North Carolina 3.0 6.741176 12.0\n", "North Dakota 2.0 3.279902 5.1\n", "Ohio 3.4 6.455392 12.0\n", "Oklahoma 2.6 4.784804 7.6\n", "Oregon 4.2 7.242647 12.6\n", "Pennsylvania 3.6 5.946569 9.5\n", "Puerto Rico 8.9 12.653922 17.3\n", "Rhode Island 3.6 7.000490 12.2\n", "South Carolina 3.2 7.130882 12.4\n", "South Dakota 2.0 3.579412 5.9\n", "Tennessee 3.3 6.369118 11.3\n", "Texas 3.5 5.775000 8.6\n", "Utah 2.1 4.716667 8.4\n", "Vermont 2.3 4.247549 7.6\n", "Virginia 1.8 4.543137 7.9\n", "Washington 4.2 6.706863 11.3\n", "West Virginia 3.6 6.168137 10.0\n", "Wisconsin 2.8 5.668627 10.4\n", "Wyoming 2.4 4.399510 8.1" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state')['unemployment_rate'].aggregate([np.min, np.mean, np.max])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can dress it up a little too if we like:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mean Unemploy.Min. Unemploy.Max. Unemploy.
state
Alabama6.5147063.312.3
Alaska7.0053925.29.1
Arizona6.3813733.411.1
Arkansas5.9970593.49.5
California7.4436274.412.7
Colorado5.4539222.49.2
Connecticut5.8210781.79.8
Delaware5.2142162.99.4
District of Columbia7.3063734.710.9
Florida6.1588242.811.5
Georgia6.4460783.010.9
Hawaii4.4563731.97.9
Idaho5.5250002.210.5
Illinois6.9362753.612.2
Indiana6.1848042.211.8
Iowa4.3730392.37.3
Kansas5.0843143.27.9
Kentucky6.6392163.711.9
Louisiana6.1975493.610.9
Maine5.4700982.69.5
Maryland5.1210783.18.3
Massachusetts5.5210782.19.6
Michigan7.6622552.915.4
Minnesota4.8867652.78.6
Mississippi7.2857844.611.7
Missouri6.0691183.010.5
Montana5.0230392.78.6
Nebraska3.6294122.65.4
Nevada7.2200983.713.9
New Hampshire4.2058822.37.0
New Jersey6.2480393.310.3
New Mexico5.9720593.49.0
New York6.1823533.99.6
North Carolina6.7411763.012.0
North Dakota3.2799022.05.1
Ohio6.4553923.412.0
Oklahoma4.7848042.67.6
Oregon7.2426474.212.6
Pennsylvania5.9465693.69.5
Puerto Rico12.6539228.917.3
Rhode Island7.0004903.612.2
South Carolina7.1308823.212.4
South Dakota3.5794122.05.9
Tennessee6.3691183.311.3
Texas5.7750003.58.6
Utah4.7166672.18.4
Vermont4.2475492.37.6
Virginia4.5431371.87.9
Washington6.7068634.211.3
West Virginia6.1681373.610.0
Wisconsin5.6686272.810.4
Wyoming4.3995102.48.1
\n", "
" ], "text/plain": [ " Mean Unemploy. Min. Unemploy. Max. Unemploy.\n", "state \n", "Alabama 6.514706 3.3 12.3\n", "Alaska 7.005392 5.2 9.1\n", "Arizona 6.381373 3.4 11.1\n", "Arkansas 5.997059 3.4 9.5\n", "California 7.443627 4.4 12.7\n", "Colorado 5.453922 2.4 9.2\n", "Connecticut 5.821078 1.7 9.8\n", "Delaware 5.214216 2.9 9.4\n", "District of Columbia 7.306373 4.7 10.9\n", "Florida 6.158824 2.8 11.5\n", "Georgia 6.446078 3.0 10.9\n", "Hawaii 4.456373 1.9 7.9\n", "Idaho 5.525000 2.2 10.5\n", "Illinois 6.936275 3.6 12.2\n", "Indiana 6.184804 2.2 11.8\n", "Iowa 4.373039 2.3 7.3\n", "Kansas 5.084314 3.2 7.9\n", "Kentucky 6.639216 3.7 11.9\n", "Louisiana 6.197549 3.6 10.9\n", "Maine 5.470098 2.6 9.5\n", "Maryland 5.121078 3.1 8.3\n", "Massachusetts 5.521078 2.1 9.6\n", "Michigan 7.662255 2.9 15.4\n", "Minnesota 4.886765 2.7 8.6\n", "Mississippi 7.285784 4.6 11.7\n", "Missouri 6.069118 3.0 10.5\n", "Montana 5.023039 2.7 8.6\n", "Nebraska 3.629412 2.6 5.4\n", "Nevada 7.220098 3.7 13.9\n", "New Hampshire 4.205882 2.3 7.0\n", "New Jersey 6.248039 3.3 10.3\n", "New Mexico 5.972059 3.4 9.0\n", "New York 6.182353 3.9 9.6\n", "North Carolina 6.741176 3.0 12.0\n", "North Dakota 3.279902 2.0 5.1\n", "Ohio 6.455392 3.4 12.0\n", "Oklahoma 4.784804 2.6 7.6\n", "Oregon 7.242647 4.2 12.6\n", "Pennsylvania 5.946569 3.6 9.5\n", "Puerto Rico 12.653922 8.9 17.3\n", "Rhode Island 7.000490 3.6 12.2\n", "South Carolina 7.130882 3.2 12.4\n", "South Dakota 3.579412 2.0 5.9\n", "Tennessee 6.369118 3.3 11.3\n", "Texas 5.775000 3.5 8.6\n", "Utah 4.716667 2.1 8.4\n", "Vermont 4.247549 2.3 7.6\n", "Virginia 4.543137 1.8 7.9\n", "Washington 6.706863 4.2 11.3\n", "West Virginia 6.168137 3.6 10.0\n", "Wisconsin 5.668627 2.8 10.4\n", "Wyoming 4.399510 2.4 8.1" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('state')['unemployment_rate'] \\\n", " .aggregate({'Min. Unemploy.' : np.min,\n", " 'Mean Unemploy.' : np.mean,\n", " 'Max. Unemploy.' : np.max})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the challenge above, we used filter to select out states based on a string, but there are more uses of filter than this simple one. For example, we can filter based on the group properties:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def filter_func(x):\n", " return x['unemployment_rate'].min() > 5\n", "\n", "problem_states = data.groupby('state').filter(filter_func)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['Alaska', 'Puerto Rico'], dtype=object)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "problem_states.state.unique()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateAlaskaPuerto Rico
year
20006.37500010.125000
20016.43333311.350000
20027.29166712.250000
20037.82500011.966667
20047.45833310.575000
20056.88333311.325000
20066.61666710.500000
20076.33333311.183333
20086.69166711.766667
20097.74166715.333333
20107.88333316.358333
20117.59166715.941667
20127.14166714.466667
20136.92500014.291667
20146.84166713.900000
20156.44166711.991667
20166.61666711.791667
\n", "
" ], "text/plain": [ "state Alaska Puerto Rico\n", "year \n", "2000 6.375000 10.125000\n", "2001 6.433333 11.350000\n", "2002 7.291667 12.250000\n", "2003 7.825000 11.966667\n", "2004 7.458333 10.575000\n", "2005 6.883333 11.325000\n", "2006 6.616667 10.500000\n", "2007 6.333333 11.183333\n", "2008 6.691667 11.766667\n", "2009 7.741667 15.333333\n", "2010 7.883333 16.358333\n", "2011 7.591667 15.941667\n", "2012 7.141667 14.466667\n", "2013 6.925000 14.291667\n", "2014 6.841667 13.900000\n", "2015 6.441667 11.991667\n", "2016 6.616667 11.791667" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "problem_states.groupby(['state', 'year']).unemployment_rate.mean().unstack(level=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other useful functions:\n", "\n", "GroupBy also works alongside the following functions:\n", "* `transform()`\n", "* `apply()`\n", "\n", "For lack of time we leave these for you to explore." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }