Hierachical (Multi-) Indexing

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

In earlier notebooks we were working with our labor market data and importing it to have a multi-index. In this notebook, we explore the notions of multi-indexing in more detail.

import pandas as pd

Multiply Indexed Data

We focus on multiply indexed DataFrames, and ignore pandas Series because we will most often come across DataFrames. Most of our discussion carries over.

Let’s import our labour market data in the ‘simplest way’

data = pd.read_csv('out_data/state_labour_statistics.csv')
data.head()
state period year unemployment_rate qty_unemployed qty_employed labour_force
0 New Hampshire M12 2016 2.5 18907 728556 747463
1 New Hampshire M11 2016 2.6 19573 729720 749293
2 New Hampshire M10 2016 2.5 18430 729792 748222
3 New Hampshire M09 2016 2.6 19415 727548 746963
4 New Hampshire M08 2016 2.8 20933 737236 758169
data.index
RangeIndex(start=0, stop=10608, step=1)

Again note that pandas has created an index for us, which is simply a row identifier. We argued earlier that a better way of indexing might be state-year-month.

To make the shift to our preferred index, we need to reset the index, using the set_index function.

data.set_index(['state', 'year', 'period'], inplace=True)
data.head()
unemployment_rate qty_unemployed qty_employed labour_force
state year period
New Hampshire 2016 M12 2.5 18907 728556 747463
M11 2.6 19573 729720 749293
M10 2.5 18430 729792 748222
M09 2.6 19415 727548 746963
M08 2.8 20933 737236 758169

If we want to go back to the original index, we can use reset_index and ask that the multi-index we set is returned to columns of the data

data.reset_index(drop=False, inplace=True)
data.head()
state year period unemployment_rate qty_unemployed qty_employed labour_force
0 New Hampshire 2016 M12 2.5 18907 728556 747463
1 New Hampshire 2016 M11 2.6 19573 729720 749293
2 New Hampshire 2016 M10 2.5 18430 729792 748222
3 New Hampshire 2016 M09 2.6 19415 727548 746963
4 New Hampshire 2016 M08 2.8 20933 737236 758169

But let’s for now stick with our multi-index data- and see how to use it

data.set_index(['state', 'year', 'period'], inplace=True)

Selecting Data with a Multi-Index

One advantage of a multi-index is that we can subset data quite simply:

data.loc['Alabama']
unemployment_rate qty_unemployed qty_employed labour_force
year period
2016 M12 5.9 129295 2050965 2180260
M11 5.7 123834 2060807 2184641
M10 6.2 136336 2057151 2193487
M09 6.2 134032 2042696 2176728
M08 6.2 134754 2033706 2168460
... ... ... ... ... ...
2000 M05 4.1 86308 2037787 2124095
M04 3.8 80215 2045171 2125386
M03 4.5 95449 2038654 2134103
M02 5.1 109464 2027226 2136690
M01 5.1 108551 2024262 2132813

204 rows × 4 columns

data.loc['Alabama', 2010]
/tmp/ipykernel_2901/4288551881.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  data.loc['Alabama', 2010]
unemployment_rate qty_unemployed qty_employed labour_force
period
M12 9.8 217272 1996538 2213810
M11 10.0 222536 2001481 2224017
M10 9.9 219224 2001483 2220707
M09 9.8 216750 1990219 2206969
M08 10.1 224435 1988261 2212696
M07 10.2 225826 1990949 2216775
M06 10.5 234972 1994590 2229562
M05 9.9 219339 1994090 2213429
M04 10.4 229680 1976552 2206232
M03 11.5 245567 1892006 2137573
M02 12.1 258825 1876758 2135583
M01 12.3 263375 1871780 2135155
data.loc['Alabama', 2010, 'M10']
unemployment_rate          9.9
qty_unemployed        219224.0
qty_employed         2001483.0
labour_force         2220707.0
Name: (Alabama, 2010, M10), dtype: float64
# we hope this may work...
data.loc['Alabama', :, 'M10']
unemployment_rate qty_unemployed qty_employed labour_force
state year period
Alabama 2016 M10 6.2 136336 2057151 2193487
2015 M10 5.9 126424 2032455 2158879
2014 M10 6.0 130881 2035345 2166226
2013 M10 7.0 152300 2012001 2164301
2012 M10 7.5 165164 2025114 2190278
2011 M10 8.9 194917 2005689 2200606
2010 M10 9.9 219224 2001483 2220707
2009 M10 11.7 252694 1906032 2158726
2008 M10 6.3 137392 2047657 2185049
2007 M10 3.8 82043 2099135 2181178
2006 M10 3.6 79073 2103518 2182591
2005 M10 4.2 91482 2078674 2170156
2004 M10 5.5 119278 2035705 2154983
2003 M10 6.1 130045 2009302 2139347
2002 M10 5.6 118872 2006102 2124974
2001 M10 5.6 118727 2006251 2124978
2000 M10 4.5 96899 2042462 2139361
# we need to sort the index first
data = data.sort_index()
data.head()
unemployment_rate qty_unemployed qty_employed labour_force
state year period
Alabama 2000 M01 5.1 108551 2024262 2132813
M02 5.1 109464 2027226 2136690
M03 4.5 95449 2038654 2134103
M04 3.8 80215 2045171 2125386
M05 4.1 86308 2037787 2124095
data.loc['Alabama', 2010:2016, 'M10':'M12']
unemployment_rate qty_unemployed qty_employed labour_force
state year period
Alabama 2010 M10 9.9 219224 2001483 2220707
M11 10.0 222536 2001481 2224017
M12 9.8 217272 1996538 2213810
2011 M10 8.9 194917 2005689 2200606
M11 8.0 175369 2008604 2183973
M12 7.9 172004 2004151 2176155
2012 M10 7.5 165164 2025114 2190278
M11 6.9 149468 2023318 2172786
M12 7.2 156518 2016210 2172728
2013 M10 7.0 152300 2012001 2164301
M11 6.7 144052 2020314 2164366
M12 6.7 143618 2013519 2157137
2014 M10 6.0 130881 2035345 2166226
M11 5.8 124177 2033698 2157875
M12 5.7 121281 2023427 2144708
2015 M10 5.9 126424 2032455 2158879
M11 5.7 122954 2035229 2158183
M12 5.9 127389 2030765 2158154
2016 M10 6.2 136336 2057151 2193487
M11 5.7 123834 2060807 2184641
M12 5.9 129295 2050965 2180260

Challenge

  1. Extract all the data for the Carolinas (Help: you need to do a partial string match on the index data.index.get_level_values(XX).str.contains(YY) )

  2. Extract all the data for the Carolinas in 2007

  3. Extract all the data for the Carolinas between 2007-2010

  4. Extract all the data in the summer months for the Carolina between 2007 and 2010

Partial Solution:

data.loc[data.index.get_level_values(0).str.contains("Carolina"), 
             2007:2010, 'M06':'M09']
unemployment_rate qty_unemployed qty_employed labour_force
state year period
North Carolina 2007 M06 4.9 220499 4323332 4543831
M07 5.1 231083 4336375 4567458
M08 4.7 212508 4299171 4511679
M09 4.5 201890 4321920 4523810
2008 M06 5.9 271173 4313781 4584954
M07 6.3 289018 4324163 4613181
M08 6.3 288553 4282651 4571204
M09 5.9 268607 4292928 4561535
2009 M06 10.8 498908 4102835 4601743
M07 10.8 498541 4121528 4620069
M08 10.6 485802 4083402 4569204
M09 10.4 472231 4076322 4548553
2010 M06 10.9 503871 4137291 4641162
M07 10.8 503704 4150739 4654443
M08 10.5 486253 4137732 4623985
M09 10.0 459133 4152584 4611717
South Carolina 2007 M06 5.9 126476 2033179 2159655
M07 5.8 125358 2031889 2157247
M08 5.7 121989 2015875 2137864
M09 5.6 119816 2008234 2128050
2008 M06 6.6 143842 2028477 2172319
M07 6.9 149774 2022393 2172167
M08 7.5 162877 2003086 2165963
M09 7.5 160318 1990871 2151189
2009 M06 11.9 260682 1932513 2193195
M07 11.6 253471 1932518 2185989
M08 11.5 247965 1914687 2162652
M09 11.5 245694 1895141 2140835
2010 M06 11.2 244208 1940423 2184631
M07 11.0 240106 1944223 2184329
M08 11.2 244749 1934786 2179535
M09 10.7 231578 1923986 2155564

Index (Un-)Stacking

One potentially cool use for multi-indexing is using the indexes across two dimensions. This is unstacking, and often allows for simple ways to view patterns in the data.

For example, we could translate the year-index across to a column axis, so that we could can easily see patterns in labor force statistics in a given month, but over various years:

data.unstack(level=1)
unemployment_rate ... labour_force
year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
state period
Alabama M01 5.1 5.2 6.5 6.2 6.4 5.4 4.4 4.3 4.8 9.7 ... 2169298 2170233 2164104 2135155 2210128 2151638 2166067 2151800 2134230 2141628
M02 5.1 5.0 6.3 6.0 6.1 5.4 4.5 4.2 4.9 10.5 ... 2170069 2167554 2171529 2135583 2204837 2165383 2170603 2165104 2136284 2152446
M03 4.5 4.7 6.0 5.6 5.9 4.6 3.9 3.8 4.7 10.4 ... 2176156 2170499 2166423 2137573 2206650 2171589 2164327 2172369 2139776 2160907
M04 3.8 4.2 5.3 5.3 5.2 3.9 3.5 3.3 4.0 9.9 ... 2156839 2162674 2162016 2206232 2204162 2163833 2158996 2153646 2144272 2154477
M05 4.1 4.0 5.2 5.6 5.4 4.0 3.6 3.3 4.9 10.6 ... 2158625 2174485 2164803 2213429 2205918 2178571 2171580 2164055 2162423 2159095
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming M08 3.4 3.3 3.4 3.7 3.3 3.2 2.8 2.5 2.6 6.1 ... 290133 296658 302187 305471 309004 309559 308650 308852 307846 304511
M09 3.4 3.3 3.5 3.6 3.4 3.2 2.8 2.4 2.5 6.0 ... 287821 294561 299774 303557 307206 306748 306052 306169 304353 301548
M10 3.4 3.4 3.6 3.6 3.4 3.3 2.8 2.4 2.7 6.3 ... 288442 295880 300205 304878 307938 307263 306560 306735 304311 300979
M11 4.0 3.8 4.2 4.1 3.7 3.5 3.0 2.6 3.2 6.7 ... 288931 295690 300575 305384 308172 306948 306599 306418 302892 300414
M12 3.9 3.9 4.5 4.2 3.7 3.5 2.9 2.5 3.7 7.5 ... 287174 294792 300582 304124 306890 305084 305285 303753 300859 299114

624 rows × 68 columns

In it’s current form this is a bit ugly because there is so much data. But we can use our indexing function loc together with a column selection to potentially view one of our series:

data.unstack(level=1).loc['California']['unemployment_rate']
year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
period
M01 5.4 5.3 7.0 7.3 7.0 6.2 5.3 5.5 6.4 10.2 12.7 12.6 11.4 10.2 8.4 7.2 5.9
M02 5.4 5.1 6.8 7.1 6.7 6.1 5.3 5.4 6.3 10.6 12.6 12.2 11.3 9.7 8.4 6.9 5.8
M03 5.4 5.3 6.9 7.1 7.0 5.9 5.2 5.2 6.6 10.9 12.6 12.1 11.0 9.3 8.3 6.7 5.8
M04 4.8 4.9 6.6 6.7 6.3 5.3 4.9 5.0 6.1 10.4 12.0 11.5 10.3 8.7 7.3 6.2 5.4
M05 4.8 4.6 6.2 6.4 5.9 5.0 4.5 4.8 6.4 10.6 11.7 11.3 10.2 8.6 7.2 6.1 4.9
M06 5.0 5.2 6.7 7.2 6.4 5.4 4.9 5.3 7.0 11.3 12.0 12.0 10.6 9.2 7.4 6.3 5.6
M07 5.3 5.5 6.9 7.1 6.4 5.5 5.2 5.7 7.6 11.7 12.6 12.2 10.9 9.3 7.9 6.5 5.8
M08 5.0 5.7 6.7 6.8 5.9 5.1 4.9 5.4 7.8 11.5 12.3 11.9 10.4 8.9 7.6 6.1 5.6
M09 4.6 5.5 6.3 6.5 5.6 5.1 4.6 5.3 7.7 11.4 11.9 11.5 9.6 8.5 7.0 5.6 5.2
M10 4.5 5.8 6.4 6.5 5.7 5.0 4.5 5.4 8.0 11.6 11.9 11.3 9.7 8.5 6.9 5.7 5.2
M11 4.7 6.2 6.8 6.7 5.9 5.3 4.8 5.6 8.4 11.7 12.3 11.0 9.5 8.3 7.0 5.7 5.0
M12 4.4 6.1 6.7 6.2 5.7 4.9 4.7 5.9 9.0 11.8 12.1 11.0 9.6 8.0 6.6 5.6 5.0

the opposite of unstacking is stacking, and it puts our data back into the multi-index format we began with:

data.unstack(level=1).stack(level=1)
unemployment_rate qty_unemployed qty_employed labour_force
state period year
Alabama M01 2000 5.1 108551 2024262 2132813
2001 5.2 110035 2008876 2118911
2002 6.5 134750 1953814 2088564
2003 6.2 129966 1974374 2104340
2004 6.4 134673 1985886 2120559
... ... ... ... ... ... ...
Wyoming M12 2012 5.2 15953 289131 305084
2013 4.6 14116 291169 305285
2014 3.8 11613 292140 303753
2015 4.8 14459 286400 300859
2016 5.0 14866 284248 299114

10608 rows × 4 columns

In this way, the unstack-stack functionality is a useful way to view data and potentially see some patterns; even if the data is has high dimensionality in the index

Challenge

  • Use the unstack method to view data about the employment rate among the labor force over years and months in California