Hierachical (Multi-) Indexing¶
Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer
In earlier notebooks we were working with our labor market data and importing it to have a multi-index. In this notebook, we explore the notions of multi-indexing in more detail.
import pandas as pd
Multiply Indexed Data¶
We focus on multiply indexed DataFrames
, and ignore pandas Series
because we will most often come across DataFrames
. Most of our discussion carries over.
Let’s import our labour market data in the ‘simplest way’
data = pd.read_csv('out_data/state_labour_statistics.csv')
data.head()
state | period | year | unemployment_rate | qty_unemployed | qty_employed | labour_force | |
---|---|---|---|---|---|---|---|
0 | New Hampshire | M12 | 2016 | 2.5 | 18907 | 728556 | 747463 |
1 | New Hampshire | M11 | 2016 | 2.6 | 19573 | 729720 | 749293 |
2 | New Hampshire | M10 | 2016 | 2.5 | 18430 | 729792 | 748222 |
3 | New Hampshire | M09 | 2016 | 2.6 | 19415 | 727548 | 746963 |
4 | New Hampshire | M08 | 2016 | 2.8 | 20933 | 737236 | 758169 |
data.index
RangeIndex(start=0, stop=10608, step=1)
Again note that pandas has created an index for us, which is simply a row identifier. We argued earlier that a better way of indexing might be state-year-month.
To make the shift to our preferred index, we need to reset the index, using the set_index
function.
data.set_index(['state', 'year', 'period'], inplace=True)
data.head()
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | year | period | ||||
New Hampshire | 2016 | M12 | 2.5 | 18907 | 728556 | 747463 |
M11 | 2.6 | 19573 | 729720 | 749293 | ||
M10 | 2.5 | 18430 | 729792 | 748222 | ||
M09 | 2.6 | 19415 | 727548 | 746963 | ||
M08 | 2.8 | 20933 | 737236 | 758169 |
If we want to go back to the original index, we can use reset_index
and ask that the multi-index we set is returned to columns of the data
data.reset_index(drop=False, inplace=True)
data.head()
state | year | period | unemployment_rate | qty_unemployed | qty_employed | labour_force | |
---|---|---|---|---|---|---|---|
0 | New Hampshire | 2016 | M12 | 2.5 | 18907 | 728556 | 747463 |
1 | New Hampshire | 2016 | M11 | 2.6 | 19573 | 729720 | 749293 |
2 | New Hampshire | 2016 | M10 | 2.5 | 18430 | 729792 | 748222 |
3 | New Hampshire | 2016 | M09 | 2.6 | 19415 | 727548 | 746963 |
4 | New Hampshire | 2016 | M08 | 2.8 | 20933 | 737236 | 758169 |
But let’s for now stick with our multi-index data- and see how to use it
data.set_index(['state', 'year', 'period'], inplace=True)
Selecting Data with a Multi-Index¶
One advantage of a multi-index is that we can subset data quite simply:
data.loc['Alabama']
unemployment_rate | qty_unemployed | qty_employed | labour_force | ||
---|---|---|---|---|---|
year | period | ||||
2016 | M12 | 5.9 | 129295 | 2050965 | 2180260 |
M11 | 5.7 | 123834 | 2060807 | 2184641 | |
M10 | 6.2 | 136336 | 2057151 | 2193487 | |
M09 | 6.2 | 134032 | 2042696 | 2176728 | |
M08 | 6.2 | 134754 | 2033706 | 2168460 | |
... | ... | ... | ... | ... | ... |
2000 | M05 | 4.1 | 86308 | 2037787 | 2124095 |
M04 | 3.8 | 80215 | 2045171 | 2125386 | |
M03 | 4.5 | 95449 | 2038654 | 2134103 | |
M02 | 5.1 | 109464 | 2027226 | 2136690 | |
M01 | 5.1 | 108551 | 2024262 | 2132813 |
204 rows × 4 columns
data.loc['Alabama', 2010]
/tmp/ipykernel_2901/4288551881.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
data.loc['Alabama', 2010]
unemployment_rate | qty_unemployed | qty_employed | labour_force | |
---|---|---|---|---|
period | ||||
M12 | 9.8 | 217272 | 1996538 | 2213810 |
M11 | 10.0 | 222536 | 2001481 | 2224017 |
M10 | 9.9 | 219224 | 2001483 | 2220707 |
M09 | 9.8 | 216750 | 1990219 | 2206969 |
M08 | 10.1 | 224435 | 1988261 | 2212696 |
M07 | 10.2 | 225826 | 1990949 | 2216775 |
M06 | 10.5 | 234972 | 1994590 | 2229562 |
M05 | 9.9 | 219339 | 1994090 | 2213429 |
M04 | 10.4 | 229680 | 1976552 | 2206232 |
M03 | 11.5 | 245567 | 1892006 | 2137573 |
M02 | 12.1 | 258825 | 1876758 | 2135583 |
M01 | 12.3 | 263375 | 1871780 | 2135155 |
data.loc['Alabama', 2010, 'M10']
unemployment_rate 9.9
qty_unemployed 219224.0
qty_employed 2001483.0
labour_force 2220707.0
Name: (Alabama, 2010, M10), dtype: float64
# we hope this may work...
data.loc['Alabama', :, 'M10']
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | year | period | ||||
Alabama | 2016 | M10 | 6.2 | 136336 | 2057151 | 2193487 |
2015 | M10 | 5.9 | 126424 | 2032455 | 2158879 | |
2014 | M10 | 6.0 | 130881 | 2035345 | 2166226 | |
2013 | M10 | 7.0 | 152300 | 2012001 | 2164301 | |
2012 | M10 | 7.5 | 165164 | 2025114 | 2190278 | |
2011 | M10 | 8.9 | 194917 | 2005689 | 2200606 | |
2010 | M10 | 9.9 | 219224 | 2001483 | 2220707 | |
2009 | M10 | 11.7 | 252694 | 1906032 | 2158726 | |
2008 | M10 | 6.3 | 137392 | 2047657 | 2185049 | |
2007 | M10 | 3.8 | 82043 | 2099135 | 2181178 | |
2006 | M10 | 3.6 | 79073 | 2103518 | 2182591 | |
2005 | M10 | 4.2 | 91482 | 2078674 | 2170156 | |
2004 | M10 | 5.5 | 119278 | 2035705 | 2154983 | |
2003 | M10 | 6.1 | 130045 | 2009302 | 2139347 | |
2002 | M10 | 5.6 | 118872 | 2006102 | 2124974 | |
2001 | M10 | 5.6 | 118727 | 2006251 | 2124978 | |
2000 | M10 | 4.5 | 96899 | 2042462 | 2139361 |
# we need to sort the index first
data = data.sort_index()
data.head()
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | year | period | ||||
Alabama | 2000 | M01 | 5.1 | 108551 | 2024262 | 2132813 |
M02 | 5.1 | 109464 | 2027226 | 2136690 | ||
M03 | 4.5 | 95449 | 2038654 | 2134103 | ||
M04 | 3.8 | 80215 | 2045171 | 2125386 | ||
M05 | 4.1 | 86308 | 2037787 | 2124095 |
data.loc['Alabama', 2010:2016, 'M10':'M12']
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | year | period | ||||
Alabama | 2010 | M10 | 9.9 | 219224 | 2001483 | 2220707 |
M11 | 10.0 | 222536 | 2001481 | 2224017 | ||
M12 | 9.8 | 217272 | 1996538 | 2213810 | ||
2011 | M10 | 8.9 | 194917 | 2005689 | 2200606 | |
M11 | 8.0 | 175369 | 2008604 | 2183973 | ||
M12 | 7.9 | 172004 | 2004151 | 2176155 | ||
2012 | M10 | 7.5 | 165164 | 2025114 | 2190278 | |
M11 | 6.9 | 149468 | 2023318 | 2172786 | ||
M12 | 7.2 | 156518 | 2016210 | 2172728 | ||
2013 | M10 | 7.0 | 152300 | 2012001 | 2164301 | |
M11 | 6.7 | 144052 | 2020314 | 2164366 | ||
M12 | 6.7 | 143618 | 2013519 | 2157137 | ||
2014 | M10 | 6.0 | 130881 | 2035345 | 2166226 | |
M11 | 5.8 | 124177 | 2033698 | 2157875 | ||
M12 | 5.7 | 121281 | 2023427 | 2144708 | ||
2015 | M10 | 5.9 | 126424 | 2032455 | 2158879 | |
M11 | 5.7 | 122954 | 2035229 | 2158183 | ||
M12 | 5.9 | 127389 | 2030765 | 2158154 | ||
2016 | M10 | 6.2 | 136336 | 2057151 | 2193487 | |
M11 | 5.7 | 123834 | 2060807 | 2184641 | ||
M12 | 5.9 | 129295 | 2050965 | 2180260 |
Challenge¶
Extract all the data for the Carolinas (Help: you need to do a partial string match on the index
data.index.get_level_values(XX).str.contains(YY)
)Extract all the data for the Carolinas in 2007
Extract all the data for the Carolinas between 2007-2010
Extract all the data in the summer months for the Carolina between 2007 and 2010
Partial Solution:¶
data.loc[data.index.get_level_values(0).str.contains("Carolina"),
2007:2010, 'M06':'M09']
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | year | period | ||||
North Carolina | 2007 | M06 | 4.9 | 220499 | 4323332 | 4543831 |
M07 | 5.1 | 231083 | 4336375 | 4567458 | ||
M08 | 4.7 | 212508 | 4299171 | 4511679 | ||
M09 | 4.5 | 201890 | 4321920 | 4523810 | ||
2008 | M06 | 5.9 | 271173 | 4313781 | 4584954 | |
M07 | 6.3 | 289018 | 4324163 | 4613181 | ||
M08 | 6.3 | 288553 | 4282651 | 4571204 | ||
M09 | 5.9 | 268607 | 4292928 | 4561535 | ||
2009 | M06 | 10.8 | 498908 | 4102835 | 4601743 | |
M07 | 10.8 | 498541 | 4121528 | 4620069 | ||
M08 | 10.6 | 485802 | 4083402 | 4569204 | ||
M09 | 10.4 | 472231 | 4076322 | 4548553 | ||
2010 | M06 | 10.9 | 503871 | 4137291 | 4641162 | |
M07 | 10.8 | 503704 | 4150739 | 4654443 | ||
M08 | 10.5 | 486253 | 4137732 | 4623985 | ||
M09 | 10.0 | 459133 | 4152584 | 4611717 | ||
South Carolina | 2007 | M06 | 5.9 | 126476 | 2033179 | 2159655 |
M07 | 5.8 | 125358 | 2031889 | 2157247 | ||
M08 | 5.7 | 121989 | 2015875 | 2137864 | ||
M09 | 5.6 | 119816 | 2008234 | 2128050 | ||
2008 | M06 | 6.6 | 143842 | 2028477 | 2172319 | |
M07 | 6.9 | 149774 | 2022393 | 2172167 | ||
M08 | 7.5 | 162877 | 2003086 | 2165963 | ||
M09 | 7.5 | 160318 | 1990871 | 2151189 | ||
2009 | M06 | 11.9 | 260682 | 1932513 | 2193195 | |
M07 | 11.6 | 253471 | 1932518 | 2185989 | ||
M08 | 11.5 | 247965 | 1914687 | 2162652 | ||
M09 | 11.5 | 245694 | 1895141 | 2140835 | ||
2010 | M06 | 11.2 | 244208 | 1940423 | 2184631 | |
M07 | 11.0 | 240106 | 1944223 | 2184329 | ||
M08 | 11.2 | 244749 | 1934786 | 2179535 | ||
M09 | 10.7 | 231578 | 1923986 | 2155564 |
Index (Un-)Stacking¶
One potentially cool use for multi-indexing is using the indexes across two dimensions. This is unstacking, and often allows for simple ways to view patterns in the data.
For example, we could translate the year-index across to a column axis, so that we could can easily see patterns in labor force statistics in a given month, but over various years:
data.unstack(level=1)
unemployment_rate | ... | labour_force | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | ... | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | |
state | period | |||||||||||||||||||||
Alabama | M01 | 5.1 | 5.2 | 6.5 | 6.2 | 6.4 | 5.4 | 4.4 | 4.3 | 4.8 | 9.7 | ... | 2169298 | 2170233 | 2164104 | 2135155 | 2210128 | 2151638 | 2166067 | 2151800 | 2134230 | 2141628 |
M02 | 5.1 | 5.0 | 6.3 | 6.0 | 6.1 | 5.4 | 4.5 | 4.2 | 4.9 | 10.5 | ... | 2170069 | 2167554 | 2171529 | 2135583 | 2204837 | 2165383 | 2170603 | 2165104 | 2136284 | 2152446 | |
M03 | 4.5 | 4.7 | 6.0 | 5.6 | 5.9 | 4.6 | 3.9 | 3.8 | 4.7 | 10.4 | ... | 2176156 | 2170499 | 2166423 | 2137573 | 2206650 | 2171589 | 2164327 | 2172369 | 2139776 | 2160907 | |
M04 | 3.8 | 4.2 | 5.3 | 5.3 | 5.2 | 3.9 | 3.5 | 3.3 | 4.0 | 9.9 | ... | 2156839 | 2162674 | 2162016 | 2206232 | 2204162 | 2163833 | 2158996 | 2153646 | 2144272 | 2154477 | |
M05 | 4.1 | 4.0 | 5.2 | 5.6 | 5.4 | 4.0 | 3.6 | 3.3 | 4.9 | 10.6 | ... | 2158625 | 2174485 | 2164803 | 2213429 | 2205918 | 2178571 | 2171580 | 2164055 | 2162423 | 2159095 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Wyoming | M08 | 3.4 | 3.3 | 3.4 | 3.7 | 3.3 | 3.2 | 2.8 | 2.5 | 2.6 | 6.1 | ... | 290133 | 296658 | 302187 | 305471 | 309004 | 309559 | 308650 | 308852 | 307846 | 304511 |
M09 | 3.4 | 3.3 | 3.5 | 3.6 | 3.4 | 3.2 | 2.8 | 2.4 | 2.5 | 6.0 | ... | 287821 | 294561 | 299774 | 303557 | 307206 | 306748 | 306052 | 306169 | 304353 | 301548 | |
M10 | 3.4 | 3.4 | 3.6 | 3.6 | 3.4 | 3.3 | 2.8 | 2.4 | 2.7 | 6.3 | ... | 288442 | 295880 | 300205 | 304878 | 307938 | 307263 | 306560 | 306735 | 304311 | 300979 | |
M11 | 4.0 | 3.8 | 4.2 | 4.1 | 3.7 | 3.5 | 3.0 | 2.6 | 3.2 | 6.7 | ... | 288931 | 295690 | 300575 | 305384 | 308172 | 306948 | 306599 | 306418 | 302892 | 300414 | |
M12 | 3.9 | 3.9 | 4.5 | 4.2 | 3.7 | 3.5 | 2.9 | 2.5 | 3.7 | 7.5 | ... | 287174 | 294792 | 300582 | 304124 | 306890 | 305084 | 305285 | 303753 | 300859 | 299114 |
624 rows × 68 columns
In it’s current form this is a bit ugly because there is so much data. But we can use our indexing function loc
together with a column selection to potentially view one of our series:
data.unstack(level=1).loc['California']['unemployment_rate']
year | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
period | |||||||||||||||||
M01 | 5.4 | 5.3 | 7.0 | 7.3 | 7.0 | 6.2 | 5.3 | 5.5 | 6.4 | 10.2 | 12.7 | 12.6 | 11.4 | 10.2 | 8.4 | 7.2 | 5.9 |
M02 | 5.4 | 5.1 | 6.8 | 7.1 | 6.7 | 6.1 | 5.3 | 5.4 | 6.3 | 10.6 | 12.6 | 12.2 | 11.3 | 9.7 | 8.4 | 6.9 | 5.8 |
M03 | 5.4 | 5.3 | 6.9 | 7.1 | 7.0 | 5.9 | 5.2 | 5.2 | 6.6 | 10.9 | 12.6 | 12.1 | 11.0 | 9.3 | 8.3 | 6.7 | 5.8 |
M04 | 4.8 | 4.9 | 6.6 | 6.7 | 6.3 | 5.3 | 4.9 | 5.0 | 6.1 | 10.4 | 12.0 | 11.5 | 10.3 | 8.7 | 7.3 | 6.2 | 5.4 |
M05 | 4.8 | 4.6 | 6.2 | 6.4 | 5.9 | 5.0 | 4.5 | 4.8 | 6.4 | 10.6 | 11.7 | 11.3 | 10.2 | 8.6 | 7.2 | 6.1 | 4.9 |
M06 | 5.0 | 5.2 | 6.7 | 7.2 | 6.4 | 5.4 | 4.9 | 5.3 | 7.0 | 11.3 | 12.0 | 12.0 | 10.6 | 9.2 | 7.4 | 6.3 | 5.6 |
M07 | 5.3 | 5.5 | 6.9 | 7.1 | 6.4 | 5.5 | 5.2 | 5.7 | 7.6 | 11.7 | 12.6 | 12.2 | 10.9 | 9.3 | 7.9 | 6.5 | 5.8 |
M08 | 5.0 | 5.7 | 6.7 | 6.8 | 5.9 | 5.1 | 4.9 | 5.4 | 7.8 | 11.5 | 12.3 | 11.9 | 10.4 | 8.9 | 7.6 | 6.1 | 5.6 |
M09 | 4.6 | 5.5 | 6.3 | 6.5 | 5.6 | 5.1 | 4.6 | 5.3 | 7.7 | 11.4 | 11.9 | 11.5 | 9.6 | 8.5 | 7.0 | 5.6 | 5.2 |
M10 | 4.5 | 5.8 | 6.4 | 6.5 | 5.7 | 5.0 | 4.5 | 5.4 | 8.0 | 11.6 | 11.9 | 11.3 | 9.7 | 8.5 | 6.9 | 5.7 | 5.2 |
M11 | 4.7 | 6.2 | 6.8 | 6.7 | 5.9 | 5.3 | 4.8 | 5.6 | 8.4 | 11.7 | 12.3 | 11.0 | 9.5 | 8.3 | 7.0 | 5.7 | 5.0 |
M12 | 4.4 | 6.1 | 6.7 | 6.2 | 5.7 | 4.9 | 4.7 | 5.9 | 9.0 | 11.8 | 12.1 | 11.0 | 9.6 | 8.0 | 6.6 | 5.6 | 5.0 |
the opposite of unstacking is stacking
, and it puts our data back into the multi-index format we began with:
data.unstack(level=1).stack(level=1)
unemployment_rate | qty_unemployed | qty_employed | labour_force | |||
---|---|---|---|---|---|---|
state | period | year | ||||
Alabama | M01 | 2000 | 5.1 | 108551 | 2024262 | 2132813 |
2001 | 5.2 | 110035 | 2008876 | 2118911 | ||
2002 | 6.5 | 134750 | 1953814 | 2088564 | ||
2003 | 6.2 | 129966 | 1974374 | 2104340 | ||
2004 | 6.4 | 134673 | 1985886 | 2120559 | ||
... | ... | ... | ... | ... | ... | ... |
Wyoming | M12 | 2012 | 5.2 | 15953 | 289131 | 305084 |
2013 | 4.6 | 14116 | 291169 | 305285 | ||
2014 | 3.8 | 11613 | 292140 | 303753 | ||
2015 | 4.8 | 14459 | 286400 | 300859 | ||
2016 | 5.0 | 14866 | 284248 | 299114 |
10608 rows × 4 columns
In this way, the unstack
-stack
functionality is a useful way to view data and potentially see some patterns; even if the data is has high dimensionality in the index
Challenge¶
Use the unstack method to view data about the employment rate among the labor force over years and months in California