Hierachical (Multi-) Indexing¶

Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

In earlier notebooks we were working with our labor market data and importing it to have a multi-index. In this notebook, we explore the notions of multi-indexing in more detail.

import pandas as pd

Multiply Indexed Data¶

We focus on multiply indexed DataFrames, and ignore pandas Series because we will most often come across DataFrames. Most of our discussion carries over.

Let’s import our labour market data in the ‘simplest way’

data = pd.read_csv('out_data/state_labour_statistics.csv')
data.head()

	state	period	year	unemployment_rate	qty_unemployed	qty_employed	labour_force
0	New Hampshire	M12	2016	2.5	18907	728556	747463
1	New Hampshire	M11	2016	2.6	19573	729720	749293
2	New Hampshire	M10	2016	2.5	18430	729792	748222
3	New Hampshire	M09	2016	2.6	19415	727548	746963
4	New Hampshire	M08	2016	2.8	20933	737236	758169

data.index

RangeIndex(start=0, stop=10608, step=1)

Again note that pandas has created an index for us, which is simply a row identifier. We argued earlier that a better way of indexing might be state-year-month.

To make the shift to our preferred index, we need to reset the index, using the set_index function.

data.set_index(['state', 'year', 'period'], inplace=True)
data.head()

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	year	period
New Hampshire	2016	M12	2.5	18907	728556	747463
		M11	2.6	19573	729720	749293
		M10	2.5	18430	729792	748222
		M09	2.6	19415	727548	746963
		M08	2.8	20933	737236	758169

If we want to go back to the original index, we can use reset_index and ask that the multi-index we set is returned to columns of the data

data.reset_index(drop=False, inplace=True)

data.head()

	state	year	period	unemployment_rate	qty_unemployed	qty_employed	labour_force
0	New Hampshire	2016	M12	2.5	18907	728556	747463
1	New Hampshire	2016	M11	2.6	19573	729720	749293
2	New Hampshire	2016	M10	2.5	18430	729792	748222
3	New Hampshire	2016	M09	2.6	19415	727548	746963
4	New Hampshire	2016	M08	2.8	20933	737236	758169

But let’s for now stick with our multi-index data- and see how to use it

data.set_index(['state', 'year', 'period'], inplace=True)

Selecting Data with a Multi-Index¶

One advantage of a multi-index is that we can subset data quite simply:

data.loc['Alabama']

		unemployment_rate	qty_unemployed	qty_employed	labour_force
year	period
2016	M12	5.9	129295	2050965	2180260
	M11	5.7	123834	2060807	2184641
	M10	6.2	136336	2057151	2193487
	M09	6.2	134032	2042696	2176728
	M08	6.2	134754	2033706	2168460
...	...	...	...	...	...
2000	M05	4.1	86308	2037787	2124095
	M04	3.8	80215	2045171	2125386
	M03	4.5	95449	2038654	2134103
	M02	5.1	109464	2027226	2136690
	M01	5.1	108551	2024262	2132813

204 rows × 4 columns

data.loc['Alabama', 2010]

/tmp/ipykernel_2901/4288551881.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  data.loc['Alabama', 2010]

	unemployment_rate	qty_unemployed	qty_employed	labour_force
period
M12	9.8	217272	1996538	2213810
M11	10.0	222536	2001481	2224017
M10	9.9	219224	2001483	2220707
M09	9.8	216750	1990219	2206969
M08	10.1	224435	1988261	2212696
M07	10.2	225826	1990949	2216775
M06	10.5	234972	1994590	2229562
M05	9.9	219339	1994090	2213429
M04	10.4	229680	1976552	2206232
M03	11.5	245567	1892006	2137573
M02	12.1	258825	1876758	2135583
M01	12.3	263375	1871780	2135155

data.loc['Alabama', 2010, 'M10']

unemployment_rate          9.9
qty_unemployed        219224.0
qty_employed         2001483.0
labour_force         2220707.0
Name: (Alabama, 2010, M10), dtype: float64

# we hope this may work...
data.loc['Alabama', :, 'M10']

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	year	period
Alabama	2016	M10	6.2	136336	2057151	2193487
	2015	M10	5.9	126424	2032455	2158879
	2014	M10	6.0	130881	2035345	2166226
	2013	M10	7.0	152300	2012001	2164301
	2012	M10	7.5	165164	2025114	2190278
	2011	M10	8.9	194917	2005689	2200606
	2010	M10	9.9	219224	2001483	2220707
	2009	M10	11.7	252694	1906032	2158726
	2008	M10	6.3	137392	2047657	2185049
	2007	M10	3.8	82043	2099135	2181178
	2006	M10	3.6	79073	2103518	2182591
	2005	M10	4.2	91482	2078674	2170156
	2004	M10	5.5	119278	2035705	2154983
	2003	M10	6.1	130045	2009302	2139347
	2002	M10	5.6	118872	2006102	2124974
	2001	M10	5.6	118727	2006251	2124978
	2000	M10	4.5	96899	2042462	2139361

# we need to sort the index first
data = data.sort_index()
data.head()

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	year	period
Alabama	2000	M01	5.1	108551	2024262	2132813
		M02	5.1	109464	2027226	2136690
		M03	4.5	95449	2038654	2134103
		M04	3.8	80215	2045171	2125386
		M05	4.1	86308	2037787	2124095

data.loc['Alabama', 2010:2016, 'M10':'M12']

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	year	period
Alabama	2010	M10	9.9	219224	2001483	2220707
		M11	10.0	222536	2001481	2224017
		M12	9.8	217272	1996538	2213810
	2011	M10	8.9	194917	2005689	2200606
		M11	8.0	175369	2008604	2183973
		M12	7.9	172004	2004151	2176155
	2012	M10	7.5	165164	2025114	2190278
		M11	6.9	149468	2023318	2172786
		M12	7.2	156518	2016210	2172728
	2013	M10	7.0	152300	2012001	2164301
		M11	6.7	144052	2020314	2164366
		M12	6.7	143618	2013519	2157137
	2014	M10	6.0	130881	2035345	2166226
		M11	5.8	124177	2033698	2157875
		M12	5.7	121281	2023427	2144708
	2015	M10	5.9	126424	2032455	2158879
		M11	5.7	122954	2035229	2158183
		M12	5.9	127389	2030765	2158154
	2016	M10	6.2	136336	2057151	2193487
		M11	5.7	123834	2060807	2184641
		M12	5.9	129295	2050965	2180260

Challenge¶

Extract all the data for the Carolinas (Help: you need to do a partial string match on the index data.index.get_level_values(XX).str.contains(YY) )
Extract all the data for the Carolinas in 2007
Extract all the data for the Carolinas between 2007-2010
Extract all the data in the summer months for the Carolina between 2007 and 2010

Partial Solution:¶

data.loc[data.index.get_level_values(0).str.contains("Carolina"), 
             2007:2010, 'M06':'M09']

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	year	period
North Carolina	2007	M06	4.9	220499	4323332	4543831
		M07	5.1	231083	4336375	4567458
		M08	4.7	212508	4299171	4511679
		M09	4.5	201890	4321920	4523810
	2008	M06	5.9	271173	4313781	4584954
		M07	6.3	289018	4324163	4613181
		M08	6.3	288553	4282651	4571204
		M09	5.9	268607	4292928	4561535
	2009	M06	10.8	498908	4102835	4601743
		M07	10.8	498541	4121528	4620069
		M08	10.6	485802	4083402	4569204
		M09	10.4	472231	4076322	4548553
	2010	M06	10.9	503871	4137291	4641162
		M07	10.8	503704	4150739	4654443
		M08	10.5	486253	4137732	4623985
		M09	10.0	459133	4152584	4611717
South Carolina	2007	M06	5.9	126476	2033179	2159655
		M07	5.8	125358	2031889	2157247
		M08	5.7	121989	2015875	2137864
		M09	5.6	119816	2008234	2128050
	2008	M06	6.6	143842	2028477	2172319
		M07	6.9	149774	2022393	2172167
		M08	7.5	162877	2003086	2165963
		M09	7.5	160318	1990871	2151189
	2009	M06	11.9	260682	1932513	2193195
		M07	11.6	253471	1932518	2185989
		M08	11.5	247965	1914687	2162652
		M09	11.5	245694	1895141	2140835
	2010	M06	11.2	244208	1940423	2184631
		M07	11.0	240106	1944223	2184329
		M08	11.2	244749	1934786	2179535
		M09	10.7	231578	1923986	2155564

Index (Un-)Stacking¶

One potentially cool use for multi-indexing is using the indexes across two dimensions. This is unstacking, and often allows for simple ways to view patterns in the data.

For example, we could translate the year-index across to a column axis, so that we could can easily see patterns in labor force statistics in a given month, but over various years:

data.unstack(level=1)

		unemployment_rate										...	labour_force
	year	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	...	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
state	period
Alabama	M01	5.1	5.2	6.5	6.2	6.4	5.4	4.4	4.3	4.8	9.7	...	2169298	2170233	2164104	2135155	2210128	2151638	2166067	2151800	2134230	2141628
	M02	5.1	5.0	6.3	6.0	6.1	5.4	4.5	4.2	4.9	10.5	...	2170069	2167554	2171529	2135583	2204837	2165383	2170603	2165104	2136284	2152446
	M03	4.5	4.7	6.0	5.6	5.9	4.6	3.9	3.8	4.7	10.4	...	2176156	2170499	2166423	2137573	2206650	2171589	2164327	2172369	2139776	2160907
	M04	3.8	4.2	5.3	5.3	5.2	3.9	3.5	3.3	4.0	9.9	...	2156839	2162674	2162016	2206232	2204162	2163833	2158996	2153646	2144272	2154477
	M05	4.1	4.0	5.2	5.6	5.4	4.0	3.6	3.3	4.9	10.6	...	2158625	2174485	2164803	2213429	2205918	2178571	2171580	2164055	2162423	2159095
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Wyoming	M08	3.4	3.3	3.4	3.7	3.3	3.2	2.8	2.5	2.6	6.1	...	290133	296658	302187	305471	309004	309559	308650	308852	307846	304511
	M09	3.4	3.3	3.5	3.6	3.4	3.2	2.8	2.4	2.5	6.0	...	287821	294561	299774	303557	307206	306748	306052	306169	304353	301548
	M10	3.4	3.4	3.6	3.6	3.4	3.3	2.8	2.4	2.7	6.3	...	288442	295880	300205	304878	307938	307263	306560	306735	304311	300979
	M11	4.0	3.8	4.2	4.1	3.7	3.5	3.0	2.6	3.2	6.7	...	288931	295690	300575	305384	308172	306948	306599	306418	302892	300414
	M12	3.9	3.9	4.5	4.2	3.7	3.5	2.9	2.5	3.7	7.5	...	287174	294792	300582	304124	306890	305084	305285	303753	300859	299114

624 rows × 68 columns

In it’s current form this is a bit ugly because there is so much data. But we can use our indexing function loc together with a column selection to potentially view one of our series:

data.unstack(level=1).loc['California']['unemployment_rate']

year	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
period
M01	5.4	5.3	7.0	7.3	7.0	6.2	5.3	5.5	6.4	10.2	12.7	12.6	11.4	10.2	8.4	7.2	5.9
M02	5.4	5.1	6.8	7.1	6.7	6.1	5.3	5.4	6.3	10.6	12.6	12.2	11.3	9.7	8.4	6.9	5.8
M03	5.4	5.3	6.9	7.1	7.0	5.9	5.2	5.2	6.6	10.9	12.6	12.1	11.0	9.3	8.3	6.7	5.8
M04	4.8	4.9	6.6	6.7	6.3	5.3	4.9	5.0	6.1	10.4	12.0	11.5	10.3	8.7	7.3	6.2	5.4
M05	4.8	4.6	6.2	6.4	5.9	5.0	4.5	4.8	6.4	10.6	11.7	11.3	10.2	8.6	7.2	6.1	4.9
M06	5.0	5.2	6.7	7.2	6.4	5.4	4.9	5.3	7.0	11.3	12.0	12.0	10.6	9.2	7.4	6.3	5.6
M07	5.3	5.5	6.9	7.1	6.4	5.5	5.2	5.7	7.6	11.7	12.6	12.2	10.9	9.3	7.9	6.5	5.8
M08	5.0	5.7	6.7	6.8	5.9	5.1	4.9	5.4	7.8	11.5	12.3	11.9	10.4	8.9	7.6	6.1	5.6
M09	4.6	5.5	6.3	6.5	5.6	5.1	4.6	5.3	7.7	11.4	11.9	11.5	9.6	8.5	7.0	5.6	5.2
M10	4.5	5.8	6.4	6.5	5.7	5.0	4.5	5.4	8.0	11.6	11.9	11.3	9.7	8.5	6.9	5.7	5.2
M11	4.7	6.2	6.8	6.7	5.9	5.3	4.8	5.6	8.4	11.7	12.3	11.0	9.5	8.3	7.0	5.7	5.0
M12	4.4	6.1	6.7	6.2	5.7	4.9	4.7	5.9	9.0	11.8	12.1	11.0	9.6	8.0	6.6	5.6	5.0

the opposite of unstacking is stacking, and it puts our data back into the multi-index format we began with:

data.unstack(level=1).stack(level=1)

			unemployment_rate	qty_unemployed	qty_employed	labour_force
state	period	year
Alabama	M01	2000	5.1	108551	2024262	2132813
		2001	5.2	110035	2008876	2118911
		2002	6.5	134750	1953814	2088564
		2003	6.2	129966	1974374	2104340
		2004	6.4	134673	1985886	2120559
...	...	...	...	...	...	...
Wyoming	M12	2012	5.2	15953	289131	305084
		2013	4.6	14116	291169	305285
		2014	3.8	11613	292140	303753
		2015	4.8	14459	286400	300859
		2016	5.0	14866	284248	299114

10608 rows × 4 columns

In this way, the unstack-stack functionality is a useful way to view data and potentially see some patterns; even if the data is has high dimensionality in the index

Challenge¶

Use the unstack method to view data about the employment rate among the labor force over years and months in California

Python for Economics and Business Research

Hierachical (Multi-) Indexing¶

Multiply Indexed Data¶

Selecting Data with a Multi-Index¶

Challenge¶

Partial Solution:¶

Index (Un-)Stacking¶

Challenge¶