{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Operations on Data\n", "* Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer\n", "\n", "When working with Numpy, and essential part of our toolkit was the ability to perform element wise operations such as adding and multiply or operations like exponentiating or taking the log. We also want this functionality for DataFrames and pandas provides it - borrowing functionality from NumPy.\n", "\n", "In this notebook we will look at how Pandas allows us to perform operations like those mentioned above - using universal functions.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## UFuncs: Preserving Indices\n", "\n", "Pandas is designed to work with Numpy. This means any numpy ufunc applied to a pandas object will return another pandas object of the same type.\n", "\n", "To see this, let's set up some example pandas objects:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 6\n", "2 3\n", "3 1\n", "4 9\n", "5 9\n", "6 4\n", "7 0\n", "8 9\n", "9 6\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(1234567890)\n", "series = pd.Series(rng.randint(0, 10, 10))\n", "series" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(series)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
04073
15168
23517
31811
42587
58982
61184
75725
86343
99485
\n", "
" ], "text/plain": [ " A B C D\n", "0 4 0 7 3\n", "1 5 1 6 8\n", "2 3 5 1 7\n", "3 1 8 1 1\n", "4 2 5 8 7\n", "5 8 9 8 2\n", "6 1 1 8 4\n", "7 5 7 2 5\n", "8 6 3 4 3\n", "9 9 4 8 5" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(rng.randint(0, 10, (10, 4)),\n", " columns=['A', 'B', 'C', 'D'])\n", "df" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we apply a NumPy ufunc to an object, let's see what happens" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "series2 = np.log(series)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(series2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df2 = np.exp(df)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df2)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df['B'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(np.log10(df['B']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also apply a numpy ufunc to an individual row of a DataFrame" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
054.5981500.01096.63315820.085537
1148.4131591.0403.4287932980.957987
220.0855375.02.7182821096.633158
32.7182828.02.7182822.718282
47.3890565.02980.9579871096.633158
52980.9579879.02980.9579877.389056
62.7182821.02980.95798754.598150
7148.4131597.07.389056148.413159
8403.4287933.054.59815020.085537
98103.0839284.02980.957987148.413159
\n", "
" ], "text/plain": [ " A B C D\n", "0 54.598150 0.0 1096.633158 20.085537\n", "1 148.413159 1.0 403.428793 2980.957987\n", "2 20.085537 5.0 2.718282 1096.633158\n", "3 2.718282 8.0 2.718282 2.718282\n", "4 7.389056 5.0 2980.957987 1096.633158\n", "5 2980.957987 9.0 2980.957987 7.389056\n", "6 2.718282 1.0 2980.957987 54.598150\n", "7 148.413159 7.0 7.389056 148.413159\n", "8 403.428793 3.0 54.598150 20.085537\n", "9 8103.083928 4.0 2980.957987 148.413159" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df2.copy()\n", "df3['B'] = np.log(df3['B'])\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to NumPy functions, pandas provides it's own operators using operator overloading. These are summarized here:\n", "\n", "[insert table]\n", "\n", "We can use these operators to combine a pandas object with a scalar:\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 545.981500\n", "1 1484.131591\n", "2 200.855369\n", "3 27.182818\n", "4 73.890561\n", "5 29809.579870\n", "6 27.182818\n", "7 1484.131591\n", "8 4034.287935\n", "9 81030.839276\n", "Name: A, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3['A'] * 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and can also create new columns using ufuncs (either via NumPy or the pandas operators):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
054.5981500.01096.63315820.0855370.018316
1148.4131591.0403.4287932980.9579877.389056
220.0855375.02.7182821096.633158403.428793
32.7182828.02.7182822.7182821.000000
47.3890565.02980.9579871096.6331580.367879
52980.9579879.02980.9579877.3890560.002479
62.7182821.02980.95798754.5981500.018316
7148.4131597.07.389056148.41315920.085537
8403.4287933.054.59815020.0855370.367879
98103.0839284.02980.957987148.4131590.049787
\n", "
" ], "text/plain": [ " A B C D E\n", "0 54.598150 0.0 1096.633158 20.085537 0.018316\n", "1 148.413159 1.0 403.428793 2980.957987 7.389056\n", "2 20.085537 5.0 2.718282 1096.633158 403.428793\n", "3 2.718282 8.0 2.718282 2.718282 1.000000\n", "4 7.389056 5.0 2980.957987 1096.633158 0.367879\n", "5 2980.957987 9.0 2980.957987 7.389056 0.002479\n", "6 2.718282 1.0 2980.957987 54.598150 0.018316\n", "7 148.413159 7.0 7.389056 148.413159 20.085537\n", "8 403.428793 3.0 54.598150 20.085537 0.367879\n", "9 8103.083928 4.0 2980.957987 148.413159 0.049787" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3['E'] = df3['D'] / df3['C']\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge\n", "\n", "Load in our labor market statistics data and perform the following operations:\n", "\n", "1. Verify the `unemployment_rate` variable is correctly computed. (Hint, round your computation to 1 decimal place using .round(decimels=1), test equality using df[''].equals(df['']))\n", "2. Does the Decomposition `labour_force = qty_employed + qty_employed` hold true?\n", "3. Assume that each worker was employed for 38 hours per week. Create a new column that estimates the number of hours worked in each state-year-month\n", "4. Assume that 60 percent of workers work 45 hours per week, whilst 40 percent work 20 hours per week, estimate the number of labour hours in each state-year-month\n", "5. Calculate the difference in your estimates from 3 and 4.\n", "\n", "#### Solutions" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateperiodyearunemployment_rateqty_unemployedqty_employedlabour_force
0AlabamaM0120005.110855120242622132813
1AlabamaM0120015.211003520088762118911
2AlabamaM0120026.513475019538142088564
3AlabamaM0120036.212996619743742104340
4AlabamaM0120046.413467319858862120559
\n", "
" ], "text/plain": [ " state period year unemployment_rate qty_unemployed qty_employed \\\n", "0 Alabama M01 2000 5.1 108551 2024262 \n", "1 Alabama M01 2001 5.2 110035 2008876 \n", "2 Alabama M01 2002 6.5 134750 1953814 \n", "3 Alabama M01 2003 6.2 129966 1974374 \n", "4 Alabama M01 2004 6.4 134673 1985886 \n", "\n", " labour_force \n", "0 2132813 \n", "1 2118911 \n", "2 2088564 \n", "3 2104340 \n", "4 2120559 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('out_data/state_labour_statistics.csv')\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stateperiodyearunemployment_rateqty_unemployedqty_employedlabour_forceue_rate2
0AlabamaM0120005.1108551202426221328135.1
1AlabamaM0120015.2110035200887621189115.2
2AlabamaM0120026.5134750195381420885646.5
3AlabamaM0120036.2129966197437421043406.2
4AlabamaM0120046.4134673198588621205596.4
\n", "
" ], "text/plain": [ " state period year unemployment_rate qty_unemployed qty_employed \\\n", "0 Alabama M01 2000 5.1 108551 2024262 \n", "1 Alabama M01 2001 5.2 110035 2008876 \n", "2 Alabama M01 2002 6.5 134750 1953814 \n", "3 Alabama M01 2003 6.2 129966 1974374 \n", "4 Alabama M01 2004 6.4 134673 1985886 \n", "\n", " labour_force ue_rate2 \n", "0 2132813 5.1 \n", "1 2118911 5.2 \n", "2 2088564 6.5 \n", "3 2104340 6.2 \n", "4 2120559 6.4 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['ue_rate2'] = (data['qty_unemployed'] / data['labour_force'] * 100).round(1)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['ue_rate2'].equals(data['unemployment_rate'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['labour_force'].equals(data['qty_employed'] + data['qty_unemployed'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Index Alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When pandas is performing binary operations on between different series, it will align indices when performing the operation. This is useful if we have incomplete data in one of both data frames when we are trying to combine them:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "A = pd.Series([2, 4, 6], index=[0, 1, 2])\n", "B = pd.Series([1, 3, 5], index=[1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 5.0\n", "2 9.0\n", "3 NaN\n", "dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A + B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "by default, pandas returns NaNs - but this may not always be the behaviour we want. The option `fill_value` allows us to specify a value for the missing value to take:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2.0\n", "1 5.0\n", "2 9.0\n", "3 5.0\n", "dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A.add(B, fill_value = 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "although this may not be desirable.\n", "\n", "A similar idea holds for DataFrames:\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
08.018.0NaN
121.018.0NaN
2NaNNaNNaN
\n", "
" ], "text/plain": [ " A B C\n", "0 8.0 18.0 NaN\n", "1 21.0 18.0 NaN\n", "2 NaN NaN NaN" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n", " columns=list('AB'))\n", "B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n", " columns=list('BAC'))\n", "A+B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally you can use ufuncs to combine info from a DataFrame and a Series:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BAC
0970
1324
2711
\n", "
" ], "text/plain": [ " B A C\n", "0 9 7 0\n", "1 3 2 4\n", "2 7 1 1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BAC
00-2-9
10-11
20-6-6
\n", "
" ], "text/plain": [ " B A C\n", "0 0 -2 -9\n", "1 0 -1 1\n", "2 0 -6 -6" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "B.subtract(B['B'], axis=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }