Missing Data¶

Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

Real world data is rarely clean and homogenous (although are working labor market example is!). One of the main features we find in real world data is missing values - so we have to know how to deal with them

Pandas handles missing values through its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types. This is potentially something important to note - whenever you have missing data in a column, it will be treated as a floating-point data type.

import numpy as np
import pandas as pd

`None`: Pythonic Missing Data¶

The Python default for missing information is None. Notice that None has a special type in Python:

type(None)

NoneType

This means that None can only be stores in an array that is of ‘object’ type:

test_array = np.array([1,2,3,4])
test_array.dtype

dtype('int64')

#test_array[3] = None

test_array2 = np.array([1,2,None,4])
test_array2

array([1, 2, None, 4], dtype=object)

As a consequence - when there are Nones in an array and we perform an aggregation - we get an error:

#test_array2.sum()

NaN: missing numerical data¶

The other missing data representation is NaN, and it performs differently than None - it is a speical floating point value:

type(np.nan)

float

test_array3 = np.array([1,2,np.nan,4])
test_array3.dtype

dtype('float64')

Notice that np.nan ‘infects’ every operation it is combined with - any arithemetic with an np.nan yields a nan as a result:

test_array3.sum()

nan

10 + np.nan

nan

np.log(np.nan)

nan

Numpy does provide functionality to get around these nans by ignoring them:

np.nansum(test_array3)

7.0

np.nanmin(test_array3), np.nanmean(test_array3)

(1.0, 2.3333333333333335)

`NaN` and `None` in Pandas¶

Pandas is built to handle both - almost interchangably:

pd.Series([None, 42, np.nan])

   NaN
  42.0
   NaN
dtype: float64

and because np.nan is a floating point, pandas type-casts:

pd.Series([1, 42])

0     1
1    42
dtype: int64

pd.Series([np.nan, 42])

0     NaN
1    42.0
dtype: float64

Operating on Null Values¶

Pandas provides methods for working with null values in its data structures:

isnull(): generates boolean mask indicating missing values
notnull(): opposite of isnull()
dropna(): returns filtered version of data
fillna(): returns a copy of the data with missing valued filled or imputed

Let’s see them in action:

data = pd.Series([1, np.nan, 42, None])

data.isnull()

  False
   True
  False
   True
dtype: bool

data.notnull()

   True
  False
   True
  False
dtype: bool

data[(data.notnull())]

0     1.0
2    42.0
dtype: float64

Dropping na¶

data.dropna()

0     1.0
2    42.0
dtype: float64

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

df.dropna()

	0	1	2
1	2.0	3.0	5

df.dropna(axis='columns')

	2
0	2
1	5
2	6

df[3] = np.nan
df

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

df.dropna(axis='columns', how='all')

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

df.dropna(axis='columns', how='any')

	2
0	2
1	5
2	6

df.dropna(axis='rows', thresh=3) # 3 non na values

	0	1	2	3
1	2.0	3.0	5	NaN

df.loc[1,1]=np.nan
df

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	NaN	5	NaN
2	NaN	4.0	6	NaN

df.dropna(axis='columns', thresh=2) # 3 non na values

	0	2
0	1.0	2
1	2.0	5
2	NaN	6

Filling na¶

data

   1.0
   NaN
  42.0
   NaN
dtype: float64

data.fillna(99)

   1.0
  99.0
  42.0
  99.0
dtype: float64

data.fillna(method='ffill')

   1.0
   1.0
  42.0
  42.0
dtype: float64

data.fillna(method='bfill')

   1.0
  42.0
  42.0
   NaN
dtype: float64

df.fillna(method='ffill', axis=1)

	0	1	2	3
0	1.0	1.0	2.0	2.0
1	2.0	2.0	5.0	5.0
2	NaN	4.0	6.0	6.0

Python for Economics and Business Research