Missing Data

  • Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer

Real world data is rarely clean and homogenous (although are working labor market example is!). One of the main features we find in real world data is missing values - so we have to know how to deal with them

Pandas handles missing values through its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types. This is potentially something important to note - whenever you have missing data in a column, it will be treated as a floating-point data type.

import numpy as np
import pandas as pd

None: Pythonic Missing Data

The Python default for missing information is None. Notice that None has a special type in Python:

type(None)
NoneType

This means that None can only be stores in an array that is of ‘object’ type:

test_array = np.array([1,2,3,4])
test_array.dtype
dtype('int64')
#test_array[3] = None
test_array2 = np.array([1,2,None,4])
test_array2
array([1, 2, None, 4], dtype=object)

As a consequence - when there are Nones in an array and we perform an aggregation - we get an error:

#test_array2.sum()

NaN: missing numerical data

The other missing data representation is NaN, and it performs differently than None - it is a speical floating point value:

type(np.nan)
float
test_array3 = np.array([1,2,np.nan,4])
test_array3.dtype
dtype('float64')

Notice that np.nan ‘infects’ every operation it is combined with - any arithemetic with an np.nan yields a nan as a result:

test_array3.sum()
nan
10 + np.nan
nan
np.log(np.nan)
nan

Numpy does provide functionality to get around these nans by ignoring them:

np.nansum(test_array3)
7.0
np.nanmin(test_array3), np.nanmean(test_array3)
(1.0, 2.3333333333333335)

NaN and None in Pandas

Pandas is built to handle both - almost interchangably:

pd.Series([None, 42, np.nan])
0     NaN
1    42.0
2     NaN
dtype: float64

and because np.nan is a floating point, pandas type-casts:

pd.Series([1, 42])
0     1
1    42
dtype: int64
pd.Series([np.nan, 42])
0     NaN
1    42.0
dtype: float64

Operating on Null Values

Pandas provides methods for working with null values in its data structures:

  • isnull(): generates boolean mask indicating missing values

  • notnull(): opposite of isnull()

  • dropna(): returns filtered version of data

  • fillna(): returns a copy of the data with missing valued filled or imputed

Let’s see them in action:

data = pd.Series([1, np.nan, 42, None])
data.isnull()
0    False
1     True
2    False
3     True
dtype: bool
data.notnull()
0     True
1    False
2     True
3    False
dtype: bool
data[(data.notnull())]
0     1.0
2    42.0
dtype: float64

Dropping na

data.dropna()
0     1.0
2    42.0
dtype: float64
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
df.dropna()
0 1 2
1 2.0 3.0 5
df.dropna(axis='columns')
2
0 2
1 5
2 6
df[3] = np.nan
df
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
df.dropna(axis='columns', how='all')
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
df.dropna(axis='columns', how='any')
2
0 2
1 5
2 6
df.dropna(axis='rows', thresh=3) # 3 non na values
0 1 2 3
1 2.0 3.0 5 NaN
df.loc[1,1]=np.nan
df
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 NaN 5 NaN
2 NaN 4.0 6 NaN
df.dropna(axis='columns', thresh=2) # 3 non na values
0 2
0 1.0 2
1 2.0 5
2 NaN 6

Filling na

data
0     1.0
1     NaN
2    42.0
3     NaN
dtype: float64
data.fillna(99)
0     1.0
1    99.0
2    42.0
3    99.0
dtype: float64
data.fillna(method='ffill')
0     1.0
1     1.0
2    42.0
3    42.0
dtype: float64
data.fillna(method='bfill')
0     1.0
1    42.0
2    42.0
3     NaN
dtype: float64
df.fillna(method='ffill', axis=1)
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 2.0 5.0 5.0
2 NaN 4.0 6.0 6.0