Missing Data¶
Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter] @lachlandeer
Real world data is rarely clean and homogenous (although are working labor market example is!). One of the main features we find in real world data is missing values - so we have to know how to deal with them
Pandas handles missing values through its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types. This is potentially something important to note - whenever you have missing data in a column, it will be treated as a floating-point data type.
import numpy as np
import pandas as pd
None
: Pythonic Missing Data¶
The Python default for missing information is None
. Notice that None
has a special type in Python:
type(None)
NoneType
This means that None
can only be stores in an array that is of ‘object’ type:
test_array = np.array([1,2,3,4])
test_array.dtype
dtype('int64')
#test_array[3] = None
test_array2 = np.array([1,2,None,4])
test_array2
array([1, 2, None, 4], dtype=object)
As a consequence - when there are Nones
in an array and we perform an aggregation - we get an error:
#test_array2.sum()
NaN: missing numerical data¶
The other missing data representation is NaN, and it performs differently than None
- it is a speical floating point value:
type(np.nan)
float
test_array3 = np.array([1,2,np.nan,4])
test_array3.dtype
dtype('float64')
Notice that np.nan
‘infects’ every operation it is combined with - any arithemetic with an np.nan
yields a nan as a result:
test_array3.sum()
nan
10 + np.nan
nan
np.log(np.nan)
nan
Numpy does provide functionality to get around these nans by ignoring them:
np.nansum(test_array3)
7.0
np.nanmin(test_array3), np.nanmean(test_array3)
(1.0, 2.3333333333333335)
NaN
and None
in Pandas¶
Pandas is built to handle both - almost interchangably:
pd.Series([None, 42, np.nan])
0 NaN
1 42.0
2 NaN
dtype: float64
and because np.nan
is a floating point, pandas type-casts:
pd.Series([1, 42])
0 1
1 42
dtype: int64
pd.Series([np.nan, 42])
0 NaN
1 42.0
dtype: float64
Operating on Null Values¶
Pandas provides methods for working with null values in its data structures:
isnull()
: generates boolean mask indicating missing valuesnotnull()
: opposite ofisnull()
dropna()
: returns filtered version of datafillna()
: returns a copy of the data with missing valued filled or imputed
Let’s see them in action:
data = pd.Series([1, np.nan, 42, None])
data.isnull()
0 False
1 True
2 False
3 True
dtype: bool
data.notnull()
0 True
1 False
2 True
3 False
dtype: bool
data[(data.notnull())]
0 1.0
2 42.0
dtype: float64
Dropping na¶
data.dropna()
0 1.0
2 42.0
dtype: float64
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
df.dropna()
0 | 1 | 2 | |
---|---|---|---|
1 | 2.0 | 3.0 | 5 |
df.dropna(axis='columns')
2 | |
---|---|
0 | 2 |
1 | 5 |
2 | 6 |
df[3] = np.nan
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 2 | NaN |
1 | 2.0 | 3.0 | 5 | NaN |
2 | NaN | 4.0 | 6 | NaN |
df.dropna(axis='columns', how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
df.dropna(axis='columns', how='any')
2 | |
---|---|
0 | 2 |
1 | 5 |
2 | 6 |
df.dropna(axis='rows', thresh=3) # 3 non na values
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 2.0 | 3.0 | 5 | NaN |
df.loc[1,1]=np.nan
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 2 | NaN |
1 | 2.0 | NaN | 5 | NaN |
2 | NaN | 4.0 | 6 | NaN |
df.dropna(axis='columns', thresh=2) # 3 non na values
0 | 2 | |
---|---|---|
0 | 1.0 | 2 |
1 | 2.0 | 5 |
2 | NaN | 6 |
Filling na¶
data
0 1.0
1 NaN
2 42.0
3 NaN
dtype: float64
data.fillna(99)
0 1.0
1 99.0
2 42.0
3 99.0
dtype: float64
data.fillna(method='ffill')
0 1.0
1 1.0
2 42.0
3 42.0
dtype: float64
data.fillna(method='bfill')
0 1.0
1 42.0
2 42.0
3 NaN
dtype: float64
df.fillna(method='ffill', axis=1)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 2.0 | 2.0 |
1 | 2.0 | 2.0 | 5.0 | 5.0 |
2 | NaN | 4.0 | 6.0 | 6.0 |