Boolean Logic and Boolean Masks¶
Often we want to examine and manipulate values within an array. We have previously seen how to do this using indexing and slicing, but this relies on us knowing the index that we want to extract.
Boolean Masks are much more flexible. They use Boolean Logic to compute True/False
on each element of an array, and then we can work with elements of an array which return either True or False. This means we can extract, modify, count, or otherwise manipulate values in an array based on some criterion.
Let’s begin by importing NumPy
import numpy as np
Comparison Operators as ufuncs¶
When we first considered ufuncs we focussed on arithmetic operators. NumPy also implements comparison operators as element wise ufuncs. The results of these comparison operators are always arrays of type Boolean
.
Six standard operations are:
x = np.array([1, 2, 3, 4, 5])
x < 3
array([ True, True, False, False, False])
x > 3
array([False, False, False, True, True])
x <= 3
array([ True, True, True, False, False])
np.greater_equal(x, 3)
array([False, False, True, True, True])
x ==3
array([False, False, True, False, False])
x != 3
array([ True, True, False, True, True])
y = np.array([1,4,3,2,5])
x == y
array([ True, False, True, False, True])
Comparison operators are implemented as ufuncs in NumPy; for example, when you write x < 3
, internally NumPy uses np.less(x, 3)
.
A summary of the comparison operators and their equivalent ufunc is shown here:
| Operator | Equivalent ufunc || Operator | Equivalent ufunc |
|—————|———————||—————|———————|
|==
|np.equal
||!=
|np.not_equal
|
|<
|np.less
||<=
|np.less_equal
|
|>
|np.greater
||>=
|np.greater_equal
|
Source: Jake VanderPlas (2016), Python Data Science Handbook Essential Tools for Working with Data, O’Reilly Media
Just like how arithmetic ufuncs worked on multidimensional arrays, so do comparison operators:
# works in multi dimension arrays too
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
x < 6
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]])
Working with Boolean Arrays¶
We can use Boolean Arrays to do a bunch of useful operations
print(x)
[[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
Counting entries¶
We can count non-zero elements
np.count_nonzero(x<6)
8
Can also be implemented using np.sum()
because False
is interpreted as 0, and True
is interpreted as 1:
np.sum(x<6)
8
The benefit of sum() and other NumPy aggregation functions, is that summation can be done along rows or columns:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)
array([4, 2, 2])
# how many values less than 6 in each col?
np.sum(x < 6, axis=0)
array([2, 2, 2, 2])
NumPy comes with built in functionality to check whether any or all values meet some condition:
# are there any values greater than 8?
np.any(x > 8)
True
# are there any values less than zero?
np.any(x < 0)
False
# are all values less than 10?
np.all(x < 10)
True
# are all values equal to 6?
np.all(x == 6)
False
np.any
and np.all
can also operate along a particular axis:
# are all values in each row less than 4?
np.all(x < 8, axis=1)
array([ True, False, True])
# are all values in each row less than 4?
np.any(x > 8, axis=0)
array([False, True, False, False])
Python has built-in sum(), any(), and all() functions. These have a different syntax than the NumPy versions, and in fail or produce unintended results when used on multidimensional arrays.
Be sure that you are using np.sum(), np.any(), and np.all() for these examples!
Boolean Operators and Data¶
Let’s use some of the Boolean operators we saw above to analyze some data. The file ../data/LAUST010000000000003.csv
contains monthly unemployment data for the state of Alabama from 2000 to 2016.
#!head ../data/LAUST010000000000003.csv
We can import it using the I/O functionality we have acquired along the way:
alabama = np.genfromtxt('../data/LAUST010000000000003.csv',
delimiter=',', skip_header=1,
usecols=(3))
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_2701/2446219879.py in <module>
----> 1 alabama = np.genfromtxt('../data/LAUST010000000000003.csv',
2 delimiter=',', skip_header=1,
3 usecols=(3))
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding, like)
1791 fname = os_fspath(fname)
1792 if isinstance(fname, str):
-> 1793 fid = np.lib._datasource.open(fname, 'rt', encoding=encoding)
1794 fid_ctx = contextlib.closing(fid)
1795 else:
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/_datasource.py in open(path, mode, destpath, encoding, newline)
191
192 ds = DataSource(destpath)
--> 193 return ds.open(path, mode, encoding=encoding, newline=newline)
194
195
/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/_datasource.py in open(self, path, mode, encoding, newline)
528 if ext == 'bz2':
529 mode.replace("+", "")
--> 530 return _file_openers[ext](found, mode=mode,
531 encoding=encoding, newline=newline)
532 else:
FileNotFoundError: [Errno 2] No such file or directory: '../data/LAUST010000000000003.csv'
Let’s first use an aggregate function to find the median unemployment rate:
median_ue = np.median(alabama)
print('median unemployment rate is:', median_ue, 'percent')
median unemployment rate is: 6.0 percent
We can use Boolean Masks to find the number- and percentage of months where unemployment is above the median:
print("Number of months in the data: ", np.size(alabama))
print("Number months above median unemployment: ", np.sum(alabama >= median_ue))
print("Percentage months above median unemployment: ", np.sum(alabama >= median_ue) / np.size(alabama))
Number of months in the data: 204
Number months above median unemployment: 107
Percentage months above median unemployment: 0.524509803922
And if we define:
bad times as when uneployment is greater than 10 percent
good times as when unemployment is less than 4
we can find the percentage of months in good and bad times, respectively:
print("Percentage months in bad times: ", np.sum(alabama >= 10) / np.size(alabama))
print("Percentage months in good times: ", np.sum(alabama <= 4) / np.size(alabama))
Percentage months in bad times: 0.112745098039
Percentage months in good times: 0.0980392156863
Bitwise logical operators¶
What if we want to combine multiple conditions? We can do this using Python’s bitwise logic operators, &, |, ^, and ~. Just as how NumPy overloaded the standard arithmetic operators, +,-,*,/
, NumPy overloads these operators with ufuncs that work element-wise on (usually Boolean) arrays:
print("Percentage months in good or bad times: ",
np.sum( (alabama >= 10) | (alabama <=4) ) / np.size(alabama))
print("Percentage months simultaneously good and bad times: ",
np.sum( (alabama >= 10) & (alabama <=4) ) / np.size(alabama))
print("Percentage months in usual times: ",
np.sum( ~((alabama >= 10) | (alabama <=4) )) / np.size(alabama))
Percentage months in good or bad times: 0.210784313725
Percentage months simultaneously good and bad times: 0.0
Percentage months in usual times: 0.789215686275
Boolean Arrays as Masks¶
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our x array from before, suppose we want an array of all values in the array that are less than, say, 5:
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
x < 5
array([[False, True, True, True],
[False, False, True, False],
[ True, True, False, False]], dtype=bool)
x[x<5]
array([0, 3, 3, 3, 2, 4])
Or using the unemployment data - what is the average unemployment rate in good, bad and usual times:
print("Avg unemployment in good times: ",
np.average(alabama[alabama <= 4]) )
print("Avg unemployment in bad times: ",
np.average(alabama[alabama >= 10]) )
print("Avg unemployment in usual times: ",
np.average(alabama[~((alabama >= 10) | (alabama <=4) )]) )
Avg unemployment in good times: 3.78
Avg unemployment in bad times: 10.9739130435
Avg unemployment in usual times: 6.21739130435
Aside: Using the Keywords and/or Versus the Operators &/|¶
Common point of confusion: difference between the keywords and
and or
vs the operators &
and |
The difference:
and
andor
gauge the truth or falsehood of entire object&
and|
refer to bits within each object.
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B
array([ True, True, True, False, True, True], dtype=bool)
A or B
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-5d8e4f2e21c0> in <module>()
----> 1 A or B
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
x = np.arange(10)
(x > 4) & (x < 8)
array([False, False, False, False, False, True, True, True, False, False], dtype=bool)
(x > 4) and (x < 8)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-3d24f1ffd63d> in <module>()
----> 1 (x > 4) and (x < 8)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Other Useful Boolean Mask operators¶
NumPy has many other useful built-in functions to extract data. Two that we can use often are np.isin()
and np.where()
np.isin()
creates a Boolean vector yielding True
when a condition is met:
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
oddnumber = np.array([1,3,5,7,9])
is_odd = np.isin(x, oddnumber)
is_odd
array([[ True, False, True, True],
[ True, True, True, True],
[False, False, True, False]], dtype=bool)
is_even = np.isin(x, oddnumber, invert=True)
is_even
array([[False, True, False, False],
[False, False, False, False],
[ True, True, False, True]], dtype=bool)
NumPy’s np.where()
returns the indices where the np.isin() array is True:
np.where(is_even)
(array([0, 2, 2, 2]), array([1, 0, 1, 3]))
np.where(is_odd)
(array([0, 0, 0, 1, 1, 1, 1, 2]), array([0, 2, 3, 0, 1, 2, 3, 2]))
Which allows us to extract all the odd values:
x[np.where(is_odd)]
array([5, 3, 3, 7, 9, 3, 5, 7])
Which could be refined to the unique elements:
np.unique(x[np.where(is_odd)])
array([3, 5, 7, 9])
Challenge:¶
Load in the ZRH weather data for the maximum, minimum and mean temperatures and dates [We’ve put the code to again import the dates below]
Use Boolean Operators to find:
The number of days where max temperature is 30 or avbove.
The percentage of days where average temperature is 17 or below
The highest minimum temperature, and the lowest maximum temperature on days where average temperature is 17 or below
All dates where the the max temperature was at its hottest (HARD!. Hint: use
np.isin(array, condition)
to create a Boolean vector callesindex
that isTrue
on the hottest days). Use array slicing on dates to extract the days where index isTrue
Solution¶
weather = np.genfromtxt('../data/zrh_weather.txt', delimiter='&',
skip_header=1, usecols=(3,4,5))
from datetime import datetime
str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')
dates = np.genfromtxt('../data/zrh_weather.txt', delimiter='&',
skip_header=1, usecols=(1),
dtype='object').astype(str)
print("N days max greater than 30: ",
np.sum( weather[:,0] >= 30 ))
print("N days avg less than 17: ",
np.sum( weather[:,1] <= 17 ) / np.shape(weather)[0])
N days max greater than 30: 9
N days avg less than 17: 0.274509803922
cold_days = weather[:,1] <=17
# highest min
print("Highest minimum on a cold day:",
np.max(weather[cold_days, 2]))
print("Lowest max on a cold day: ",
np.min(weather[cold_days, 0]))
Highest minimum on a cold day: 13.0
Lowest max on a cold day: 13.0
index = np.isin(weather[:,0], np.max(weather[:,0]))
dates[index==True]
array(['2017-07-06', '2017-07-08'],
dtype='<U10')
# or, using where:
dates[np.where(index)]
array(['2017-07-06', '2017-07-08'],
dtype='<U10')