Boolean Logic and Boolean Masks¶

Often we want to examine and manipulate values within an array. We have previously seen how to do this using indexing and slicing, but this relies on us knowing the index that we want to extract.

Boolean Masks are much more flexible. They use Boolean Logic to compute True/False on each element of an array, and then we can work with elements of an array which return either True or False. This means we can extract, modify, count, or otherwise manipulate values in an array based on some criterion.

Let’s begin by importing NumPy

import numpy as np

Comparison Operators as ufuncs¶

When we first considered ufuncs we focussed on arithmetic operators. NumPy also implements comparison operators as element wise ufuncs. The results of these comparison operators are always arrays of type Boolean.

Six standard operations are:

x = np.array([1, 2, 3, 4, 5])

x < 3

array([ True,  True, False, False, False])

x > 3

array([False, False, False,  True,  True])

x <= 3

array([ True,  True,  True, False, False])

np.greater_equal(x, 3)

array([False, False,  True,  True,  True])

x ==3

array([False, False,  True, False, False])

x != 3

array([ True,  True, False,  True,  True])

y = np.array([1,4,3,2,5])

x == y

array([ True, False,  True, False,  True])

Comparison operators are implemented as ufuncs in NumPy; for example, when you write x < 3, internally NumPy uses np.less(x, 3).

A summary of the comparison operators and their equivalent ufunc is shown here:

| Operator | Equivalent ufunc || Operator | Equivalent ufunc | |—————|———————||—————|———————| |== |np.equal ||!= |np.not_equal | |< |np.less ||<= |np.less_equal | |> |np.greater ||>= |np.greater_equal |

Source: Jake VanderPlas (2016), Python Data Science Handbook Essential Tools for Working with Data, O’Reilly Media

Just like how arithmetic ufuncs worked on multidimensional arrays, so do comparison operators:

# works in multi dimension arrays too
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

x < 6

array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]])

Working with Boolean Arrays¶

We can use Boolean Arrays to do a bunch of useful operations

print(x)

[[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]]

Counting entries¶

We can count non-zero elements

np.count_nonzero(x<6)

Can also be implemented using np.sum() because False is interpreted as 0, and True is interpreted as 1:

np.sum(x<6)

The benefit of sum() and other NumPy aggregation functions, is that summation can be done along rows or columns:

# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

array([4, 2, 2])

# how many values less than 6 in each col?
np.sum(x < 6, axis=0)

array([2, 2, 2, 2])

NumPy comes with built in functionality to check whether any or all values meet some condition:

# are there any values greater than 8?
np.any(x > 8)

True

# are there any values less than zero?
np.any(x < 0)

False

# are all values less than 10?
np.all(x < 10)

True

# are all values equal to 6?
np.all(x == 6)

False

np.any and np.all can also operate along a particular axis:

# are all values in each row less than 4?
np.all(x < 8, axis=1)

array([ True, False,  True])

# are all values in each row less than 4?
np.any(x > 8, axis=0)

array([False,  True, False, False])

Python has built-in sum(), any(), and all() functions. These have a different syntax than the NumPy versions, and in fail or produce unintended results when used on multidimensional arrays.

Be sure that you are using np.sum(), np.any(), and np.all() for these examples!

Boolean Operators and Data¶

Let’s use some of the Boolean operators we saw above to analyze some data. The file ../data/LAUST010000000000003.csv contains monthly unemployment data for the state of Alabama from 2000 to 2016.

#!head ../data/LAUST010000000000003.csv

We can import it using the I/O functionality we have acquired along the way:

alabama = np.genfromtxt('../data/LAUST010000000000003.csv', 
                            delimiter=',', skip_header=1, 
                            usecols=(3))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_2701/2446219879.py in <module>
----> 1 alabama = np.genfromtxt('../data/LAUST010000000000003.csv', 
      2                             delimiter=',', skip_header=1,
      3                             usecols=(3))

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows, encoding, like)
   1791             fname = os_fspath(fname)
   1792         if isinstance(fname, str):
-> 1793             fid = np.lib._datasource.open(fname, 'rt', encoding=encoding)
   1794             fid_ctx = contextlib.closing(fid)
   1795         else:

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/_datasource.py in open(path, mode, destpath, encoding, newline)
    191 
    192     ds = DataSource(destpath)
--> 193     return ds.open(path, mode, encoding=encoding, newline=newline)
    194 
    195 

/opt/hostedtoolcache/Python/3.8.11/x64/lib/python3.8/site-packages/numpy/lib/_datasource.py in open(self, path, mode, encoding, newline)
    528             if ext == 'bz2':
    529                 mode.replace("+", "")
--> 530             return _file_openers[ext](found, mode=mode,
    531                                       encoding=encoding, newline=newline)
    532         else:

FileNotFoundError: [Errno 2] No such file or directory: '../data/LAUST010000000000003.csv'

Let’s first use an aggregate function to find the median unemployment rate:

median_ue = np.median(alabama)
print('median unemployment rate is:', median_ue, 'percent')

median unemployment rate is: 6.0 percent

We can use Boolean Masks to find the number- and percentage of months where unemployment is above the median:

print("Number of months in the data:                 ", np.size(alabama))
print("Number months above median unemployment:      ", np.sum(alabama >= median_ue))
print("Percentage months above median unemployment:  ", np.sum(alabama >= median_ue) /  np.size(alabama))

Number of months in the data:                  204
Number months above median unemployment:       107
Percentage months above median unemployment:   0.524509803922

And if we define:

bad times as when uneployment is greater than 10 percent
good times as when unemployment is less than 4

we can find the percentage of months in good and bad times, respectively:

print("Percentage months in bad times:            ", np.sum(alabama >= 10) /  np.size(alabama))
print("Percentage months in good times:           ", np.sum(alabama <= 4) /  np.size(alabama))

Percentage months in bad times:             0.112745098039
Percentage months in good times:            0.0980392156863

Bitwise logical operators¶

What if we want to combine multiple conditions? We can do this using Python’s bitwise logic operators, &, |, ^, and ~. Just as how NumPy overloaded the standard arithmetic operators, +,-,*,/, NumPy overloads these operators with ufuncs that work element-wise on (usually Boolean) arrays:

print("Percentage months in good or bad times:            ", 
      np.sum( (alabama >= 10) | (alabama <=4) ) /  np.size(alabama))
print("Percentage months simultaneously good and bad times:            ", 
      np.sum( (alabama >= 10) & (alabama <=4) ) /  np.size(alabama))
print("Percentage months in usual times:            ", 
      np.sum( ~((alabama >= 10) | (alabama <=4) )) /  np.size(alabama))

Percentage months in good or bad times:             0.210784313725
Percentage months simultaneously good and bad times:             0.0
Percentage months in usual times:             0.789215686275

Boolean Arrays as Masks¶

A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.

Returning to our x array from before, suppose we want an array of all values in the array that are less than, say, 5:

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

x < 5

array([[False,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True, False, False]], dtype=bool)

x[x<5]

array([0, 3, 3, 3, 2, 4])

Or using the unemployment data - what is the average unemployment rate in good, bad and usual times:

print("Avg unemployment in good times:            ", 
          np.average(alabama[alabama <= 4]) )
print("Avg unemployment in bad times:            ", 
          np.average(alabama[alabama >= 10]) )
print("Avg unemployment in usual times:            ", 
          np.average(alabama[~((alabama >= 10) | (alabama <=4) )]) )

Avg unemployment in good times:             3.78
Avg unemployment in bad times:             10.9739130435
Avg unemployment in usual times:             6.21739130435

Aside: Using the Keywords and/or Versus the Operators &/|¶

Common point of confusion: difference between the keywords and and or vs the operators & and |

The difference:

and and or gauge the truth or falsehood of entire object
& and | refer to bits within each object.

A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B

array([ True,  True,  True, False,  True,  True], dtype=bool)

A or B

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-5d8e4f2e21c0> in <module>()
----> 1 A or B

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

x = np.arange(10)
(x > 4) & (x < 8)

array([False, False, False, False, False,  True,  True,  True, False, False], dtype=bool)

(x > 4) and (x < 8)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-3d24f1ffd63d> in <module>()
----> 1 (x > 4) and (x < 8)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Other Useful Boolean Mask operators¶

NumPy has many other useful built-in functions to extract data. Two that we can use often are np.isin() and np.where()

np.isin() creates a Boolean vector yielding True when a condition is met:

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

oddnumber = np.array([1,3,5,7,9])

is_odd = np.isin(x, oddnumber)
is_odd

array([[ True, False,  True,  True],
       [ True,  True,  True,  True],
       [False, False,  True, False]], dtype=bool)

is_even = np.isin(x, oddnumber, invert=True)
is_even

array([[False,  True, False, False],
       [False, False, False, False],
       [ True,  True, False,  True]], dtype=bool)

NumPy’s np.where() returns the indices where the np.isin() array is True:

np.where(is_even)

(array([0, 2, 2, 2]), array([1, 0, 1, 3]))

np.where(is_odd)

(array([0, 0, 0, 1, 1, 1, 1, 2]), array([0, 2, 3, 0, 1, 2, 3, 2]))

Which allows us to extract all the odd values:

x[np.where(is_odd)]

array([5, 3, 3, 7, 9, 3, 5, 7])

Which could be refined to the unique elements:

np.unique(x[np.where(is_odd)])

array([3, 5, 7, 9])

Challenge:¶

Load in the ZRH weather data for the maximum, minimum and mean temperatures and dates [We’ve put the code to again import the dates below]

Use Boolean Operators to find:

The number of days where max temperature is 30 or avbove.
The percentage of days where average temperature is 17 or below
The highest minimum temperature, and the lowest maximum temperature on days where average temperature is 17 or below
All dates where the the max temperature was at its hottest (HARD!. Hint: use np.isin(array, condition) to create a Boolean vector calles index that is True on the hottest days). Use array slicing on dates to extract the days where index is True

Solution¶

weather = np.genfromtxt('../data/zrh_weather.txt', delimiter='&', 
                        skip_header=1, usecols=(3,4,5)) 

from datetime import datetime

str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')


dates = np.genfromtxt('../data/zrh_weather.txt', delimiter='&',
                      skip_header=1, usecols=(1),
                     dtype='object').astype(str)

print("N days max greater than 30:            ", 
      np.sum( weather[:,0] >= 30  ))
print("N days avg less than 17:            ", 
      np.sum( weather[:,1] <= 17  ) / np.shape(weather)[0])

N days max greater than 30:             9
N days avg less than 17:             0.274509803922

cold_days = weather[:,1] <=17

# highest min
print("Highest minimum on a cold day:", 
            np.max(weather[cold_days, 2]))

print("Lowest max on a cold day:     ",
            np.min(weather[cold_days, 0]))

Highest minimum on a cold day: 13.0
Lowest max on a cold day:      13.0

index = np.isin(weather[:,0], np.max(weather[:,0]))

dates[index==True]

array(['2017-07-06', '2017-07-08'],
      dtype='<U10')

# or, using where:
dates[np.where(index)]

array(['2017-07-06', '2017-07-08'],
      dtype='<U10')

Python for Economics and Business Research