Aggregate Functions¶
When faced with a large amount of data, a useful first step is to compute summary statistics.
NumPy has built-in aggregation functions for working on arrays tha we will discuss here.
import numpy as np
np.random.seed(1234567890)
Summing Values¶
Python has it’s own built in sum
function:
array = np.random.random(100)
sum(array)
52.87629161311716
and NumPy has its own corresponding one:
np.sum(array)
52.87629161311717
As discussed earlier, NumPys functions are compiled - so they should be much quicker on large arrays
big_array = np.random.random(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
79.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
378 µs ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We recommend you use the NumPy sum function, it also works on multidimensional arrays.
Maximum and Minimum¶
Python also has built in max
and min
functions
min(big_array)
3.5160150346769115e-09
max(big_array)
0.9999981855267694
And again there are NumPy equivalents. The plain Python ones suffer the same problem as the sum
function, so recommend these:
np.min(big_array)
3.5160150346769115e-09
np.max(big_array)
0.9999981855267694
There is a shorter syntax we may be useful too (holds for other aggregate functions as well):
print(big_array.min(), big_array.max(), big_array.sum())
3.5160150346769115e-09 0.9999981855267694 500261.0610574294
Aggregates on Multidimensional arrays¶
Aggregate functions work on multidimension arrays too:
matrix = np.random.normal(0, 1, (5, 4))
print(matrix)
[[ 0.47998397 -0.60570686 0.60617379 -1.09073067]
[ 1.8597819 1.04717174 1.51560204 0.33884624]
[ 0.88728258 0.08329178 0.46331039 -0.22682462]
[ 0.08583326 -0.81966389 -2.60297839 0.75971677]
[-0.62601854 -0.71412651 -1.75315878 -0.15160382]]
matrix.sum()
-0.46381762991309927
Often this is not what we want. If we want the value of an aggregate across an axis, we have to specify the axis. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. Thus
axis = 0
computes column-wise, collapsing the rows
matrix.min(axis=0)
array([-0.62601854, -0.81966389, -2.60297839, -1.09073067])
axis = 1
computes row-wise, collapsing the columns
matrix.max(axis=1)
array([ 0.60617379, 1.8597819 , 0.88728258, 0.75971677, -0.15160382])
Aggregate Functions on Missing data¶
All NumPy aggregate functions will produce errors when working with missing data, which NumPy specifies as NaN
. Instead there are routines that are NaN
-safe in the sense they ignore missing values.
The following table provides a list of useful aggregate functions and their NaN-safe equivalents:
Function Name |
NaN-safe Version |
Description |
---|---|---|
|
|
Compute sum of elements |
|
|
Compute product of elements |
|
|
Compute mean of elements |
|
|
Compute standard deviation |
|
|
Compute variance |
|
|
Find minimum value |
|
|
Find maximum value |
|
|
Find index of minimum value |
|
|
Find index of maximum value |
|
|
Compute median of elements |
|
|
Compute rank-based statistics of elements |
|
N/A |
Evaluate whether any elements are true |
|
N/A |
Evaluate whether all elements are true |
Source: Jake VanderPlas (2016), Python Data Science Handbook Essential Tools for Working with Data, O’Reilly Media.