Aggregate Functions¶

When faced with a large amount of data, a useful first step is to compute summary statistics.

NumPy has built-in aggregation functions for working on arrays tha we will discuss here.

import numpy as np
np.random.seed(1234567890)

Summing Values¶

Python has it’s own built in sum function:

array = np.random.random(100)
sum(array)

52.87629161311716

and NumPy has its own corresponding one:

np.sum(array)

52.87629161311717

As discussed earlier, NumPys functions are compiled - so they should be much quicker on large arrays

big_array = np.random.random(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

79.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

378 µs ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We recommend you use the NumPy sum function, it also works on multidimensional arrays.

Maximum and Minimum¶

Python also has built in max and min functions

min(big_array)

3.5160150346769115e-09

max(big_array)

0.9999981855267694

And again there are NumPy equivalents. The plain Python ones suffer the same problem as the sum function, so recommend these:

np.min(big_array)

3.5160150346769115e-09

np.max(big_array)

0.9999981855267694

There is a shorter syntax we may be useful too (holds for other aggregate functions as well):

print(big_array.min(), big_array.max(), big_array.sum())

3.5160150346769115e-09 0.9999981855267694 500261.0610574294

Aggregates on Multidimensional arrays¶

Aggregate functions work on multidimension arrays too:

matrix = np.random.normal(0, 1, (5, 4))
print(matrix)

[[ 0.47998397 -0.60570686  0.60617379 -1.09073067]
 [ 1.8597819   1.04717174  1.51560204  0.33884624]
 [ 0.88728258  0.08329178  0.46331039 -0.22682462]
 [ 0.08583326 -0.81966389 -2.60297839  0.75971677]
 [-0.62601854 -0.71412651 -1.75315878 -0.15160382]]

matrix.sum()

-0.46381762991309927

Often this is not what we want. If we want the value of an aggregate across an axis, we have to specify the axis. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. Thus

axis = 0 computes column-wise, collapsing the rows

matrix.min(axis=0)

array([-0.62601854, -0.81966389, -2.60297839, -1.09073067])

axis = 1 computes row-wise, collapsing the columns

matrix.max(axis=1)

array([ 0.60617379,  1.8597819 ,  0.88728258,  0.75971677, -0.15160382])