Aggregate Functions

When faced with a large amount of data, a useful first step is to compute summary statistics.

NumPy has built-in aggregation functions for working on arrays tha we will discuss here.

import numpy as np
np.random.seed(1234567890)

Summing Values

Python has it’s own built in sum function:

array = np.random.random(100)
sum(array)
52.87629161311716

and NumPy has its own corresponding one:

np.sum(array)
52.87629161311717

As discussed earlier, NumPys functions are compiled - so they should be much quicker on large arrays

big_array = np.random.random(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
79.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
378 µs ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We recommend you use the NumPy sum function, it also works on multidimensional arrays.

Maximum and Minimum

Python also has built in max and min functions

min(big_array)
3.5160150346769115e-09
max(big_array)
0.9999981855267694

And again there are NumPy equivalents. The plain Python ones suffer the same problem as the sum function, so recommend these:

np.min(big_array)
3.5160150346769115e-09
np.max(big_array)
0.9999981855267694

There is a shorter syntax we may be useful too (holds for other aggregate functions as well):

print(big_array.min(), big_array.max(), big_array.sum())
3.5160150346769115e-09 0.9999981855267694 500261.0610574294

Aggregates on Multidimensional arrays

Aggregate functions work on multidimension arrays too:

matrix = np.random.normal(0, 1, (5, 4))
print(matrix)
[[ 0.47998397 -0.60570686  0.60617379 -1.09073067]
 [ 1.8597819   1.04717174  1.51560204  0.33884624]
 [ 0.88728258  0.08329178  0.46331039 -0.22682462]
 [ 0.08583326 -0.81966389 -2.60297839  0.75971677]
 [-0.62601854 -0.71412651 -1.75315878 -0.15160382]]
matrix.sum()
-0.46381762991309927

Often this is not what we want. If we want the value of an aggregate across an axis, we have to specify the axis. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. Thus

  • axis = 0 computes column-wise, collapsing the rows

matrix.min(axis=0)
array([-0.62601854, -0.81966389, -2.60297839, -1.09073067])
  • axis = 1 computes row-wise, collapsing the columns

matrix.max(axis=1)
array([ 0.60617379,  1.8597819 ,  0.88728258,  0.75971677, -0.15160382])

Aggregate Functions on Missing data

All NumPy aggregate functions will produce errors when working with missing data, which NumPy specifies as NaN. Instead there are routines that are NaN-safe in the sense they ignore missing values.

The following table provides a list of useful aggregate functions and their NaN-safe equivalents:

Function Name

NaN-safe Version

Description

np.sum

np.nansum

Compute sum of elements

np.prod

np.nanprod

Compute product of elements

np.mean

np.nanmean

Compute mean of elements

np.std

np.nanstd

Compute standard deviation

np.var

np.nanvar

Compute variance

np.min

np.nanmin

Find minimum value

np.max

np.nanmax

Find maximum value

np.argmin

np.nanargmin

Find index of minimum value

np.argmax

np.nanargmax

Find index of maximum value

np.median

np.nanmedian

Compute median of elements

np.percentile

np.nanpercentile

Compute rank-based statistics of elements

np.any

N/A

Evaluate whether any elements are true

np.all

N/A

Evaluate whether all elements are true

Source: Jake VanderPlas (2016), Python Data Science Handbook Essential Tools for Working with Data, O’Reilly Media.