Saving and Loading Data

Once we are constructing and computing on arrays, at some point we might want to save our results. We may also be interested in importing an existing array or data from our files to work on.

This notebook looks into methods to do this:

import numpy as np

Saving textfiles with savetxt

The simplest way to save an array is to write it out to a plain text file. NumPy’s savetext function allows us to do this easily:

x = np.array([[1, 2, 3], 
              [4, 5, 6],
              [7, 8, 9]], np.int32)
np.savetxt("test.txt", x)

We can verify that the array was saved by using a shell command within our python / jupyter session:

!ls *.txt
test.txt

The savetxt function gives us a lot of flexibility. For example we can choose how many significant digits we want to save, and how we want the text file representation to separate the individual elements:

np.savetxt("test2.txt", x, fmt="%2.3f", delimiter=",")
np.savetxt("test3.txt", x, fmt="%04d", delimiter=" :-) ")
!ls *.txt
test.txt  test2.txt  test3.txt
!head *.txt
==> test.txt <==
1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00
4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00
7.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00

==> test2.txt <==
1.000,2.000,3.000
4.000,5.000,6.000
7.000,8.000,9.000

==> test3.txt <==
0001 :-) 0002 :-) 0003
0004 :-) 0005 :-) 0006
0007 :-) 0008 :-) 0009

We can also tell NumPy how we want new lines to be stores, and can add comments at the beginning and end of the array that will not be read in back in when we load the data.

# or to go over the top
np.savetxt('test4.txt', x, fmt='%2.3f', delimiter=',', 
               newline='\n', header='this is a header', 
               footer='and a footer', comments='## ')
!head test4.txt
## this is a header
1.000,2.000,3.000
4.000,5.000,6.000
7.000,8.000,9.000
## and a footer

Loading Textfiles with loadtxt

Now we have seen how to write an array to a file, unsuprisingly there is a loadtxt file that allows us to read in an array from a plain text file too:

y = np.loadtxt("test.txt")
print(y)
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

It has similar functionality, where we can specify the characters used to delimit the individual elements:

y = np.loadtxt("test2.txt", delimiter=",")
print(y)
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
y = np.loadtxt("test3.txt", delimiter=" :-) ")
print(y)
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

And we can also tell NumPy the data type that we want the individual elements to be once they are read in:

y = np.loadtxt("test4.txt", delimiter=",", dtype='complex')
y
array([[1.+0.j, 2.+0.j, 3.+0.j],
       [4.+0.j, 5.+0.j, 6.+0.j],
       [7.+0.j, 8.+0.j, 9.+0.j]])

We can also read in parts of an array by selecting the columns to read in, and whether we want the results to be read into one large array or unpacked into multiple:

y,z = np.loadtxt('test4.txt', delimiter=',', usecols=(0, 2), unpack=True)
print(y,z)
[1. 4. 7.] [3. 6. 9.]

with genfrmtxt

Since NumPy 0.12, the preferred way to read in an array from a file is with genfrmtxt rather than loadtxt. The functionality looks the same:

np.genfromtxt('test4.txt', delimiter=",")
array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])
np.genfromtxt('test4.txt', delimiter=',', 
                  skip_header=2, skip_footer=1, 
                  usecols=(0, -1))
array([4., 6.])
np.genfromtxt('test4.txt', delimiter=',', 
                  skip_header=2, skip_footer=1, 
                  usecols=(0, -1), 
                  names="A, C", dtype=['int', 'float'] )
array((-1, 6.), dtype=[('A', '<i8'), ('C', '<f8')])

Using NumPy’s native format

We have so far focussed on saving an array to a plain text file - and most of the time this is the recommended way to go. Sometimes however, we may want to save and load the output from NumPy’s own binary format .npy

The functions to do this are straight forwards:

np.save('x_mat.npy', x)

Note that in this case we cannot see into the array using standard tools because the array is written in NumPy’s binary format:

!head *.npy
�NUMPYv{'descr': '<i4', 'fortran_order': False, 'shape': (3, 3), }                                                          
	

If you get an array saved as .npy - we can readily load it back into our session with:

np.load('x_mat.npy')
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]], dtype=int32)

Challenge

Let’s use our combined tools of aggregate functions and loading and saving text data to analyse some weather data from ZRH airport. The data are contained in the file data/zrh_weather.txt in this repository.

  1. load the data for the maximum, mean and minimum temperature into a numpy array called weather. Also load in the dates using the code excerpt at the bottom of these questions (adding the necessary information filled in with XX’s):

  2. Find the hottest temperature at Zurich airport over the duration of the data

  3. On what date did the hottest temperature occur? (Did it happen only once?)

  4. On what date did the minimum temperature occur?

  5. On what date was the largest difference between the maximum and the minimum temperature?

  6. Save the maximum, mean and minimum temperature to the file weather_changes.txt for the week around the date of the largest temperature difference. In the file you save write a header that says “The seven days centered around the largest temperature change”, ensuring that line begins with a triple #.

from datetime import datetime

str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')


dates = np.genfromtxt('XXX', delimiter='XXX',
                      skip_header=XXX, usecols=(XXX),
                     dtype='object').astype(str)
!head -1 data/zrh_weather.txt
"Date"&"CEST"&"Max_TemperatureC"&"Mean_TemperatureC"&"Min_TemperatureC"&"Dew_PointC"&"MeanDew_PointC"&"Min_DewpointC"&"Max_Humidity"&"Mean_Humidity"&"Min_Humidity"&"Max_Sea_Level_PressurehPa"&"Mean_Sea_Level_PressurehPa"&"Min_Sea_Level_PressurehPa"&"Max_VisibilityKm"&"Mean_VisibilityKm"&"Min_VisibilitykM"&"Max_Wind_SpeedKm_h"&"Mean_Wind_SpeedKm_h"&"Max_Gust_SpeedKm_h"&"Precipitationmm"&"CloudCover"&"Events"&"WindDirDegrees"
weather = np.genfromtxt('data/zrh_weather.txt', delimiter='&', 
                        skip_header=1, usecols=(3,4,5)) 
from datetime import datetime

str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d')


dates = np.genfromtxt('data/zrh_weather.txt', delimiter='&',
                      skip_header=1, usecols=(1),
                     dtype='object').astype(str)
# max temperature
np.max(weather[:,0], axis=0)
34.0
# (first) date of max temperature:
dates[np.argmax(weather[:,0], axis=0)]
'2017-07-06'
np.unique(weather[:,0], return_counts=True)
(array([13., 14., 18., 19., 20., 21., 22., 24., 25., 26., 27., 28., 29.,
        30., 31., 32., 34.]),
 array([1, 1, 2, 1, 1, 6, 5, 7, 1, 3, 5, 6, 3, 1, 3, 3, 2]))
# min temperature
dates[np.argmin(weather[:,0], axis=0)]
'2017-08-10'
# largest change
np.max(weather[:,0] - weather[:,2], axis=0)
20.0
dates[np.argmax(weather[:,0] - weather[:,2], axis=0)]
'2017-07-05'
# what was the max, min and mean temp on that day
weather[np.argmax(weather[:,0]- weather[:,2]),]
array([31., 21., 11.])
reference_index = np.argmax(weather[:,0]- weather[:,2])
reference_index
4
weather_changes = weather[reference_index-3:reference_index+4 , 0:2]
np.savetxt('weather.txt', x, fmt='%2.2f', delimiter=',', 
               newline='\n', 
               header='The seven days centered around the largest temperature change', 
               comments='## ')