Iteration Example

qp has a built in method to create a generator object that can be used to iterate through a qp file. In this notebook we will test this out by reading and writing Ensembles from a file.

import numpy as np
import os
import qp

Reading Ensembles from file

Let’s read in our file and see what the Ensemble looks like.

# the path to the file
data_file = "../assets/test.hdf5"
ens = qp.read(data_file)
print(ens)
Ensemble(the_class=mixmod,shape=(100, 3))

We have an Ensemble of 100 Gaussian mixed model distributions, with 3 Gaussian components each. That’s a lot to handle at once. However, instead of reading in the whole file at once we can use the iterator method to create a generator, which we can then use to iterate through a subset of Ensembles at a time. We would still want to know how many distributions are in the file, though, so we know what chunk size to pick. To do that we can use the qp.data_length function:

qp.data_length(data_file)
100

Since we have 100 distributions, let’s pick a chunk size of 10:

itr = qp.iterator(data_file, chunk_size=10)
type(itr)
generator

Now that we have our generator, we can iterate through each set of 10 Ensembles and get whatever we need from them. Let’s check that the PDFs of the chunks we get match the PDFs for the chunk we expect. We’ll evaluate the PDF at test_vals for each of the chunks.

test_vals = np.linspace(0., 1., 11)
for start, end, ens_i in itr:
    print(f"Chunk indices are: ({start}:{end})")
    if np.allclose(ens[start:end].pdf(test_vals), ens_i.pdf(test_vals)):
        print(f"The PDF values match")
    else:
        print(f"The PDF values for the iterated chunk do not match the values for the chunk from the whole Ensemble")
Chunk indices are: (0:10)
The PDF values match
Chunk indices are: (10:20)
The PDF values match
Chunk indices are: (20:30)
The PDF values match
Chunk indices are: (30:40)
The PDF values match
Chunk indices are: (40:50)
The PDF values match
Chunk indices are: (50:60)
The PDF values match
Chunk indices are: (60:70)
The PDF values match
Chunk indices are: (70:80)
The PDF values match
Chunk indices are: (80:90)
The PDF values match
Chunk indices are: (90:100)
The PDF values match

You can also do this all in one line, as shown below. This time we use a chunk size of 11 to demonstrate how the iteration behaves when the number of distributions is not evenly divided by the given chunk size:

for start, end, ens_chunk in qp.iterator(data_file, chunk_size=11):
    print(f"Indices are: ({start}, {end})")
    print(ens_chunk)
Indices are: (0, 11)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (11, 22)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (22, 33)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (33, 44)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (44, 55)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (55, 66)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (66, 77)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (77, 88)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (88, 99)
Ensemble(the_class=mixmod,shape=(11, 3))
Indices are: (99, 100)
Ensemble(the_class=mixmod,shape=(1, 3))

If the number of distributions is not easily divisible by the chunk size, then the last chunk will contain any remaining distributions.

Writing Ensembles to file

Now that we know how to read in an Ensemble iteratively, let’s take a look at how to write one out a chunk at a time to an HDF5 file. First, let’s set up a file path to write our Ensemble to, and a chunk_size, or number of distributions to write at a time.

import tempfile
import os

td = tempfile.TemporaryDirectory()

new_file_path = os.path.join(td.name, "test-write.hdf5") # file to write to
chunk_size = 5 # number of distributions to write at a time

Now we initialize our new HDF5 file. This creates an HDF5 file with the groups and datasets we need to store this Ensemble, but empty.

groups, fout = ens.initializeHdf5Write(new_file_path, 50)

Next, we can iterate through the distributions in our Ensemble, one chunk at a time, and write their data to the HDF5 file. Let’s only write half of the distributions, so we’ll set our iteration to go from 0 to 50.

for i in range(0, 50, chunk_size):
    ens[i:i+chunk_size].writeHdf5Chunk(groups, i, i+chunk_size)

Now that all of our distribution data is written, we can add in our metdata and close the file. This is done by the following command:

ens.finalizeHdf5Write(fout)

We can check that this successfully wrote out 50 of our distributions by getting the number of distributions in our new file:

qp.data_length(new_file_path)
50

Great, we’ve successfully iterated through part of an Ensemble and written it to file in chunks. Let’s delete the file, we don’t need duplicates.

td.cleanup()

Iteration in parallel

This can also be done in parallel, by passing an MPI Communicator to the comm argument of initializeHdf5Write(). This sets up the HDF5 file for parallel writing, so each process will be able to write chunks of data to the file.