Exploring a qp file
This notebook takes you through what the data structure of an Ensemble looks like, and what a qp HDF5 file contains.
import qp
import h5py
import tables_io
import numpy as np
from scipy import stats
What’s in a qp file?
First, let’s read in an Ensemble from an HDF5 file using qp and take a look at the metadata.
ens_i = qp.read("../assets/interp-ensemble.hdf5")
ens_i
Ensemble(the_class=interp,shape=(3, 50))
ens_i.metadata
{'pdf_name': array([b'interp'], dtype='|S6'),
'pdf_version': array([0]),
'xvals': array([-1. , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
-0.3877551 , -0.26530612, -0.14285714, -0.02040816, 0.10204082,
0.2244898 , 0.34693878, 0.46938776, 0.59183673, 0.71428571,
0.83673469, 0.95918367, 1.08163265, 1.20408163, 1.32653061,
1.44897959, 1.57142857, 1.69387755, 1.81632653, 1.93877551,
2.06122449, 2.18367347, 2.30612245, 2.42857143, 2.55102041,
2.67346939, 2.79591837, 2.91836735, 3.04081633, 3.16326531,
3.28571429, 3.40816327, 3.53061224, 3.65306122, 3.7755102 ,
3.89795918, 4.02040816, 4.14285714, 4.26530612, 4.3877551 ,
4.51020408, 4.63265306, 4.75510204, 4.87755102, 5. ])}
Using h5py
Now that we know for sure this file contains an Ensemble, let’s use h5py to read in the data in the file and get a look at how it’s formatted.
fileobj = h5py.File("../assets/interp-ensemble.hdf5", "r")
list(fileobj.keys())
['ancil', 'data', 'meta']
This HDF5 file has 3 keys, meta for metadata, data for objdata, and ancil for ancillary data. Each of these is a group object. Let’s take a closer look at them to see what’s stored in them:
print(f"meta: {list(fileobj['meta'].keys())}")
print(f"data : {list(fileobj['data'].keys())}")
print(f"ancil: {list(fileobj['ancil'].keys())}")
meta: ['pdf_name', 'pdf_version', 'xvals']
data : ['yvals']
ancil: ['ids']
Each of these groups contains at least one dataset. If you look back at the Ensemble metadata dictionary we printed out earlier, you can see that the meta group has the same metadata keys as the Ensemble metadata dictionary. Each of these keys is its own dataset. If we print out the datasets for all the metadata keys, we can see the same information that we saw earlier:
# print out the contents of the metadata datasets
print(f"pdf_name: {fileobj['meta']['pdf_name'][:]}")
print(f"pdf_version: {fileobj['meta']['pdf_version'][:]}")
print(f"xvals: {fileobj['meta']['xvals'][:]}")
pdf_name: [b'interp']
pdf_version: [0]
xvals: [[-1. -0.87755102 -0.75510204 -0.63265306 -0.51020408 -0.3877551
-0.26530612 -0.14285714 -0.02040816 0.10204082 0.2244898 0.34693878
0.46938776 0.59183673 0.71428571 0.83673469 0.95918367 1.08163265
1.20408163 1.32653061 1.44897959 1.57142857 1.69387755 1.81632653
1.93877551 2.06122449 2.18367347 2.30612245 2.42857143 2.55102041
2.67346939 2.79591837 2.91836735 3.04081633 3.16326531 3.28571429
3.40816327 3.53061224 3.65306122 3.7755102 3.89795918 4.02040816
4.14285714 4.26530612 4.3877551 4.51020408 4.63265306 4.75510204
4.87755102 5. ]]
Using tables_io
Now let’s take a look at the file using tables_io to read in the data in the file and get a look at how it’s formatted. tables_io also uses h5py to read in the file, but it only takes one function call to read in the whole file to a dictionary of dictionaries, or a TableDict-like object. For more information about tables_io, check out its documentation.
file_tab_i = tables_io.read("../assets/interp-ensemble.hdf5")
file_tab_i.keys()
odict_keys(['ancil', 'data', 'meta'])
It’s an ordered dictionary with three keys: meta for metadata, data for objdata, and ancil for ancillary data. Let’s take a look at each of these dictionaries to see how they’re formatted:
file_tab_i["meta"]
OrderedDict([('pdf_name', array([b'interp'], dtype='|S6')),
('pdf_version', array([0])),
('xvals',
array([[-1. , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
-0.3877551 , -0.26530612, -0.14285714, -0.02040816, 0.10204082,
0.2244898 , 0.34693878, 0.46938776, 0.59183673, 0.71428571,
0.83673469, 0.95918367, 1.08163265, 1.20408163, 1.32653061,
1.44897959, 1.57142857, 1.69387755, 1.81632653, 1.93877551,
2.06122449, 2.18367347, 2.30612245, 2.42857143, 2.55102041,
2.67346939, 2.79591837, 2.91836735, 3.04081633, 3.16326531,
3.28571429, 3.40816327, 3.53061224, 3.65306122, 3.7755102 ,
3.89795918, 4.02040816, 4.14285714, 4.26530612, 4.3877551 ,
4.51020408, 4.63265306, 4.75510204, 4.87755102, 5. ]]))])
file_tab_i["data"]
OrderedDict([('yvals',
array([[5.62574037e-11, 8.27495969e-10, 1.03658971e-08, 1.10586652e-07,
1.00473916e-06, 7.77425458e-06, 5.12293674e-05, 2.87497474e-04,
1.37405425e-03, 5.59279029e-03, 1.93868833e-02, 5.72324401e-02,
1.43890240e-01, 3.08088307e-01, 5.61789868e-01, 8.72423581e-01,
1.15381371e+00, 1.29956738e+00, 1.24657012e+00, 1.01833210e+00,
7.08462651e-01, 4.19758300e-01, 2.11805109e-01, 9.10182281e-02,
3.33100380e-02, 1.03818963e-02, 2.75570708e-03, 6.22937139e-04,
1.19925131e-04, 1.96621492e-05, 2.74540600e-06, 3.26465262e-07,
3.30614716e-08, 2.85142657e-09, 2.09438737e-10, 1.31010660e-11,
6.97928711e-13, 3.16643299e-14, 1.22344470e-15, 4.02580927e-17,
1.12817599e-18, 2.69249754e-20, 5.47253547e-22, 9.47276278e-24,
1.39643120e-25, 1.75314257e-27, 1.87443217e-29, 1.70677799e-31,
1.32354631e-33, 8.74089811e-36],
[4.38787527e-07, 2.09991515e-06, 9.19647172e-06, 3.68563866e-05,
1.35168748e-04, 4.53640531e-04, 1.39321913e-03, 3.91560641e-03,
1.00704912e-02, 2.37014163e-02, 5.10469694e-02, 1.00609190e-01,
1.81458518e-01, 2.99494665e-01, 4.52348175e-01, 6.25213886e-01,
7.90781336e-01, 9.15284751e-01, 9.69455915e-01, 9.39662562e-01,
8.33465840e-01, 6.76512316e-01, 5.02499497e-01, 3.41560450e-01,
2.12457244e-01, 1.20933753e-01, 6.29934715e-02, 3.00272510e-02,
1.30980796e-02, 5.22843486e-03, 1.90988755e-03, 6.38433864e-04,
1.95297217e-04, 5.46698918e-05, 1.40046544e-05, 3.28298275e-06,
7.04266167e-07, 1.38253798e-07, 2.48364397e-08, 4.08294594e-09,
6.14228468e-10, 8.45586954e-11, 1.06526738e-11, 1.22809229e-12,
1.29561331e-13, 1.25081137e-14, 1.10504575e-15, 8.93389249e-17,
6.60956998e-18, 4.47484204e-19],
[2.87668051e-05, 8.15692010e-05, 2.18658974e-04, 5.54134158e-04,
1.32760555e-03, 3.00697428e-03, 6.43868072e-03, 1.30337860e-02,
2.49431215e-02, 4.51271116e-02, 7.71846238e-02, 1.24804593e-01,
1.90781765e-01, 2.75708196e-01, 3.76676920e-01, 4.86513466e-01,
5.94055798e-01, 6.85750500e-01, 7.48361689e-01, 7.72082093e-01,
7.53046741e-01, 6.94363536e-01, 6.05282892e-01, 4.98811448e-01,
3.88616239e-01, 2.86227921e-01, 1.99301042e-01, 1.31193905e-01,
8.16439997e-02, 4.80331871e-02, 2.67156075e-02, 1.40473766e-02,
6.98283720e-03, 3.28152055e-03, 1.45789034e-03, 6.12323756e-04,
2.43132983e-04, 9.12668725e-05, 3.23883615e-05, 1.08660402e-05,
3.44635616e-06, 1.03336923e-06, 2.92925659e-07, 7.84993073e-08,
1.98875257e-08, 4.76323711e-09, 1.07852488e-09, 2.30868487e-10,
4.67203046e-11, 8.93826431e-12]]))])
file_tab_i["ancil"]
OrderedDict([('ids', array([1., 2., 3.]))])
We can get a similar data structure from the Ensemble itself by using the method build_tables, which is what is called to create a dictionary of the three main data tables before writing an Ensemble to file.
tables_i = ens_i.build_tables()
tables_i.keys()
dict_keys(['meta', 'data', 'ancil'])
We can compare the metadata tables generated by the build_tables method to the ones read in from file by tables_io:
print("From build_tables method:")
print(tables_i["meta"])
print("From file:")
print(file_tab_i["meta"])
From build_tables method:
{'pdf_name': array([b'interp'], dtype='|S6'), 'pdf_version': array([0]), 'xvals': array([[-1. , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
-0.3877551 , -0.26530612, -0.14285714, -0.02040816, 0.10204082,
0.2244898 , 0.34693878, 0.46938776, 0.59183673, 0.71428571,
0.83673469, 0.95918367, 1.08163265, 1.20408163, 1.32653061,
1.44897959, 1.57142857, 1.69387755, 1.81632653, 1.93877551,
2.06122449, 2.18367347, 2.30612245, 2.42857143, 2.55102041,
2.67346939, 2.79591837, 2.91836735, 3.04081633, 3.16326531,
3.28571429, 3.40816327, 3.53061224, 3.65306122, 3.7755102 ,
3.89795918, 4.02040816, 4.14285714, 4.26530612, 4.3877551 ,
4.51020408, 4.63265306, 4.75510204, 4.87755102, 5. ]])}
From file:
OrderedDict([('pdf_name', array([b'interp'], dtype='|S6')), ('pdf_version', array([0])), ('xvals', array([[-1. , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
-0.3877551 , -0.26530612, -0.14285714, -0.02040816, 0.10204082,
0.2244898 , 0.34693878, 0.46938776, 0.59183673, 0.71428571,
0.83673469, 0.95918367, 1.08163265, 1.20408163, 1.32653061,
1.44897959, 1.57142857, 1.69387755, 1.81632653, 1.93877551,
2.06122449, 2.18367347, 2.30612245, 2.42857143, 2.55102041,
2.67346939, 2.79591837, 2.91836735, 3.04081633, 3.16326531,
3.28571429, 3.40816327, 3.53061224, 3.65306122, 3.7755102 ,
3.89795918, 4.02040816, 4.14285714, 4.26530612, 4.3877551 ,
4.51020408, 4.63265306, 4.75510204, 4.87755102, 5. ]]))])
They are essentially identical. One thing to note is that build_tables can also encode any strings in the ancil table, if you provide it with the appropriate arguments. This is useful if you will be writing to HDF5 files.
Creating a qp file from scratch
Now let’s try to create an Ensemble file from scratch, and see if we can read it in as an Ensemble. First, we need a metadata table (dictionary) with the appropriate keys. Let’s make it an interpolation parameterized Ensemble as well, so it will need: “pdf_name”, “pdf_version”, and “xvals”. Note that for this to be a Table-like object, there are a few things we have to make sure to do:
All values must be iterable, so they all must be arrays
They must all have the same length, or first dimension. Since “xvals” will inherently have more values than any other value in the metadata here, we make it a 2D array, so the first dimension is 1.
xvals = np.array([np.linspace(0,5,10, )])
new_meta = {"pdf_name": np.array(["interp".encode()]),"pdf_version": np.array([0]),"xvals": xvals}
new_meta
{'pdf_name': array([b'interp'], dtype='|S6'),
'pdf_version': array([0]),
'xvals': array([[0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ]])}
So far, that looks like it matches the format of our Ensemble tables above. Now let’s make a data table with the “yvals” for 3 distributions.
yvals = np.array([[0,1,2,3,4,4,3,2,1,0],[0,0.1,0.2,0.3,0.4,0.4,0.3,0.2,0.1,0],[0,1,2,3,4,5,4,2,1,0]])
new_data = {"yvals": yvals}
new_data
{'yvals': array([[0. , 1. , 2. , 3. , 4. , 4. , 3. , 2. , 1. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.4, 0.3, 0.2, 0.1, 0. ],
[0. , 1. , 2. , 3. , 4. , 5. , 4. , 2. , 1. , 0. ]])}
ancil = np.linspace(0,2,3)
new_ancil = {"ids": ancil}
new_ancil
{'ids': array([0., 1., 2.])}
Using tables_io
Now we can use tables_io to write out the HDF5 file:
data_tables = {"meta":new_meta, "data": new_data, "ancil": new_ancil}
tables_io.write(data_tables, "../assets/new-interp-ensemble.hdf5")
'../assets/new-interp-ensemble.hdf5'
Now the tables wrote to file, but is it a qp file? Let’s check:
qp.is_qp_file("../assets/new-interp-ensemble.hdf5")
True
Yay! We’ve successfully created a qp file. Let’s try reading it in as an Ensemble to make sure.
new_ens = qp.read("../assets/new-interp-ensemble.hdf5")
new_ens
Ensemble(the_class=interp,shape=(3, 10))
Using h5py
We can also use h5py to write out the HDF5 manually, like so:
# create the file object
f = h5py.File("../assets/h5py-interp-ensemble.hdf5", "w")
# create the necessary groups
meta_g = f.create_group("meta")
data_g = f.create_group("data")
ancil_g = f.create_group("ancil")
# populate the groups with the datasets
# metadata
f["meta"]["pdf_name"] = new_meta["pdf_name"]
f["meta"]["pdf_version"] = new_meta["pdf_version"]
f["meta"]["xvals"] = new_meta["xvals"]
# data
f["data"]["yvals"] = new_data["yvals"]
# ancil
f["ancil"]["ids"] = new_ancil["ids"]
# make sure the file object has the right groups before closing
f.keys()
<KeysViewHDF5 ['ancil', 'data', 'meta']>
f.close()
Now let’s test that this is also a qp approved file:
qp.is_qp_file("../assets/h5py-interp-ensemble.hdf5")
True