Chapter 3. Working with Datasets
Datasets are the central feature of HDF5. You can think of them as
NumPy arrays that live on disk. Every dataset in HDF5 has a name,
a type, and a shape, and supports random access. Unlike the built-in
np.save
and friends, there’s no need to read and write the
entire array as a block; you can use the standard NumPy syntax for
slicing to read and write just the parts you want.
Dataset Basics
First, let’s create a file so we have somewhere to store our datasets:
>>>
f
=
h5py
.
File
(
"testfile.hdf5"
)
Every dataset in an HDF5 file has a name. Let’s see what happens if we just assign a new NumPy array to a name in the file:
>>>
arr
=
np
.
ones
((
5
,
2
))
>>>
f
[
"my dataset"
]
=
arr
>>>
dset
=
f
[
"my dataset"
]
>>>
dset
<HDF5 dataset "my dataset": shape (5, 2), type "<f8">
We put in a NumPy array but got back something else: an instance
of the class h5py.Dataset
. This is a “proxy” object that lets
you read and write to the underlying HDF5 dataset on disk.
Type and Shape
Let’s explore the Dataset
object. If you’re using IPython,
type dset.
and hit Tab to see the object’s attributes; otherwise,
do dir(dset)
. There are a lot, but a few stand out:
>>>
dset
.
dtype
dtype('float64')
Each dataset has a fixed type that is defined
when it’s created and can never be changed. HDF5 has a vast,
expressive type mechanism that can easily handle the built-in NumPy
types, with few exceptions. For this reason, h5py always
expresses the type of a dataset using standard NumPy dtype
objects.
There’s ...
Get Python and HDF5 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.