Intro to HDF5

Hierarchical Data Format Version 5 (HDF5)
is model for managing and storing data, which includes both the storage model (i.e., the file format .hdf) and the
libraries for programming interfaces to implement this model (e.g., h5py).

HDF5 Data Model

The HDF5 file is a portable, self-describing file format that supports large and heterogeneously complex data.

There are three main components
to the HDF5 data structure:

  1. Groups
  2. Datasets
  3. Attributes
Groups
  • analogous to a file system’s directory
    (i.e., it’s like a folder)
  • may contain zero or more objects
  • may have zero or more attributes
  • every object (other than the root group) must be a member of at least one group
Dataset
  • a multi-dimensional array of data elements
  • a data element is a set of bits that describe a number, a character, an array of heterogeneous data elements (i.e., it can be just about anything)
  • has data type (description of the data)
  • has data space (layout of the data)
  • may have zero or more attributes
  • may be compressed
Attributes
  • document an object (i.e., it’s metadata)
  • have a name and data (e.g., like a key-attribute pair)
  • the attribute data is like a dataset, except:
    • it should be small
    • it only lives with its associated object
    • it cannot be partially read from memory
    • it may not have other attributes
Overview of the HDF5 Data Model.

Designing HDF5 Files

It all begins with the root (/).

Root group (every HDF5 file has one).

From there, you may define some groups.

Root group with two sub-group members.

And, in one of the groups, you put a dataset.

Root group with two groups; one with a dataset member.

Then, you decide to add a dataset to your root group.

Root group with two group members and a dataset, where one group member also has a dataset.

The world is your oyster.

Images taken from the HDF User’s Guide

Activity

Exploring HDF

Let’s take a look at an example scenario.

Find where all the plant life
is located on Earth.

What data do we need?

→ Global vegetation index

Where do we get it?

  • NASA has two satellites in orbit: Aqua and Terra


Both satellites host the
Moderate Resolution Imaging Spectroradiometer
or MODIS for short

By combining spectral bands from MODIS, we can “see” vegetation coverage over the entire globe at about 16-day to one month averages
(it takes a while for these satellites to image the whole earth.)

How to Find This Data

  • MODIS data is maintained by LP DAAC (Land Processes Distributed Active Archive Center)
  • You get access to all* DAACs through EarthData

*Some data are only available to select users/researchers.

How do we access it?

→ Create a EarthData account.
  1. Click “Register”
  2. Complete the registration form
    • Affiliation: Education
    • User Type: Public User
    • Organization is “William & Mary”
    • Study Area: any; this example is for “Land Processes”

Challenge

In the LP DAAC, find the latest monthly 0.05 degree global EVI vegetation index from the Terra satellite in HDF file format.

What is the name of that file?

Working with HDF Files

Let’s try it out.

Python API

We are going to use
h5py
to read and write the HDF5 file format.

Find hdf_read.py and hdf_write.py in the scripts folder of our spatial-data-discovery.github.io repository.



Please follow along with the demo.