Geospatial data, including many climate and weather datasets, are often released by government and nonprofit organizations in compressed file formats such as the
Network Common Data Form
(NetCDF) or
GRIdded Binary
(GRIB). Many of these formats were designed in a pre-cloud world for users to download whole files for analysis on a local machine. As the complexity and size of geospatial datasets continue to grow, it is more time- and cost-efficient to leave the files in one place, virtually query the data, and download only the subset that is needed locally.
Unlike legacy file formats, the cloud-native
Zarr
format is designed for virtual and efficient access to compressed chunks of data saved in a central location such as
Amazon Simple Storage Service
(Amazon S3) from Amazon Web Services (Amazon Web Services).
In this walkthrough, learn how to convert NetCDF datasets to Zarr using an
Amazon SageMaker notebook
and an
Amazon Web Services Fargate
cluster and query the resulting Zarr store, reducing the time required for time series queries from minutes to seconds.
Legacy geospatial data formats
Geospatial raster data files associate data points with cells in a latitude-longitude grid. At a minimum, this type of data is two-dimensional but grows to N-dimensional if, for example, multiple data variables are associated with each cell (for example, temperature and windspeed) or variables are recorded over time or with an additional dimension such as altitude.
Over the years, a number of file formats have been devised to store this type of data. The GRIB format created in 1985 stores data as a collection of compressed two-dimensional records, with each record corresponding to a single time step. However, GRIB files do not contain metadata that supports accessing records directly; retrieving a time series for a single variable requires sequential access to every single record.
NetCDF was introduced in 1990 and improved on some of the shortcomings of GRIB. NetCDF files contain metadata that supports more efficient data indexing and retrieval, including opening files remotely and virtually browsing their structure without loading everything into memory. NetCDF data can be stored as chunks within a file, enabling parallel reads and writes on different chunks. As of 2002, NetCDF4 uses the
Hierarchical Data Format version 5
(HDF5) as a backend, improving the speed of parallel I/O on chunks. However, the query performance of NetCDF4 datasets can still be limited by the fact that multiple chunks are contained within a single file.
Cloud-native geospatial data with Zarr
The Zarr file format organizes N-dimensional data as a hierarchy of groups (datasets) and arrays (variables within a dataset). Each array’s data is stored as a set of compressed, binary chunks, and each chunk is written to its own file.
Using Zarr with Amazon S3 enables chunk access with millisecond latency. Chunk reads and writes can run in parallel and be scaled as needed, either via multiple threads or processes from a single machine or from distributed compute resources such as containers running in
Amazon Elastic Container Service
(Amazon ECS). Storing Zarr data in Amazon S3 and applying distributed compute not only avoids data download time and costs but is often the only option for analyzing geospatial data that is too large to fit into the RAM or on the disk of a single machine.
Finally, the Zarr format includes accessible metadata that allows users to remotely open and browse datasets and download only the subset of data they need locally, such as the output of an analysis. The figure below shows an example of metadata files and chunks included in a Zarr group containing one array.
Figure 1. Zarr group containing one Zarr array.
Zarr is tightly integrated with other Python libraries in the open-source
Pangeo technology stack
for analyzing scientific data, such as
Dask
for coordinating distributed, parallel compute jobs and
xarray
as a an interface for working with labeled, N-dimensional array data.
Zarr performance benefits
The
Registry of Open Data on Amazon Web Services
contains both NetCDF and Zarr versions of many widely-used geospatial datasets, including the ECMWF Reanalysis v5 (
ERA5
) dataset. ERA5 contains hourly air pressure, windspeed, and air temperature data at multiple altitudes on a 31 km latitude-longitude grid, since 2008. We compared the average time to query one, five, nine, and 13 years of historical data for these three variables, at a given latitude and longitude, from both the NetCDF and Zarr versions of
ERA5
.
Length of time series |
Avg. NetCDF query time (seconds) |
Avg. Zarr query time (seconds) |
1 year |
23.8 |
1.3 |
5 years |
82.7 |
2.4 |
9 years |
135.8 |
3.5 |
13 years |
193.1 |
4.7 |
Figure 2. Comparison of the average time to query three ERA5 variables using a Dask cluster with 10 workers (8 vCPU, 16 GM memory.
Note that these tests are not a true one-to-one comparison of file formats since the size of the chunks used by each format differs, and results can also vary depending on network conditions. However, they are in line with
benchmark studies
that show a significant performance advantage for Zarr compared to NetCDF.
While the Registry of Open Data on Amazon Web Services contains Zarr versions of many popular geospatial datasets, the Zarr chunking strategy of those datasets may not be optimized for an organization’s data access patterns, or organizations may have substantial internal datasets in NetCDF format. In these cases, they may wish to convert from NetCDF files to Zarr.
Moving to Zarr
There are several options for converting NetCDF files to Zarr: a
Pangeo-forge
recipe, using the Python
rechunker
library, and using xarray.
The Pangeo-forge project provides premade
recipes
that may enable you to convert some GRIB or NetCDF datasets to Zarr without diving into the low-level details of the specific formats.
For more control of the conversion process, you can use the rechunker and xarray libraries. Rechunker enables efficient manipulation of chunked arrays saved in persistent storage. Because rechunker uses an intermediate file store and is designed for parallel execution, it is useful for converting datasets larger than working memory from one file format to another (such as from NetCDF to Zarr), while simultaneously changing the underlying chunk sizes. This makes it ideal for converting large NetCDF datasets to an optimized Zarr format. Xarray is useful for converting smaller datasets to Zarr or appending data from NetCDF files to an existing Zarr store, such as additional data points in a time series (which rechunker currently does not support).
Finally, while technically not an option for converting from NetCDF to Zarr, the
Kerchunk
python library creates index files that allow a NetCDF dataset to be accessed as if it is a Zarr store, yielding significant performance improvements relative to NetCDF alone without creating a new Zarr dataset.
The rest of this blog post provides step-by-step instructions for using rechunker and xarray to convert a NetCDF dataset to Zarr.
Overview of solution
This solution is deployed within an
Amazon Virtual Private Cloud
(Amazon VPC) with an
internet gateway
attached. Within the private subnet, it deploys both a Dask cluster running on
Amazon Web Services Fargate
and the
Amazon SageMaker notebook instance
that will be used to submit jobs to the Dask cluster. In the public subnet, a
NAT Gateway
provides internet access for connections from the Dask cluster or the notebook instance (for example, when installing libraries). Finally, a
Network Load Balancer
(NLB) enables access to the Dask dashboard for monitoring the progress of Dask jobs. Resources within the VPC access Amazon S3 via an
Amazon S3 VPC Endpoint
.
Figure 3. Architecture diagram of the solution.
Solution walkthrough
The following is a high-level overview of the steps required to perform this walkthrough, which we present in more detail:
- Clone the
GitHub repository
.
- Deploy the infrastructure using the
Amazon Web Services Cloud Development Kit
(Amazon Web Services CDK).
- (Optional) Enable the Dask dashboard.
- Set up the Jupyter notebook on the SageMaker notebook instance.
- Use the Jupyter notebook to convert NetCDF files to Zarr.
The first four steps are mostly automated and should take roughly 45 minutes. Once that setup is complete, it should take 30 mins to run the cells in the Jupyter notebook.
Prerequisites
Before starting this walkthrough, you should have the following prerequisites:
- An
Amazon Web Services account
- Amazon Web Services Identity and Access Management (IAM) permissions to follow all of the steps in this guide
-
Python3
and the Amazon Web Services CDK
installed
on the computer you will deploy the solution from.
Deploying the solution
Step 1: Clone the GitHub repository
In a terminal, run the following code to clone the
GitHub repository
containing the solution.
git clone https://github.com/aws-samples/convert-netcdf-to-zarr.git
Change to the root directory of the repository.
cd convert-netcdf-to-zarr
Step 2: Deploy the infrastructure
From the root directory, run the following code to create and activate a virtual environment and install the required libraries. (If running from a Windows machine, you will need to
activate the environment
differently.)
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Then, run the code below to deploy the infrastructure.
cdk deploy
This step first builds the Docker image that will be used by the Dask cluster. When the Docker image is finished, the CDK creates a ConvertToZarr CloudFormation stack with the infrastructure described above. Enter “y” when asked if you wish to deploy. This process should take 20–30 minutes.
Step 3: (Optional) Enable the Dask dashboard
The Dask dashboard provides real-time visualizations and metrics for Dask distributed computing jobs. The dashboard is not required for the Jupyter notebook to work, but is helpful for visualizing jobs and troubleshooting. The Dask dashboard runs on a Dask-Scheduler task port within the private subnet, so we use an NLB to forward requests from the internet to the dashboard.
First, copy the private IP address of the Dask-Scheduler:
1. Sign in to the Amazon ECS console.
2. Choose the Dask-Cluster .
3. Select the Tasks tab.
4. Choose the task with the Dask-Scheduler task definition.
5. Copy the Private IP address of the task in the Configuration tab of the Configuration section.
Now, add the IP address as the target of the NLB.
1. Sign in to the Amazon EC2 console.
2. In the navigation pane, under Load Balancing , choose Target Groups .
3. Choose the target group that starts with Conver-dasks .
4. Select Register targets .
5. In Step 1, make sure that the Convert-To-Zarr VPC is selected. For Step 2, paste the private IP address of the Dask-Scheduler. The port stays as 8787. Select Include as pending below .
6. Scroll down to the bottom of the page. Select Register pending targets . In a few minutes, the target’s health status should change from Pending to Healthy, indicating that you can access the Dask dashboard from a web browser.
Finally, copy the DNS name of the NLB.
1. Sign in to the Amazon EC2 console.
2. In the navigation pane, under Load Balancing , choose Load Balancers .
3. Find the load balancer that begins with Conve-daskd and copy the DNS name .
4. Paste the DNS name into a web browser to load the (empty) Dask dashboard.
Step 4: Set up the Jupyter notebook
We need to clone the GitHub repository on the SageMaker instance and create the kernel for the Jupyter notebook.
1. Sign in to the SageMaker console.
2. In the navigation pane, under Notebook , choose Notebook instances .
3. Find the Convert-To-Zarr-Notebook and select Open Jupyter .
4. On the Jupyter home page, from the New menu at the top right, select Terminal .
5. In the terminal, run the following commands. First, change to the SageMaker directory:
cd SageMaker
Next, clone the repository:
git clone https://github.com/aws-samples/convert-netcdf-to-zarr.git
Change to the notebooks directory within the repository:
cd convert-netcdf-to-zarr/notebooks
6. Finally, create the Jupyter kernel as a new conda environment.
conda env create --name zarr_py310_nb -f environment.yml
This should take 10-15 minutes.
Step 5: Convert NetCDF dataset to Zarr
On the SageMaker instance, in the convert-netcdf-to-zarr/notebooks folder, open Convert-NetCDF-to-Zarr.ipynb . This Jupyter notebook contains all the steps required to convert hourly data from the
NASA MERRA-2
dataset (available from the
Registry of Open Data on Amazon Web Services
) from NetCDF to Zarr, using both rechunker and xarray. Let’s look at some key lines of code.
The code below registers the notebook as a client of the Dask cluster.
DASK_SCHEDULER_URL = "Dask-Scheduler.local-dask:8786"
client = Client(DASK_SCHEDULER_URL)
The notebook then builds a list of two months of daily MERRA-2 NetCDF files (nc_files_map) and opens this dataset with array, using the Dask cluster.
ds_nc = xr.open_mfdataset(nc_files_map, engine='h5netcdf', chunks={}, parallel=True)
The chunks parameter tells xarray to store the data in memory as Dask arrays, and parallel=True tells Dask to open the NetCDF files in parallel on the cluster.
The notebook focuses on converting data for one variable in the dataset, T2M (air temperature at 2 meters), from NetCDF to Zarr. The xarray output for T2M shows the NetCDF chunk sizes are (24, 361, 576).
Figure 4. Xarray output for T2M.
Querying the NetCDF files for two months of T2M data for a given latitude and longitude takes 1–2 minutes. If you have enabled the Dask dashboard, you can watch the progress of the query on the Dask cluster.
Figure 5. Dask dashboard.
Next, before starting the Zarr conversion the notebook creates a dictionary with new chunk sizes for the Zarr store (new_chunksizes_dict).
# returns [“time”, “lat”, “lon”]
dims = ds_nc.dims.keys()
new_chunksizes = (1080, 90, 90)
new_chunksizes_dict = dict(zip(dims, new_chunksizes))
Relative to NetCDF chunk sizes, the new Zarr chunk sizes (1080, 90, 90) are larger in the time dimension to decrease the number of chunks read during a time series query and smaller in the other two dimensions to keep the size and number of chunks created at appropriate levels.
The code block below shows the call to the rechunk function. The key parameters passed to the function are the NetCDF dataset to convert (ds_nc), the target chunk sizes (new_chunksizes_dict) for each variable to convert (var_name, which in this case equals T2M), the S3 URI of the Zarr store that will be created (zarr_store).
ds_zarr = rechunk(
ds_nc,
target_chunks={
var_name: new_chunksizes_dict,
'time': None,'lat': None,'lon': None
},
max_mem='15GB',
target_store = zarr_store,
temp_store = zarr_temp
).execute()
The Jupyter notebook contains a more detailed discussion of both chunking strategy and parameters passed to rechunk. The conversion takes approximately five minutes.
Finally, the notebook code shows how to use xarray to append one week’s worth of additional data, one day at a time, to the Zarr store that was just created. This should take roughly 10 minutes. The file structure of the resulting Zarr store on Amazon S3 is shown below.
Figure 6. Zarr store file structure on Amazon S3.
The result? Running the same query to plot two months of T2M Zarr data for the specified latitude and longitude takes less than a second, compared to roughly 1–2 minutes from a NetCDF dataset.
Cleaning up
When you’re finished, you can avoid unwanted charges and remove the Amazon Web Services resources created with this project by running the following code from the root directory of the repository:
cdk destroy
This will remove all resources created by the project except for the S3 bucket. To delete the bucket, log into the Amazon Web Services S3 Console. Select the bucket and choose Empty . Once the bucket is empty, select the bucket again and choose Delete .
Conclusion
As geospatial datasets continue to grow in size, the performance benefits of using cloud-native data formats such as Zarr with object storage such as Amazon S3 increase as well. While converting terabytes or more of legacy data to Zarr can be a significant undertaking, this post provides step-by-step instructions to jumpstart the process and unlock the significant performance improvements available from using Zarr.
Additional resources
For more examples of processing array-based geospatial data on Amazon Web Services, see:
-
Machine learning on distributed Dask using Amazon SageMaker and Amazon Web Services Fargate
-
Analyze terabyte-scale geospatial datasets with Dask and Jupyter on Amazon Web Services
Finally,
Geospatial ML with Amazon SageMaker
(now in preview) allows customers to efficiently ingest, transform, and train a model on geospatial data, including satellite imagery and location data. It includes built-in map-based visualizations of model predictions, all available within an Amazon SageMaker Studio notebook.
Subscribe to the Amazon Web Services Public Sector Blog newsletter
to get the latest in Amazon Web Services tools, solutions, and innovations from the public sector delivered to your inbox, or
contact us
.
Please take a few minutes to share insights regarding your experience with the Amazon Web Services Public Sector Blog in this survey
, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.