Amazon Web Services ParallelCluster 3.3.0 now supports On-Demand Capacity Reservations

On-Demand Capacity Reservations (ODCRs) enable you to reserve compute capacity for your Amazon EC2 instances in a specific Availability Zone for any duration. When you use them with Amazon Web Services ParallelCluster, they help to ensure your HPC workloads have enough resources to complete successfully and on-time.

It has long been possible for Amazon Web Services ParallelCluster to make use of ODCRs via manual setup. However, with Amazon Web Services ParallelCluster 3.3.0, you can now add and modify ODCRs for your HPC cluster directly within your Amazon Web Services ParallelCluster configuration.

This post explains what ODCRs are, how this new feature works, and how to configure your HPC cluster to use them.

What are On-Demand Capacity Reservations?

On-Demand Capacity Reservations (ODCRs) let you reserve EC2 compute instance capacity. You can create capacity reservations whenever you want to without entering into an extended commitment with Amazon Web Services. If you have a Savings Plan , you may combine it with your ODCR to reserve instance capacity while taking advantage of cost savings. The capacity becomes available and billing starts as soon as the Capacity Reservation is provisioned in your account. When you no longer need it, cancel the Capacity Reservation to release the capacity and to stop incurring charges. You can manually modify or cancel capacity reservations at any time. You can also schedule them to stop automatically at a future time.

There are two kinds of On Demand Capacity Reservations: open ODCRs and targeted ODCRs.

Under an open ODCR, you do not have to provide a reservation identifier when an Amazon EC2 instance is launched. Instead, instances launched after the start of the reservation that match the reservation by instance type, platform, and Availability Zone are automatically allocated to the ODCR.

Using targeted ODCRs requires you to provide either the ODCR identifier or Resource Group ARN at instance launch time. When the reservation expires, no more instance launches can take place. This can be a good way to meter utilization while you also ensure sufficient capacity.

What’s New?

Amazon Web Services ParallelCluster has historically supported open ODCRs so long as the instance types it launched matched the reservation attributes. Since ParallelCluster 3.1.1, you have been able to use targeted ODCRs by editing a file on the cluster head node to add certain EC2 API parameters.

Amazon Web Services ParallelCluster 3.3.0 improves this experience. Now, you configure ODCRs directly in the Amazon Web Services ParallelCluster configuration file. You specify a combination of capacity reservation identifier, capacity resource group ARN, and cluster placement group for each compute resource in your Slurm queues (Figure 1). Importantly, you can add or remove capacity reservations to your queues dynamically, without disrupting cluster operations.

If you’ve used Cluster Placement Groups with ParallelCluster before, you may recognize the presence of a Networking stanza in the Compute Resource configuration. This is because in ParallelCluster 3.3.0, each Compute Resource can have its own Cluster Placement Group. This feature is described in greater detail in the ParallelCluster documentation, as it has applications beyond just its relevance to ODCRs.

Figure 1: ParallelCluster configurations now support capacity reservations in combination with networking placement groups.

How to use ODCRs with Amazon Web Services ParallelCluster 3.3.0

To use targeted ODCRs with ParallelCluster, you will need to create a new cluster. Update to or install ParallelCluster to version 3.3.0 following this online guide . Create a new cluster configuration file (or update an existing one) as described below, then use it configuration to create a new cluster.

Configuring ODCR

You can configure an ODCR for any compute resource in any Slurm queue. The configuration details will vary depending on whether you are using a single instance type per Slurm queue Compute Resource or multiple instance types.

In the case of single instance types, you have two options. You can create a targeted ODCR for the desired instance type and add its reservation ID directly to a compute resource as CapacityReservationId . Or, you can create a capacity reservation group containing your reservation and add it to a compute resource as the CapacityReservationResourceGroupArn .

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: q1
      ComputeResources:
        - Name: cr1
          InstanceType: c6a.48xlarge
          MinCount: 1
          MaxCount: 8
          CapacityReservationTarget:
            CapacityReservationId: cr-01234567890abdef0
# OR #
            CapacityReservationResourceGroupArn: arn:aws:resource-groups:us-east-1:123456791537:group/MyCRGroup

Note that you must use the Amazon Web Services Command Line Interface (Amazon Web Services CLI) when creating capacity reservation groups for use with Amazon Web Services ParallelCluster, rather than the Amazon Web Services Management Console. This is because the console only supports creation of Tag- and Stack-based resource groups, and these are not supported by ParallelCluster.

In the case of multiple instance types per Compute Resource , you can only use a CapacityReservationResourceGroupArn . Create a capacity reservation group containing a capacity reservation for each InstanceType in your list of Instances . Then, specify the group’s ARN in your cluster configuration like this:

…
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: q2
      ComputeResources:
        - Name: cr1
          Instances: 
            - InstanceType: c6a.24xlarge
            - InstanceType: r6a.24xlarge
            - InstanceType: m6a.24xlarge
          MinCount: 1
          MaxCount: 8
          CapacityReservationTarget:
            CapacityReservationResourceGroupArn: arn:aws:resource-groups:us-east-1:123456791537:group/MyCRGroup
...

Using Cluster Placement Groups

A cluster placement group (CPG) is a logical grouping of instances within a single Availability Zone and they offer the benefit of low network latency and high network throughput, which helps increase performance of tightly coupled HPC workloads. You can use them with ODCR by creating a Cluster Placement Group ODCR (CPG ODCR).

To create a Cluster Placement Group ODCR

Create a Cluster Placement Group (if you do not have one already).
Create a targeted ODCR, specifying the Cluster Placement Group name when you do so.
Create a capacity reservation resource group to hold the ODCR. Add either your reservation identifier or the group ARN to the cluster configuration. Then, add the networking placement group name as shown in this example:

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: q1
      ComputeResources:
        - Name: cr1
          InstanceType: c6a.48xlarge
          MinCount: 1
          MaxCount: 8
          CapacityReservationTarget:
            CapacityReservationId: cr-01234567890abdef0
          Networking:
             PlacementGroup: 
                 Name: my-placement-group

Updating Your Cluster

You can dynamically update or modify capacity reservation configurations on your cluster once it is running Amazon Web Services ParallelCluster 3.3.0. By default, you will need to stop your compute fleet, update the cluster, then restart the fleet to make changes. However, you can change the Slurm QueueUpdateStrategy to either DRAIN or TERMINATE , as we discussed in our earlier blog post about flexible instance types.

Troubleshooting ODCRs

Once you have created an ODCR, it’s straightforward to get it working with Amazon Web Services ParallelCluster 3.3.0. However, your request to create a capacity reservation can, itself, fail. It’s worth looking at three reasons this can happen.

Your request can fail when there isn’t currently enough Amazon EC2 capacity for the requested instance type in the desired availability zone. You can address this by waiting until later, switching availability zones, or changing instance types.
It can fail when the requested number of instances exceeds your service quota. You can resolve this by requesting an increase in your service quota for On-Demand instances. Ensure that is high enough to accommodate your capacity reservation and any other concurrent instances you need to run.
Your request can fail due to its Cluster Placement Group. The capacity reservation and the placement group have to be in the same availability zone. Furthermore, you can only create capacity reservations for instance types that support Cluster Placement Groups. Most instance types are supported, but you can learn about the exceptions in the Cluster Placement Groups documentation .

Summary

In Amazon Web Services ParallelCluster 3.3.0, it’s easier than ever to use On-Demand Capacity Reservations (ODCRs) to reserve exactly the amount of Amazon EC2 instance capacity you need to complete your HPC workloads. Open capacity reservations are designed to work by default, while targeted reservations and reservations with placement groups use a new configuration mechanism. You’ll need to update your Amazon Web Services ParallelCluster installation and update your clusters to take advantage of this new capability.

We’d love to know what you think after trying out ODCRs with Amazon Web Services ParallelCluster, and how we can improve this new feature. Reach out to us on Twitter at @TechHPC with your feedback and ideas.