We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Introducing GPU health checks in Amazon Web Services ParallelCluster 3.6
GPU failures are relatively rare but when they do occur, they can have severe consequences for HPC and deep learning tasks. For example, they can disrupt long-running simulations and distributed training jobs. Amazon EC2 verifies GPU health before it launches an instance. It also does
With Amazon Web Services ParallelCluster 3.6, you can configure NVIDIA GPU health checks that run at the start of your Slurm jobs. If the health check fails, the job is re-queued on another instance. Meanwhile, the instance is marked as unavailable for new work and is de-provisioned once any other jobs running on it have completed. This helps increase the reliability of GPU-based workloads (NVIDIA-based ones, at least), and helps prevent unwanted spend resulting from unsuccessful jobs.
Using GPU Health Checks
To get started with GPU health checks, you’ll need ParallelCluster 3.6.0 or higher. You can follow
By default, GPU health checks are off on new clusters. You can enable or disable them at the queue level as well as on individual compute resources. To do this, add a HealthChecks
stanza to Slurm queues and/or individual compute resources. If you set a value for HealthChecks:Gpu:Enabled
at the compute resource level, it overrides the setting from the queue level.
Scheduling:
SlurmQueues:
- Name: <string>
HealthChecks:
Gpu:
Enabled: <boolean>
ComputeResources:
- Name: <string>
HealthChecks:
Gpu:
Enabled: <boolean>
When GPU health checks are enabled on a queue or compute resource, a
If the GPU diagnostic test succeeds, ParallelCluster logs a success message and the job continues to run.
If it fails, ParallelCluster logs an error message and begins the mitigation process. First, the job is rescheduled onto another instance. Then, the failing instance is drained so no more work is allocated to it. Once all jobs running on it have completed, it’s terminated. If there is no GPU on an instance, this gets logged and the diagnostic test is skipped.
You can find logs for GPU Health checks on the individual compute instances at /var/log/parallelcluster/slurm_health_check.log
. However, once an instance is decommissioned by Amazon Web Services ParallelCluster, you no longer have access to it to read this file.
Health check logs are also stored persistently in your cluster’s
- Log group names are a combination of cluster name and a date stamp, so if you have more than one cluster with the same name, choose the one whose date stamp matches your intended cluster’s creation date.
- Under Log streams , you can find health check logs named after the instance they ran on. For example, log stream
ip-172-31-2-17.i-0dc17624a7862b835.slurm_health_check
contains logs from an instance whose private IP was32.2.17
and had the instance identifier0dc17624a7862b835
.
You can use the instance identifier in correspondence with Amazon support should you wish to report a GPU failure.
Details to be aware of
GPU health checks are a straightforward feature. Activate them, and ParallelCluster will transparently monitor for and attempt to mitigate failed GPUs. However, there are a few details to keep in mind as you start using them.
Cost
Validating GPU health when a job starts minimizes the amount of time a GPU instance runs with a degraded GPU. However, you will still incur usage charges for at least the 2-3 minutes it takes to run the health check, and for as long as it takes other jobs on the instance to complete. Also, logs from GPU health checks are sent to an instance-specific stream in your cluster’s Amazon CloudWatch log group. You may incur charges for this additional log data.
Custom Prologs and Epilogs
If you explicitly set custom prologs or epilogs for your Slurm jobs, that may conflict with GPU health checks. This is explained in detail in the
Custom AMIs
You can use GPU health checks with any ParallelCluster AMI from version 3.6.0 on, as well as derivative custom AMIs. GPU health checks rely on the presence of NVIDIA DCGM. If it can’t be found, a log message will be generated and the job will continue to run. Therefore, when you are testing a new custom AMI, we recommend you inspect some health check logs on a GPU-based instance to ensure that health checks can run as expected.
Conclusion
With Amazon Web Services ParallelCluster 3.6.0, you can configure your cluster to detect and recover from GPU failures for any NVIDIA-based GPU instances. This can help minimize unwanted costs and lost time with GPU-intensive workloads. You’ll need to update your ParallelCluster installation, then add HealthChecks
settings to your cluster configuration to use this new feature.
Try out GPU health checks and let us know how we can improve them, or any other aspect of Amazon Web Services ParallelCluster. If they make your life quantifiably better, you can tell us about that, too. Reach us on Twitter at
Twitter/Social Media Excerpt:
Suggested tags: HPC, ParallelCluster, NVIDIA, GPU, AI/ML
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.