Posted On: Apr 1, 2024

Amazon CloudWatch Container Insights with Enhanced Observability for EKS now auto-discovers critical health and performance metrics from your NVIDIA GPUs and delivers them in automatic dashboards to enable faster problem isolation and troubleshooting in your AI/ML workload observability. Container Insights with Enhanced Observability delivers you out-of-the-box trends and patterns on your infrastructure health and removes the overhead of manual dashboard and alarm set-ups saving you time and effort.

Using enhanced observability on Container Insights, you can now easily understand if your GPUs and memory on your accelerated instances are healthy and ensure that your training jobs remain performant. You can easily pinpoint errors and quickly drill down to identify root cause while minimizing long disruptions to your training jobs. Container Insights with Enhanced Observability for EKS delivers accelerated compute observability in curated visualizations and enables you to easily monitor how efficient your resources are consumed by your distributed training models and optimize your allocations accordingly.

Getting started with NVIDIA GPU observability is easy. You can onboard NVIDIA GPU observability on Enhanced Container Insights by installing CloudWatch Observability Add-on into your clusters either through EKS console or via programmatic access. Once configured you can navigate to Container Insights console and view your NVIDIA GPU metrics out-of-the-box.

NVIDIA GPU metrics are now available in Container Insights with Enhanced Observability for EKS in Amazon Web Services China (Beijing) Region, operated by Sinnet and Amazon Web Services China (Ningxia) Region, operated by NWCD. NVIDIA GPU metrics follow observation based pricing – see Container Insights pricing page for details. For further information, see the Container Insights user guide.