Skip to main content

New Elastic Fabric Adapter (EFA) metrics for improved observability of Amazon Web Services China networking

Posted on: Sep 24, 2025

Today, we have introduced five new Elastic Fabric Adapter (EFA) metrics to enhance network observability for AI/ML and High Performance Computing (HPC) workloads. These new metrics help diagnose performance issues by tracking retransmitted packets and bytes, connection timeout events, impaired remote connection events, and remote receiver unresponsive events.

With these new metrics, you can monitor for network congestion or instance configuration issues, allowing for timely recovery actions to maintain application performance. The metrics are implemented as counters at the per-EFA device level, accumulating data since instance launch or the most recent driver reset. Stored in the sys filesystem, these metrics counters are accessible via the instance command line. For enhanced monitoring and alerting capabilities, you can integrate these metrics into Prometheus scripts, facilitating export to third-party tools such as Grafana for dashboard creation and alarm setting. The new metrics are available on Nitro v4 (and later) instances and require EFA installer version 1.44.0 or higher. For a full list of metrics and to learn more on how to use them, please visit the Monitor an EFA user guide. For a comprehensive list of instances built on different Nitro system versions, please refer to the Amazon Nitro Systems documentation.

These metrics are available in Amazon Web Services China (Beijing) Region, operated by Sinnet and Amazon Web Services China (Ningxia) Region, operated by NWCD. To learn more about EFA, please visit the EFA documentation.