We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Improved resiliency with backpressure and admission control for Amazon OpenSearch Service
We are now excited to introduce Search Backpressure and CPU-based admission control for OpenSearch Service, which further enhances the resiliency of clusters. These improvements are available for all OpenSearch versions 1.3 or higher.
Search Backpressure
Backpressure prevents a system from being overwhelmed with work. It does so by controlling the traffic rate or by shedding excessive load in order to prevent crashes and data loss, improve performance, and avoid total failure of the system.
Search Backpressure is a mechanism to identify and cancel in-flight resource-intensive search requests when a node is under duress. It’s effective against search workloads with anomalously high resource usage (such as complex queries, slow queries, many hits, or heavy aggregations), which could otherwise cause node crashes and impact the cluster’s health.
Search Backpressure is built on top of the task resource tracking framework, which provides an easy-to-use API to monitor each task’s resource usage. Search Backpressure uses a background thread that periodically measures the node’s resource usage and assigns a cancellation score to each in-flight search task based on factors like CPU time, heap allocations, and elapsed time. A higher cancellation score corresponds to a more resource-intensive search request. Search requests are cancelled in descending order of their cancellation score to recover nodes quickly, but the number of cancellations is rate-limited to avoid wasteful work.
The following diagram illustrates the Search Backpressure workflow.
Search requests return an HTTP 429 “Too Many Requests” status code upon cancellation. OpenSearch returns partial results if only some shards fail and partial results are allowed. See the following code:
Monitoring Search Backpressure
You can monitor the detailed Search Backpressure state using the node stats API:
You can also view the cluster-wide summary of cancellations using
- SearchTaskCancelled – The number of coordinator node cancellations
- SearchShardTaskCancelled – The number of data node cancellations
The following screenshot shows an example of tracking these metrics on the CloudWatch console.
CPU-based admission control
Admission control is a gatekeeping mechanism that proactively limits the number of requests to a node based on its current capacity, both for organic increases and spikes in traffic.
In addition to the JVM memory pressure and request size thresholds, it now also monitors each node’s rolling average CPU usage to reject incoming _search
and _bulk
requests. It prevents nodes from being overwhelmed with too many requests leading to hot spots, performance problems, request timeouts, and other cascading failures. Excessive requests return an HTTP 429 “Too Many Requests” status code upon rejection.
Handling HTTP 429 errors
You’ll receive HTTP 429 errors if you send excessive traffic to a node. It indicates either insufficient cluster resources, resource-intensive search requests, or an unintended spike in the workload.
Search Backpressure provides the reason for rejection, which can help fine-tune resource-intensive search requests. For traffic spikes, we recommend client-side retries with exponential backoff and jitter.
You can also follow these troubleshooting guides to debug excessive rejections:
-
How do I resolve search or write rejections in OpenSearch Service? -
How do I troubleshoot search latency spikes in my Amazon OpenSearch Service cluster?
Conclusion
Search Backpressure is a reactive mechanism to shed excessive load, while admission control is a proactive mechanism to limit the number of requests to a node beyond its capacity. Both work in tandem to improve the overall resiliency of an OpenSearch cluster.
Search Backpressure is available in
About the authors
Ketan Verma is a Senior SDE working on Amazon OpenSearch Service. He is passionate about building large-scale distributed systems, improving performance, and simplifying complex ideas with simple abstractions. Outside work, he likes to read and improve his home barista skills.
Suresh N S is a Senior SDE working on Amazon OpenSearch Service. He is passionate towards solving problems in large scale distributed systems.
Pritkumar Ladani is an SDE-2 working on Amazon OpenSearch Service. He likes to contribute to open source software development, and is passionate about distributed systems. He is an amateur badminton player and enjoys trekking.
Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in building distributed and autonomous systems. He is a maintainer and an active contributor to OpenSearch.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.