We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Improved resiliency with cluster manager task throttling for Amazon OpenSearch Service
We have introduced a throttling mechanism for cluster manager nodes to provide protection against a large number of pending tasks. It acts during task submission to the leader node. This feature is available for Amazon OpenSearch engine version 1.3 and above in Amazon OpenSearch Service.
Cluster manager task throttling
Cluster manager task throttling is a mechanism to protect the cluster manager against submission of too many resource-intensive cluster state update tasks from other nodes. For tasks like put-mapping
, data nodes have an existing throttling mechanism for cluster-state tasks that helps avoid overload of the cluster manager. For example, if the cluster manager can handle 10 K requests and the domain has 10 data nodes, then each data node gets a throttle at 1,000 put-mapping requests. If the domain grows to 100 data nodes, then each data node must throttle at 100 requests. To avoid having to modify these throttle limit whenever the cluster changes the number of data nodes and to support more task types, we ‘ve introduced throttling at cluster manager node for self-protection.
The cluster state update tasks are of different types ( create-index
, put-mapping
, and more) and this throttling mechanism rejects a task based on its type. For any incoming task, the cluster manager evaluates the total number of tasks of the same type in the pending task queue. If this number exceeds the threshold for a task type, then the cluster manager rejects the incoming task.
Amazon OpenSearch Service configures different throttling thresholds for different task types and throttling acts independently on each task type. Rejecting a particular task doesn’t affect other tasks of a different type. For example, if the cluster manager rejects a put-mapping
task, it can still accept a concurrent create-index
task.
All of the tasks generated by data plane APIs( _mapping/
, _setting/
and more) have been onboarded for throttling and are listed
When the cluster manager rejects a task, the data node performs retries with exponential back off to resubmit the task to the cluster manager until it is successfully submitted. If retries are unsuccessful within the timeout period, then Amazon OpenSearch returns a cluster timeout error.
Sample of error
Handling time out errors
The throttling exception is acted upon by data nodes; they perform the retries on throttled task. If API times out during throttling period, you’ll get process_cluster_event_timeout_exception
, which is a 503 error. This is the same error that was thrown earlier as well when tasks are timing out in the cluster manager node’s queue. You can retry the API calls with timeout errors.
This
Monitoring throttling
You can monitor the detailed throttling stats using the _nodes/stats
API.
You can use the cluster_manager_throttling section
in the _nodes/stats
response to track, which tasks are getting throttled and how many tasks has been throttled.
Sample response
Conclusion
In this post, we showed you how a throttling mechanism for task submission to the cluster manager node makes it more resilient to a high number of pending tasks in Amazon OpenSearch Service, where we have fine-tuned the thresholds per cluster.
Cluster manager throttling is available in Amazon
About the Authors
Dhwanil Patel is a Software Developer Engineer working on Amazon OpenSearch Service. He likes to contribute to open-source software development, and is passionate about distributed systems.
Shweta Thareja is a Principal Engineer working on Amazon OpenSearch Service. She is interested in building distributed and autonomous systems. She is a maintainer and an active contributor to OpenSearch.
Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the Amazon Web Services Cloud. Prior to joining Amazon Web Services, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.