We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. However, from time to time we need to relaunch your ElastiCache nodes to apply mandatory OS updates to the underlying host. These replacements are required to apply upgrades that strengthen security, reliability, and operational performance.
You also have the option to manage these replacements yourself at any time prior to the scheduled maintenance window. When you manage a replacement yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.
Q: How long does a node replacement take?
A replacement typically completes within a few minutes. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.
Memcached nodes do not need to sync during replacement and are always replaced fast irrespective of node sizes.
Q: How does a node replacement impact my application?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ or Cluster Mode is enabled, replacing the primary triggers a failover to a read replica. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary will be unavailable during this time.
For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.
Q: What best practices should I follow for a smooth replacement experience and minimize data loss?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.
For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.
Q: How do I manage node replacements on my own?
We recommend that you allow ElastiCache to manage your node replacements for you during your scheduled maintenance window. You can specify your preferred time for replacements via the weekly maintenance window when you create an ElastiCache cluster. For changing your maintenance window to a more convenient time later, you can use the ModifyCacheCluster API or click on Modify in the ElastiCache Management Console.
If you choose to manage the replacement yourself, you can take various actions depending on your use case and cluster configuration:
- Change the Maintenance Window.
- Re-launch your Redis instance using Backup & Restore process.
- If your Redis cluster configuration is Cluster Mode Disabled
- Replace a read-replica (Cluster-Mode Disabled) – A procedure to manually replace a read-replica in a Redis replication group.
- Replace the primary node (Cluster-Mode Disabled) – A procedure to manually replace the primary node in a Redis replication group.
- Replace a standalone node (Cluster-Mode Disabled) – Two different procedures to replace a standalone Redis node.
- If your Redis cluster configuration is Cluster Mode Enabled
- Replace a node in cluster with one or more shards – You can either use backup and restore or scale-out followed by a scale-in to replace the nodes.
For more instructions on all these options see Actions You Can Take When a Node is Scheduled for Replacement page.
For Memcached, you can just delete and re-create the clusters. Post replacement, your instance should no longer have a scheduled event associated with it.
Q: Why are you doing these node replacements?
These replacements are needed to apply mandatory software updates to your underlying host. The updates help strengthen our security, reliability, and operational performance.
Q: Do these replacements affect my nodes in Multiple Availability Zones at the same time?
We may replace multiple nodes from the same cluster depending on the cluster configuration while maintaining cluster stability. For Redis sharded clusters, we try not to replace multiple nodes in the same shard at a time. In addition, we try not to replace majority of the master nodes in the cluster across all the shards.
For non-sharded clusters, we will attempt to stagger node replacements over the maintenance window as much as possible to continue maintaining cluster stability.
Q: Can the nodes in different clusters from different regions be replaced at the same time?
Yes, it is possible that these nodes will be replaced at the same time, if your maintenance window for these clusters is configured to be the same.
Q: What are the performance benefits of Amazon ElastiCache for Redis?
ElastiCache for Redis provides enhanced I/O threads that deliver significant improvements to throughput and latency at scale through multiplexing, presentation layer offloading, and more. Enhanced I/O threads improve performance by leveraging more cores for processing I/O and dynamically adjusting to the workload. ElastiCache for Redis improves the throughput of TLS-enabled clusters by offloading encryption to the same enhanced I/O threads. ElastiCache for Redis version 7.0 introduced enhanced I/O multiplexing, that combines many client requests into a single channel, and improves the main Redis thread efficiency. In ElastiCache for Redis version 7.1 and above, we extended the enhanced I/O threads functionality to also handle the presentation layer logic (see blog). Enhanced I/O threads not only read client input, but also parses the input into a Redis binary command format, which is then forwarded to the main thread for execution, to provide performance gains. With ElastiCache for Redis version 7.1, you can achieve up to 100% more throughput and 50% lower P99 latency, compared to ElastiCache for Redis version 7.0. On r7g.4xlarge or larger, you can achieve over 1 million requests per second (RPS) per node.
Q: How do I monitor Redis CPU utilization?
Amazon ElastiCache provides two metrics to measure CPU utilization for Amazon ElastiCache for Redis workloads – EngineCPUUtilization and CPUUtilization. The CPUUtilization metric measures the CPU utilization for the instance (node), and the EngineCPUUtilization metric measures the utilization at the Redis process level. You need the EngineCPUUtilization metric in addition to the CPUUtilization metric as the main Redis process is single threaded and uses just one CPU of the multiple CPU cores available on an instance. Therefore, the CPUUtilization metric does not provide precise visibility into the CPU utilization rates at the Redis process level.
We recommend that you use both the CPUUtilization and EngineCPUUtilization metrics together to get a detailed understanding of CPU Utilization for your Redis clusters. Both the metrics are available in all Amazon Web Services regions, and you can access these metric using CloudWatch or via the Amazon Web Services Management Console.
Additionally, we’d recommend that you visit this page to learn about useful metrics for performance monitoring.