We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. However, from time to time we need to relaunch your ElastiCache nodes to apply mandatory OS updates to the underlying host. These replacements are required to apply upgrades that strengthen security, reliability, and operational performance.
You also have the option to manage these replacements yourself at any time prior to the scheduled maintenance window. When you manage a replacement yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.
Q: How long does a node replacement take?
A replacement typically completes within a few minutes. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.
Memcached nodes do not need to sync during replacement and are always replaced fast irrespective of node sizes.
Q: How does a node replacement impact my application?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ or Cluster Mode is enabled, replacing the primary triggers a failover to a read replica. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary will be unavailable during this time.
For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.
Q: What best practices should I follow for a smooth replacement experience and minimize data loss?
For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.
For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.
Q: How do I manage node replacements on my own?
We recommend that you allow ElastiCache to manage your node replacements for you during your scheduled maintenance window. You can specify your preferred time for replacements via the weekly maintenance window when you create an ElastiCache cluster. For changing your maintenance window to a more convenient time later, you can use the ModifyCacheCluster API or click on Modify in the ElastiCache Management Console.
If you choose to manage the replacement yourself, you can take various actions depending on your use case and cluster configuration:
- Change the Maintenance Window.
- Re-launch your Redis instance using Backup & Restore process.
- If your Redis cluster configuration is Cluster Mode Disabled
- Replace a read-replica (Cluster-Mode Disabled) – A procedure to manually replace a read-replica in a Redis replication group.
- Replace the primary node (Cluster-Mode Disabled) – A procedure to manually replace the primary node in a Redis replication group.
- Replace a standalone node (Cluster-Mode Disabled) – Two different procedures to replace a standalone Redis node.
- If your Redis cluster configuration is Cluster Mode Enabled
- Replace a node in cluster with one or more shards – You can either use backup and restore or scale-out followed by a scale-in to replace the nodes.
For more instructions on all these options see Actions You Can Take When a Node is Scheduled for Replacement page.
For Memcached, you can just delete and re-create the clusters. Post replacement, your instance should no longer have a scheduled event associated with it.
Q: Why are you doing these node replacements?
These replacements are needed to apply mandatory software updates to your underlying host. The updates help strengthen our security, reliability, and operational performance.
Q: Do these replacements affect my nodes in Multiple Availability Zones at the same time?
We may replace multiple nodes from the same cluster depending on the cluster configuration while maintaining cluster stability. For Redis sharded clusters, we try not to replace multiple nodes in the same shard at a time. In addition, we try not to replace majority of the master nodes in the cluster across all the shards.
For non-sharded clusters, we will attempt to stagger node replacements over the maintenance window as much as possible to continue maintaining cluster stability.
Q: Can the nodes in different clusters from different regions be replaced at the same time?
Yes, it is possible that these nodes will be replaced at the same time, if your maintenance window for these clusters is configured to be the same.