Amazon Elasticache FAQs

We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. However, from time to time we need to relaunch your ElastiCache nodes to apply mandatory OS updates to the underlying host. These replacements are required to apply upgrades that strengthen security, reliability, and operational performance.

You also have the option to manage these replacements yourself at any time prior to the scheduled maintenance window. When you manage a replacement yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.

Q: How long does a node replacement take?

A replacement typically completes within a few minutes. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.

Memcached nodes do not need to sync during replacement and are always replaced fast irrespective of node sizes.

Q: How does a node replacement impact my application?

For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ or Cluster Mode is enabled, replacing the primary triggers a failover to a read replica. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary will be unavailable during this time.

For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.

Q: What best practices should I follow for a smooth replacement experience and minimize data loss?

For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.

For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.

Q: How do I manage node replacements on my own?

We recommend that you allow ElastiCache to manage your node replacements for you during your scheduled maintenance window. You can specify your preferred time for replacements via the weekly maintenance window when you create an ElastiCache cluster. For changing your maintenance window to a more convenient time later, you can use the ModifyCacheCluster API or click on Modify in the ElastiCache Management Console.

If you choose to manage the replacement yourself, you can take various actions depending on your use case and cluster configuration:

For more instructions on all these options see Actions You Can Take When a Node is Scheduled for Replacement page.

For Memcached, you can just delete and re-create the clusters. Post replacement, your instance should no longer have a scheduled event associated with it.

Q: Why are you doing these node replacements?

These replacements are needed to apply mandatory software updates to your underlying host. The updates help strengthen our security, reliability, and operational performance.

Q: Do these replacements affect my nodes in Multiple Availability Zones at the same time?

We may replace multiple nodes from the same cluster depending on the cluster configuration while maintaining cluster stability. For Redis sharded clusters, we try not to replace multiple nodes in the same shard at a time. In addition, we try not to replace majority of the master nodes in the cluster across all the shards.

For non-sharded clusters, we will attempt to stagger node replacements over the maintenance window as much as possible to continue maintaining cluster stability.

Q: Can the nodes in different clusters from different regions be replaced at the same time?

Yes, it is possible that these nodes will be replaced at the same time, if your maintenance window for these clusters is configured to be the same.

Q: How does ElastiCache performance compare to open source Redis?

Amazon ElastiCache for Redis adds dynamic network processing to enhanced I/O handling in Redis versions 5.0.3 and above. By utilizing the extra CPU power available in nodes with four or more vCPUs, ElastiCache transparently delivers up to 83% increase in throughput and up to 47% reduction in latency per node.

With ElastiCache for Redis 7 and above, we introduced Enhanced IO includes Multiplexing, which delivers additional improvements to throughput and latency at scale. Enhanced IO Multiplexing is ideal for throughput-bound workloads with multiple client connections, and its benefits scale with the level of workload concurrency,As an example, when using r6g.xlarge node and running 5200 concurrent clients, you can achieve up to 72% increased throughput (read and write operations per second) and up to 71% decreased P99 latency, compared with ElastiCache for Redis 6. For these types of workloads, a node's network IO processing can become a limiting factor in the ability to scale. With ElastiCache for Redis each dedicated network IO thread pipelines commands from multiple clients into the Redis engine, taking advantage of Redis' ability to efficiently process commands in batches.

For more information see the documentation.

Q: How do I monitor Redis CPU utilization?

Amazon ElastiCache provides two metrics to measure CPU utilization for Amazon ElastiCache for Redis workloads – EngineCPUUtilization and CPUUtilization. The CPUUtilization metric measures the CPU utilization for the instance (node), and the EngineCPUUtilization metric measures the utilization at the Redis process level. You need the EngineCPUUtilization metric in addition to the CPUUtilization metric as the main Redis process is single threaded and uses just one CPU of the multiple CPU cores available on an instance. Therefore, the CPUUtilization metric does not provide precise visibility into the CPU utilization rates at the Redis process level.

We recommend that you use both the CPUUtilization and EngineCPUUtilization metrics together to get a detailed understanding of CPU Utilization for your Redis clusters. Both the metrics are available in all Amazon Web Services regions, and you can access these metric using CloudWatch or via the Amazon Web Services Management Console.

Additionally, we’d recommend that you visit this page to learn about useful metrics for performance monitoring.