We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Life360’s journey to a multi-cluster Amazon EKS architecture to improve resiliency
This post was coauthored by Jesse Gonzalez, Sr. Staff Site Reliability and Naveen Puvvula, Sr. Eng Manager, Reliability Engineering at Life360
Introduction
Life360 is on a mission to simplify safety so families can live fully. In 2022 alone, Life360 called emergency dispatchers over 34,000 times, and received close to 2.2 million in-app SOS alerts from members. The 223 billion miles driven with Life360 Crash Detection has been able to save members’ lives across the world. 26 billion Safe arrival notifications added to the safety net for families. Any interruption to our service directly impacts a family member’s ability to dispatch an ambulance in case of an emergency. The 24/7 availability of our Application Programming Interface (API) is essential, requiring an increasingly more stable and scalable infrastructure.
Solution architecture
In our current architecture, we follow Amazon Web Services High Availability best practices by spreading our workloads across multiple availability zones (AZs) and autoscaling the compute capacity using
At a high level, we use
Our goal is to build
With our workloads in Amazon EC2, we had better control over our traffic routing and could keep our transactions from crossing AZ boundary with the use of Consul and preferential weighting in HAProxy. Today in our Amazon EC2 workloads, an optimal transaction stays in an AZ once it enters our environment. In our multi-AZ Amazon EKS clusters, as we don’t use HAProxy or Consul, which breaks down because a
We chose to adopt a cell-based multi-cluster Amazon EKS architecture to improve the resiliency. By restricting the cluster to a single-AZ we also avoid inter-AZ data transfer costs. In this post, we share our multi-cluster Amazon EKS architecture and operational improvements to address Amazon EKS scaling and workload management, and have a statically stable resilient infrastructure for AZ wide failures.
Life360’s future State: Multi-cluster Amazon EKS architecture
To ensure statically stable design, we choose to implement bulkhead architectures (also known as cell-based architectures) to restrict the effect of failure within a single AZ. Let’s discuss more about bulkhead architectures.
Bulk-head architecture
The bulkhead pattern gets its name from a naval engineering practice where ships have internal chambers isolating their hull so that if an object is to crack it, water can’t spread to the entire ship causing it to sink. See Figure 1 for a visual illustration.
Figure 1. Bulkhead pattern
Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests or clients so that the number of impaired requests is limited, and most can continue without error. Bulkheads for data are often called partitions, while bulkheads for services are known as cells.
In a cell-based architecture , each cell is a complete, independent instance of the service and has a fixed maximum size. As load increases, workloads grow by adding more cells. Any failure is contained to the single cell it occurs in, so that the number of impaired requests is limited as other cells continue without error.
Life360’s cell-based architecture
We used the bulkhead design principle when evaluating how we would design the next iteration of our Amazon EKS environment. Our design essentially moved us away from a multi-AZ Amazon EKS cluster solution, to a single AZ Amazon EKS cluster solution, with multiple Amazon EKS clusters per Region within each AZ. This essentially turns each Amazon EKS cluster into a cell in the bulkhead pattern. We call these individual single AZ Amazon EKS clusters as cells , and the aggregation of cells in each region a supercell.
Figure 2: Transition from Amazon EKS multi-AZ to supercell architecture
Figure 3: Availability Zone bounded cell-based multi-cluster Amazon EKS
As you can see in the Figure 3, the full stack of services is available in each cell, and once a request is forwarded to a cell it stays in that cell. This reduces our inter-AZ data transfer costs, and provides us AZ failure resiliency since its now relatively simple for us to disable traffic at the Border for the cell that is having an issue. As we roll this out, we’ll initially have both cell-based Amazon EKS clusters and our legacy multi-AZ EKS cluster as shown in Figure 2.
Figure 4: Supercell Architecture
In future, with our supercell architecture, we can have multiple supercells within a single Region as shown in Figure 4. This allows us to route traffic if a cell is down for maintenance, or a slow rollout of a new feature. We’ll continue this model as move into multi-region deployments.
For Kubernetes cluster level autoscaling requirements, we were using cluster autoscaler; however, we ran into limitations when we use multiple Amazon EC2 instance types and sizes. As we were migrating to this cell-based architecture, we identified opportunities for operational improvements to address Amazon EKS scaling using Karpenter and workload management using
Improved Amazon EKS scaling with Karpenter
For our Amazon EKS workloads, we continue to use
Karpenter has the ability to efficiently address the full range of instance types and capacities available through Amazon Web Services. With Karpenter, you define Provisioners, made available to Kubernetes as a CRD (Custom Resource Definition). These CRDs are mutable, which gives our operations teams a better user experience when they need to modify our instance types, sizes, and more. It also improves our node utilization by bin packing pending pods on the most efficient Amazon EC2 instance types, resulting in significant cost optimization.
Workload management in Amazon EKS
During the design phase of the project, we identified the need to improve how we manage applications within Amazon EKS. For some time now, we’ve been using a combination of yaml manifests to define Kubernetes resources, and have relied on a combination of Terraform and Jenkins custom pipelines to manage the deployment of resources in our Amazon EKS clusters. The use of Terraform for workload management focused on core infrastructu re add-ons like observability applications for log and metric aggregation, controllers to manage Amazon Web Services resources like Load Balancers and Route53 entries, and other middleware type workloads that are shared by a large portion of Life360 services. The use of Jenkins pipelines with raw yaml was primarily used for Life360 developed services. While both solutions were functional, they weren’t scalable with cell-based design, so we decided to consolidate on a single solution that would allow us to manage our core infrastructure add-ons and our Life360 developed workloads.
We chose ArgoCD to help manage our workloads in Amazon EKS. We wanted our engineers to have an easy-to-use interface to manage their workloads across multiple clusters. This became increasingly more important as we moved from a workload running in a single Amazon EKS cluster, to a workload running in multiple Amazon EKS clusters, each isolated to a single AZ.
There have been a few iterations of our ArgoCD git repository, but we have settled on a solution that’s working extremely well for us now. Our current solution makes heavy use of the ArgoCD
We defined the defaults object above to be used by the ApplicationSet chart template. Some values here are passed in as helm global values. This allows us to, for example, use the value global.clusterName if desired in a pod annotation in the custom-app chart template. Next, we have our cluster-cells. We iterate over this object in our chart template to create in support of a
The other fields are used to define the Application configuration that’s generated by the ApplicationSet template. The field name is used in combination with the clusterCell.cluster to generate a unique name for the ArgoCD Application. In this example, ArgoCD would have a total of three Applications generated by the ApplicationSet (i.e., env-use1-az2-custom-app, env-use1-az4-custom-app, and env-use1-az6-custom-app). The fields repoUrl and path are used by the ArgoCD Application to define the source repository and relative path of the helm chart it manages.
Conclusion
In this post, we showed you how we improved our resiliency by implementing a multi-cluster Amazon EKS architecture. Our new approach adopted a bulkhead architecture that used cells to ensure a failure in one cell didn’t affect others. This approach allowed Life360 to reduce inter-AZ data transfer costs, improve EKS scaling and workload management, and create a statically stable infrastructure for AZ-wide failures. With the supercell architecture, Life360 provided uninterrupted service and continued to simplify safety for families globally.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.