We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
How CoStar uses Karpenter to optimize their Amazon EKS Resources
Introduction
CoStar is well known as a market leader for Commercial Real Estate data, but they also run major home, rental, and apartments websites —including
Challenge
CoStar’s biggest challenge has always been to collect data from hundreds of sources, enrich that data with important insights, and deliver that data in a meaningful and user-friendly system. CoStar Suite’s Commercial Real Estate, Apartments, and Homes all have different data sources that update at different times and with different volumes of data. The systems to support this data ingestion and updates to data sources must be fast, accurate, and able to scale up and down to make them affordable. Many of these systems are in the process of being migrated from legacy data centers and into CoStar’s Amazon Web Services environment, so running them on parallel and interoperable systems was necessary to avoid massive duplication of engineering support. These needs all pointed to running Kubernetes on-premises and in Amazon Web Services, with scaling capabilities for their container clusters for increases and decreases in usage. After months of successful testing and production, CoStar decided to optimize their engineering stack even further, while still maintaining as much parallel on-premises Kubernetes management as possible.
In the Kubernetes cluster architecture, the control plane and its components are responsible for managing cluster operations (i.e., scheduling containers, managing application availability, and storing cluster data among other key tasks) and worker nodes to host pods that run containerized application workloads. Amazon Elastic Kubernetes Service (
The default method to provision worker nodes is to leverage Amazon EKS-managed node groups, which automate the provisioning and lifecycle management of the underlying Amazon EC2 instances using Amazon EC2 Auto Scaling Groups. For the dynamic adjustment of Amazon EC2 instances, the Amazon EKS-managed node group functionality can be paired with the Cluster Autoscaler solution. This autoscaling solution watches for pending pods waiting for compute capacity and any underutilized worker nodes. When pods are in the pending state due to insufficient resources, the Cluster Autoscaler increases the desired number of instances in the Amazon EC2 Auto Scaling group, which provisions new worker nodes, which allows those pods to be scheduled and run. Cluster Autoscaler terminates underutilized or unused nodes based on certain factors.
For CoStar’s workloads running on Amazon EKS, the goal was to maximize the availability and performance while being an efficient resource. While the Cluster Autoscaler solution provides a degree of dynamic compute provisioning and cost-efficiency, there are many considerations and limitations that make it challenging or even restrictive to use. Namely, the Amazon EC2 instance types for a given node group must be of similar Central Processing Unit (CPU), Memory, and Graphics Processing Unit (GPU) specifications to minimize undesired behavior. This is because it uses the first instance type specified in the node group policy to simulate scheduling of pods. If the policy has additional instance types with higher specs, node resources may be wasted after scaling out since it will only schedule pods based on the size of the first instance type. If the policy has additional instance types with lower specs, then pods may fail to schedule on those nodes due to node resource constraints. To diversify the instance sizes to accommodate CoStar’s varied pod resource requirements, they needed to create multiple node groups with similarly specified instance types. Furthermore, the Cluster Autoscaler only deprovisions underutilized nodes, but doesn’t have replace them with cheaper instance types in response to changes in the workloads. Additionally, for CoStar’s stateless workloads, being able to prefer and target Spot compute capacity for deeper discounts over on-demand was cumbersome to implement with node groups.
Solution overview
Why Karpenter
CoStar needed a more efficient means of provisioning nodes for their diverse workload demands without the overhead of management of multiple node groups. This was addressed using the open-source
The following diagram illustrates how Karpenter observes the aggregate resource requests of unscheduled pods, which makes decisions to launch new nodes, and terminates them to reduce infrastructure costs:
To achieve cost-effectiveness for CoStar’s stateless workloads and lower environments, the CoStar team configured the Karpenter Provisioner to prefer Spot capacity and only provision On-Demand capacity if no Spot capacity is available. Karpenter uses the price-capacity-optimized allocation strategy for Spot capacity, which balances cost and lowers the probability of interruptions in the near term. For stateful workloads in production clusters, the Karpenter Provisioner defines a selection of compute and storage optimized instance families running On-Demand, most of which is covered by Compute Savings Plans and Reserved Instances to obtain discounts. For further optimization, CoStar enabled the consolidation capability, which allows Karpenter to actively reduce cluster costs by monitoring the utilization of nodes and checking whether existing workloads can run on other nodes or be replaced with cheaper variants. By evaluating multiple factors, such as the number of running pods, configured node expiry times, use of lower priority pods, and existing Pod Disruption Budgets (PDBs), the consolidation actions are performed in a manner to minimize workload disruptions.
Prerequisites
To carry out the example in this post, you’ll need to setup the following:
- Provision a Kubernetes cluster in Amazon Web Services
- Install Karpenter for cluster autoscaling
-
Install Amazon Elastic Kubernetes Service (Amazon EKS) Node Viewer
Walkthrough
In this section, we provide a simple demonstration of the replace mechanism, which is part of the consolidation capability of Karpenter. The following Karpenter Provisioner and node template configuration code constrains to a number of compute-optimized instance types with the capacity type of on-demand:
Here is the application deployment manifest we use to demonstrate the consolidation behavior:
Using the
Let’s assume the traffic loads are lower during off business hours. You may use the
With 30 replicas, the overall resource requirements are much lower. Karpenter provisions a replacement node with a cheaper variant (in this case with a c6a.4xlarge) and cordons the original node as shown in the following screenshot.
After the pods are rescheduled on the replacement node the previous node is terminated.
As you can see from our example, Karpenter gives CoStar the ability to scale efficiently and optimize for cost by provisioning the c6a.4xlarge, as resource requirements decreased off business hours.
Cleaning up
To avoid incurring additional operational costs, remember to destroy all the infrastructure you created for the examples in this post.
Conclusion
There are many use cases and options on Amazon Web Services when evaluating our container solutions. For CoStar, Karpenter consolidated the Amazon EC2 Spot Capacity they were running in their Dev and Test environments and moved workloads to the lowest cost instance type while still effectively runs their workloads. Customers focused on maintaining parity during migrations, continuing to use deep and effective investments in Kubernetes and seek a proven solution to focus on cost optimization, should consider Amazon EKS with Karpenter. For those already using Amazon EKS, the
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.