We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Optimizing your Amazon Web Services Batch architecture for scale with observability dashboards
Amazon Web Services Batch is a fully managed service enabling you to run computational jobs at any scale without the need to manage compute resources. Customers often ask for guidance to optimize their architectures and make their workload to scale rapidly using the service. Every workload is different, and some optimizations may not yield the same results for every case. For this reason, we built an
In this blogpost, you will learn how Amazon Web Services Batch works to:
- Scale Amazon Elastic Compute Cloud (Amazon EC2) instances to process jobs.
- How Amazon EC2 provisions instances for Amazon Web Services Batch.
- How to leverage our observability solution to peek under the covers of Batch to optimize your architectures.
The learnings in this blogpost apply to regular and array jobs on Amazon EC2 — Amazon Web Services Batch compute environments that leverage Amazon Web Services Fargate resources and multi-node parallel jobs are not discussed here.
Amazon Web Services Batch: how instances and jobs are scaling on-demand
To run a job on Amazon Web Services Batch, you submit it to a Batch job queue (JQ) which manages the lifecycle of the job, from submission intake to dispatch to tracking the return status. Upon submission, the job is in the SUBMITTED
RUNNABLE
state which means it is ready to be scheduled for execution. The Amazon Web Services Batch scheduler service regularly looks at your job queues and evaluates the requirements (vCPUs, GPUs, and memory) of the RUNNABLE
jobs. Based on this evaluation, the service identifies if the associated Batch compute environments (CEs) need to be scaled-up in order to process the jobs in the queues.
To initiate this action Amazon Web Services Batch generates a list of Amazon EC2 instances based on the jobs’ compute requirements (number of vCPUs, GPUs and memory) from the Amazon EC2 instance types you selected when
Once instances are created by Amazon EC2, they
RUNNABLE
and fitting the free resources, transition to the STARTING
state during which the containers are started. Once the containers are executing, the job transition to the state RUNNING
states. It remains there until either successful completion or it encounters an error. When there are no jobs in the RUNNABLE
state and instances stay idle, then Amazon Web Services Batch detaches the instance from the Compute Environment and requests for Amazon ECS to
Jobs placement onto instances
While the CE are scaling up and down, Amazon Web Services Batch continuously asks Amazon ECS to place jobs on your instances by calling the
How instances are provisioned
You have seen how instances are requested by Amazon Web Services Batch, we will now discussed on how they are provisioned. Instance provisioning is handled by Amazon EC2, it picks instances based on the list generated by Amazon Web Services Batch and the
With the allocation strategy called BEST_FIT
(BF), Amazon EC2 pick the least expensive instance, with BEST_FIT_PROGRESSIVE
(BFP) pick the least expensive instances and pick additional instance types if the previously selected types are not available. In the case of SPOT_CAPACITY_OPTIMIZED
, Amazon EC2 Spot
Why optimize the Amazon EC2 instance selection in your Compute Environments?
Understanding how Amazon Web Services Batch selects instances helps to optimize the environment to achieve your objective. The instances you get through Amazon Web Services Batch are picked based the number of jobs, their shapes in terms of number of vCPUs and amount of memory they require, the allocation strategy, and the price. If they have low requirements (1-2 vCPUs, <4GB of memory) then it is likely that you will get small instances from Amazon EC2 (i.e., c5.xlarge) since they can be cost effective and belong to the deepest Spot instances pools. In this case, you may provision the
Another case where observability on your Amazon Web Services Batch environment is important is when your jobs need storage scratch space when running. If you set your EBS volumes to a fixed size, you may see some jobs crash on larger instances due to
or
. This can be due to the higher number of jobs packed per instance which will consume your
To help you in understanding and tuning your Amazon Web Services Batch architectures, we developed and published a new
We will discuss this observability solution in detail and demonstrate how it can be used to understand the behavior of your compute environment using a given workload. The examples provided here can be deployed on your account using Amazon CloudFormation templates.
Amazon Web Services Batch observability solution
This open-source
SAM application architecture
The SAM application collects data points from your Amazon Web Services Batch environment using CloudWatch events. Depending on their nature, these events trigger
Using Amazon CloudWatch Events with Amazon Web Services Batch
This solution only uses events generated by services instead of calling
The events captured by the serverless application are the following:
- ECS container instance
registration andderegistration : When an instance reaches the RUNNING state, it registers itself with ECS to become part of the resource pool where your jobs will run. When an instance is removed, it deregisters and will no longer be used by ECS to run jobs. Both API calls contain information on the Instance Kind, its resources (vCPUs, Memory), the Availability Zone where the instance is located, the ECS cluster and thecontainer instance Instance registration status is collected in an Amazon DynamoDB table to determine the EC2 instance, Availability Zone and ECS cluster used by a job using the container instance ID. -
RunTask API calls are used by Batch to request the placement of jobs on instances. If successful, a job is placed. If an error is returned, the cause can be a lack of available instances or resources (vCPUs, Memory) to place a job. Calls are captured in an Amazon DynamoDB table to associate a Batch job to the container instance it runs on. -
Batch jobs state transitions trigger an eventwhen a job transitions between states . When moving to theRUNNING
state, theJobID
is matched with the ECS instances and RunTask DynamoDB tables to identify where the job is running.
The serverless application must be deployed on your account before going to the next section. If you haven’t done so, please follow the documentation in the
Understanding your Batch architecture
Now that you can visualize how Amazon Web Services Batch acquires instances and places jobs, we will use this data to evaluate the effect of instance selection on a test workload.
To demonstrate this, you will deploy a set of Amazon Web Services CloudFormation templates as described in Part I of this
Understanding the instance selection of your Compute Environments
When creating a
In this
CE1 is configured to pick instances from the c5, c4, m5, m4 Amazon EC2 instance families regardless of the instance and uses the
The test
Run and visualize your Workload on CE1
While your workload is running, look at the vCPU and instance capacity on the dashboard Batch-EC2-Capacity . In the run below, the CE acquired a maximum of 766 instances and provided a peak of 2,044 vCPUs. It also shows that 18 minutes were necessary to run the 1,000 jobs. This runtime can vary depending on the number of instances acquired. The dashboard reflects some missing data near the end of the run but it has no impact on the processing of the jobs.
Dashboard Batch-ECS-RunTaskJobsPlaced (Figure 4) has 2 plots where you can compare the Desired Capacity requested by Batch and the In-service Capacity as maintained by ECS. When the workload is optimal, they should mirror each other closely.
To identify which instance types were launched, in which Availability Zone they were launched, and when they joined the ECS cluster, open the dashboard Batch-ECS-InstancesRegistration (Figure 5) . In this case, the instances acquired were c5.large, c5.xlarge, c5.2xlarge, c4.xlarge as they can all accommodate the jobs and were selected by the SPOT_CAPACITY_OPTIMIZED
allocation strategy as they belong to the deepest available instance pools.
Dashboard Batch-Jobs-Placement (Figure 6) shows when jobs are placed onto the instances (moving states from RUNNABLE
to RUNNING
), on which instance type and in which Availability Zone. In this run, the peak task placement rate was 820 tasks per minute.
There are other additional plot and Dashboards available to explore. By taking the time and studying the dashboards in this way for your workload, you can gain an understanding of the behavior of Amazon Web Services Batch based on these CE1 settings. By modifying the settings, or running a different workload, you can continue to use these dashboards to gain deeper insights into how each workload will behave.
Summary
In this blogpost, you learned how to use runtime metrics to understand an Amazon Web Services Batch architecture for a given workload. Better understanding your workloads can yield many benefits as the workload scales.
Several customers in Health Care & Life Sciences, Media Entertainment and Financial services industry have used this monitoring tool to optimize their workload for scale by reshaping their jobs, refining their instance selection and tuning their Amazon Web Services Batch architecture.
There are other tools that you can use to track the behavior of your infrastructure, for example the
To get started with the Amazon Web Services Batch open-source observability solution, visit the
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.