We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Monitor and optimize cost on Amazon Web Services Glue for Apache Spark
One of the most common questions we get from customers is how to effectively monitor and optimize costs on Amazon Web Services Glue for Spark. The diversity of features and pricing options for Amazon Web Services Glue offers the flexibility to effectively manage the cost of your data workloads and still keep the performance and capacity as per your business needs. Although the fundamental process of cost optimization for Amazon Web Services Glue workloads remains the same, you can monitor job runs and analyze the costs and usage to find savings and take action to implement improvements to the code or configurations.
In this post, we demonstrate a tactical approach to help you manage and reduce cost through monitoring and optimization techniques on top of your Amazon Web Services Glue workloads.
Monitor overall costs on Amazon Web Services Glue for Apache Spark
Amazon Web Services Glue for Apache Spark charges an hourly rate in 1-second increments with a minimum of 1 minute based on the number of data processing units (DPUs). Learn more in
Amazon Web Services Cost Explorer
In
- On the Cost Explorer console, create a new cost and usage report.
- For Service , choose Glue .
- For Usage type , choose the following options:
- Choose <Region>-ETL-DPU-Hour (DPU-Hour) for standard jobs.
- Choose <Region>-ETL-Flex-DPU-Hour (DPU-Hour) for Flex jobs.
- Choose <Region>-GlueInteractiveSession-DPU-Hour (DPU-Hour) for interactive sessions.
- Choose Apply .
Learn more in
Monitor individual job run costs
This section describes a way to monitor individual job run costs on Amazon Web Services Glue for Apache Spark. There are two options to achieve this.
Amazon Web Services Glue Studio Monitoring page
On the Monitoring page in Amazon Web Services Glue Studio, you can monitor the DPU hours you spent on a specific job run. The following screenshot shows three job runs that processed the same dataset; the first job run spent 0.66 DPU hours, and the second spent 0.44 DPU hours. The third one with Flex spent only 0.32 DPU hours.
GetJobRun and GetJobRuns APIs
The DPU hour values per job run can be retrieved through Amazon Web Services APIs.
For auto scaling jobs and Flex jobs, the field DPUSeconds
is available in GetJobRun
and GetJobRuns
API responses:
The field DPUSeconds returns 1137.0. This means 0.32 DPU hours which can be calculated in 1137.0/(60*60)=0.32
.
For the other standard jobs without auto scaling, the field DPUSeconds
is not available:
For these jobs, you can calculate DPU hours by ExecutionTime*MaxCapacity/(60*60)
. Then you get 0.44 DPU hour by 157*10/(60*60)=0.44
. Note that Amazon Web Services Glue versions 2.0 and later have a 1-minute minimum billing.
Amazon Web Services CloudFormation template
Because DPU hours can be retrieved through the GetJobRun
and GetJobRuns
APIs, you can integrate this with other services like
To help you configure that quickly, we provide an
The CloudFormation template generates the following resources:
-
Amazon Web Services Identity and Access Management (IAM) role - Lambda function
- EventBridge rule
To create your resources, complete the following steps:
- Sign in to the Amazon Web Services CloudFormation console.
- Choose Launch Stack:
- Choose Next .
- Choose Next .
- On the next page, choose Next .
- Review the details on the final page and select I acknowledge that Amazon Web Services CloudFormation might create IAM resources .
- Choose Create stack .
Stack creation can take up to 3 minutes.
After you complete the stack creation, when Amazon Web Services Glue jobs finish, the following DPUHours metrics are published under the Glue namespace in CloudWatch:
- Aggregated metrics – Dimension=[JobType, GlueVersion, ExecutionClass]
- Per-job metrics – Dimension=[JobName, JobRunId=ALL]
- Per-job run metrics – Dimension=[JobName, JobRunId]
Aggregated metrics and per-job metrics are shown as in the following screenshot.
Each datapoint represents DPUHours per individual job run, so valid statistics for the CloudWatch metrics is SUM. With the CloudWatch metrics, you can have a granular view on DPU hours.
Options to optimize cost
This section describes key options to optimize costs on Amazon Web Services Glue for Apache Spark:
- Upgrade to the latest version
- Auto scaling
- Flex
- Set the job’s timeout period appropriately
- Interactive sessions
- Smaller worker type for streaming jobs
We dive deep to the individual options.
Upgrade to the latest version
Having Amazon Web Services Glue jobs running on the latest version enables you to take advantage of the latest functionalities and improvements offered by Amazon Web Services Glue and the upgraded version of the supported engines such as Apache Spark. For example, Amazon Web Services Glue 4.0 includes the new optimized Apache Spark 3.3.0 runtime and adds support for built-in pandas APIs as well as native support for Apache Hudi, Apache Iceberg, and Delta Lake formats, giving you more options for analyzing and storing your data. It also includes a new highly performant
Auto scaling
One of the most common challenges to reduce cost is to identify the right amount of resources to run jobs. Users tend to overprovision workers in order to avoid resource-related problems, but part of those DPUs are not used, which increases costs unnecessarily. Starting with Amazon Web Services Glue version 3.0, Amazon Web Services Glue auto scaling helps you dynamically scale resources up and down based on the workload, for both batch and streaming jobs. Auto scaling reduces the need to optimize the number of workers to avoid over-provisioning resources for jobs, or paying for idle workers.
To enable auto scaling on Amazon Web Services Glue Studio, go to the Job Details tab of your Amazon Web Services Glue job and select Automatically scale number of workers .
You can learn more in
Flex
For non-urgent data integration workloads that don’t require fast job start times or can afford to rerun the jobs in case of a failure,
To enable Flex on Amazon Web Services Glue Studio, go to the Job Details tab of your job and select Flex execution .
You can learn more in
Interactive sessions
One common practice among developers that create Amazon Web Services Glue jobs is to run the same job several times every time a modification is made to the code. However, this may not be cost-effective depending of the number of workers assigned to the job and the number of times that it’s run. Also, this approach may slow down the development time because you have to wait until every job run is complete. To address this issue, in 2022 we released Amazon Web Services Glue
Set the job’s timeout period appropriately
Due to configuration issues, script coding errors, or data anomalies, sometimes Amazon Web Services Glue jobs can take an exceptionally long time or struggle to process the data, and it can cause unexpected charges. Amazon Web Services Glue gives you the ability to set a timeout value on any jobs. By default, an Amazon Web Services Glue job is configured with 48 hours as the timeout value, but you can specify any timeout. We recommend identifying the average runtime of your job, and based on that, set an appropriate timeout period. This way, you can control cost per job run, prevent unexpected charges, and detect any problems related to the job earlier.
To change the timeout value on Amazon Web Services Glue Studio, go to the Job Details tab of your job and enter a value for Job timeout .
Interactive sessions also have the same ability to set an idle timeout value on sessions. The default idle timeout value for Spark ETL sessions is 2880 minutes (48 hours). To change the timeout value, you can use
Smaller worker type for streaming jobs
Processing data in real time is a common use case for customers, but sometimes these streams have sporadic and low data volumes. G.1X and G.2X worker types could be too big for these workloads, especially if we consider streaming jobs may need to run 24/7. To help you reduce costs, in 2022 we released
To select the G.025X worker type on Amazon Web Services Glue Studio, go to the Job Details tab of your job. For Type , choose Spark Streaming , then choose G 0.25X for Worker type .
You can learn more in
Performance tuning to optimize cost
Performance tuning plays an important role in reducing cost. The first action for performance tuning is to identify the bottlenecks. Without measuring the performance and identifying bottlenecks, it’s not realistic to optimize cost-effectively.
The following are high-level strategies to optimize costs:
- Scale cluster capacity
- Reduce the amount of data scanned
- Parallelize tasks
- Optimize shuffles
- Overcome data skew
- Accelerate query planning
For this post, we discuss the techniques for reducing the amount of data scanned and parallelizing tasks.
Reduce the amount of data scanned: Enable job bookmarks
Reduce the amount of data scanned: Partition pruning
If your input data is partitioned in advance, you can reduce the amount of data scan by pruning partitions.
For Amazon Web Services Glue DynamicFrame, set push_down_predicate
(and catalogPartitionPredicate
), as shown in the following code. Learn more in
For Spark DataFrame (or Spark SQL), set a where or filter clause to prune partitions:
Parallelize tasks: Parallelize JDBC reads
The number of concurrent reads from the JDBC source is determined by configuration. Note that by default, a single JDBC connection will read all the data from the source through a SELECT
query.
Both Amazon Web Services Glue DynamicFrame and Spark DataFrame support parallelize data scans across multiple tasks by splitting the dataset.
For Amazon Web Services Glue DynamicFrame, set hashfield
or hashexpression
and hashpartition
. Learn more in
For Spark DataFrame, set numPartitions
, partitionColumn
, lowerBound
, and upperBound
. Learn more in
Conclusion
In this post, we discussed methodologies for monitoring and optimizing cost on Amazon Web Services Glue for Apache Spark. With these techniques, you can effectively monitor and optimize costs on Amazon Web Services Glue for Spark.
If you have comments or feedback, please leave them in the comments.
Appendix: Amazon CloudWatch charges
When you use Amazon CloudWatch with Amazon Web Services Glue jobs, you are charged standard rates for CloudWatch Metrics, and CloudWatch Logs. Learn more in
-
Job metrics : You incur additional charges when you enable job metrics, and CloudWatch custom metrics are created. - Application logging : You incur additional charges for aggregated application logs in CloudWatch log group
/aws-glue/jobs/output
and/aws-glue/jobs/error
. -
Continuous logging : You incur additional charges when you enable continuous logging, and CloudWatch log events are emitted in CloudWatch log group/aws-glue/jobs/logs-v2
.
When you want to optimize CloudWatch charges related to Glue jobs, first you should see breakdown information in the Amazon Web Services Cost Explorer.
Optimize cost for CloudWatch metrics
To reduce charges for metrics, you can disable job metrics. Note that the CloudFormation template provided in this post creates custom metrics and you also incur additional charges from that.
Optimize cost for CloudWatch logs
CloudWatch Logs pricing is defined mainly in ingestion and archive storage.
To reduce charges for log ingestion, you can do following:
- Reduce unneeded logging such as
print()
,df.show()
, andcustom logger calls in your job script - Configure standard log filter instead of no filter
- Avoid setting log level to
DEBUG
for production jobs
To reduce charges for log archive storage, you can configure retention period for your log groups. Learn more in
About the Authors
Noritaka Sekiyama is a Principal Big Data Architect on the Amazon Web Services Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.