ACTS Blog Selection
We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Scale your Amazon Web Services Glue for Apache Spark jobs with new larger worker types G.4X and G.8X
Hundreds of thousands of customers use
Today we are pleased to announce the general availability of Amazon Web Services Glue G.4X (4 DPU) and G.8X (8 DPU) workers, the next series of Amazon Web Services Glue workers for the most demanding data integration workloads. G.4X and G.8X workers offer increased compute, memory, and storage, making it possible for you to vertically scale and run intensive data integration jobs, such as memory-intensive data transforms, skewed aggregations, and entity detection checks involving petabytes of data. Larger worker types not only benefit the Spark executors, but also in cases where the Spark driver needs larger capacity—for instance, because the job query plan is quite large.
This post demonstrates how Amazon Web Services Glue G.4X and G.8X workers help you scale your Amazon Web Services Glue for Apache Spark jobs.
G.4X and G.8X workers
Amazon Web Services Glue G.4X and G.8X workers give you more compute, memory, and storage to run your most demanding jobs. G.4X workers provide 4 DPU, with 16 vCPU, 64 GB memory, and 256 GB of disk per node. G.8X workers provide 8 DPU, with 32 vCPU, 128 GB memory, and 512 GB of disk per node. You can enable G.4X and G.8X workers with a single parameter change in the API,
The following table shows compute, memory, disk, and Spark configurations per worker type in Amazon Web Services Glue 3.0 or later.
Amazon Web Services Glue Worker Type | DPU per Node | vCPU | Memory (GB) | Disk (GB) | Number of Spark Executors per Node | Number of Cores per Spark Executor |
G.1X | 1 | 4 | 16 | 64 | 1 | 4 |
G.2X | 2 | 8 | 32 | 128 | 1 | 8 |
G.4X (new) | 4 | 16 | 64 | 256 | 1 | 16 |
G.8X (new) | 8 | 32 | 128 | 512 | 1 | 32 |
To use G.4X and G.8X workers on an Amazon Web Services Glue job, change the setting of the worker type parameter to G.4X or G.8X. In Amazon Web Services Glue Studio, you can choose G 4X or G 8X under Worker type.
In the Amazon Web Services API or Amazon Web Services SDK, you can specify G.4X or G.8X in the --worker-type
parameter in a
command.
To use G.4X and G.8X on an Amazon Web Services Glue Studio notebook or interactive sessions, set G.4X or G.8X in the %worker_type
magic:
Performance characteristics using the TPC-DS benchmark
In this section, we use the TPC-DS benchmark to showcase performance characteristics of the new G.4X and G.8X worker types. We used Amazon Web Services Glue version 4.0 jobs.
G.2X, G.4X, and G.8X results with the same number of workers
Compared to the G.2X worker type, the G.4X worker has 2 times the DPUs and the G.8X worker has 4 times the DPUs. We ran over 100 TPC-DS queries against the 3 TB TPC-DS dataset with the same number of workers but on different worker types. The following table shows the results of the benchmark.
Worker Type | Number of Workers | Number of DPUs | Duration (minutes) | Cost at $0.44/DPU-hour ($) |
G.2X | 30 | 60 | 537.4 | $236.46 |
G.4X | 30 | 120 | 264.6 | $232.85 |
G.8X | 30 | 240 | 122.6 | $215.78 |
When running jobs on the same number of workers, the new G.4X and G.8x workers achieved roughly linear vertical scalability.
G.2X, G.4X, and G.8X results with the same number of DPUs
We ran over 100 TPC-DS queries against the 10 TB TPC-DS dataset with the same number of DPUs but on different worker types. The following table shows the results of the experiments.
Worker Type | Number of Workers | Number of DPUs | Duration (minutes) | Cost at $0.44/DPU-hour ($) |
G.2X | 40 | 80 | 1323 | $776.16 |
G.4X | 20 | 80 | 1191 | $698.72 |
G.8X | 10 | 80 | 1190 | $698.13 |
When running jobs on the same number of total DPUs, the job performance stayed mostly the same with new worker types.
Example: Memory-intensive transformations
Data transformations are an essential step to preprocess and structure your data into an optimal form. Bigger memory footprints are consumed in some transformations such as aggregation, join, your own custom logic using user-defined functions (UDFs), and so on. The new G.4X and G.8X workers enable you to run larger memory-intensive transformations at scale.
The following example reads large JSON files compressed in GZIP from an input groupBy
, calculates groups based on
With G.2X workers
When an Amazon Web Services Glue job runs on 12 G.2X workers (24 DPU), it failed due to a No space left on device error. On the Spark UI, the Stages tab for the failed stage shows that there were multiple failed tasks in the Amazon Web Services Glue job due to the error.
The Executor tab shows failed tasks per executor.
Generally, G.2X workers can process memory-intensive workload well. This time, we used a special Pandas UDF that consumes a significant amount of memory, and it caused a failure due to a large amount of shuffle writes.
With G.8X workers
When an Amazon Web Services Glue job runs on 3 G.8X workers (24 DPU), it succeeded without any failures, as shown on the Spark UI’s Jobs tab.
The Executors tab also explains that there were no failed tasks.
From this result, we observed that G.8X workers processed the same workload without failures.
Conclusion
In this post, we demonstrated how Amazon Web Services Glue G.4X and G.8X workers can help you vertically scale your Amazon Web Services Glue for Apache Spark jobs. G.4X and G.8X workers are available today in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm). You can start using the new G.4X and G.8X worker types to scale your workload from today. To get started with Amazon Web Services Glue, visit
About the authors
Noritaka Sekiyama is a Principal Big Data Architect on the Amazon Web Services Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Tomohiro Tanaka is a Senior Cloud Support Engineer on the Amazon Web Services Support team. He’s passionate about helping customers build data lakes using ETL workloads. In his free time, he enjoys coffee breaks with his colleagues and making coffee at home.
Chuhan Liu is a Software Development Engineer on the Amazon Web Services Glue team. He is passionate about building scalable distributed systems for big data processing, analytics, and management. In his spare time, he enjoys playing tennis.
Matt Su is a Senior Product Manager on the Amazon Web Services Glue team. He enjoys helping customers uncover insights and make better decisions using their data with Amazon Web Services Analytic services. In his spare time, he enjoys skiing and gardening.