We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Accelerate data science feature engineering on transactional data lakes using Amazon Athena with Apache Iceberg
Combined with the capabilities of Athena, Apache Iceberg delivers a simplified workflow for data scientists to create new data features without needing to copy or recreate the entire dataset. You can create features using standard SQL on Athena without using any other service for feature engineering. Data scientists can reduce the time spent preparing and copying datasets, and instead focus on data feature engineering, experimentation, and analyzing data at scale.
In this post, we review the benefits of using Athena with the Apache Iceberg open table format and how it simplifies common feature engineering tasks for data scientists. We demonstrate how Athena can convert an existing table in Apache Iceberg format, then add columns, delete columns, and modify the data in the table without recreating or copying the dataset, and use these capabilities to create new features on Apache Iceberg tables.
Solution overview
Data scientists are generally accustomed to working with large datasets. Datasets are usually stored in either JSON, CSV, ORC, or
Athena has introduced
The Athena Iceberg UPDATE operation writes Apache Iceberg position delete files and newly updated rows as data files in the same transaction. You can make record corrections via a single UPDATE statement.
With the release of Athena engine version 3, the capabilities for Apache Iceberg tables are enhanced with the support for operations such as
Prerequisites
Set up an Athena workgroup with Athena engine version 3 to use CTAS and MERGE commands with an Apache Iceberg table. To upgrade your existing Athena engine to version 3 in your Athena workgroup, follow the instructions in
Dataset
For demonstration, we use an Apache Parquet table that contains several million records of randomly distributed fictitious sales data from the last several years stored in an S3 bucket.
s3://sample-iceberg-datasets-xxxxxxxxxxx/sampledb/orders_and_customers/
.
The following table shows the layout for the table customer_orders
.
Column Name | Data Type | Description |
orderkey | string | Order number for the order |
custkey | string | Customer identification number |
orderstatus | string | Status of the order |
totalprice | string | Total price of the order |
orderdate | string | Date of the order |
orderpriority | string | Priority of the order |
clerk | string | Name of the clerk who processed the order |
shippriority | string | Priority on the shipping |
name | string | Customer name |
address | string | Customer address |
nationkey | string | Customer nation key |
phone | string | Customer phone number |
acctbal | string | Customer account balance |
mktsegment | string | Customer market segment |
Perform feature engineering
As a data scientist, we want to perform
customer_orders
table in the sampledb
database using Athena as shown in the following DDL command. (You can use any of your existing datasets and follow the steps mentioned in this post.) The customer_orders
dataset was generated and stored in the S3 bucket location s3://sample-iceberg-datasets-xxxxxxxxxxx/sampledb/orders_and_customers/
in Parquet format. This table is not an Apache Iceberg table.
Validate the data in the table by running a query:
We want to add new features to this table to get a deeper understanding of customer sales, which can result in faster model training and more valuable insights. To add new features to the dataset, convert the customer_orders
Athena table to Apache Iceberg table on Athena. Issue a
customer_orders
table. While doing so, a new feature is added to get the total purchase amount in the past year (max year of the dataset) by each customer.
In the following CTAS query, a new column named one_year_sales_aggregate
with the default value as 0.0
of data type double
is added and table_type
is set to ICEBERG
:
Issue the following query to verify the data in the Apache Iceberg table with the new column one_year_sales_aggregate
values as 0.0
:
We want to populate the values for the new feature one_year_sales_aggregate
in the dataset to get the total purchase amount for each customer based on their purchases in the past year (max year of the dataset). Issue a MERGE query statement to the Apache Iceberg table using Athena to populate values for the one_year_sales_aggregate
feature:
Issue the following query to validate the updated value for total spend by each customer in the past year:
We decide to add another feature onto an existing Apache Iceberg table to compute and store the average purchase amount in the past year by each customer. Issue an ALTER query statement to add a new column to an existing table for feature one_year_sales_average
:
Before populating the values to this new feature, you can set the default value for the feature one_year_sales_average
to 0.0
. Using the same Apache Iceberg table on Athena, issue an UPDATE query statement to populate the value for the new feature as 0.0
:
Issue the following query to verify the updated value for average spend by each customer in the past year is set to 0.0
:
Now we want to populate the values for the new feature one_year_sales_average
in the dataset to get the average purchase amount for each customer based on their purchases in the past year (max year of the dataset). Issue a MERGE query statement to the existing Apache Iceberg table on Athena using the Athena engine to populate values for the feature one_year_sales_average
:
Issue the following query to verify the updated values for average spend by each customer:
Once additional data features have been added to the dataset, data scientists generally proceed to train ML models and make inferences using Amazon Sagemaker or equivalent toolset.
Conclusion
In this post, we demonstrated how to perform feature engineering using Athena with Apache Iceberg. We also demonstrated using the CTAS query to create an Apache Iceberg table on Athena from an existing dataset in Apache Parquet format, adding new features in an existing Apache Iceberg table on Athena using the ALTER query, and using UPDATE and MERGE query statements to update the feature values of existing columns.
We encourage you to use CTAS queries to create tables quickly and efficiently, and use the MERGE query statement to synchronize tables in one step to simplify data preparations and update tasks when transforming the features using Athena with Apache Iceberg. If you have comments or feedback, please leave them in the comments section.
About the Authors
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.