Access data in Amazon S3 Tables using PyIceberg through the Amazon Glue Iceberg REST endpoint

Authors: Srividya Parthasarathy, Dylan Qu, Kalyan Kumar Neelampudi, Aritra Gupta |

Modern data lakes integrate with multiple engines to meet a wide range of analytics needs, from SQL querying to stream processing. A key enabler of this approach is the adoption of Apache Iceberg as the open table format for building transactional data lakes. However, as the Iceberg ecosystem expands, the growing variety of engines and languages has introduced significant integration challenges. Each engine needs its own client to connect to catalogs, which are the components responsible for tracking table metadata and evolution. This fragmented approach forces developers to manage multiple catalog integrations, thereby increasing pipeline complexity. The lack of a standardized solution often leads to data silos, inconsistent features across engines, and slower progress in analytics modernization efforts.

The Amazon Glue Iceberg REST endpoint addresses these challenges by providing a standardized interface for interacting with Iceberg tables. Fully aligned with the Iceberg REST Catalog Open API specification, the Glue Iceberg REST endpoint streamlines interoperability. This enables users to interact with Iceberg tables through a single, unified standard set of REST APIs across various engines, languages, and platforms. This, in conjunction with the enhanced performance of Amazon S3 Tables, automated table maintenance, and streamlined security features, provides a strong foundation for users to build and scale data lakes on Amazon Web Services.

In this post, we demonstrate how to access Iceberg tables stored in S3 Tables using PyIceberg through the Glue Iceberg REST endpoint with Amazon Lake Formation managing storage credential vending. PyIceberg is the official Python client of the Iceberg project, which provides a lightweight solution for discovering and querying Iceberg tables with your preferred Python tools. The same data consumption flow can also be applied to other compute engines that support the Iceberg REST catalog specification, providing a consistent and efficient experience across platforms.

Solution overview

Solution overview diagram

This post references an example of a user using a local PyIceberg client to perform some basic data transformations, such as table creation, data ingestion, updates, and time travel. This workflow showcases a common development pattern where developers use local environments to prototype and iterate on data transformation pipelines before deploying them at scale.

We begin by creating a table bucket to store Iceberg tables. Then, we configure the PyIceberg client to interact with the Iceberg table through the Amazon Glue Iceberg REST endpoint. All permissions are centrally managed using Lake Formation.

Prerequisites

To follow along, you need the following setup:

1. You need access to an Amazon Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in the Data Catalog account. For instructions, go to Create a data lake administrator.

2. Verify that you have Python version 3.7 or later installed. Check if pip3 version is 22.2.2 or higher is installed.

3. Install or update to the latest Amazon Web Services Command Line Interface (Amazon Web Services CLI). For instructions, see Installing or updating the latest version of the Amazon Web Services CLI. Run the Amazon Web Services configure command using the Amazon Web Services CLI to point to your Amazon Web Services account.

Walkthrough

The following steps walk you through this solution.

Step 1: Create a table bucket and enable Glue Data Catalog integration for S3 Tables:

1.1. Log in to the Amazon S3 console using Admin role and choose Table Buckets from the navigation panel, as shown in the following figure.

Table buckets

1.2. Choose Enable integration. When the integration is successfully integrated, you should see it enabled for your table buckets, as shown in the following figure.

Integration with AWS analytics services

1.3. From the left pane of Amazon S3, choose Table buckets. Choose Create table bucket.

1.4. Under Properties, enter the Table bucket name as pyiceberg-blog-bucket and choose Create table bucket.

Step 2: Create IAM role for the local PyIceberg environment

You need to create an IAM role for the PyIceberg script to access Glue and Lake Formation APIs. To do this you create the following policy and role:

2.1. Create the policy and name it irc-glue-lf-policy. Here are some steps to do it through the Amazon Web Services Management Console:

a. Sign in to the console.

b. Open the IAM console.

c. In the navigation pane of the console, choose Policies and choose the Create Policy option.

d. In the policy editor choose JSON and paste the following policies.

i. Replace <region>, <account-id>, <s3_table_bucket_name>, and <database_name> in the following policy as per your Amazon Web Services Region, Amazon Web Services account ID, S3 table bucket name, and database name respectively. We use myblognamespace as the database name in the rest of this post.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account-id>:catalog",
                "arn:aws:glue:<region>:<account-id>:catalog/s3tablescatalog",
                "arn:aws:glue:<region>:<account-id>:catalog/s3tablescatalog/<s3_table_bucket_name>",
                "arn:aws:glue:<region>:<account-id>:table/s3tablescatalog/<s3_table_bucket_name>/<database_name>/*",
                "arn:aws:glue:<region>:<account-id>:database/s3tablescatalog/<s3_table_bucket_name>/<database_name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

2.2. Create a role named pyiceberg-etl-role by following these steps in the IAM console.

a. In the navigation pane of the console, choose Roles and choose the Create role option.

b. Choose Custom trust policy and paste the following policy.

i. Replace <account-id> and <Admin_role> in the following policy as per your data lake admin ARN.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Amazon Web Services": "arn:aws:iam::<accountid>:role/<Admin_role>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {}
        }
    ]
}

c. Choose Next and choose the policy you previously created in Step 2a, named irc-glue-lf-policy.

d. Choose Next and enter pyiceberg-etl-role as the role name.

e. Choose Create role.

When you create the role, you need to define access to this role using Lake Formation.

Step 3: Define access control using Lake Formation

3.1. Application integration setup:

a. In Lake Formation, enable full table access for external engines to access data. This allows third-party applications to get the Lake Formation temporary credential using an IAM role(s) that has full permissions (ALL) on the requested table.

i. Sign in as an admin user and go to Lake Formation.

ii. On the Left Pane, expand the Administration section.

iii. Choose Application integration settings and choose Allow external engines to access data in Amazon S3 locations with full table access.

iv. Choose Save, as shown in the following figure.

Application integration settings

3.2. Create a database:

a. In the Lake Formation console navigation pane, choose Databases under Data Catalog.

b. Choose Create database, and provide the name myblognamespace.

c. Choose the Data Catalog created in Step 1 (<accountid>:s3tablescatalog/pyiceberg-blog-bucket) from the drop-down.

d. Choose Create database.

e. Refresh the browser if the database does not show up.

After you create the database, you need to set up Lake Formation grants for the role used by the PyIceberg client to create and manage tables under this database. To do this you need to provide database and table level permissions to the pyiceberg-etl-role.

3.3. Set up database level permissions:

Grant the following permissions to the pyiceberg-etl-role role on the resources as shown in the following table.

a. In the Lake Formation console navigation pane, choose Data permissions, then choose Grant.

b. In the Principals section, choose the radio button IAM users and roles, and from the drop-down choose pyiceberg-etl-role.

c. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:

i. Choose <accountid>:s3tablescatalog/pyiceberg-blog-bucket for Catalogs.

ii. Choose mynamespace for Databases.

d. Choose CREATE TABLE, DESCRIBE for database permissions.

e. Choose Grant.

3.4. Set up table level permissions:

a. In the Lake Formation console navigation pane, choose Data permissions, then choose Grant.

b. In the Principals section, choose the radio button IAM users and roles, and from the drop-down choose pyiceberg-etl-role.

c. In the LF-Tags or catalog resources section, choose Named Data Catalog resources:

i. Choose <accountid>:s3tablescatalog/pyiceberg-blog-bucket for Catalogs.

ii. Choose mynamespace for Databases.

iii. Choose ALL_TABLES for Tables.

d. Choose SUPER for table permissions.

e. Choose Grant.

Now that the permissions are set up, you can set up a local PyIceberg environment to use S3 Tables.

Step 4: Set up local PyIceberg environment

4.1. Install Python along with the following packages:

pip install "pyiceberg[pandas,pyarrow]"

pip install boto3

pip install tabulate

4.2. Configure Amazon Web Services CLI to log in as the Admin user on your local machine and assume the spark-etl-role with the following command:

aws sts assume-role --role-arn "arn:aws:iam::<accountid>:role/pyiceberg-etl-role" --role-session-name pyiceberg-etl-role

4.3. Copy the credentials and replace the following placeholders to configure environment variables.

export AWS_ACCESS_KEY_ID='<AccessKeyId>'
export AWS_SECRET_ACCESS_KEY='<SecretAccessKey>'
export AWS_SESSION_TOKEN='<SessionToken>'
export AWS_DEFAULT_REGION='<region>'

Next, use a PyIceberg setup locally to create a table and load data, and perform basic queries.

Step 5: Run PyIceberg script

In this post, we highlight the key steps incrementally. You can download the entire Python script and run python blogcode_pyiceberg.py --table customer.

First, we import some libraries and initialize constants that we use throughout our script.

#!/usr/bin/env python3
from pyiceberg.catalog import load_catalog
import os
import pyarrow as pa
import pandas as pd
from pyiceberg.expressions import EqualTo
import boto3
import json
import argparse
from botocore.exceptions import ProfileNotFound
from datetime import datetime
from tabulate import tabulate

# Constants
REGION = 'us-east-2'
CATALOG = 's3tablescatalog'
DATABASE = 'myblognamespace'
TABLE_BUCKET = 'pyiceberg-blog-bucket'

We initialize catalog using the Glue Iceberg REST endpoint.

rest_catalog = load_catalog(
        $CATALOG,
        **{
            "type": "rest",
            "warehouse": f"{account_id}:s3tablescatalog/$TABLE_BUCKET",
            "uri": f"https://glue.{region}.amazonaws.com/iceberg",
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "glue",
            "rest.signing-region": region
        }
    )

We define a table schema.

def create_customer_schema() -> pa.Schema:
    """
    Create and return the PyArrow schema for customer table.
    """
    return pa.schema([
        pa.field('c_salutation', pa.string()),
        pa.field('c_preferred_cust_flag', pa.string()),
        pa.field('c_first_sales_date_sk', pa.int32()),
        pa.field('c_customer_sk', pa.int32()),
        pa.field('c_first_name', pa.string()),
        pa.field('c_email_address', pa.string())
    ])

Then, we create a table with the following schema.

my_schema = create_customer_schema()
# Check if table exists before creating
        try:
            rest_catalog.create_table(
                identifier=f"{database_name}.{table_name}",
                schema=my_schema
            )
            print("Table created successfully")
        except Exception as e:
            print(f"Table creation note: {str(e)}")

We load a sample customer data to the table.

def get_sample_customer_data() -> dict:
    return {
        "c_salutation": "Ms",
        "c_preferred_cust_flag": "NULL",
        "c_first_sales_date_sk": 2452736,
        "c_customer_sk": 1234,
        "c_first_name": "Mickey",
        "c_email_address": "mickey@email.com"
    }
sample_data = get_sample_customer_data()
        df = pa.Table.from_pylist([sample_data], schema=my_schema)
        table.append(df)

We query the table.

tabledata = table.scan(
row_filter=EqualTo("c_first_name", "Mickey"),
limit=10
).to_pandas()

update the value of the “c_preferred_cust_flag” column and display the change

Then, we update the value of the c_preferred_cust_flag column and display the change.

condition = tabledata['c_preferred_cust_flag'] == 'NULL'
        tabledata.loc[condition, 'c_preferred_cust_flag'] = 'N'
        df2 = pa.Table.from_pandas(tabledata, schema=my_schema)
        table.overwrite(df2)

display the snapshot history

Finally, we display the snapshot history.

print("\n⏰ Performing Time Travel Operations...")
        customer_snapshots = table.snapshots()
        print_snapshot_info(customer_snapshots)

Snapshot history

Cleaning up

To clean up your resources, complete the following steps:

  1. Delete the Amazon S3 table.
  2. Delete the namespace.
  3. Delete the S3 table bucket.

Conclusion

In this post, we’ve showcased an example of how you can use PyIceberg to create, load, and query data in S3 Tables using the Amazon Glue Iceberg REST endpoint. Integrating your table buckets with the Amazon Glue Data Catalog (in preview) allows you to query and visualize data using Amazon Web Services Analytics services such as Amazon Athena, Amazon Redshift, and Amazon QuickSight, and open source clients such as PyIceberg.



Srividya Parthasarathy

Srividya Parthasarathy

Srividya Parthasarathy is a Senior Big Data Architect on the Amazon Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Dylan Qu

Dylan Qu

Dylan Qu is a Specialist Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on Amazon Web Services.

Kalyan Kumar Neelampudi

Kalyan Kumar Neelampudi

Kalyan Kumar Neelampudi (KK) is a Specialist Partner Solutions Architect at Amazon Web Services, focusing on Data Analytics and Generative AI from Austin, Texas. As a technical advisor to Amazon Web Services partners, KK helps architect and implement cutting-edge data analytics and AI/ML solutions, driving innovation in cloud technologies. When he's not architecting cloud solutions, KK maintains an active lifestyle, enjoying competitive sports like badminton and pickleball.

Aritra Gupta

Aritra Gupta

Aritra Gupta is a Senior Technical Product Manager on the Amazon S3 team at Amazon Web Services. He helps customers build and scale multi-region architectures on Amazon S3. Based in Seattle, he likes to play chess and badminton in his spare time.


The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.