Skip to main content

Amazon Glue

Amazon Glue FAQs

General

Open all

Amazon Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Amazon Glue provides all of the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AMAZON Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AMAZON Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can create and run ETL workflows. Data analysts and data scientists can use AMAZON Glue DataBrew to visually enrich, clean, and normalize data without writing code.

Amazon Glue consists of a Data Catalog which is a central metadata repository; an ETL engine that can automatically generate Scala or Python code; a flexible scheduler that handles dependency resolution, job monitoring, and retries; and Amazon Glue DataBrew for cleaning and normalizing data with a visual interface. Together, these automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data.

You should use Amazon Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3 , data warehouse in Amazon Redshift , and various databases running on Amazon. It provides a unified view of your data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena , Amazon EMR , and Amazon Redshift Spectrum . Glue automatically generates Scala or Python code for your ETL jobs that you can further customize using tools you are already familiar with. You can use Amazon Glue DataBrew to visually clean up and normalize data without writing code. Amazon Glue is serverless, so there are no compute resources to configure and manage.

A: Lake Formation leverages a shared infrastructure with AMAZON Glue, including console controls, ETL code creation and job monitoring, a common data catalog, and a serverless architecture. While Amazon Glue is still focused on these types of functions, Lake Formation encompasses Amazon Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake. See the Amazon Lake Formation pages for more details.

Amazon Glue Data Catalog

Open all

The Amazon  Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.

The Amazon Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. For more information on setting up your EMR cluster to use Amazon Glue Data Catalog as an Apache Hive Metastore, click here .

The Amazon Glue Data Catalog also provides out-of-box integration with Amazon EMR , and Amazon Redshift Spectrum . Once you add your table definitions to the Glue Data Catalog, they are available for ETL and also readily available for querying in Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.

Amazon Glue provides a number of ways to populate metadata into the Amazon Glue Data Catalog. Glue crawlers scan various data stores you own to automatically infer schemas and partition structure and populate the Glue Data Catalog with corresponding table definitions and statistics. You can also schedule crawlers to run periodically so that your metadata is always up-to-date and in-sync with the underlying data. Alternately, you can add and update table details manually by using the Amazon Glue Console or by calling the API. You can also run Hive DDL statements via a Hive client on an Amazon EMR cluster. Finally, if you already have a persistent Apache Hive Metastore, you can perform a bulk import of that metadata into the Amazon Glue Data Catalog by using our import script.

An Amazon Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions. You can customize Glue crawlers to classify your own file types.

You simply run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3 , and then imports that data into the Amazon Glue Data Catalog.

No. Amazon Glue Data Catalog is Apache Hive Metastore compatible. You can point to the Glue Data Catalog endpoint and use it as an Apache Hive Metastore replacement. For more information on how to configure your cluster to use Amazon  Glue Data Catalog as an Apache Hive Metastore, please read our documentation here .

Amazon Glue Schema Registry

Open all

Amazon Glue Schema Registry , a serverless feature of Amazon Glue , enables you to validate and control the evolution of streaming data using schemas registered in Apache Avro and JSON Schema data formats, at no additional charge. Through Apache-licensed serializers and deserializers, the Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK) , Amazon Kinesis Data Streams , Apache Flink, Amazon Kinesis Data Analytics for Apache Flink , and Amazon Lambda . When data streaming applications are integrated with the Schema Registry, you can improve data quality and safeguard against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update Amazon Glue tables and partitions using Apache Avro schemas stored within the registry.

With the Amazon Glue Schema Registry, you can:

  1. Validate schemas. When data streaming applications are integrated with Amazon Glue Schema Registry, schemas used for data production are validated against schemas within a central registry, allowing you to centrally control data quality.
  2. Safeguard schema evolution. You can set rules on how schemas can and cannot evolve using one of eight compatibility modes.
  3. Improve data quality. Serializers validate schemas used by data producers against those stored in the registry, improving data quality when it originates and reducing downstream issues from unexpected schema drift.
  4. Save costs. Serializers convert data into a binary format and can compress it before it is delivered, reducing data transfer and storage costs.
  5. Improve processing efficiency. In many cases, a data stream contains records of different schemas. The Schema Registry enables applications that read from data streams to selectively process each record based on the schema without having to parse its contents, which increases processing efficiency.

The Schema Registry supports Apache Avro and JSON Schema data formats and Java client applications. We plan to continue expanding support for other data formats and non-Java clients. The Schema Registry integrates with applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK) , Amazon Kinesis Data Streams , Apache Flink, Amazon Kinesis Data Analytics for Apache Flink , and Amazon Lambda .

The following compatibility modes are available for you to manage your schema evolution: Backward, Backward All, Forward, Forward All, Full, Full All, None, and Disabled. Visit the Schema Registry user documentation to learn more about compatibility rules.

The Schema Registry storage and control plane is designed for high availability and is backed by the Amazon Glue SLA , and the serializers and deserializers leverage best-practice caching techniques to maximize schema availability within clients.

Amazon Glue Schema Registry storage is an Amazon Web Services service, while the serializers and deserializers are Apache-licensed open-source components.

Yes, your clients communicate with the Schema Registry via API calls which encrypt data in-transit using TLS encryption over HTTPS. Schemas stored in the Schema Registry are always encrypted at rest using a service-managed KMS key.

You can use Amazon PrivateLink to connect your data producer’s VPC to Amazon Glue by defining an interface VPC endpoint for Amazon Glue. When you use a VPC interface endpoint, communication between your VPC and Amazon Glue is conducted entirely within the Amazon Web Services network. For more information, please visit the user documentation .

Amazon CloudWatch metrics are available as part of CloudWatch’s free tier. You can access these metrics in the CloudWatch Console. Visit the Amazon Glue Schema Registry user documentation for more information.

Yes, the Schema Registry supports both resource-level permissions and identity-based IAM policies.

Steps to migrate from a third-party schema registry to Amazon Glue Schema Registry are available in the user documentation .

Extract, transform, and load (ETL)

Open all

You can use either Scala or Python.

Amazon Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation . You can write ETL code using Amazon Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the Amazon Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.

Yes. You can import custom Python libraries and Jar files into your Amazon Glue ETL job. For more details, please check our documentation here .

Yes. You can write your own code using Amazon Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here .

You can create and connect to development endpoints that offer ways to connect your notebooks and IDEs.

In addition to the ETL library and code generation, Amazon Glue provides a robust set of orchestration features that allow you to manage dependencies between multiple jobs to build end-to-end ETL workflows. Amazon Glue ETL jobs can either be triggered on a schedule or on a job completion event. Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event. You can also trigger one or more Glue jobs from an external source such as an Amazon Lambda function.

Amazon Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs. You can either have a scheduled trigger that invokes jobs periodically, an on-demand trigger, or a job completion trigger.

Amazon Glue monitors job event metrics and errors, and pushes all notifications to Amazon CloudWatch . With Amazon CloudWatch, you can configure a host of actions that can be triggered based on specific notifications from Amazon Glue. For example, if you get an error or a success notification from Glue, you can trigger an Amazon Lambda function. Glue also provides default retry behavior that will retry all failures three times before sending out an error notification.

Yes. You can run your existing Scala or Python code on Amazon Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.

Amazon Glue ETL is batch oriented, and you can schedule your ETL jobs at a minimum of 5 min intervals. While it can process micro-batches, it does not handle streaming data. If your use case requires you to ETL data while you stream it in, you can perform the first leg of your ETL using Amazon Kinesis Data Firehose , and then store data to either Amazon S3 or Amazon Redshift and trigger a Glue ETL job to pick up that dataset and continue applying additional transformations to that data.

No. While we do believe that using both the Amazon Glue Data Catalog and ETL provides an end-to-end ETL experience, you can use either one of them independently without using the other.

Amazon Glue DataBrew

Open all

Amazon Glue DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to prepare data with an interactive, point-and-click visual interface without writing code. With Glue DataBrew, you can easily visualize, clean, and normalize terabytes, and even petabytes of data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. Amazon Glue DataBrew is generally available in the regions where Amazon Glue is available .

Amazon Glue DataBrew is built for users who need to clean and normalize data for analytics and machine learning. Data analysts and data scientists are the primary users. For data analysts, examples of job functions are business intelligence analysts, operations analysts, market intelligence analysts, legal analysts, financial analysts, economists, quants, or accountants. For data scientists, examples of job functions are materials scientists, bioanalytical scientists, and scientific researchers.

You can choose from over 250 built-in transformations to combine, pivot, and transpose the data without writing code. Amazon Glue DataBrew also automatically recommends transformations such as filtering anomalies, correcting invalid, incorrectly classified, or duplicate data, normalizing data to standard date and time values, or generating aggregates for analyses. For complex transformations, such as converting words to a common base or root word, Glue DataBrew provides transformations that use advanced machine learning techniques such as Natural Language Processing (NLP). You can group multiple transformations together, save them as recipes, and apply the recipes directly to the new incoming data.

For input data, Amazon Glue DataBrew supports commonly used file formats, such as comma-separated values (.csv), JSON and nested JSON, Apache Parquet and nested Apache Parquet, and Excel sheets. For output data, Amazon Glue DataBrew supports comma-separated values (.csv), JSON, Apache Parquet, Apache Avro, Apache ORC and XML.

Yes. Sign up for an Amazon Free Tier account, then visit the Amazon Glue DataBrew Management Console , and get started instantly for free. If you are a first-time user of Glue DataBrew, the first 40 interactive sessions are free. Visit the Amazon Glue Pricing page to learn more.

No. You can use Amazon Glue DataBrew without using either Amazon Glue Data Catalog or Amazon Lake Formation. If you use Glue Data Catalog to store schema and metadata, Glue DataBrew automatically infers schema from the Glue Data Catalog. If your data is centralized and secured in Amazon Lake Formation, DataBrew users can use all data sets available to them from its centralized data catalog.

Yes. You can visually track all the changes made to your data in the Amazon Glue DataBrew Management Console . The visual view makes it easy to trace the changes and relationships made to the datasets, projects and recipes, and all other associated jobs. In addition, Glue DataBrew keeps all account activities as logs in the Amazon CloudTrail.

Amazon Web Services Product Integrations

Open all

Amazon Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. Amazon Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.

Amazon Database Migration Service (DMS) helps you migrate databases to Amazon Web Services easily and securely. For use cases which require a database migration from on-premises to Amazon Web Services or database replication between on-premises sources and sources on Amazon Web Services, we recommend you use Amazon DMS. Once your data is in Amazon Web Services, you can use Amazon Glue to move and transform data from your data source into another database or data warehouse, such as Amazon Redshift .

Pricing and billing

Open all

You will pay a simple monthly fee, for storing and accessing the metadata in the Amazon Glue Data Catalog. Additionally, you will pay an hourly rate, billed per second, for the ETL job and crawler run, with a 10-minute minimum for each. If you choose to use a development endpoint to interactively develop your ETL code, you will pay an hourly rate, billed per second, for the time your development endpoint is provisioned, with a 10-minute minimum. For more details, please refer our pricing page .

Billing commences as soon as the job is scheduled for execution and continues until the entire job completes. With Amazon Glue, you only pay for the time for which your job runs and not for the environment provisioning or shutdown time.

Security and availability

Open all

We provide server side encryption for data at rest and SSL for data in motion.

Please refer our documentation to learn more about service limits.

A development endpoint is provisioned with 5 DPUs by default. You can configure a development endpoint with a minimum of 2 DPUs and a maximum of 5 DPUs.

You can simply specify the number of DPUs (Data Processing Units) you want to allocate to your ETL job. A Glue ETL job requires a minimum of 2 DPUs. By default, Amazon Glue allocates 10 DPUs to each ETL job.

The Amazon Glue provides status of each job and pushes all notifications to Amazon CloudWatch . You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.

Service Level Agreement

Open all

Our Amazon Glue SLA guarantees a Monthly Uptime Percentage of at least 99.9% for Amazon Glue.

You are eligible for a SLA credit for Amazon Glue under the Amazon Glue SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.