Amazon Lake Formation

Amazon Lake Formation features

Overview

Amazon Lake Formation is a service that makes it easy to set up a highly secure data lake in days. Lake Formation also simplifies ongoing management of data lakes.

Build data lakes quickly
7
Simplify security management
3
Provide self-service access to data
4

Build data lakes quickly

Open all

Once you specify where your existing databases are and provide your access credentials, Lake Formation reads the data and its metadata (schema) to understand the contents of the data source. Lake Formation then imports the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. Both bulk and incremental data loading is supported.

You can use Lake Formation to move data from on-premises databases by connecting with Java Database Connectivity (JDBC.) Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with Amazon Glue.

Using Lake Formation, you can also pull in semi-structured and unstructured data from other S3 data sources. You can identify existing Amazon S3 buckets containing data to copy into your data lake. Once you specify the S3 path to register your data sources and authorize access, Lake Formation reads the data and its schema. Lake Formation can collect and organize data sets, like logs from Amazon CloudTrail, Amazon CloudFront, Detailed Billing Reports, and Amazon Elastic Load Balancing. You can also load your data into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.

Lake Formation crawls and reads your data sources to extract technical metadata (like schema definitions) and creates a searchable catalog to describe this information for users, so they can discover available data sets. You can also add your own custom labels, at the table- and column-level, to your data to define attributes, like “sensitive information” and “European sales data.” Lake Formation provides a text-based search over this metadata, so your users can quickly find the data they need to analyze.

Lake Formation can perform transformations on your data, such as rewriting various date formats for consistency, to ensure that the data is stored in an analytics-friendly fashion. Lake Formation creates transformation templates and schedules jobs to prepare your data for analysis. Your data is transformed with Amazon Glue and written in columnar formats, such as Parquet and ORC, for better performance. Less data needs to be read for analysis when data is organized into columns versus scanning entire rows. You can create custom transformation jobs with Amazon Glue and Apache Spark to suit your specific requirements.

Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records. For example, use Lake Formation's FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about machine learning to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a “match” and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.

Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. Raw data that is loaded may be in partitions that are too small (requiring extra reads) or too large (reading more data than needed.) With Lake Formation, your data is organized by size, time period, and/or relevant keys. This enables both fast scans and parallel, distributed reads for the most commonly used queries.

Simplify security management

Open all

Lake Formation leverages the encryption capabilities of S3 for data in your data lake. This approach provides automatic server-side encryption with keys managed by the Amazon Key Management Service (KMS). S3 encrypts data in transit when replicating across regions, and lets you use separate accounts for source and destination China regions to protect against malicious insider deletions. These encryption capabilities provide a secure foundation for all data in your data lake.

Lake Formation provides central access controls for data in your data lake. You can define security policy-based rules for your users and applications by role in Lake Formation, and integration with Amazon IAM authenticates those users and roles. Once the rules are defined, Lake Formation enforces your access controls at table- and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. Amazon Glue access is enforced at the table-level and is typically for administrators only. EMR integration (in beta) supports authorizing Active Directory, Okta, and Auth0 users for EMR Notebooks and Zeppelin notebooks connected to EMR clusters.

Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and machine learning services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and Console.

Provide self-service access to data

Open all

Lake Formation provides you the ability to designate data owners, such as data stewards and business units, by adding a field in table properties as custom attributes. Your owners can augment the technical metadata with business metadata that further defines appropriate uses for the data. You can specify appropriate use cases and label the sensitivity of your data for enforcement by using Lake Formation security and access controls.

Lake Formation facilitates requesting and vending access to datasets to give your users self-service access to the data lake for a variety of analytics use cases. You can specify, grant, and revoke permissions on tables defined in the central data catalog. The same data catalog is available for multiple accounts, groups, and services.

With Lake Formation, your users enjoy online, text-based search and filtering of data sets recorded in the central data catalog. They can search for relevant data by name, contents, sensitivity, or other any other custom labels you have defined.

With Lake Formation, you can give your analytics users the ability to directly query datasets with Athena for SQL, Redshift for data warehousing, and (in beta) EMR for Apache Spark-based big data processing and machine learning (for EMR Notebooks and Zeppelin notebooks). Once you point these services to Lake Formation, the data sets available are shown in the catalog and access controls are enforced consistently, allowing your users to readily combine analytics approaches on the same data.

Learn more about the Amazon Lake Formation pricing

Learn more

Amazon Lake Formation features

Overview

Build data lakes quickly

Simplify security management

Provide self-service access to data

Learn more about the Amazon Lake Formation pricing

Sign up for an account

Start building in the console

About Us

Products & Solutions

Resources & Support

Manage Your Account

Amazon Lake Formation features

Overview

Build data lakes quickly

Import data from databases already in Amazon Web Services

Import data from other external sources

Import data from other Amazon Web Services services

Catalog and label your data

Transform data

Clean and deduplicate data

Optimize partitions

Simplify security management

Enforce encryption

Define and manage access controls

Implement audit logging

Provide self-service access to data

Label your data with business metadata

Enable self-service access

Discover relevant data for analysis

Combine analytics approaches for more insights

Learn more about the Amazon Lake Formation pricing

Sign up for an account

Start building in the console

About Us

Products & Solutions

Resources & Support

Manage Your Account