AWS Lake Formation is a service that makes it easy to set up a highly secure data lake in days. Lake Formation also simplifies ongoing management of data lakes.
Build data lakes quickly
Once you specify where your existing databases are and provide your access credentials, Lake Formation reads the data and its metadata (schema) to understand the contents of the data source. Lake Formation then imports the data to your new data lake and records the metadata in a central catalog. With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. Both bulk and incremental data loading is supported.
You can use Lake Formation to move data from on-premises databases by connecting with Java Database Connectivity (JDBC.) Identify your target sources and provide access credentials in the console, and Lake Formation reads and loads your data into the data lake. To import data from databases other than the ones listed above, you can create custom ETL jobs with AWS Glue.
Using Lake Formation, you can also pull in semi-structured and unstructured data from other S3 data sources. You can identify existing Amazon S3 buckets containing data to copy into your data lake. Once you specify the S3 path to register your data sources and authorize access, Lake Formation reads the data and its schema. Lake Formation can collect and organize data sets, like logs from AWS CloudTrail, AWS CloudFront, Detailed Billing Reports, and AWS Elastic Load Balancing. You can also load your data into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.
Lake Formation crawls and reads your data sources to extract technical metadata (like schema definitions) and creates a searchable catalog to describe this information for users, so they can discover available data sets. You can also add your own custom labels, at the table- and column-level, to your data to define attributes, like “sensitive information” and “European sales data.” Lake Formation provides a text-based search over this metadata, so your users can quickly find the data they need to analyze.
Lake Formation can perform transformations on your data, such as rewriting various date formats for consistency, to ensure that the data is stored in an analytics-friendly fashion. Lake Formation creates transformation templates and schedules jobs to prepare your data for analysis. Your data is transformed with AWS Glue and written in columnar formats, such as Parquet and ORC, for better performance. Less data needs to be read for analysis when data is organized into columns versus scanning entire rows. You can create custom transformation jobs with AWS Glue and Apache Spark to suit your specific requirements.
Lake Formation helps clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records. For example, use Lake Formation's FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main.” You don't need to know anything about machine learning to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a “match” and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.
Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. Raw data that is loaded may be in partitions that are too small (requiring extra reads) or too large (reading more data than needed.) With Lake Formation, your data is organized by size, time period, and/or relevant keys. This enables both fast scans and parallel, distributed reads for the most commonly used queries.
Simplify security management
Lake Formation leverages the encryption capabilities of S3 for data in your data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). S3 encrypts data in transit when replicating across regions, and lets you use separate accounts for source and destination China regions to protect against malicious insider deletions. These encryption capabilities provide a secure foundation for all data in your data lake.
Lake Formation provides central access controls for data in your data lake. You can define security policy-based rules for your users and applications by role in Lake Formation, and integration with AWS IAM authenticates those users and roles. Once the rules are defined, Lake Formation enforces your access controls at table- and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. AWS Glue access is enforced at the table-level and is typically for administrators only. EMR integration (in beta) supports authorizing Active Directory, Okta, and Auth0 users for EMR Notebooks and Zeppelin notebooks connected to EMR clusters.
Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and machine learning services that read the data in your data lake via Lake Formation. This lets you see which users or roles have attempted to access what data, with which services, and when. You can access audit logs in the same way you access any other CloudTrail logs using the CloudTrail APIs and Console.
Provide self-service access to data
Lake Formation provides you the ability to designate data owners, such as data stewards and business units, by adding a field in table properties as custom attributes. Your owners can augment the technical metadata with business metadata that further defines appropriate uses for the data. You can specify appropriate use cases and label the sensitivity of your data for enforcement by using Lake Formation security and access controls.
Lake Formation facilitates requesting and vending access to datasets to give your users self-service access to the data lake for a variety of analytics use cases. You can specify, grant, and revoke permissions on tables defined in the central data catalog. The same data catalog is available for multiple accounts, groups, and services.
With Lake Formation, your users enjoy online, text-based search and filtering of data sets recorded in the central data catalog. They can search for relevant data by name, contents, sensitivity, or other any other custom labels you have defined.
With Lake Formation, you can give your analytics users the ability to directly query datasets with Athena for SQL, Redshift for data warehousing, and (in beta) EMR for Apache Spark-based big data processing and machine learning (for EMR Notebooks and Zeppelin notebooks). Once you point these services to Lake Formation, the data sets available are shown in the catalog and access controls are enforced consistently, allowing your users to readily combine analytics approaches on the same data.