Revisit Amazon Web Services re:Invent 2024’s biggest moments and watch keynotes and innovation talks on demand

 ✕

What does this Amazon Web Services Solution do?

Many Amazon Web Services  customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.

The Amazon Web Services Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include Amazon Web Services managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, Amazon Web Services offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the Amazon Web Services Cloud along with a user-friendly console for searching and requesting datasets.

Version 2.3 of the solution uses the most up-to-date Node.js runtime. Version 2.1 uses the Node.js 8.10 runtime, which reaches end-of-life on December 31, 2019. To upgrade to version 2.3, you must deploy the solution as a new stack.

Note: this solution is designed for Amazon Web Services Services that are available in China regions.

Amazon Web Services Solution Overview

Amazon Web Services offers a data lake solution that automatically configures the core Amazon Web Services services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The solution deploys a console that users can access to search and browse available datasets for their business needs. The solution also includes a federated template that allows you to launch a version of the solution that is ready to integrate with Microsoft Active Directory.

The diagram below presents the data lake architecture you can deploy in minutes using the solution's implementation guide and accompanying Amazon CloudFormation template.

Data Lake on Amazon Web Services Solution Architecture

The Amazon CloudFormation template configures the solution's core Amazon Web Services services, which includes a suite of Amazon Lambda microservices (functions), Amazon Elasticsearch for robust search capabilities, open source software Keycloak for user authentication, Amazon Glue for data transformation, and Amazon Athena for analysis.


The solution leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata. Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the solution console, and create a list of data they require access to.

The solution keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.

Data Lake on Amazon Web Services

Version 2.3
Last updated: 10/2020
Author: Amazon Web Services 

Estimated deployment time: 30 min

Source code 

Features

Data lake reference implementation

Leverage this data lake solution out-of-the-box, or as a reference implementation that you can customize to meet unique data management, search, and processing needs.

Managed storage layer

Secure and manage the storage and retrieval of data in a managed Amazon S3 bucket, and use a solution-specific Amazon Key Management Service (KMS) key to encrypt data at rest.

Data access flexibility

Leverage pre-signed Amazon S3 URLs, or use an appropriate Amazon Identity and Access Management (IAM) role for controlled yet direct access to datasets in Amazon S3.

User interface

The solution automatically creates an intuitive, web-based console UI hosted on Amazon S3 and delivered by Amazon CloudFront. Access the console to easily manage data lake users, data lake policies, add or remove data packages, search data packages, and create manifests of datasets for additional analysis.

Federation sign in

Optionally, you can enable users to sign in through a SAML identity provider (IdP) such as Microsoft Active Directory Federation Services (AD FS).
Explore all Amazon Web Services Solutions

Browse our portfolio of Amazon Web Services -built solutions to common architectural problems.

Learn more 
Find a Partner

Find Amazon Web Services certified consulting and technology partners to help you get started.

Learn more 
Start building in the console

Sign-up and start exploring our services.

Get started