Amazon EMR is a managed service that makes it fast, easy, and cost-effective to run Apache Hadoop and Spark to process vast amounts of data. Amazon EMR also supports powerful and proven Hadoop tools such as Presto, Hive, Pig, HBase, and more. In this project, you will deploy a fully functional Hadoop cluster, ready to analyze log data in just a few minutes. You will start by launching an Amazon EMR cluster and then use a HiveQL script to process sample log data stored in an Amazon S3 bucket. HiveQL, is a SQL-like scripting language for data warehousing and analysis. You can then use a similar setup to analyze your own log files.

Amazon Web Services-project_analyze-big-data_diagram

What you'll accomplish:

Launch a fully functional Hadoop cluster using Amazon EMR.

Define the schema and create a table for sample log data stored in Amazon S3.

Analyze the data using a HiveQL script & write the results back to Amazon S3.

Download and view the results on your computer.

What you'll need before starting:

An Amazon Web Services Account: You will need an Amazon Web Services account to begin provisioning resources to host your website. Sign up for Amazon Web Services.

IT Experience: Prior experience with Hadoop is recommended, but not required, to complete this project.

Amazon Web Services Experience: Basic familiarity with Amazon S3 and Amazon EC2 key pairs is suggested, but not required, to complete this project.

Learn about features, benefits, and key use cases for Amazon EMR.

Need more resources to get started with Amazon Web Services? Visit the Getting Started Resource Center to learn more.