- Products›
- Amazon EMR›
- Amazon EMR features
Apache Hadoop on Amazon EMR
Overview
Apache™ Hadoop® is an open source software project that can be used to efficiently process large datasets. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel.There are many applications and execution engines in the Hadoop ecosystem, providing a variety of tools to match the needs of your analytics workloads. Amazon EMR makes it easy to create and manage fully configured, elastic clusters of Amazon EC2 instances running Hadoop and other applications in the Hadoop ecosystem.
Applications and frameworks in the Hadoop ecosystem
Open allHadoop commonly refers to the actual Apache Hadoop project, which includes MapReduce (execution framework), YARN (resource manager), and HDFS (distributed storage). You can also install Apache Tez, a next-generation framework which can be used instead of Hadoop MapReduce as an execution engine. Amazon EMR also includes EMRFS, a connector allowing Hadoop to use Amazon S3 as a storage layer.
However, there are also other applications and frameworks in the Hadoop ecosystem, including tools that enable low-latency queries, GUIs for interactive querying, a variety of interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes many open source tools designed to build additional functionality on Hadoop core components, and you can use Amazon EMR to easily install and configure tools such as Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in addition to Hadoop on Amazon EMR.
Hadoop: the basic components
Open allAmazon EMR programmatically installs and configures applications in the Hadoop project, including Hadoop MapReduce, YARN, HDFS, and Apache Tez across the nodes in your cluster.
Hadoop MapReduce and Tez, execution engines in the Hadoop ecosystem, process workloads using frameworks that break down jobs into smaller pieces of work that can be distributed across nodes in your Amazon EMR cluster. They are built with the expectation that any given machine in your cluster could fail at any time and are designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.
You can write MapReduce and Tez programs in Java, use Hadoop Streaming to execute custom scripts in a parallel fashion, utilize Hive and Pig for higher level abstractions over MapReduce and Tez, or other tools to interact with Hadoop.
Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce and Tez workloads as well as other distributed frameworks such as Apache Spark.
By using the EMR File System (EMRFS) on your Amazon EMR cluster, you can leverage Amazon S3 as your data layer for Hadoop. Amazon S3 is highly scalable, low cost, and designed for durability, making it a great data store for big data processing. By storing your data in Amazon S3, you can decouple your compute layer from your storage layer, allowing you to size your Amazon EMR cluster for the amount of CPU and memory required for your workloads instead of having extra nodes in your cluster to maximize on-cluster storage. Additionally, you can terminate your Amazon EMR cluster when it is idle to save costs, while your data remains in Amazon S3.
EMRFS is optimized for Hadoop to directly read and write in parallel to Amazon S3 performantly, and can process objects encrypted with Amazon S3 server-side and client-side encryption. EMRFS allows you to use Amazon S3 as your data lake, and Hadoop in Amazon EMR can be used as an elastic query layer.
Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.
HDFS is automatically installed with Hadoop on your Amazon EMR cluster, and you can use HDFS along with Amazon S3 to store your input and output data. You can easily encrypt HDFS using an Amazon EMR security configuration. Also, Amazon EMR configures Hadoop to uses HDFS and local disk for intermediate data created during your Hadoop MapReduce jobs, even if your input data is located in Amazon S3.
Advantages of Hadoop on Amazon EMR
Open allUse cases
Open allGiven its massive scalability and lower costs, Hadoop is ideally suited for common ETL workloads such as collecting, sorting, joining, and aggregating big datasets for easier consumption by downstream systems.
Apache and Hadoop are trademarks of the Apache Software Foundation.