Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface.
Apache Hive is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Hive clusters from the Amazon Web Services Management Console, Amazon Web Services CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including direct connectivity to Amazon DynamoDB or Amazon S3 for storage, integration with the Amazon Glue Data Catalog, Amazon Lake Formation, Amazon RDS, or Amazon Aurora to configure an external metastore, and EMR Managed Scaling to add or remove instances from your cluster.
Features and benefits
You can launch an EMR cluster with multiple master nodes to support high availability for Apache Hive. Amazon EMR automatically fails over to a standby master node if the primary master node fails or if critical processes, like Resource Manager or Name Node, crash. This means that you can run Apache Hive on EMR clusters without interruption.
Amazon EMR allows you to define EMR Managed Scaling for Apache Hive clusters to help you optimize your resource usage. With EMR Managed Scaling, you can automatically resize your cluster for best performance at the lowest possible cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR automatically resizes them for best performance and resource utilization. EMR Managed Scaling continuously samples key metrics associated with the workloads running on clusters.
Amazon EMR 6.0.0 adds support for Hive LLAP, providing an average performance speedup of 2x over EMR 5.29. You can now use S3 Select with Hive on Amazon EMR to improve performance. S3 Select allows applications to retrieve only a subset of data from an object, which reduces the amount of data transferred between Amazon EMR and Amazon S3. Amazon EMR also enables fast performance on complex Apache Hive queries. EMR uses Apache Tez by default, which is significantly faster. Apache MapReduce uses multiple phases, so a complex Apache Hive query would get broken down into four or five jobs. Apache Tez is designed for more complex queries, so that same job on Apache Tez would run in one job, making it significantly faster.
Flexible metastore options
With Amazon EMR, you have the option to leave the metastore as local or externalize it. EMR provides integration with the Amazon Glue Data Catalog and Amazon Lake Formation, so that EMR can pull information directly from Glue or Lake Formation to populate the metastore.