Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database which runs on top of Amazon S3 (using EMRFS) or the Hadoop Distributed Filesystem (HDFS), and it is built for random, strictly consistent realtime access for tables with billions of rows and millions of columns. Apache Phoenix integrates with Apache HBase for low-latency SQL access over Apache HBase tables and secondary indexing for increased performance. Additionally, Apache HBase has tight integration with Apache Hadoop, Apache Hive, and Apache Pig, so you can easily combine massively parallel analytics with fast data access. Apache HBase's data model, throughput, and fault tolerance are a good match for workloads in ad tech, web analytics, financial services, applications using time-series data, and many more.Apache
HBase is natively supported in Amazon EMR, so you can quickly and easily create managed Apache HBase clusters from the Amazon Web Services Management Console, Amazon Web Services CLI, or the Amazon EMR API. You can leverage additional Amazon EMR features, including using Amazon S3 as a data store to reduce costs, creating read-replica clusters for increased availability, leveraging your choice of a wide variety Amazon EC2 instances and Amazon EBS volumes for your cluster's hardware, backup-and-restore to Amazon S3 using the Amazon EMR File System (EMRFS), automatic node replacement, and easy resize commands to add or remove instances from your cluster. Also, you can use Hue to visualize your HBase tables and explore your data. Learn more about Apache HBase and about Apache HBase on Amazon EMR.
Features and benefits
Performance at scale
Apache HBase is designed to maintain performance while scaling out to hundreds of nodes, supporting billions of rows and millions of columns. It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore. Amazon EMR supports a wide variety of instance types and Amazon EBS volumes, so you can customize the hardware of your cluster to optimize for cost and performance. Additionally, you can use Apache Phoenix for low-latency SQL over massive HBase tables or creating secondary indexes for increased performance.
Through tight integration with projects in the Apache Hadoop ecosystem, you can easily run massively parallel analytics workloads on data stored in HBase tables. You can easily install Apache Phoenix, Apache Hadoop, Apache Hive, Apache Pig, and other open-source big data applications on your Amazon EMR cluster alongside Apache HBase, and utilize these tools to run reporting, SQL queries, or other analytics workloads on your data in Apache HBase. Also, you can use these tools to bulk import/export data into Apache HBase tables, or use Apache Hive to join data from Apache HBase with external tables on Amazon S3.
Integration with Amazon EMR
You can easily launch a fully-configured Amazon EMR cluster running Apache HBase and other Apache Hadoop and Apache Spark ecosystem applications in minutes. Amazon EMR automatically replaces poorly performing nodes, and you can easily resize your cluster to meet your requirements. You can manage tables and browse data in Apache HBase using the Hue UI, and easily backup and restore tables to Amazon S3 using EMRFS and Hadoop MapReduce. Additionally, Apache HBase on Amazon EMR can utilize Amazon EMR’s authorization, Kerberos authentication, and encryption feature sets.
Amazon S3 storage for HBase
Amazon EMR enables you to use Amazon S3 as a data store for Apache HBase using the EMR File System. Separating your cluster’s storage and compute nodes by using Amazon S3 as a data store, provides several advantages over on-cluster HDFS. You can save costs by sizing your cluster for your compute requirements instead of HDFS data storage, get the availability and durability of S3 storage, scale compute nodes without impacting your underlying storage, and terminate your cluster to save costs and quickly restore it. You can also create and configure a read-replica cluster in another Amazon EC2 Availability Zone that provides read-only access to the same data as the primary cluster, ensuring uninterrupted access to your data even if the primary cluster becomes unavailable.