Posted On: Jan 7, 2021

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning. With a single click, data scientists and developers can quickly spin up SageMaker Studio notebooks to explore and prepare datasets to build, train and deploy machine learning models in a single pane of glass. Amazon Elastic MapReduce (EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Starting today, customers can use Studio notebooks to easily and securely connect to Amazon EMR clusters and prepare vast amounts of data for analysis and reporting, model training, or inference. 

Data preparation is a critical step in the machine learning workflow. With SageMaker Studio, you have access to a range of tools for data preparation based on your preference. If you prefer to write code, you can use SageMaker Studio notebooks to prepare data interactively using libraries and SDKs, or process large amounts of data in batch using Amazon SageMaker Processing with built-in Spark container. However, if you prefer to connect Studio notebooks to existing EMR clusters to access and process data, you need to manually set up the environment, bring your own Sparkmagic kernel, configure target cluster information, install tools such as Kerberos for authentication, before running your Spark or Hive jobs. 

Amazon SageMaker Studio now comes with built-in tools that make it quick and easy to securely connect your notebook to an EMR cluster to process large amounts of data. You can create a Studio notebook from a built-in SageMaker image with PySpark kernel, use built-in commands to connect to an EMR cluster, and start running Spark or Hive jobs to process data in a few steps. For added security, you can connect to EMR clusters using Kerberos authentication. The feature is now available in both AWS China (Beijing) Region, operated by Sinnet, and AWS China (Ningxia) Region, operated by NWCD. For more information, see the Amazon SageMaker Studio documentation.