We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool
Today, we’re pleased to introduce the
In this post, we show how you can use the EMR CLI to create a new PySpark project from scratch and deploy it to
Overview of solution
The EMR CLI is an open-source tool to help improve the developer experience of developing and deploying jobs on Amazon EMR. When you’re just getting started with Apache Spark, there are a variety of options with respect to how to package, deploy, and run jobs that can be overwhelming or require deep domain expertise. The EMR CLI provides simple commands for these actions that remove the guesswork from deploying Spark jobs. You can use it to create new projects or alongside existing PySpark projects.
In this post, we walk through creating a new PySpark project that analyzes weather data from the
- Initialize the project.
- Package the dependencies.
-
Deploy the code and dependencies to
Amazon Simple Storage Service (Amazon S3). - Run the job on EMR Serverless.
Prerequisites
For this walkthrough, you should have the following prerequisites:
-
An
Amazon Web Services account -
An EMR Serverless application in the
us-east-1Region -
An S3 bucket for your code and logs in the
us-east-1Region -
An
Amazon Web Services Identity and Access Management (IAM) job role that can run EMR Serverless jobs and access S3 buckets - Python version >= 3.7
-
Docker
If you don’t already have an existing EMR Serverless application, you can use the following
emr bootstrap
command after you’ve installed the CLI.
Install the EMR CLI
You can find the source for the EMR CLI in the
You should now be able to run the
emr --help
command and see the different subcommands you can use:
If you didn’t already create an EMR Serverless application, the
bootstrap
command can create a sample environment for you and a configuration file with the relevant settings. Assuming you used the provided CloudFormation stack, set the following environment variables using the information on the
Outputs
tab of your stack. Set the Region in the terminal to
us-east-1
and set a few other environment variables we’ll need along the way:
We use
us-east-1
because that’s where the NOAA GSOD data bucket is. EMR Serverless can access S3 buckets and other Amazon Web Services resources in the same Region by default. To access other services,
Initialize a project
Next, we use the
emr init
command to initialize a default PySpark project for us in the provided directory. The default templates create a standard Python project that uses
pyproject.toml
to define its dependencies. In this case, we use Pandas and PyArrow in our script, so those are already pre-populated.
After the project is initialized, you can run
cd my-project
or open the
my-project
directory in your code editor of choice. You should see the following set of files:
Note that we also have a Dockerfile here. This is used by the
package
command to ensure that our project dependencies are built on the right architecture and operating system for Amazon EMR.
If you use
--project-type poetry
flag to the
emr init
command to create a Poetry project.
If you already have an existing PySpark project, you can use
emr init --dockerfile
to create the Dockerfile necessary to package things up.
Run the project
Now that we’ve got our sample project created, we need to package our dependencies, deploy the code to Amazon S3, and start a job on EMR Serverless. With the EMR CLI, you can do all of that in one command. Make sure to run the command from the
my-project
directory:
This command performs several actions:
- Auto-detects the type of Spark project in the current directory.
- Initiates a build for your project to package up dependencies.
- Copies your entry point and resulting build files to Amazon S3.
- Starts an EMR Serverless job.
- Waits for the job to finish, exiting with an error status if it fails.
You should now see the following output in your terminal as the job begins running in EMR Serverless:
And that’s it! If you want to run the same code on Amazon EMR on
--application-id
with
--cluster-id j-11111111
. The CLI will take care of sending the right
spark-submit
commands to your EMR cluster.
Now let’s walk through some of the other commands.
emr package
PySpark projects can be packaged in numerous ways, from a single .py file to a complex Poetry project with various dependencies. The EMR CLI can help consistently package your projects without having to worry about the details.
For example, if you have a single .py file in your project directory, the
package
command doesn’t need to do anything. If, however, you have multiple .py files in a typical Python project style, the
emr package
command will zip these files up as a package that can later be uploaded to Amazon S3 and provided to your PySpark job using the
--py-files
option. If you have third party dependencies defined in
pyproject.toml
,
emr package
will create a virtual environment archive and start your EMR job with the
spark.archive
option.
The EMR CLI also supports
poetry.lock
file, there’s nothing else you need to do. The
emr package
command will detect your
poetry.lock
file and automatically build the project using the Poetry Bundle plugin. You can use a Poetry project in two ways:
-
Create a project using the
emr initcommand. The commands take a--project-typepoetry option that create a Poetry project for you: -
If you have a pre-existing project, you can use the
emr init --dockerfileoption, which creates a Dockerfile that is automatically used when you runemr package.
Finally, as noted earlier, the EMR CLI provides you a default Dockerfile based on Amazon Linux 2 that you can use to reliably build package artifacts that are compatible with different EMR environments.
emr deploy
The
emr deploy
command takes care of copying the necessary artifacts for your project to Amazon S3, so you don’t have to worry about it. Regardless of how the project is packaged,
emr deploy
will copy the resulting files to your Amazon S3 location of choice.
One use case for this is with CI/CD pipelines. Sometimes you want to deploy a specific version of code to Amazon S3 to be used in your data pipelines. With
emr deploy
, this is as simple as changing the
--s3-code-uri
parameter.
For example, let’s assume you’ve already packaged your project using the
emr package
command. Most CI/CD pipelines allow you to access the git tag. You can use that as part of the
emr deploy
command to deploy a new version of your artifacts. In GitHub actions, this is
github.ref_name
, and you can use this in an action to deploy a versioned artifact to Amazon S3. See the following code:
In your downstream jobs, you could then update the location of your entry point files to point to this new location when you’re ready, or you can use the
emr run
command discussed in the next section.
emr run
Let’s take a quick look at the
emr run
command. We’ve used it before to package, deploy, and run in one command, but you can also use it to run on already-deployed artifacts. Let’s look at the specific options:
If you want to run your code on EMR Serverless, the
emr run
command takes an
--application-id
and
--job-role
parameters. If you want to run on EMR on EC2, you only need the
--cluster-id
option.
Required for both options are
--entry-point
and
--s3-code-uri
.
--entry-point
is the main script that will be called by Amazon EMR. If you have any dependencies,
--s3-code-uri
is where they get uploaded to using the emr deploy command, and the EMR CLI will build the relevant spark-submit properties pointing to these artifacts.
There are a few different ways to customize the job:
- –job-name – Allows you to specify the job or step name
- –job-args – Allows you to provide command line arguments to your script
-
–spark-submit-opts
– Allows you to add additional
spark-submitoptions like--conf spark.jarsor others -
–show-stdout
– Currently only works with single-file .py jobs on EMR on EC2, but will display
stdoutin your terminal after the job is complete
As we’ve seen before,
--build
invokes both the
package
and
deploy
commands. This makes it easier to iterate on local development when your code still needs to run remotely. You can simply use the same
emr run
command over and over again to build, deploy, and run your code in your environment of choice.
Future updates
The EMR CLI is under active development. Updates are currently in progress to support
Clean up
To avoid incurring future charges, stop or delete your EMR Serverless application. If you used the CloudFormation template, be sure to delete your stack.
Conclusion
With the release of the EMR CLI, we’ve made it easier for you to deploy and run Spark jobs on EMR Serverless. The utility is available as open source on
About the author
Damon
is a Principal Developer Advocate on the EMR team at Amazon Web Services. He’s worked with data and analytics pipelines for over 10 years and splits his team between splitting service logs and stacking firewood.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.