We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
How to extend the functionality of Amazon Web Services Trainium with custom operators
Deep learning (DL) is a fast-evolving field, and practitioners are constantly innovating DL models and inventing ways to speed them up. Custom operators are one of the mechanisms developers use to push the boundaries of DL innovation by extending the functionality of existing machine learning (ML) frameworks such as PyTorch. In general, an operator describes the mathematical function of a layer in a deep learning model. A custom operator allows developers to build their own mathematical functions for a layer in the deep learning model.
To follow along, familiarity with core Amazon Web Services services such as
Custom operators in PyTorch and their benefits
CustomOps for PyTorch originated in version 1.10, called PyTorch C++ Frontend, and provided an easy-to-use mechanism to register CustomOps written in C++. The following are some of the benefits that CustomOps provide:
- Performance optimization – CustomOps can be optimized for specific use cases, leading to faster model runs and improved performance.
- Improved model expressiveness – With CustomOps, you can express complex computations that aren’t easily expressible using the built-in operators provided by PyTorch.
- Increased modularity – You can use CustomOps as building blocks to create more complex models by creating C++ libraries of reusable components. This makes the development process easier and more modular, and facilitates rapid experimentation.
- Increased flexibility – CustomOps enables operations beyond the built-in operators—that is, they provide a flexible way to define complex operations that aren’t implemented using the standard ones.
Trainium support for custom operators
Trainium (and Amazon Web Services Inferentia2) supports CustomOps in software through the Neuron SDK and accelerates them in hardware using the GPSIMD engine (General Purpose Single Instruction Multiple Data engine). Let’s look at how these enable efficient CustomOps implementation and provide increased flexibility and performance when developing and innovating DL models.
Neuron SDK
The Neuron SDK helps developers train models on Trainium and deploy models on the Amazon Web Services Inferentia accelerators. It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code to train models on Trn1 instances.
The Neuron SDK uses the standard PyTorch interface for CustomOps. Developers can use the standard programming interface in PyTorch to write CustomOps in C++ and extend Neuron’s official operator support. Neuron then compiles these CustomOps to run efficiently on the GPSIMD engine, which is described in more detail in the following section. This makes it easy to implement new experimental CustomOps and accelerate them on purpose-built hardware, without any intimate knowledge of this underlying hardware.
General Purpose Single Instruction Multiple Data engine
At the core of Trainium optimizations resides the NeuronCore architecture, a fully independent, heterogeneous compute-unit with four main engines: tensor, vector, scalar, and the GPSIMD engine. The scalar and vector engines are highly parallelized and optimized for floating-point operations. The tensor engine is based on a power-optimized, systolic-array supporting mixed precision computation.
The GPSIMD engine is a general-purpose Single Instruction Multiple Data (SIMD) engine designed for running and accelerating CustomOps. This engine consists of eight fully programmable 512-bit wide general-purpose processors, which can run straight-line C-code and have direct inline access to the other NeuronCore-v2 engines, as well as the embedded SRAM and HBM memories. Together, these capabilities help run CustomOps efficiently on Trainium.
Take for example operators such as TopK, LayerNorm, or ZeroCompression, which read data from memory and only use it for a minimal number of ALU calculations. Regular CPU systems are completely memory bound for these calculations, and performance is limited by the time required to move the data into the CPU. In Trainium, the GP-SIMD engines are tightly coupled with the on-chip caches using a high bandwidth streaming interface, which can sustain 2 TB/sec of memory bandwidth. Therefore, CustomOps like these can be run really fast on Trainium.
Neuron SDK custom operators in practice
For this post, we assume a DLAMI (refer to instructions for either
Similar to the process of PyTorch integration with C++ code, Neuron CustomOps requires a C++ implementation of an operator via a NeuronCore-ported subset of the Torch C++ API . The C++ implementation of the operator is called the kernel function , and the port of the C++ API contains everything required for CustomOps development and model integration, specifically tensor and scalar classes in
The torch.h
header needs to be included when defining the kernel for you to have access to a NeuronCore-ported subset of the
Neuron CustomOps also require a shape function . The shape function has the same function signature as the kernel function, but doesn’t perform any computations. It only defines the shape of the output tensor but not the actual values.
Neuron CustomOps are grouped into libraries, and macros are used to register them with the NEURON_LIBRARY
scope from within the shape function. The function will be run on the host at compilation time and will require the register.h
header from the torchneuron library:
Finally, the custom library is built by calling the load API. If supplying the build_directory
parameter, the library file will be stored in the indicated directory:
To use the CustomOp from a PyTorch model, simply load the library by calling the load_library
API and call the Neuron CustomOp in the same manner that CustomOps are called in PyTorch via the torch.ops namespace. The format is usually torch.ops.<library_name>.<operator_name>
. See the following code:
Note that the custom_op.load
API builds the C++ library, whereas the custom_op.load_library
API loads an already-built library file.
Example: Neuron CustomOps in MLP training
To get started, perform the following steps:
- Create and launch your EC2 Trn1 instance. Be sure that you use a DLAMI image (either
Ubuntu orAmazon Linux , pre-installed with all necessary Neuron software) and that you have specified a root volume size of 512 GB. - After your instance is up and running, SSH to your instance.
- Install PyTorch Neuron (torch-neuronx) on your running Trn1 instance. For instructions, refer to
Neuron Custom C++ Operators in MLP Training . - Download the sample code from the
GitHub repository .
Now that your environment is set up, continue through this post as we describe the implementation of a typical C++ CustomOp in Neuron in the form of Relu forward and backward functions to be used on a simple multilayer perceptron (MLP) model. The steps are described in the
The example code from the repository shows two folders:
- ./customop_mlp/PyTorch – Contains the Relu code that will be compiled for a CPU
- ./customop_mlp/neuron – Contains the Relu code that will be compiled for Trainium
Develop a Neuron CustomOp: The kernel function
The host or dev environment for the development of the kernel function (the Neuron CustomOp) can run PyTorch 1.13 and a C++17 compatible compiler in a Linux environment. This is the same as developing any C++ function for PyTorch, and the only libraries that need to be present in the development environment are those for PyTorch and C++. In the following example, we create a relu.cpp file with the custom Relu forward and backward functions:
When developing a Neuron CustomOp for Neuron, make sure you take into account the currently supported features and APIs. For more information, refer to
Build and register the Neuron CustomOp: The shape function
The build for the Neuron CustomOp and runtime environment is the Trn1 instance where the training will take place, and the Neuron CustomOp will be compiled and registered as a neuronx-cc library and interpreted by the Neuron runtime to run on the highly optimized GP-SIMD engine.
To build and register the Neuron CustomOp, we need to create a shape function ( shape.cpp
) that will define the input and output tensors and register the operators: the relu_fwd_shape
and relu_bwd_shape
functions . See the following code:
The relu_fwd_shape
and relu_bwd_shape
functions define the shape of the output tensor (to be the same size as the input tensor). Then we register the functions in the NEURON_LIBRARY
scope.
In the ./customop_ml/
neuron repository example, we have a build.py script to run the build and registration of the CustomOp, by simply calling the load function from the torch_neuronx.xla_impl
package:
In the build_directory
, we should find the librelu.so
library ready to be loaded and used in training our model.
Build the MLP model with the Neuron CustomOp
In this section, we go through the steps to build the MLP model with the Neuron CustomOp.
Define the Relu class
For a detailed explanation of how to train an MLP model, refer to
After we build the CustomOp, we create a Python package called my_ops.py
, where we define a Relu PyTorch class, inheriting from the torch autograd function. The autograd function implements automatic differentiation, so that it can be used in a training loop.
First we load the librelu.so library, then we define the new class with the forward and backward functions defined with static method decorators. In this way, the methods can be called directly when we define the model. See the following code:
Examine the MLP model
Now we’re ready to write our multilayer perceptron model with our Neuron CustomOp by importing the my_ops
package where we have defined the Relu class:
Run the training script
Now we can train our model by using the train.py
provided script:
By sending the model to the xla device, the model and Relu custom operator are compiled to be run by the Neuron runtime using the optimized Trainium hardware.
In this example, we showed how to create a custom Relu operator that takes advantage of the hardware engine (GP-SIMD) available on the Trainium ML accelerator chip. The result is a trained PyTorch model that can now be deployed for inferencing.
Conclusion
Modern state-of-the-art model architectures require an increasing number of resources from engineering staff (data scientists, ML engineers, MLOps engineers, and others) to actual infrastructure including storage, compute, memory, and accelerators. These requirements increase the cost and complexity of developing and deploying deep learning models. Trainium accelerators deliver a high-performance, low-cost solution for DL training in the cloud. The use of Trainium is facilitated by the Neuron SDK, which includes a deep learning compiler, runtime, and tools that are natively integrated into popular frameworks such as PyTorch and TensorFlow. (Note that at the time of writing, the Neuron SDK 2.9 only supports PyTorch for the development of custom operators.)
As demonstrated in this post, Trainium not only provides the means to train your models performantly and efficiently, but also offers the ability to customize your operators to add flexibility and expressiveness to both training and experimentation.
For more information, refer to the
About the Authors
Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking Amazon Web Services Inferentia and Amazon Web Services Trainium workloads.
Shruti Koparkar is a Senior Product Marketing Manager at Amazon Web Services. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.
Ashley Miller is a Senior AI/ML Evangelist at Amazon Web Services where he works closely with public sector partners developing AI/ML solutions on Amazon Web Services.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.