Distributed training libraries

Complete distributed training up to 40% faster

Amazon SageMaker offers the fastest and easiest methods for training large deep learning models and datasets. Using partitioning algorithms, SageMaker's distributed training libraries automatically split large deep learning models and training datasets across Amazon Web Services GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently in order to improve training speed.

ML use cases such as image classification and text-to-speech demand increasingly larger computational requirements and datasets. For example BERT, a state-of-the-art natural language processing (NLP) model released in 2018, uses 340 million parameters. Now, state-of-the-art NLP models, such as T5, GPT-3, Turing-NLG, and Megatron, have set new accuracy records, but require tens to hundreds of billions of parameters. Training models like T5 or GPT-3 on a single GPU instance can take several days, slowing your ability to deploy the latest iterations into production. Additionally, implementing your own data and model parallelism strategies manually can take weeks of experimentation.

With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize Amazon Web Services China compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.

Benefits

Data parallelism library

Reduce training time

Amazon SageMaker reduces training time by making it easy to split training data across GPUs. For example, training Mask R-CNN on p3dn.24xlarge runs 25% faster on SageMaker compared to Horovod. The reduction in training time is possible because SageMaker manages the GPUs running in parallel to achieve optimal synchronization.

Optimized for Amazon Web Services

SageMaker’s data parallelism library provides communication algorithms that are designed to fully utilize the Amazon Web Services China network and infrastructure to achieve near-linear scaling efficiency. For example, BERT on p3dn.24xlarge instances achieves a scaling efficiency of 90% using SageMaker, a 26% improvement over the same model using open source.

SageMaker provides data parallelism optimizations through the same APIs that are already common for distributed training so that you are not required to learn a new library. To enable data parallelism, you can use the DistributedDataParallel (DDP) API for PyTorch and Horovod API for TensorFlow.

Model parallelism library

Automatic and efficient model partitioning

Manually partitioning large models can take weeks of effort for even the most experienced data science teams. Amazon SageMaker can split your model in seconds by profiling it and finding the most efficient way to partition it across GPUs.

Minimal code changes

Amazon SageMaker requires changing fewer than 10 lines of code in your TensorFlow or PyTorch training script to split your models across multiple GPUs. You can reuse existing APIs from TensorFlow, PyTorch, and Horovod to quickly get up and running.

Optimize resources

Amazon SageMaker offers maximum utilization of your GPU instances by splitting your training batches into smaller microbatches. The smaller microbatches are fed to GPUs in an efficient pipeline to keep all GPU devices simultaneously active.

Use cases

Bounding box around object, showing object detection

Object detection

For object detection, model training time is often a bottleneck, slowing data science teams down as they wait several days or weeks for results. For example, autonomous vehicle object detection models need to train on large volumes of data to improve vehicle perception. SageMaker’s data parallelism library can help data science teams efficiently split training data and quickly scale to hundreds or even thousands of GPUs, reducing training time from days to minutes.

Speech bubble with language translation, showing natural language processing

Natural language processing

In natural language understanding, data scientists often improve model accuracy by increasing the number of layers and the size of the neural network, resulting in models with billions of parameters such as GPT-2, GPT-3, T5, and Megatron. Splitting model layers and operations across GPUs can take weeks, but SageMaker’s model parallelism library automatically analyzes and splits the model efficiently to enable data science teams to start training large models within minutes.

Eye icon with data streams, describing computer vision

Computer vision

In computer vision, hardware constraints often force data scientists to pick batch sizes or input sizes that are smaller than they would prefer. For example, bigger inputs may improve model accuracy but may cause out-of-memory errors and poor performance with smaller batch sizes. Similarly, larger batch sizes improve GPU utilization and performance but may hinder model accuracy. SageMaker distribute training libraries offer the flexibility to easily train models efficiently with lower batch sizes or train with bigger inputs.