Posted On: Mar 18, 2021
Today we are introducing two new distributed training libraries for Amazon SageMaker, providing integrated methods for you to quickly train large deep learning models. Using partitioning algorithms, these SageMaker distributed training libraries automatically split large deep learning models and training datasets across Amazon Web Services GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: model parallelism and data parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently in order to improve training speed.
ML use cases such as image classification and text-to-speech demand increasingly larger computational requirements and datasets. For example BERT, a state-of-the-art natural language processing (NLP) model released in 2018, uses 340 million parameters. Now, state-of-the-art NLP models, such as T5, GPT-3, Turing-NLG, and Megatron, have set new accuracy records, but require tens to hundreds of billions of parameters. Training models like T5 or GPT-3 on a single GPU instance can take several days, slowing your ability to deploy the latest iterations into production. Additionally, implementing your own data and model parallelism strategies manually to ensure your model trains efficiently across a cluster of GPUs can take weeks of experimentation.
With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you, allowing you to train models faster. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize Amazon Web Services China compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.