We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances
Training large language models (LLMs) with billions of parameters can be challenging. In addition to designing the model architecture, researchers need to set up state-of-the-art training techniques for distributed training like mixed precision support, gradient accumulation, and checkpointing. With large models, the training setup is even more challenging because the available memory in a single accelerator device bounds the size of models trained using only data parallelism, and using model parallel training requires additional level of modifications to the training code. Libraries such as
In this post, we set up training on the Intel Habana Gaudi-based
Training setup
We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using
The distributed training workshop illustrates the steps to set up the distributed training cluster. The workshop shows the distributed training setup using Amazon Web Services Batch and in particular, the multi-node parallel jobs feature to launch large-scale containerized training jobs on fully managed clusters. More specifically, a fully managed Amazon Web Services Batch compute environment is created with DL1 instances. The containers are pulled from
BERT 1.5B pre-training with DeepSpeed
Habana
All these features are enabled on the
Pre-training (phase 1) scaling results
For pre-training large models at scale, we mainly focused on two aspects of the solution: training performance, as measured by the time to train, and cost-effectiveness of arriving at a fully converged solution. Next, we dive deeper into these two metrics with BERT 1.5B pre-training as an example.
Scaling performance and time to train
We start by measuring the performance of the BERT Large implementation as a baseline for scalability. The following table lists the measured throughput of sequences per second from 1-8 dl1.24xlarge instances (with eight accelerator devices per instance). Using the single-instance throughput as baseline, we measured the efficiency of scaling across multiple instances, which is an important lever to understand the price-performance training metric.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 1,379.76 | 172.47 | 100.0% |
2 | 16 | 2,705.57 | 169.10 | 98.04% |
4 | 32 | 5,291.58 | 165.36 | 95.88% |
8 | 64 | 9,977.54 | 155.90 | 90.39% |
The following figure illustrates the scaling efficiency.
For BERT 1.5B, we modified the hyperparameters for the model in the reference repository to guarantee convergence. The effective batch size per accelerator was set to 384 (for maximum memory utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Learning rates of 0.0015 and 0.003 were used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the phase 1 pre-training of BERT 1.5B across 8 dl1.24xlarge instances (64 accelerators) in approximately 25 hours, and 15 hours across 16 dl1.24xlarge instances (128 accelerators). The following figure shows the average loss as a function of number of training epochs, as we scale up the number of accelerators.
With the configuration described earlier, we obtained 85% strong scaling efficiency with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single instance. The following table summarizes the parameters.
Number of Instances | Number of Accelerators | Sequences per Second | Sequences per Second per Accelerator | Scaling Efficiency |
1 | 8 | 276.66 | 34.58 | 100.0% |
8 | 64 | 1,883.63 | 29.43 | 85.1% |
16 | 128 | 3,659.15 | 28.59 | 82.7% |
The following figure illustrates the scaling efficiency.
Conclusion
In this post, we evaluated support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and how it helps scale LLM training on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT model took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% strong scaling. We encourage you to take a look at the architecture demonstrated in the
About the authors
Pierre-Yves Aquilanti is Head of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based ML Frameworks solutions. His background is in High Performance Computing and prior to joining Amazon Web Services, Pierre-Yves was working in the Oil & Gas industry. Pierre-Yves is originally from France and holds a Ph.D. in Computer Science from the University of Lille.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.