使用 Hugging Face 和 LoRa 在单个亚马逊 SageMaker GPU 上训练大型语言模型

这篇文章是与《Hugging Face》中的菲利普·施密德共同撰写的。

我们都听说过大型语言模型 (LLM) 领域正在取得的进展，以及越来越多的问题集，在这些问题集中，LLM 提供了宝贵的见解。大型模型在大规模数据集和多项任务上进行训练时，也能够对未经过专门训练的任务进行概括。此类模型被称为 基础模型 ，该术语最初由斯坦福以人为本的人工智能研究所推广。尽管这些基础模型能够很好地概括，尤其是在快速的工程技术的帮助下，但用例通常是特定领域的，或者任务差异很大，因此模型需要进一步定制。提高特定领域或任务的大型模型性能的一种方法是使用较小的任务特定数据集进一步训练模型。尽管这种被称为微调的方法成功地提高了 LLM 的准确性，但它需要修改所有模型权重。由于数据集大小要小得多，微调比模型的预训练要快得多，但仍然需要大量的计算能力和内存。微调会修改原始模型的所有参数权重，这使得它变得昂贵，并导致模型的大小与原始模型相同。

为了应对这些挑战， Hugging Face 推出了参数高效微调库 (PEFT)。该库允许您冻结大部分原始模型权重，并通过训练一组额外的、小得多的参数来替换或扩展模型层。就所需的计算和内存而言，这使得训练的成本要低得多。

在这篇文章中，我们将向您展示如何在亚马逊SageMaker上仅使用单个图形处理单元 (GPU) 训练70亿参数的 BloomZ模型。SageMaker是亚马逊的机器学习 (ML) 平台，用于准备、构建、训练和部署高质量的机器学习模型。BloomZ 是一种通用的自然语言处理 (NLP) 模型。我们使用PEFT来优化该模型，以完成总结类似信使的对话的特定任务。我们使用的单 GPU 实例是亚马逊云科技提供的许多实例类型的低成本示例。在单个 GPU 上训练这个模型凸显了亚马逊云科技致力于成为最具成本效益的人工智能/机器学习服务提供商。

本演练的代码可以在 Hugging Face 笔记本 GitHub 存储库的 s agemaker/24_train_bloom_peft_lora 文件夹下找到。

先决条件

为了继续操作，您应该具备以下先决条件：

一个亚马逊云科技账户。
亚马逊 SageMaker Studio 中的 Jupyter 笔记本电脑或 SageMaker 笔记本电脑实例。
你需要访问包含单个 NVIDIA A10G GPU 的 SageMaker ml.g5.2xlarge 实例类型。在亚马逊云科技管理控制台上，导航到 SageMaker 的服务配额，并请求提高以下配额的 1 个实例： ml.g5.2xlarge 用于训练任务使用量，ml.g5.2xlarge 用于终端节点使用量。
将申请的配额应用到账户后，您可以使用带有 ml.t3.medium 实例的默认 Studio Python 3（数据科学）映像来运行笔记本代码片段。有关可用内核的完整列表，请参阅可用的亚马逊 SageMaker 内核。

设置 SageMaker 会话

使用以下代码来设置您的 SageMaker 会话：

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it does not exist
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

加载并准备数据集

我们使用 samsum 数据集，该数据集收集了 16,000 个带有摘要的类似信使的对话。对话是由精通英语的语言学家创建和写下的。以下是数据集的示例：

{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}

要训练模型，您需要将输入（文本）转换为代币 ID。这是由 Hugging Face Transformers 分词器完成的。有关更多信息，请参阅 Hugging Face NLP 课程的第 6 章。

使用以下代码转换输入：

from transformers import AutoTokenizer

model_id="bigscience/bloomz-7b1"

# Load tokenizer of BLOOMZ
tokenized = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

在开始训练之前，你需要处理数据。训练完成后，模型将以一组短信作为输入，并生成摘要作为输出。您需要将数据格式化为带有正确答复（摘要）的提示（消息）。您还需要将示例分成更长的输入序列，以优化模型训练。参见以下代码：

from random import randint
from itertools import chain
from functools import partial

# custom instruct prompt start
prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n{{summary}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(dialogue=sample["dialogue"],
                                            summary=sample["summary"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

现在你可以使用 FileSystem 集成将数据集上传到亚马逊 Simple Storage Servic e (Amazon S3)：

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/samsum-sagemaker/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

In [ ]:
training_input_path="s3://sagemaker-us-east-1-558105141721/processed/samsum-sagemaker/train"

在 SageMaker 上使用 LoRa 和 bitsandbytes int-8 微调 BLOOMZ-7B

Hugging Face BLOOMZ-7B 模型卡显示其初始训练分布在 8 个节点上，每个节点有 8 个 A100 80 GB GB GPU 和 512 GB 内存 CPU。这种计算配置不容易获取，对消费者来说成本过高，并且需要分布式训练性能优化方面的专业知识。SageMaker 通过其分布式培训库降低了复制这种设置的障碍；但是，八个按需 ml.p4de.24xlarge 实例的成本将为每小时 376.88 美元。此外，经过全面训练的模型消耗大约 40 GB 的内存，这超过了许多个人消费者可用 GPU 的可用内存，需要策略来解决大型模型推断问题。因此，在多次模型运行和部署中为任务对模型进行全面微调将需要在消费者不容易获得的硬件上花费大量的计算、内存和存储成本。

我们的目标是找到一种方法，在保持准确性的同时，以更易于访问和更具成本效益的方式使 BLOOMZ-7B 适应我们的聊天摘要用例。为了使我们的模型能够在具有单个消费级 NVIDIA A10G GPU 的 SageMaker ml.g5.2xlarge 实例上进行微调，我们采用了两种技术来降低微调所需的计算和内存需求：LoRa 和量化。

LoRa（低等级自适应）是一种在不损失预测性能的情况下显著减少微调新任务所需的模型参数和相关计算数量的技术。首先，它会冻结您的原始模型权重，而是针对您的新任务优化较小的等级分解权重矩阵，而不是更新完整的权重，然后将这些调整后的权重注入到原始模型中。因此，更少的权重梯度更新意味着微调期间的计算和 GPU 内存更少。这种方法背后的直觉是，LoRa 允许 LLM 专注于最重要的输入和输出令牌，而忽略冗余和不太重要的代币。要加深您对LoRa技术的理解，请参阅原始论文《 LoRa：大型语言模型的低等级改编》。

除了 LoRa 技术外，你还可以使用 bitsanbytes Hugging Face 集成 llm.int8 () 方法来量化冻结的 BloomZ 模型，或者通过将权重和偏差值从 float16 舍入到 int8 来降低权重和偏差值的精度。量化将 BloomZ 所需的内存减少了大约四倍，这使您能够在 A10G GPU 实例上拟合模型，而不会显著降低预测性能。为了加深你对 int8 量化的工作原理、它在 bitsandbytes 库中的实现以及它与 Hugging Face Transformers 库的集成的理解，请参阅《使用拥抱脸变形金刚》、《加速器》和 bitsandbytes 进行大规模变形金刚 8 位矩阵乘法的温和简介。

Hugging Face 通过 PEFT 库及其与 bitsandbytes 库的集成，使各种变压器模型都能使用 LoRa 和量化功能。准备好的脚本 run_clm.py 中的 create_peft_config () 函数说明了它们在准备模型进行训练时的用法：

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8, # Lora attention dimension.
        lora_alpha=32, # the alpha parameter for Lora scaling.
        lora_dropout=0.05, # the dropout probability for Lora layers.
        target_modules=["query_key_value"],
    )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model

使用 LoRa，print_trainable_parameters () 的输出表明，我们能够将模型参数的数量从 70 亿减少到 390 万。这意味着只需要更新原始模型参数的 5.6%。计算和内存需求的显著降低使我们能够毫无问题地在 GPU 上拟合和训练模型。

要创建 SageMaker 训练作业，你需要一个 Hugging Face 估算器。估算器处理端到端的 SageMaker 训练和部署任务。SageMaker 负责为您启动和管理所有必需的亚马逊弹性计算云（Amazon EC2）实例。此外，它还提供正确的 Hugging Face 训练容器，上传提供的脚本，并将数据从我们的 S3 存储桶下载到路径为 /opt/ml/input/data 的容器中。 然后，它开始训练作业。参见以下代码：

import time
# define Training Job Name 
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'epochs': 3,                                         # number of training epochs
  'per_device_train_batch_size': 1,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # IAM role used in training job to access AWS resources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.26',            # the transformers version used in the training job
    pytorch_version      = '1.13',            # the pytorch_version version used in the training job
    py_version           = 'py39',            # the python version used in the training job
    hyperparameters      =  hyperparameters
)

现在，您可以使用 .fit () 方法开始训练作业，并将 S3 路径传递给训练脚本：

# define a data input dictionary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as inputs
huggingface_estimator.fit(data, wait=True)

使用 LoRa 和量化可以让 SageMaker 以经济实惠且高效的方式对 BLOOMZ-7B 进行微调。使用 SageMaker 训练作业时，您只需在模型训练期间为 GPU 付费。在我们的示例中，SageMaker 训练作业花费了 20,632 秒，也就是大约 5.7 个小时。我们使用的 ml.g5.2xlarge 实例的按需使用费用为每小时 1.515 美元。因此，训练我们经过微调的 BLOOMZ-7B 模型的总成本仅为 8.63 美元。相比之下，假设在Hugging Face模型卡中概述的原始计算配置上进行线性GPU扩展，则对模型的70亿权重进行全面微调估计将花费600美元，即每次训练的费用增加6,900％。实际上，这将进一步取决于您的训练策略、实例选择和实例定价。

我们还可以通过使用 SageMaker 管理的竞价型实例来进一步降低培训成本。但是，由于竞价型实例中断，这有可能导致总训练时间增加。有关示例定价详情，请参阅亚马逊 SageMaker 定价。

将模型部署到 SageMaker 端点进行推断

使用 LoRa，您之前根据新任务调整了一组较小的权重。您需要一种方法将这些任务特定的权重与原始模型的预训练权重相结合。在 run_clm.py 脚本中，PEFT 库 merge_and_unload () 方法负责将基础 BLOOMZ-7B 模型与根据您的任务进行微调的更新后的适配器权重合并，使其更易于部署，与原始模型相比，不会带来任何推理延迟。

在本节中，我们将介绍如何使用经过微调的模型工件创建 SageMaker 模型并将其部署到 SageMaker 端点进行推理的步骤。首先，您可以使用经过微调的新模型工件创建 Hugging Face 模型，以便部署到 SageMaker 端点。由于您之前使用 SageMaker Hugging Face 估算器训练过模型，因此可以立即部署模型。您可以改为将经过训练的模型上传到 S3 存储桶，稍后使用它们来创建模型包。参见以下代码：

from sagemaker.huggingface import HuggingFaceModel

# 1. create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=huggingface_estimator.model_data,
   #model_data="s3://hf-sagemaker-inference/model.tar.gz",  # Change to your model path
   role=role, 
   transformers_version="4.26", 
   pytorch_version="1.13", 
   py_version="py39",
   model_server_workers=1
)

与任何 SageMaker 估算器一样，你可以使用 Hugging Face 估算器对象中的 deploy () 方法部署模型，传入所需的实例数量和类型。在此示例中，我们使用配备单个 NVIDIA A10g GPU 的 G5 实例类型，该模型在上一步中进行了微调：

# 2. deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type= "ml.g5.4xlarge"
)

SageMaker 端点可能需要 5-10 分钟才能使您的实例联机并下载模型，以便准备好接受推理请求。

当端点运行时，你可以通过发送来自数据集测试拆分的示例对话框来对其进行测试。首先使用 Hugging Face 数据集库加载测试拆分。接下来，选择一个随机整数，用于从数据集数组中对单个测试样本进行索引切片。使用字符串格式，将测试样本与提示模板组合成结构化输入，以指导模型的响应。然后，可以将此结构化输入与其他模型输入参数组合成格式化的 JSON 负载示例。最后，使用格式化示例调用 SageMaker 端点，并打印汇总示例对话框的模型输出。参见以下代码：

from random import randint
from datasets import load_dataset

# 1. Load dataset from the hub
test_dataset = load_dataset("samsum", split="test")

# 2. select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# 3. format the sample
prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n"

fomatted_sample = {
  "inputs": prompt_template.format(dialogue=sample["dialogue"]),
  "parameters": {
    "do_sample": True, # sample output predicted probabilities
    "top_p": 0.9, # sampling technique Fan et. al (2018)
    "temperature": 0.1, # increasing the likelihood of high probability words and decreasing the likelihood of low probability words
    "max_new_tokens": 100, # 
  }
}

# 4. Invoke the SageMaker endpoint with the formatted sample
res = predictor.predict(fomatted_sample)


# 5. Print the model output
print(res[0]["generated_text"].split("Summary:")[-1])
# Sample model output: Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.

现在，让我们比较模型汇总对话框输出与测试示例摘要：

print(sample["summary"])
# Sample model input: Kirsten reminds Alex that the youth group meets this Friday at 7 pm to go bowling.

清理

现在您已经测试了模型，请务必清理相关的 SageMaker 资源以防止继续收费：

predictor.delete_model()
predictor.delete_endpoint()

摘要

在这篇文章中，你使用了 Hugging Face Transformer、PEFT 和 sageMaker 的 bitsandbytes 库，在单个 GPU 上以 8 美元的价格微调了 BloomZ 大型语言模型，然后将模型部署到 SageMaker 端点以对测试样本进行推断。SageMaker 提供了多种使用 Hugging Face 模型的方法；有关更多示例，请查看亚马逊云科技示例 GitHub。

要继续使用SageMaker来微调基础模型，请尝试一下亚马逊 SageMaker 上的《架构师个性化生成式 AI SaaS 应用程序》中的一些技术。我们还鼓励您通过探索 JumpStart 、亚马逊 Titan 模型和亚马逊 Bedrock 来进一步了解亚马逊生成式人工智能功能。

作者简介

菲利普·施密德 是Hugging Face的技术主管，其使命是通过开源和开放科学使良好的机器学习民主化。Philipp 热衷于制作尖端的生成式 AI 机器学习模型。他喜欢在各种聚会上分享他在人工智能和自然语言处理方面的知识，例如亚马逊云科技上的数据科学和技术博客。

R obert Fisher 是医疗保健和生命科学客户的高级解决方案架构师。他与客户密切合作，了解亚马逊云科技如何帮助他们解决问题，尤其是在人工智能/机器学习领域。Robert 在包括医疗设备、金融科技和面向消费者的应用在内的一系列垂直行业拥有多年的软件工程经验。

道格·凯利 是一名亚马逊云科技高级解决方案架构师，曾为机器学习平台、自动驾驶汽车到精准农业等垂直领域的顶级机器学习初创公司担任值得信赖的技术顾问。他是亚马逊云科技 ML 技术领域社区的成员，专门为使用 mLOP 和 ML 推理工作负载的客户提供支持。

亚马逊云科技精选博客