使用更新的数据集在 Amazon SageMaker Canvas 中重新训练机器学习模型并自动进行批量预测

现在，您可以使用 Am azon SageMaker Canvas 中更新的数据集重新训练机器学习 (ML) 模型并自动执行批量预测工作流程，从而更容易不断学习、提高模型性能和提高效率。机器学习模型的有效性取决于其训练所依据的数据的质量和相关性。随着时间的推移，数据中的潜在模式、趋势和分布可能会发生变化。通过更新数据集，您可以确保模型从最新的代表性数据中学习，从而提高其做出准确预测的能力。Canvas 现在支持自动和手动更新数据集，使您能够使用最新版本的表格、图像和文档数据集来训练 ML 模型。

训练模型后，您可能需要对其进行预测。在 ML 模型上运行批量预测可以同时处理多个数据点，而不是一个接一个地进行预测。自动化此过程可提高效率、可扩展性和及时的决策。生成预测后，可以对其进行进一步分析、汇总或可视化，以获得见解、识别模式或根据预测结果做出明智的决策。Canvas 现在支持设置自动批量预测配置并将数据集与之关联。手动或按计划刷新关联数据集时，将在相应模型上自动触发批量预测工作流程。预测结果可以在线查看或下载以供日后查看。

在这篇文章中，我们展示了如何使用 Canvas 中更新的数据集重新训练机器学习模型并自动进行批量预测。

解决方案概述

在我们的用例中，我们扮演一家电子商务公司的业务分析师的角色。我们的产品团队希望我们确定影响购物者购买决定的最关键指标。为此，我们使用公司的客户网站在线会话数据集在Canvas中训练机器学习模型。我们会评估模型的性能，并在需要时使用其他数据对模型进行重新训练，以查看它是否提高了现有模型的性能。为此，我们使用了 Canvas 中的自动更新数据集功能，并使用最新版本的训练数据集重新训练我们现有的 ML 模型。然后，我们配置自动批量预测工作流程——更新相应的预测数据集时，它会自动触发模型上的批量预测作业，并将结果提供给我们查看。

工作流程步骤如下：

将下载的客户网站在线会话数据上传到 Amazon Simple Storage Servic e (Amazon S3)，然后创建新的训练数据集 Canvas。有关支持数据源的完整列表，请参阅在亚马逊 SageMaker Canvas 中导入数据。
构建 ML 模型并分析其绩效指标。请参阅有关如何在 C anvas 中构建自定义 ML 模型并评估模型性能的步骤。
设置现有训练数据集的自动更新，并将新数据上传到支持该数据集的 Amazon S3 位置。完成后，它应该创建一个新的数据集版本。
使用数据集的最新版本重新训练 ML 模型并分析其性能。
在性能更好的模型版本上设置自动批量预测并查看预测结果。

您无需编写任何代码即可在 Canvas 中执行这些步骤。

数据概述

该数据集由属于 12,330 个会话的特征向量组成。数据集的形成使得每个会话在 1 年内属于不同的用户，以避免出现任何特定活动、特殊日子、用户个人资料或时段的趋势。下表概述了数据架构。

Column Name	Data Type	Description
`Administrative`	Numeric	Number of pages visited by the user for user account management-related activities.
`Administrative_Duration`	Numeric	Amount of time spent in this category of pages.
`Informational`	Numeric	Number of pages of this type (informational) that the user visited.
`Informational_Duration`	Numeric	Amount of time spent in this category of pages.
`ProductRelated`	Numeric	Number of pages of this type (product related) that the user visited.
`ProductRelated_Duration`	Numeric	Amount of time spent in this category of pages.
`BounceRates`	Numeric	Percentage of visitors who enter the website through that page and exit without triggering any additional tasks.
`ExitRates`	Numeric	Average exit rate of the pages visited by the user. This is the percentage of people who left your site from that page.
`Page Values`	Numeric	Average page value of the pages visited by the user. This is the average value for a page that a user visited before landing on the goal page or completing an ecommerce transaction (or both).
`SpecialDay`	Binary	The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (such as Mother’s Day or Valentine’s Day) in which the sessions are more likely to be finalized with a transaction.
`Month`	Categorical	Month of the visit.
`OperatingSystems`	Categorical	Operating systems of the visitor.
`Browser`	Categorical	Browser used by the user.
`Region`	Categorical	Geographic region from which the session has been started by the visitor.
`TrafficType`	Categorical	Traffic source through which user has entered the website.
`VisitorType`	Categorical	Whether the customer is a new user, returning user, or other.
`Weekend`	Binary	If the customer visited the website on the weekend.
`Revenue`	Binary	If a purchase was made.