We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Hosting ML Models on Amazon SageMaker using Triton: XGBoost, LightGBM, and Treelite Models
One of the most popular models available today is XGBoost. With the ability to solve various problems such as classification and regression, XGBoost has become a popular option that also falls into the category of tree-based models. In this post, we dive deep to see how
SageMaker provides
Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the
Deep dive into the FIL backend
Triton supports the
The FIL backend uses cuML’s libraries to use CPU or GPU cores to accelerate learning. In order to use these processors, data is referenced from host memory (for example, NumPy arrays) or GPU arrays (uDF, Numba, cuPY, or any library that supports the __cuda_array_interface__
) API. After the data is staged in memory, the FIL backend can run processing across all the available CPU or GPU cores.
The FIL backend threads can communicate with each other without utilizing shared memory of the host, but in ensemble workloads, host memory should be considered. The following diagram shows an ensemble scheduler runtime architecture where you have the ability to fine-tune the memory areas, including CPU addressable shared memory that is used for inter-process communication between Triton (C++) and the Python process (Python backend) for exchanging tensors (input/output) with the FIL backend.
Triton Inference Server provides configurable options for developers to tune their workloads and optimize model performance. The configuration dynamic_batching
allows Triton to hold client-side requests and batch them on the server side in order to efficiently use FIL’s parallel computation to inference the entire batch together. The option max_queue_delay_microseconds
offers a fail-safe control of how long Triton waits to form a batch.
There are a number of other FIL-specific
storage_type
. When running the backend on GPU, FIL creates a new memory/data structure that is a representation of the tree for which FIL can impact performance and footprint. This is configurable via the environment parameter storage_type
, which has the options dense, sparse, and auto. Choosing the dense option will consume more GPU memory and doesn’t always result in better performance, so it’s best to check. In contrast, the sparse option will consume less GPU memory and can possibly perform as well or better than dense. Choosing auto will cause the model to default to dense unless doing so will consume significantly more GPU memory than sparse.
When it comes to model performance, you might consider emphasizing the threads_per_tree
option. One thing that you may overserve in real-world scenarios is that threads_per_tree
can have a bigger impact on throughput than any other parameter. Setting it to any power of 2 from 1–32 is legitimate. The optimal value is hard to predict for this parameter, but when the server is expected to deal with higher load or process larger batch sizes, it tends to benefit from a larger value than when it’s processing a few rows at a time.
Another parameter to be aware of is algo
, which is also available if you’re running on GPU. This parameter determines the algorithm that’s used to process the inference requests. The options supported for this are ALGO_AUTO
, NAIVE
, TREE_REORG
, and BATCH_TREE_REORG
. These options determine how nodes within a tree are organized and can also result in performance gains. The ALGO_AUTO
option defaults to NAIVE
for sparse storage and BATCH_TREE_REORG
for dense storage.
Lastly, FIL comes with Shapley explainer, which can be activated by using the treeshap_output
parameter. However, you should keep in mind that Shapley outputs hurt performance due to its output size.
Model format
There is currently no standard file format to store forest-based models; every framework tends to define its own format. In order to support multiple input file formats, FIL imports data using the open-source
model_type
configuration value specified in the config.pbtxt
file.
Config.pbtxt
Each model in a
config.pbtxt
file specified as
- max_batch_size – This determines the maximum batch size that can be passed to this model. In general, the only limit on the size of batches passed to a FIL backend is the memory available with which to process them. For GPU runs, the available memory is determined by the size of Triton’s CUDA memory pool, which can be set via a command line argument when starting the server.
- input – Options in this section tell Triton the number of features to expect for each input sample.
- output – Options in this section tell Triton how many output values there will be for each sample. If the
predict_proba
option is set to true, then a probability value will be returned for each class. Otherwise, a single value will be returned, indicating the class predicted for the given sample. - instance_group – This determines how many instances of this model will be created and whether they will use GPU or CPU.
- model_type – This string indicates what format the model is in (
xgboost_json
in this example, butxgboost
,lightgbm
, andtl_checkpoint
are valid formats as well). - predict_proba – If set to true, probability values will be returned for each class rather than just a class prediction.
- output_class – This is set to true for classification models and false for regression models.
- threshold – This is a score threshold for determining classification. When
output_class
is set to true, this must be provided, although it won’t be used ifpredict_proba
is also set to true. - storage_type – In general, using AUTO for this setting should meet most use cases. If AUTO storage is selected, FIL will load the model using either a sparse or dense representation based on the approximate size of the model. In some cases, you may want to explicitly set this to SPARSE in order to reduce the memory footprint of large models.
Triton Inference Server on SageMaker
SageMaker
When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance
as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, depending on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.
For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to
SageMaker notebook walkthrough
ML applications are complex and can often require data preprocessing. In this notebook, we dive into how to deploy a tree-based ML model like XGBoost using the FIL backend in Triton on a SageMaker multi-model endpoint. We also cover how to implement a Python-based data preprocessing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from the client side and have both data preprocessing and model inference happen in a Triton SageMaker endpoint for optimal inference performance.
Triton model ensemble feature
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Inference Server comes with a convenient solution that simplifies building preprocessing and postprocessing pipelines. The Triton Inference Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput. Using ensemble models can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.
In this notebook, we show how to use the ensemble feature for building a pipeline of data preprocessing with XGBoost model inference, and you can extrapolate from it to add custom postprocessing to the pipeline.
Set up the environment
We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the
Create a Conda environment for preprocessing dependencies
The Python backend in Triton requires us to use a
- We show how to create a Conda environment for your dependencies and how to package it in the
format expected by Triton’s Python backend. - By showing the preprocessing model running in the Python backend on the CPU while the XGBoost runs on the GPU in the FIL backend, we illustrate how each model in Triton’s ensemble pipeline can run on a different framework backend as well as different hardware configurations.
- It highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, scikit-learn). For example, we can show how
LabelEncoders
created in cuML can be used in scikit-learn and vice versa.
We follow the instructions from the
After we run the preceding script, it generates preprocessing_env.tar.gz
, which we copy to the preprocessing directory:
Set up preprocessing with the Triton Python backend
For preprocessing, we use Triton’s
The Python backend enables preprocessing, postprocessing, and any other custom logic to be implemented in Python and served with Triton. Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. We have already set up a model for Python data preprocessing called preprocessing in cpu_model_repository
and gpu_model_repository
.
Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. The value 1 represents version 1 of our Python preprocessing model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifact required by that backend. For this example, we use the Python backend, which requires the Python file you’re serving to be called model.py, and the file needs to implement
The
Every Triton model must also provide a config.pbtxt
file describing the model configuration. To learn more about the config settings, refer to
Set up a tree-based ML model for the FIL backend
Next, we set up the model directory for a tree-based ML model like XGBoost, which will be using the FIL backend.
The expected layout for cpu_memory_repository
and gpu_memory_repository
are similar to the one we showed earlier.
Here, FIL
is the name of the model. We can give it a different name like xgboost
if we want to. 1
is the version subdirectory, which contains the model artifact. In this case, it’s the xgboost.json
model that we saved. Let’s create this expected layout:
We need to have the configuration file config.pbtxt
describing the model configuration for the tree-based ML model, so that the FIL backend in Triton can understand how to serve it. For more information, refer to the latest generic
Create config.pbtxt
for model_cpu_repository
:
Similarly, set up config.pbtxt
for model_gpu_repository
(note the difference is USE_GPU = True
):
Set up an inference pipeline of the data preprocessing Python backend and FIL backend using ensembles
Now we’re ready to set up the inference pipeline for data preprocessing and tree-based model inference using an
The expected layout for the ensemble
model directory is similar to the ones we showed previously:
We created the ensemble model’s
config.pbtxt
, which specifies the data flow between models within the ensemble. The ensemble scheduler collects the output tensors in each step, and provides them as input tensors for other steps according to the specification.
Package the model repository and upload to Amazon S3
Finally, we end up with the following model repository directory structure, containing a Python preprocessing model and its dependencies along with the XGBoost FIL model and the model ensemble.
We package the directory and its contents as model.tar.gz
for uploading to
Create and upload the model package for a CPU-based instance (optimized for CPU) with the following code:
Create and upload the model package for a GPU-based instance (optimized for GPU) with the following code:
Create a SageMaker endpoint
We now have the model artifacts stored in an S3 bucket. In this step, we can also provide the additional environment variable SAGEMAKER_TRITON_DEFAULT_MODEL_NAME
, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in the case of a single model. In the case of ensemble models, this key has to be specified for Triton to start up in SageMaker.
Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT
and SAGEMAKER_TRITON_THREAD_COUNT
for optimizing the thread counts.
We use the preceding model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint
We use this endpoint configuration to create a SageMaker endpoint and wait for the deployment to finish. With SageMaker MMEs, we have the option to host multiple ensemble models by repeating this process, but we stick with one deployment for this example:
The status will change to InService
when the deployment is successful.
Invoke your model hosted on the SageMaker endpoint
After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing
community standard
The notebook referred in the blog can be found in the
Best practices
In addition to the options to fine-tune the settings of the FIL backend we mentioned earlier, data scientists can also ensure that the input data for the backend is optimized for processing by the engine. Whenever possible, input data in row-major format into the GPU array. Other formats will require internal conversion and take up cycles, decreasing performance.
Due to the way FIL data structures are maintained in GPU memory, be mindful of the tree depth. The deeper the tree depth, the larger your GPU memory footprint will be.
Use the instance_group_count
parameter to add worker processes and increase the throughput of the FIL backend, which will result in larger CPU and GPU memory consumption. In addition, consider SageMaker-specific variables that are available to increase the throughput, such as HTTP threads, HTTP buffer size, batch size, and max delay.
Conclusion
In this post, we dove deep into the FIL backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your tree-based models such as the popular XGBoost algorithm. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use this capability with single and multi-model endpoints to balance of performance and cost savings.
We encourage you to take the information in this post and see if SageMaker can meet your hosting needs to serve tree-based models, meeting your requirements for cost reduction and workload performance.
The notebook referenced in this post can be found in the SageMaker examples
About the Authors
Raghu Ramesha is an Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on Amazon Web Services, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
Dhawal Patel is a Principal Machine Learning Architect at Amazon Web Services. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.
Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.