Introducing Amazon SageMaker Asynchronous Inference for workloads with large payload sizes and long inference processing times

Posted On: Sep 12, 2023

We are introducing the general availability of Amazon SageMaker Asynchronous Inference in Amazon Web Services China (Beijing)Region, operated by Sinnet, and Amazon Web Services China (Ningxia) Region, operated by NWCD. Asynchronous inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1GB) and/or long processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

Creating an asynchronous inference endpoint is similar to creating a real-time endpoint. You can use your existing Amazon SageMaker Models and only need to specify additional asynchronous inference specific configuration parameters while creating your endpoint configuration. To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, Amazon SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, Amazon SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Simple Notification Service (SNS).

For a detailed description of how to create, invoke, and monitor asynchronous inference endpoints, please read our documentation, which also contains a sample notebook to help you get started. For pricing information, please visit the Amazon SageMaker pricing page.