Posted On: Sep 12, 2023
We are introducing the general availability of Amazon SageMaker Asynchronous Inference in Amazon Web Services China (Beijing)Region, operated by Sinnet, and Amazon Web Services China (Ningxia) Region, operated by NWCD. Asynchronous inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1GB) and/or long processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.
Creating an asynchronous inference endpoint is similar to creating a real-time endpoint. You can use your existing Amazon SageMaker Models and only need to specify additional asynchronous inference specific configuration parameters while creating your endpoint configuration. To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, Amazon SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, Amazon SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Simple Notification Service (SNS).