We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Implementing Amazon Web Services Lambda error handling patterns
This post is written by Jeff Chen, Principal Cloud Application Architect, and Jeff Li, Senior Cloud Application Architect
Handling errors becomes even more important when designing distributed applications. A service may fail if it cannot handle an invalid payload, dependent resources may be unavailable, or the service may time out. There may be permission errors that can cause failures. Amazon Web Services services provide many features to handle error conditions, which you can use to improve the resiliency of your applications.
This post explores three use-cases and design patterns for handling failures.
Overview
The post
Lambda’s integration with
Lambda’s integration with SNS and EventBridge are examples of asynchronous invocations. SNS, for example, sends an event to Lambda for processing. When Lambda receives the event, it places it on an internal event queue and returns an acknowledgment to SNS that it has received the message. Another Lambda process reads events from the internal queue and invokes your Lambda function. If SNS cannot deliver an event to your Lambda function, the service automatically retries the same operation based on a retry policy.
Lambda’s integration with SQS uses poll-based invocations. Lambda runs a fleet of pollers that poll your SQS queue for messages. The pollers read the messages in batches and invoke your Lambda function once per batch.
You can apply this pattern in many scenarios. For example, your operational application can add sales orders to an operational data store. You may then want to load the sales orders to your data warehouse periodically so that the information is available for forecasting and analysis. The operational application can batch completed sales as events and place them on an SQS queue. A Lambda function can then process the events and load the completed sale records into your data warehouse.
If your function processes the batch successfully, the pollers delete the messages from the SQS queue. If the batch is not successfully processed, the pollers do not delete the messages from the queue. Once the
The following table shows the invocation types and retry behavior of the Amazon Web Services services mentioned.
Amazon Web Services service example | Invocation type | Retry behavior |
Amazon API Gateway | Synchronous | No built-in retry, client attempts retries. |
Amazon SNS Amazon EventBridge |
Asynchronous | Built-in retries with exponential backoff. |
Amazon SQS | Poll-based | Retries after visibility timeout expires until message retention period expires. |
There are a number of design patterns to use for poll-based and asynchronous invocation types to retain failed messages for additional processing. These patterns can help you recover from delivery or processing failures.
You can explore the patterns and test the scenarios by deploying the code from
Lambda poll-based invocation pattern
When using Lambda with SQS, if Lambda isn’t able to process the message and the message retention period expires, SQS drops the message. Failure to process the message can be due to function processing failures, including time-outs or invalid payloads. Processing failures can also occur when the destination function does not exist, or has incorrect permissions.
You can configure a separate
The following shows the design pattern when you configure Lambda to poll events from an SQS queue and invoke a Lambda function.
To explore this pattern, deploy the code in
Lambda asynchronous invocation pattern
With asynchronous invokes, there are two failure aspects to consider when using Lambda. The event source cannot deliver the message to Lambda and the Lambda function errors when processing the event.
Event sources vary in how they handle failures delivering messages to Lambda. If SNS or EventBridge cannot send the event to Lambda after exhausting all their retry attempts, the service drops the event. You can configure a DLQ on an SNS topic or EventBridge event bus to hold the dropped event. This works in the same way as the poll-based invocation pattern with SQS.
Lambda functions may then error due to input payload syntax errors, duration time-outs, or the function throws an exception such as a data resource not available.
For asynchronous invokes, you can configure how long Lambda retains an event in its internal queue, up to 6 hours. You can also configure how many times Lambda retries when the function errors, between 0 and 2. Lambda discards the event when the maximum age passes or all retry attempts fail. To retain a copy of discarded events, you can configure either a DLQ or, preferably, a failed-event
A Lambda destination enables you to specify what to do next if an asynchronous invocation succeeds or fails. You can configure a destination to send invocation records to SQS, SNS, EventBridge, or another Lambda function. Destinations are preferred for failure processing as they support additional targets and include additional information. A DLQ holds the original failed event. With a destination, Lambda also passes details of the function’s response in the invocation record. This includes stack traces, which can be useful for analyzing the root cause.
Using both a DLQ and Lambda destinations
You can apply this pattern in many scenarios. For example, many of your applications may contain customer records. To comply with the
The following shows the design pattern when you configure an SNS topic as the event source for a Lambda function, which uses destination queues for success and failure process.
You configure a DLQ on the SNS topic to capture messages that SNS cannot deliver to Lambda. When Lambda invokes the function, it sends details of the successfully processed messages to an on-success SQS destination. You can use this pattern to route an event to multiple services for simpler use cases. For orchestrating multiple services,
Lambda can also send details of unsuccessfully processed messages to an on-failure SQS destination.
A variant of this pattern is to replace an SQS destination with an EventBridge destination so that multiple consumers can process an event based on the destination.
To explore how to use an SQS DLQ and Lambda destinations, deploy the code in
Using a DLQ
Although destinations is the preferred method to handle function failures, you can explore using DLQs.
The following shows the design pattern when you configure an SNS topic as the event source for a Lambda function, which uses SQS queues for failure process.
You configure a DLQ on the SNS topic to capture the messages that SNS cannot deliver to the Lambda function. You also configure a separate DLQ for the Lambda function. Lambda saves an unsuccessful event to this DLQ after Lambda cannot process the event after maximum retry attempts.
To explore how to use a Lambda DLQ, deploy the code in
Conclusion
This post explains three patterns that you can use to design resilient event-driven serverless applications. Error handling during event processing is an important part of designing serverless cloud applications.
You can deploy the code from the repository to explore how to use poll-based and asynchronous invocations. See how poll-based invocations can send failed messages to a DLQ. See how to use DLQs and Lambda destinations to route and handle unsuccessful events.
Learn more about event-driven architecture on
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.