We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
Lowering MTTR with Amazon CloudWatch and Amazon Web Services X-Ray
Customers running microservice-based workloads in a serverless environment frequently have issues with troubleshooting incidents as the data they need can be distributed across hundreds or thousands of components. In this blog post, I will demonstrate how you can reduce the mean time to resolution (MTTR, or the average time it takes to repair or mitigate the impact of a failure) for your serverless workloads using
Understanding your dependencies
CloudWatch ServiceLens helps you visualize your distributed environment and understand the dependencies between services. It brings metrics, logs, and traces together into a service map so that you can
By default, the size of each node and the edges between them represent the number of requests they are receiving. This can be changed by selecting the gear icon to load the Preferences dialog box and choosing an alternate metric to base sizing on such as Latency . Selecting a node will open a panel with additional metrics showing you latency, requests, and faults over time as well as providing links to your logs, traces, and a more detailed dashboard.
This panel will also surface any alerts associated with the resource. In the preceding image, you can see that two alerts for a DynamoDB table are currently in alarm.
Returning to the service map, let’s look at one of the nodes experiencing faults and select View connections to focus on all of the incoming and outgoing requests for that specific resource. Alternatively, you can use
From the preceding image, you can see that there are three nodes making requests to the back-end PetSearch microservice and only two of them are experiencing issues. Selecting one of these edges brings up a panel with a Response time distribution graph summarizing the traffic along that path. If you were troubleshooting latency, you could highlight a portion of the graph showing slow response times and open the View filtered traces button to get a list of traces in the selected time window. In this case, I am more concerned with the 24% error rate from the front-end PetSite service so I select the checkbox next to 24% Faults (5xx) and open View filtered traces .
This loads a new page where you can further refine the query if needed. Selecting the first trace from the table, you are brought to the Trace details page.
With the Segment Timeline , you get a clear picture of what happened across the distributed system and can clearly see the status of each response as well as how long the request took to answer. Selecting a segment from the timeline loads a panel with additional details, allowing you to see on the Exceptions tab that the root cause of the failure is a permission issue leading to an authorization error. This page also displays a correlated list of log messages for all of the services associated with the trace, allowing you to see logs from different log groups in a single place.
Integrating with Amazon Web Services X-Ray
ServiceLens builds its ServiceMap using traces from X-Ray to understand dependencies between services. While some interactions you want to capture will require
To enable X-Ray for your Amazon Web Services Lambda functions
- Open the
Amazon Web Services Lambda console - Select your function function from the list
- Choose the Configuration tab and then choose Monitoring and operations tools
- Choose Edit and then toggle Active tracing under Amazon Web Services X-Ray
While on the page to enable active tracing for your Lambda functions, you’ll also see an option to enable
Conclusion
In this article, I reviewed some of the ways you can use CloudWatch ServiceLens and Amazon Web Services X-Ray to monitor and observe your distributed workloads, focusing on the integrations between services. Once you’ve been alerted to an issue, ServiceLens can help you drill down into the relevant traces so that you can diagnose the failure and start working on a plan to return to normal business operations.
Interested in learning more about monitoring and observability? Review the
About the author
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.