May 26, 2023

Treating errors in AWS Step Functions

Step Functions is a service provided by AWS to help developers build applications with complex workflows or pipelines. In essence, it allows you to describe your workflow as a state machine (or flowchart) of steps to be performed. The steps can be anything from a lambda function, a batch job, an Athena query or even starting some other state machine.

Step Functions also has functionalities related to the orchestration of distributed applications, with support for parallel execution of tasks and ‘fan-out’ for dynamic parallelism (for example, calling a lambda function for every N objects in a list). It also has support for asynchronous execution of tasks, and facilities for handling exceptions and re-executing states.

We can use a JSON format, through CloudFormation for example, to define these workflows (or state machines). We also have the option of using the visual editor through the AWS console.

We at Tenchi Security take advantage of the ease of integration of Step Functions with the entire AWS ecosystem, for example, IAM for access control, CloudTrail for auditing and CloudWatch for monitoring.

Step Functions really shines when you want to integrate pieces of logic that are scattered across multiple AWS services, in different languages, and with complex coordination.

For example, the following state machine (from the documentation):

‍

Is defined by the following code:

{
  “Comment”: “An example of the Amazon States Language for notification on an AWS Batch job completion”,
  “StartAt”: “Submit Batch Job”,
  “TimeoutSeconds”: 3600,
  “States”: {
    “Submit Batch Job”: {
      “Type”: “Task”,
      “Resource”: “arn:aws:states:::batch:submitJob.sync”,
      “Parameters”: {
        “JobName”: “BatchJobNotification”,
        “JobQueue”: “arn:aws:batch:us-east-1:123456789012:job-queue/BatchJobQueue-7049d367474b4dd”,
        “JobDefinition”: “arn:aws:batch:us-east-1:123456789012:job-definition/BatchJobDefinition-74d55ec34c4643c:1”
      },
      “Next”: “Notify Success”
    },
    “Notify Success”: {
      “Type”: “Task”,
      “Resource”: “arn:aws:states:::sns:publish”,
      “Parameters”: {
        “Message”: “Batch job submitted through Step Functions succeeded”,
        “TopicArn”: “arn:aws:sns:us-east-1:123456789012:batchjobnotificatiointemplate-SNSTopic-1J757CVBQ2KHM”
      },
      “End”: true
    }
  }
}

At Tenchi, we use Step Functions to power Zanshin‘s scanning engine. This allows us to describe the steps required to run a scan of a company’s environment as a flowchart of steps encompassing many different services that need to be accessed, with components that can be developed independently.

Like any complex computing environment, errors, exceptions, and timeouts happen from time to time. Depending on the nature of the application, these errors need to be caught and handled. How they are handled depends on the application. Tenchi’s engine is idempotent, so rerunning is a viable option, but other applications may want to abort the entire process or take a step to deal with the error.

Step Functions gives us a few different ways to catch errors:

1) Catching errors with the ‘catch’ statement

Each ‘step’ in a Step Functions state machine, represented by a state, can define its own ‘catch’ statement, for which you can define a set of expected errors or simply expect ALL errors.

In code, it looks like this:

  “Catch”: [
          {
            “ErrorEquals”: [ “States.ALL” ],
            “Next”: “Notify Failure”
          }
      ]

In our previous example, it would look like this (we created another state to treat the error):

{
  “Comment”: “An example of the Amazon States Language for notification on an AWS Batch job completion”,
  “StartAt”: “Submit Batch Job”,
  “TimeoutSeconds”: 3600,
  “States”: {
    “Submit Batch Job”: {
      “Type”: “Task”,
      “Resource”: “arn:aws:states:::batch:submitJob.sync”,
      “Parameters”: {
        “JobName”: “BatchJobNotification”,
        “JobQueue”: “arn:aws:batch:us-east-1:123456789012:job-queue/BatchJobQueue-7049d367474b4dd”,
        “JobDefinition”: “arn:aws:batch:us-east-1:123456789012:job-definition/BatchJobDefinition-74d55ec34c4643c:1”
      },
      “Next”: “Notify Success”,
      “Catch”: [
          {
            “ErrorEquals”: [ “States.ALL” ],
            “Next”: “Notify Failure”
          }
      ]
    },
    “Notify Success”: {
      “Type”: “Task”,
      “Resource”: “arn:aws:states:::sns:publish”,
      “Parameters”: {
        “Message”: “Batch job submitted through Step Functions succeeded”,
        “TopicArn”: “arn:aws:sns:us-east-1:123456789012:batchjobnotificatiointemplate-SNSTopic-1J757CVBQ2KHM”
      },
      “End”: true
    },
    “Notify Failure”: {
      “Type”: “Task”,
      “Resource”: “arn:aws:states:::sns:publish”,
      “Parameters”: {
        “Message”: “Batch job submitted through Step Functions failed”,
        “TopicArn”: “arn:aws:sns:us-east-1:123456789012:batchjobnotificatiointemplate-SNSTopic-1J757CVBQ2KHM”
      },
      “End”: true
    }
  }
}

Which produces the following diagram:

Shortfalls:

This method will work whenever an error is produced within a step, as long as we remember to add the catch block to every step.

However, this adds some unpleasant complexity to the state machine definition and diagram. Not so much in our example, but when things start to scale.

Furthermore, it will not catch errors produced outside the steps, but by Step Functions itself. For example, if it tries to access a parameter that doesn’t exist.

This method will also not be able to detect when the execution of the Step Functions state machine is interrupted because it has exceeded its timeout. In our case, this can happen when we scan in an environment that is too large.

2) Catching errors with the CloudWatch Events

CloudWatch Events is the event streaming service associated with the AWS monitoring service, CloudWatch. It is possible to use it to register lambda functions that listen and react to certain types of events.

A good option for monitoring failures for Step Functions is to register a lambda to listen for the events associated with an execution failure.

To do this, we create a new lambda and register the right trigger. On the AWS console, it should look like this:

In this example, we are using the same lambda to capture three events. FAILED, TIMED_OUT and ABORTED.

A list of all available events is available in the documentation.

This option is great because it detects any failure in the execution of the state machine, whether due to an execution error in one of the states, failures between state transitions, timeouts, or external interruptions (abort).

Problem solved, right? Not yet.

According to the documentation, there is no guarantee of delivery of these events. Which means we can’t 100% trust them to handle every error in our app.

3) Combining multiple methods

To decrease the probability of errors not being properly handled by our product, we can combine several methods.

In Zanshin, we use catch statements to catch most errors inside Step Functions, but we also have a lambda function registered to listen for FAILED and TIMED_OUT events. We also have a lambda that runs periodically and alerts us when we have an execution that has been running longer than a certain threshold, in case the two previous methods fail. This covers the unlikely but not impossible scenario of: error in the state machine definition that led to a lack of catch, failure in the Step Functions service that causes the catch not to be honored, and/or failure to deliver the associated EventBridge events.

The caveat is that if your error-handling logic isn’t idempotent, you’ll need to ensure it only runs once to avoid unwanted side effects.

Conclusions

In any sufficiently complex application, one should expect errors to happen eventually. Monitoring and handling these errors correctly is essential to ensure system reliability and a good user experience.

In this article, we show two ways to monitor errors in AWS Step Functions and how they can be combined to improve the reliability of an application. Remember to also consider what kind of ‘sanity checks’ you can run on your app, to catch issues when all else fails.

Credits

Lucas Silva Schanner