Generic Asynchronous Retry Architecture

Murphy’s law says:

Whatever can go wrong, will go wrong.

It’s so true in software engineering.

To build resilient software applications, when architecting the integration points with downstream services, we shall consider all error scenarios. Robust error handling is essential. Retrying remote API calls is an important part.

A retry can be done either synchronously or asynchronously. If the clients require a response of the execution status, not just the acknowledgement of the receipt of the request, it’s appropriate to implement synchronous retries with limits on total retry numbers and time. On the other hand, if the clients don’t care about the actual execution status, or have ways to receive responses asynchronously, it is almost always a good idea to adopt asynchronous retry architecture. Of course, before putting a request into asynchronous retry process, we can always implement synchronous retry first whenever it makes sense.

In this article, we will focus on the asynchronous retry architecture.

2. Queuing for Asynchronous Retry Architecture

Queuing mechanism is the center of the Asynchronous Retry Architecture.

The originating service constructs a Retry Message that includes the original request info, the destination URL and other metadata, puts the Retry Message into an Async Retry Queue based on the chosen queuing system. A trigger could be configured in the queue to trigger a processor. Or, an Async Retry Processor can pull the queuing system for new messages. The Async Retry Processor can then utilize the message received from the Async Retry Queue and make another call to the destination downstream service.

A Dead Letter Queue is used to hold Retry Messages for certain period time after a (configurable) maximum number of retries have been reached.

The below figure is a very high level workflow and message flow:

3. Asynchronous Retry Architecture Diagram

Asynchronous Retry Architecture

In the above diagram, Service A is the calling service and Service B is the destination downstream service. If the initial call in Step 1 fails, Service A will put a Retry Message into the Async Retry Queue.

Depends on what Queuing System is chosen, either a trigger can be configured in the Async Retry Queue to trigger the Processor (3.1), or a Processor can be configured to poll the Async Retry Queue (3.2). If AWS SQS is chosen as the queuing system, a Lambda function can be configured to trigger the processor when a new message arrives.

Once the Async Retry Processor receives the Retry Message, it can use the request info in the message to reconstruct the request, and send the request to the destination URL that is also included in the retry message.

A Retry Message will be moved to the Dead Letter Queue if the maximum retry attempts have been reached as detected by the Async Retry Processor or the Async Retry Queue.

4. Generic Data Model for Retry Message

A Retry Message can have a generic data model as below:

    "receivedCount": "$number",

With this generic data model design for Retry Message, an Async Retry Processor can be designed to process any Retry Message constructed by any originating services (Service A) to any destination services (Service B).

5. Retryable Errors

Only non-functional errors are retryable. Below are some examples:

a. No response at all;

b. Temporary Network issue, usually 5xx (http status) errors;

c. Request timeout: http status 408 errors;

d. Conflict: http status 409 errors;

e. Too many request: http status 429

f. If there is a Retry-After header in the http response of the downstream service;

h. Unauthorized: http status 401 errors with expired token error code/message. These kind of errors usually require a new token. In this case, the Async Retry Processor is responsible for getting the proper token.

6. Conclusion

Asynchronous Retry Architecture can be used to handle all retryable errors when the client is not expecting the execution result in the response of the call. It is extremely useful if a function may need to be tried many times for a long period of time.

The number of maximum retry attempts, the async retry queue name/url, and the dead letter queue name/url can all be configurable. The configurable values can make architecture flexible for many different applications.