Understanding SQS at-least-once delivery

SQS guarantees at-least-once delivery, which means a message might be delivered multiple times. To handle this, we typically use DynamoDB to track message status. Before processing, we check if the message is already being handled. If it's marked as "in progress" or "completed," we skip it to avoid duplicate processing.

The status flow looks like this: a message starts as "pending," transitions to "in progress" when processing begins, and finally becomes "completed" or "failed" depending on the outcome. This state machine works well - until Lambda times out.

Impact of Lambda timeouts

AWS Lambda functions have a maximum execution timeout that you configure. When this timeout is reached, AWS doesn't send a notification or raise an exception you can catch. It simply kills the process immediately. Your code stops mid-execution, wherever it happens to be. The impact of this can be illustrated by the following scenario:

Lambda receives a batch of messages from SQS. It processes the first message, marking it as "in progress" in DynamoDB.
The processing takes longer than expected; maybe an external API is slow, or there's more data than usual.
Lambda reaches its timeout limit and AWS terminates the function.
The message status in DynamoDB is still "in progress." It never got marked as completed or failed.
As the message wasn't marked "failed", should it be delivered another time by SQS, your code checks DynamoDB, sees "in progress," and skips it to avoid duplicate processing.
The message is now stuck. It won't be processed, nor would it be marked as a batch failure and end up in the DLQ.

The root cause

The fundamental issue is that you have no way to run cleanup code when Lambda times out. Unlike a normal exception that you can catch and handle, the timeout termination happens at the OS level. Your database transaction isn't committed (or rolled back cleanly), your status update doesn't happen, and you can't tell SQS which messages failed.

This isn't just a theoretical problem. In production systems processing thousands of messages per hour, these stuck messages accumulate. You might not notice immediately (most messages process fine) but over time, you'll have a growing number of messages that are stuck in an "in progress" state with no way forward.

Proactive timeout handling

You can raise an exception before AWS terminates your Lambda function. With a few seconds' warning, you can cleanup your state, mark messages as failed, and tell SQS which items need retry.

Unix signals make this possible. Using Python's signal module, we can set an alarm that fires before the Lambda timeout. When the alarm goes off, it raises a TimeoutError that we can catch and handle like any other exception.

import signal


class TimeoutHandler:
    def __init__(self, timeout_buffer_ms: int = 5000):
        self.timeout_buffer_ms = timeout_buffer_ms

    def set_timeout(self, remaining_time_ms: int):
        timeout_seconds = max(1, (remaining_time_ms - self.timeout_buffer_ms) // 1000)
        signal.signal(signal.SIGALRM, self._timeout_handler)
        signal.alarm(timeout_seconds)

    def clear_timeout(self):
        signal.alarm(0)

    def _timeout_handler(self, signum, frame):
        raise TimeoutError("Lambda execution timeout")

The timeout handler calculates how much time remains in the Lambda execution (via context.get_remaining_time_in_millis()), subtracts a safety buffer, and sets a Unix alarm. When that alarm fires, it interrupts execution and raises an exception. The buffer gives you time to handle the exception, update DynamoDB, and return a proper response to SQS.

In your Lambda handler, you set the timeout at the start of processing each message. If processing completes successfully, you clear the alarm. If a timeout occurs, your exception handler catches it, marks the message as failed in DynamoDB, and adds it to the batch item failures list for SQS.

timeout_handler = TimeoutHandler()


def handler(event, context):
    batch_item_failures = []

    for record in event["Records"]:
        message_id = record["messageId"]

    try:
        timeout_handler.set_timeout(context.get_remaining_time_in_millis())

        # Update DynamoDB: mark as in progress
        update_status(message_id, "IN_PROGRESS")

        # Process the message
        process_message(record)

        # Update DynamoDB: mark as completed
        update_status(message_id, "COMPLETED")

        timeout_handler.clear_timeout()

    except TimeoutError:
        # We're about to timeout - cleanup gracefully
        update_status(message_id, "FAILED")
        batch_item_failures.append({"itemIdentifier": message_id})

    except Exception:
        timeout_handler.clear_timeout()
        update_status(message_id, "FAILED")
        batch_item_failures.append({"itemIdentifier": message_id})

    return {"batchItemFailures": batch_item_failures}

Now when Lambda is about to timeout, your code has control. The message gets marked as "failed" in DynamoDB, and SQS is explicitly told to retry it.

This pattern transforms Lambda timeouts from a source of data inconsistency into just another type of retryable error. Your system becomes more resilient because every execution path—success, failure, or timeout—updates DynamoDB correctly and communicates properly with SQS.

The observability improves too. Instead of stuck messages that you have to debug later, you get logs showing which messages timed out and why. You can set up alarms on timeout frequency to catch performance degradation early.

For systems processing millions of messages, this matters. A timeout rate of 0.1% means thousands of stuck messages every day. With proactive timeout handling, those become failures that retry automatically.

Implementation details

The timeout buffer defaults to 5 seconds, which works well for most use cases. This gives you enough time to update DynamoDB, construct your SQS response, and return from the handler. If your cleanup logic is particularly complex, perhaps you're closing multiple database connections or making API calls, you can increase the buffer.

For batch processing, it's important to reset the timeout for each message. Lambda's remaining time decreases as you process the batch, so you need to recalculate before each message. This ensures earlier messages in the batch don't unfairly consume time that later messages need.

Clear the alarm on both success and error paths. If you don't, the alarm might fire while processing the next message, causing a spurious timeout. Set the alarm at the start of the try block and clear it in both the success path and exception handler.

Limitations and trade-offs

This approach relies on Unix signals, specifically `SIGALRM`, which works in Lambda's Linux environment. The signal-based approach is simple and reliable for single-threaded code, which describes most Lambda functions.

The main trade-off is that you're reducing your effective processing time by the buffer duration. A 5 second buffer on a 60 second timeout means you really have 55 seconds to work with. This is usually acceptable, but if you're already pushing against timeout limits, you might need to either increase your Lambda timeout or optimize your processing code.

Conclusion

Serverless architectures introduce new failure modes. Lambda timeouts combined with SQS's at-least-once delivery and DynamoDB status tracking can result in stuck messages. Proactive timeout handling ensures that every message, whether it succeeds, fails, or times out, leaves your system in a consistent state.

The implementation is straightforward: a small timeout handler class and careful exception handling in your Lambda function. You get a more reliable system with better observability. When messages fail, they fail cleanly and retry automatically. Your DynamoDB status stays accurate, and you don't need manual cleanup scripts to fix stuck messages.

`1`	`import signal`
`2`
`3`
`4`	`class TimeoutHandler:`
`5`	`def __init__(self, timeout_buffer_ms: int = 5000):`
`6`	`self.timeout_buffer_ms = timeout_buffer_ms`
`7`
`8`	`def set_timeout(self, remaining_time_ms: int):`
`9`	`timeout_seconds = max(1, (remaining_time_ms - self.timeout_buffer_ms) // 1000)`
`10`	`signal.signal(signal.SIGALRM, self._timeout_handler)`
`11`	`signal.alarm(timeout_seconds)`
`12`
`13`	`def clear_timeout(self):`
`14`	`signal.alarm(0)`
`15`
`16`	`def _timeout_handler(self, signum, frame):`
`17`	`raise TimeoutError("Lambda execution timeout")`

`1`	`timeout_handler = TimeoutHandler()`
`2`
`3`
`4`	`def handler(event, context):`
`5`	`batch_item_failures = []`
`6`
`7`	`for record in event["Records"]:`
`8`	`message_id = record["messageId"]`
`9`
`10`	`try:`
`11`	`timeout_handler.set_timeout(context.get_remaining_time_in_millis())`
`12`
`13`	`# Update DynamoDB: mark as in progress`
`14`	`update_status(message_id, "IN_PROGRESS")`
`15`
`16`	`# Process the message`
`17`	`process_message(record)`
`18`
`19`	`# Update DynamoDB: mark as completed`
`20`	`update_status(message_id, "COMPLETED")`
`21`
`22`	`timeout_handler.clear_timeout()`
`23`
`24`	`except TimeoutError:`
`25`	`# We're about to timeout - cleanup gracefully`
`26`	`update_status(message_id, "FAILED")`
`27`	`batch_item_failures.append({"itemIdentifier": message_id})`
`28`
`29`	`except Exception:`
`30`	`timeout_handler.clear_timeout()`
`31`	`update_status(message_id, "FAILED")`
`32`	`batch_item_failures.append({"itemIdentifier": message_id})`
`33`
`34`	`return {"batchItemFailures": batch_item_failures}`

Handling Lambda invocation timeouts in batch processing