How developers kept running during the AWS us-east-1 outage

On October 20th 2025, the internet came to a halt.

Amazon Web Services's (“AWS”) most popular cloud region, named us-east-1, virtually went offline and took many other businesses’ applications with it. The effects of the outage — which happened in two phases — impacted consumers worldwide and made major headlines.

But perhaps even more significant than the number of household names that made the news was the number of names that didn’t. Applications that had a multi-region high availability/disaster recovery (HA/DR) strategy continued running during the outage, even if they had a dependency on the affected cloud region.

How to survive an outage#

At Temporal, we’ve built a platform for “Durable Execution”: running workflows while handling the inevitable errors that hit production software at scale. A Durable execution platform must operate under all failure scenarios, including a cloud region failure. That’s why Temporal Cloud offers Multi-region and Multi-cloud Replication.

These high availability features make it simple for Temporal Cloud customers to execute a multi-region HA/DR strategy. After a Temporal “Namespace” enables replication, Temporal Cloud syncs its Workflows to a different cloud region. Then, if the primary cloud region becomes unstable — like us-east-1 did on October 20th — the user can trigger a failover on the Namespace. Temporal will trigger a failover automatically if it detects a region outage.

One such application that benefited from these features was FireHydrant, an “All-in-One Alerting, On-Call, and Incident Management” platform that depends on Temporal Cloud. FireHydrant incorporated Temporal’s multi-region product into their HA/DR strategy last year. They picked Namespaces that serve business-critical workflows to enable Multi-region Replication (this must be done ahead of time, not mid-outage). They provisioned multi-region flows for the rest of their infrastructure, including Temporal Workers and other data systems. Finally, they tested the process with a “game day,” proving they could move their architecture to a different region.

The October 20 us-east-1 outage (Temporal’s version)#

At 12:11 AM (PDT), AWS announced an increase in error rates in us-east-1, and soon the DynamoDB service went offline. Temporal detected the incident 12 minutes before it was reported. The on-call Temporal Cloud engineers quickly confirmed that customer workflows continued to execute normally.

Unfortunately, this issue blocked the creation of new Temporal Namespaces in us-east-1. Since this did not affect customer Workflows, Temporal did not failover Namespaces in us-east-1. However, some Temporal customers, concerned about these events in the region, proactively failed over their us-east-1 Namespaces that had Multi-region Replication. This turned out to be a wise decision, as more regional issues were coming.

AWS restored DynamoDB after 3 hours, but lingering effects in us-east-1 caused a cascading series of failures in hundreds of critical AWS services. This affected virtually every service that runs in the region, including Temporal Cloud.

At 8AM (PDT), Temporal detected network instability that prevented a small percentage of requests from reaching Temporal Namespaces in us-east-1. Temporal’s on-call engineers monitored the situation; because the network error percentage was small, they left Multi-region and Multi-cloud replicated Namespaces in us-east-1, as did Temporal’s auto-failover detection rules.

Around this time, FireHydrant detected errors from various customer-facing vendors in their stack, affecting end users. At 9:34AM (PDT) they triggered a failover of their Temporal Namespaces. The multi-region Temporal architecture performed exactly as designed, keeping workflows uninterrupted throughout the entire event.

“Thanks so much for being so damn rock solid with Temporal Cloud [during the AWS outage]. Our on-call didn’t flinch when us-east-1 went down, and being able to trigger a failover to us-east-2 was 🤌 ” — Robert Ross, CEO of FireHydrant”

Over the next hour, the network error rate grew rapidly on one Temporal cell in us-east-1. Temporal’s on-call engineers decided that this warranted a failover. They triggered a failover on Temporal’s internal Namespaces out of us-east-1, to validate that moving to another region would resolve the observed issues. While these progressed, Temporal’s automation requested approval to failover all replicated Namespaces in us-east-1; the on-call team chose to wait for validation from the internal failovers first.

Eight minutes after triggering failover on internal Namespaces, the on-call engineers validated that these workflows were running as expected in their failover regions. They began Temporal’s proactive auto-failover for customer Namespaces in us-east-1 that had Multi-region or Multi-cloud Replication.

As failovers progressed, the network error rate in the degraded cell grew, eventually hitting 100%, meaning a full network partition. Most customer Namespaces completed their failovers without issue and resumed processing customer workflows in other regions; but a few Namespaces lingered in the unhealthy cell. These stalled — we later discovered — because the Temporal workflow that triggers auto-failovers had a previously-undetected dependency on the source region, us-east-1. When the unhealthy cell became network partitioned, these calls blocked the Workflow from triggering auto-failovers, prolonging the recovery time for affected Namespaces by several minutes. The on-call engineers detected this and used internal tooling to trigger failover on the remaining Namespaces.

In the end, all customer Namespaces with Multi-region or Multi-cloud Replication evacuated us-east-1. After the Namespace failover, Temporal customers like FireHydrant only had to scale up Temporal Workers in the replica region to keep their Workflows running. The outage lasted 3 hours according to the Temporal status page.

Over the following days, most customer Namespaces returned to us-east-1 with a seamless “failback.” They moved their Temporal Workers back to us-east-1 at their own convenience, independently of the Namespace, since Temporal Cloud forwards requests between active and replica regions. Some Temporal customers chose to keep their Namespaces active in other regions, a practice known as “fail-forward” or “fail-and-stay.” No matter which region each Namespace ultimately settled in, Temporal Cloud kept the active and replica regions in sync.

Looking ahead#

At Temporal, the October 20th outage grew our trust in our multi-region product and revealed several ways to harden it. While the vast majority of customer Namespaces executed a failover without issue, the auto-failover call that blocked on the source region prolonged the recovery time for a handful of customers.

We will fix this urgently, so that all users of Temporal Cloud replication can beat their Recovery Time Objectives (RTOs) in the next outage, and test it during our internal “game days” that simulate real-world disruptions. We will also have our automation detect this type of network partition more quickly and trigger auto-failover earlier in the next outage. For Namespaces with Multi-region and Multi-cloud Replication, Temporal stands by its RTO of sub-20 minutes for cloud region outages.

You can get started on your own multi-region journey with Temporal by signing up for Temporal Cloud, if you haven’t already, or reaching out to your Temporal account manager. To make your Namespace multi-region or multi-cloud, simply add a Replica when creating it or in its settings page, and choose the replica’s region or cloud provider.

regions-and-availability-aws-outage-us-east-1

We get it#

If you were on-call during the us-east-1 outage, the adrenaline, confusion, and the stressful triage that you experienced is a battle we understand. We built Temporal Cloud for these moments because we’re a product built for developers, by developers.

To us, “durable” is the dividing line between code that crumbles when shipped and software that scales in real-world environments.

This is why we obsess over durability and recovery. We know that behind every workflow is you — a developer who deserves to know that their system won’t fold under pressure.

Don’t take our word for it. Start building with Temporal Cloud to find out what it can do for your workflows and applications.

Durable under pressure: How developers kept running during the AWS us-east-1 outage

How to survive an outage#

The October 20 us-east-1 outage (Temporal’s version)#

Looking ahead#

We get it#

More Posts