Error handling in distributed systems: A guide to resilience patterns

Let’s face it: distributed systems are tough. Just when you think you’ve got everything working perfectly, bam, a network interruption or a mysteriously slow service reminds you that while failures are not desirable features of your application, they are unavoidable characteristics of the distributed computing landscape you build on.

The good news is that with the right patterns and mindset, you can build systems that don’t just survive these failures but thrive despite them.

Why distributed systems are different (and why you should care)

Moving from monolithic to distributed architectures fundamentally changes how we think about errors. Here’s what makes them special (and occasionally maddening):

Partial failures: Your new normal

In a monolith, when something breaks, everything breaks. Simple, if painful. In distributed systems? Service A might be having a great day while Service B is on fire. Your e-commerce site could be processing orders perfectly while the recommendation engine is temporarily unavailable.

This partial failure mode creates a fascinating challenge. Instead of dealing with binary states, you’re managing a spectrum of degradation. Maybe your system is 87% healthy — what does that even mean for your users? Can they still check out? Will they see personalized recommendations or generic ones?

The mantra here is “design for failure.” You don’t merely want to handle errors, you want to expect them and prepare for them. Build your architecture to isolate failures, prevent cascades, and enable autonomous recovery wherever possible. When you assume things will break, you build systems that gracefully degrade rather than dramatically collapse.

The network: Your unreliable friend

Here’s a truth that seasoned distributed systems engineers know in their bones: the network will betray you.

Networks introduce latency that varies wildly based on everything from physical distance to how many people are streaming the latest Netflix series. They drop packets like a clumsy waiter drops plates. And sometimes, they partition, creating isolated islands where parts of your system can’t talk to each other. When a service doesn’t respond, you’re stuck with a mystery.

Did your request never arrive?
Did the service receive it but crash?
Did it process successfully but the response got lost?
Is it just slow today?

This ambiguity is why patterns like retries (with idempotency!) and well-tuned timeouts are essential. Such uncertainty has led to approaches like Durable Execution, where the platform tracks every step of your workflow execution, eliminating the guesswork about what succeeded and what failed.

Asynchronous chaos

Async communication is great for decoupling and scaling. It lets your services work independently, processing messages at their own pace without blocking each other. But it turns error handling into a puzzle.

When an error occurs three services deep in an async flow, tracing it back to the source is like following breadcrumbs through a hurricane. By the time Service C fails to process a message, Service A has long since moved on, not realizing that its original request has gone sideways. While you can still get stack traces from individual service logs, the problem is correlating them across the correct host and transaction for that particular “flow.” You need new tools and patterns to maintain visibility.

At-most-once: A message will be delivered once or not at all. This is fast but risks data loss.
At-least-once: A message will be delivered one or more times. This prevents data loss, but your system must be able to handle duplicate messages.
Exactly-once: A message will be delivered exactly one time. This is the ideal state, but it's arguably impossible to guarantee in any distributed system framework due to physics (e.g., a crash occurring at the exact moment a response is being sent). The goal is to get as close as possible through careful, stateful coordination.

This complexity multiplies when you have long-running processes that span hours or days (and sometimes weeks or more). A payment reconciliation that runs overnight, a document processing pipeline that waits for human approval, or a subscription renewal that triggers monthly — these patterns demand that level of stateful coordination. This is where Temporal’s Durable Execution shines. It can provide an effectively exactly-once guarantee for an entire business process. It achieves this by ensuring your Workflow logic is executed to completion once, even if it requires multiple replays behind the scenes. Its Activities (the functions that do the work) are executed at least once, and you use idempotency patterns to ensure there are no unintended side effects from retries.

A key advantage of Temporal is that it allows you to write code that looks and feels like simple, synchronous, local code, while the platform executes it as durable, asynchronous code. This gives you the benefits of scalable, event-driven systems without you having to build and manage the underlying complexity.

Data (in)consistency

In distributed systems, you’re constantly trading off between consistency, availability, and performance. Strong consistency simplifies your life but can tank availability during network partitions. Eventual consistency keeps things running but forces you to handle stale or conflicting data.

When operations span multiple services and traditional distributed transactions aren’t feasible, the saga pattern becomes your friend, using compensating transactions to handle failures. Imagine an account opening process: user created, address added, client profile... failed. Now what?

You can’t leave the user with a partially created account. The saga lets you roll back the journey, deleting the address and user to restore logical consistency. It’s not as clean as an ACID transaction, but it’s often the best you can do in a distributed world.

Here’s how a saga might look in a Temporal Workflow:

// workflow implementations as sequential executions
export async function openAccount(params: OpenAccount): Promise<void> {
  // This array holds the compensating transactions for our saga.
  const compensations: Compensation[] = [];

  try {
    // Step 1: Create the account.
    // This is a critical first step. If it fails, the saga doesn't even start.
    await createAccount({ accountId: params.accountId });
  } catch (err) {
    log.error('Fatal error: creating account failed. Stopping workflow.', { err });
    // No compensations are needed because nothing has been done yet.
    throw err;
  }

  try {
    // The saga begins now. From this point on, any failure will trigger a rollback.

    // Step 2: Add the user's address.
    await addAddress({
      accountId: params.accountId,
      address: params.address,
    });
    // If successful, add its compensating transaction to the stack.
    compensations.unshift({
      message: prettyErrorMessage('reversing add address'),
      fn: () => clearPostalAddresses({ accountId: params.accountId }),
    });

    // Step 3: Add the user as a client.
    await addClient({
      accountId: params.accountId,
      clientEmail: params.clientEmail,
    });
    // If successful, add its compensating transaction to the stack.
    compensations.unshift({
      message: prettyErrorMessage('reversing add client'),
      fn: () => removeClient({ accountId: params.accountId }),
    });

  } catch (err) {
    // A step in the saga has failed. We must now roll back.
    log.error('A downstream step failed. Starting saga rollback.', { err });

    // Execute all compensating transactions in the reverse order they were added.
    for (const compensation of compensations) {
      log.info(`Executing compensation: ${compensation.message}`);
      try {
        await compensation.fn();
      } catch (compensationErr) {
        log.error('Compensation function failed.', { compensationErr });
        // Depending on requirements, you might want to handle this failure,
        // e.g., by logging for manual intervention.
      }
    }
    // After rolling back, re-throw the original error to fail the workflow.
    throw err;
  }
}

The beauty is that the Temporal platform durably executes your Workflow code. This ensures that your catch block and all the compensation logic within it are guaranteed to run, even if your worker process crashes mid-rollback. You write standard-looking try...catch logic, and the platform makes it fault-tolerant.

Essential resilience patterns that actually work

Now for the fun part: the patterns that’ll save your bacon when things go wrong.

Timeouts: Your first line of defense

Without timeouts, a single slow service can bring down your entire system through resource exhaustion. But setting them is an art:

Two types to configure:

Connection timeout: How long to wait to establish a connection.
Request timeout: How long to wait for a response after connecting.

Smart timeout strategies:

Percentile-based: Set timeouts based on your p99 or p99.9 latency. If 99% of requests complete in 100ms, maybe timeout at 150ms.
Deadline propagation: If your user request has 2 seconds total, and you’ve already used 500ms, tell the next service it only has 1.5 seconds. No point processing requests that’ll timeout anyway.

The Goldilocks problem:

Too short? You’ll timeout healthy services and trigger unnecessary retries.
Too long? You’ll exhaust resources waiting for dead services. Remember: timeouts trigger other patterns. They’re often the canary that tells your circuit breaker to start counting failures (more on that further down!).

Retries: Second chances for transient failures

Retries help you recover from temporary glitches, but they’re a double-edged sword.

Strategies that work:

Strategy	Description	When to use
Immediate retry	Try again instantly	Almost never (seriously)
Fixed interval	Wait a constant time between retries	When you know recovery time
Exponential backoff	Double wait time each retry (1s, 2s, 4s…)	Most network calls
Exponential backoff + jitter	Add randomness to prevent thundering herds	The gold standard

The golden rule of retries: Idempotency

Never, ever retry non-idempotent operations. Charging a credit card twice because of a timeout isn’t a “transient error,” it’s a lawsuit waiting to happen.

Know when to quit:

Don’t retry client errors (400s) — they won’t magically fix themselves.
Don’t retry an already overloaded service — you’ll make things worse.
Watch for retry amplification in deep call chains (one failure times 3 retries per layer equals exponential pain)

Here’s how retry policy configuration looks in practice with Temporal’s Python SDK:

from temporalio.common import RetryPolicy
# ...
        activity_result = await workflow.execute_activity(
            your_activity,
            YourParams(greeting, "Retry Policy options"),
            start_to_close_timeout=timedelta(seconds=10),
            # Retry Policy
            retry_policy=RetryPolicy(
                backoff_coefficient=2.0,
                maximum_attempts=5,
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=2),
                # Don't retry on application errors like ValueError
                non_retryable_error_types=["ValueError"],
            ),
        )

This declarative approach means you configure once and let the platform handle retry timing and backoff calculations.

Circuit breakers: Stop hitting yourself

Circuit breakers are your system’s self-preservation instinct. When a downstream service is clearly struggling, the circuit breaker says “enough is enough” and stops calling it entirely. This prevents your system from wasting resources on doomed requests and gives the struggling service breathing room to recover.

The pattern takes its name from electrical circuit breakers, and the analogy is apt. Just as an electrical breaker prevents your house from burning down when there’s a short circuit, a software circuit breaker prevents cascading failures from taking down your entire system.

Three states:

Closed: Everything’s fine, requests flow normally.
Open: Service is dead to us, fail fast without calling.
Half-open: Let’s cautiously check if it’s recovered.

Key settings to tune:

Future threshold (e.g., 50% errors in 10 seconds).
Reset timeout (how long to wait before checking again).
Request volume threshold (don’t trip on just 2 failure requests).

Circuit breakers and retries can work together beautifully. Retries handle transient blips, while circuit breakers handle persistent problems. When the circuit is open, smart retry logic knows not to bother — there’s no point retrying when the circuit breaker has already declared the service unavailable.

Some platforms achieve circuit breaker behavior through retry policies. In Temporal, you can configure maximum retry attempts with specific error types that immediately stop retries. You can also build advanced monitoring and control.

Fallbacks: Plan B (and C)

When things fail despite retries and circuit breakers, you need alternatives:

Cache it: Serve stale data rather than no data.
Simplify it: Show generic recommendations when personalization fails.
Disable it: Turn off non-essential features.
Fail fast: Sometimes a quick, graceful error is better than a long wait.

What’s key here is that fallback logic must be simpler and more reliable than what it’s replacing. A fallback that queries three other services and performs complex calculations defeats the purpose. The best fallbacks often involve static data, simple caches, or straightforward business rules that can execute without external dependencies.

Dead Letter Queues: Where messages go to think about what they’ve done

In async systems, some messages are destined to fail. Maybe they’re malformed, maybe they’re trying to process an order for a product that no longer exists, or maybe they’re perfectly fine but arrived during a service outage. Dead Letter Queues (DLQs) give these problematic messages a place to go instead of clogging up your main processing pipeline. While it’s easy to think of them like error bins, they’re actually diagnostic goldmines.

Best practices:

Monitor DLQ depth and arrival rate.
Analyze patterns (same message type failing? Same error?).
Implement automated or manual reprocessing.
Set retention policies — don’t hoard failures forever.

Modern workflow engines like Temporal take a different approach that removes the need for DLQs entirely. Instead of moving a failed task to a separate queue where it loses its context, Temporal keeps the “stuck” Workflow in a failed state, but its full execution history remains visible and queryable.

This is a huge advantage. It means you can handle these “stuck” events with far more powerful patterns:

Alerting: Set up monitoring to alert you when a Workflow has a high failure count or has been running for an unusually long time.
Debugging: Inspect the live, failed Workflow's state and complete history to understand exactly what went wrong.
Hotfixing: Fix the underlying bug in your worker code.
Resume/replay: Once the fix is deployed, you can use Temporal’s tooling to replay the failed Workflow. It will then re-execute on the fixed code, often continuing from the point of failure without losing any progress.

This turns a potential data loss scenario (a message in a DLQ) into a recoverable operational flow.

The Bulkhead pattern: Compartmentalize your failures

The Bulkhead pattern recognizes a simple truth: shared resources create shared fate. When all your services share the same thread pool, one slow service can exhaust all the threads, starving every other operation.

By isolating resources into separate pools, you contain the blast radius of failures. Service A gets 50 threads, Service B gets 30, Service C gets 20. Now when Service A decides to have a bad day, it can only exhaust its own threads. Services B and C continue operating normally, blissfully unaware of A’s struggles.

This principle of isolation is fundamental to how Temporal designs resilient systems. While traditional bulkheads operate at the resource level (like thread pools), Temporal applies the pattern at an architectural level using Task Queues. By assigning different types of activities to different Task Queues, each served by a dedicated pool of Workers, you create a robust bulkhead. If activities on one queue become slow or start failing, they only consume the resources of their dedicated Worker pool, leaving other critical business functions completely unimpaired.

This approach is fundamentally more resilient than simple resource pooling because of Durable Execution. The work itself (the workflow state) is not tied to a specific worker or thread; it’s durably persisted by the Temporal cluster. This means even if an entire worker pool crashes and needs to be restarted, the work is never lost.

Idempotency: The unsung hero

Making operations idempotent is the foundation of reliable retries and at-least-once delivery. As we discussed, Temporal Activities provide an “at-least-once” execution guarantee, and your idempotent Activity implementation provides the no-more-than-once” business effect. Together, they allow you to achieve an effective “exactly-once” execution of your business logic.

Implementation approach:

Client generates a unique idempotency key.
Server checks if the key was seen before.
If new: process and store result with key.
If duplicate: return stored result without reprocessing.

Pro tips:

Use UUIDs or meaningful keys (e.g., order-create-{orderId})
Store keys with TTL — don’t keep them forever.
Make the key capture the operation’s intent precisely.

In Temporal Workflows, idempotency is particularly elegant. Since the platform tracks every step of execution, you can implement this pattern reliably.

import uuid
from datetime import timedelta
from temporalio.common import RetryPolicy
from temporalio import workflow

# Define your activity, e.g., process_payment
@activity.defn
async def process_payment(data: PaymentRequest) -> str:
    # This activity would connect to a payment provider like Stripe,
    # passing the idempotency_key with the API request.
    # The payment provider handles step 2 (checking the key).
    ...
    return "payment_processed_ok"

@workflow.defn
class PaymentWorkflow:
    @workflow.run
    async def run(self, data: PaymentRequest) -> str:
        # NOTE: Generating random data directly in a Workflow is a non-deterministic 
        # operation and will break replay. A unique key must be generated in a 
        # deterministic way. A best practice is to have the client that starts 
        # the Workflow generates the key, or use workflow.side_effect. 
        # Here, we assume it's passed in via the `data` object. 
        # The Workflow calls the activity, passing the key. 
        # If the Workflow process were to crash and resume here, the *same* # idempotency_key would be used on the next attempt, # preventing a duplicate charge.

        # The Workflow calls the activity, passing the key.
        # If the Workflow process were to crash and resume here, the *same*
        # idempotency_key would be used on the next attempt,
        # preventing a duplicate charge.
        return await workflow.execute_activity(
            process_payment,
            data,
            idempotency_key=idempotency_key,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

Observability: Seeing through the chaos

You can’t fix what you can’t see. Three pillars light your way:

Structured logging + correlation IDs

Every request gets a unique ID that follows it everywhere. Combined with structured (JSON/YAML) logs, you can trace a request’s entire journey with a simple query.

{
  "timestamp": "2024-01-10T10:30:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "correlationId": "abc-123-def",
  "message": "Payment processing failed",
  "userId": "user-789",
  "error": "Gateway timeout"
}

Distributed tracing

While logs show what happened in each service, traces show the big picture — how services called each other, where time was spent, and where failures occurred.

A trace might reveal that your API call took 5 seconds not because your service was slow, but because it spent 4.8 seconds waiting for a database query that normally takes 50ms. Without tracing, you’d be optimizing the wrong thing.

OpenTelemetry is becoming the standard here, letting you instrument and choose your backend (Jaeger, Zipkin, etc.) later. But Temporal can make your life easier: it provides built-in visibility into execution history, making it easy to debug failures without instrumenting every service call.

Metrics that matter

Monitor your resilience patterns to know if they’re helping or hurting:

Pattern	Key metrics	Alert when
Timeouts	Timeout rate, p99 latency	Rate > X% or latency approaching timeout
Retries	Retry rate, max retries hit	High retry rate or many hitting max
Circuit breaker	State changes, rejected calls	Stuck open or flapping
DLQ	Queue depth, message age	Growing depth or old messages
Fallbacks	Invocation rate, success rate	High usage or fallback failures

While these three pillars are essential for any distributed system, a platform like Temporal bakes observability in as a first-class feature. Temporal automatically injects correlation IDs like WorkflowId and RunId into logs and traces, making it trivial to track a single execution’s journey. The SDKs emit the very metrics mentioned above (retries, latencies, timeouts) out-of-the-box and integrate with OpenTelemetry for distributed tracing.

But Temporal’s greatest contribution to observability is its “fourth pillar”: the Event History. Every Workflow execution provides a complete, queryable, and replayable audit trail of every event (WorkflowExecutionStarted, ActivityTaskScheduled, ActivityTaskCompleted, TimerFired, etc.). This provides an infallible source of truth for debugging that makes stitching together disparate logs and traces unnecessary, allowing you to see exactly what happened, step-by-step, long after the execution is complete.

Performance: The cost of resilience

Every resilience pattern has overhead. The trick is making smart trade-offs:

Retries increase load — use backoff and jitter.
Idempotency adds lookup latency — optimize your key storage.
Circuit breakers can have false positives — tune thresholds carefully.
Tracing adds overhead — use sampling in production.

The question isn’t whether these patterns have costs, but whether those costs are worth paying. In almost every case, they are. A system that takes 50ms longer to process requests but stays up during failures beats a system that’s 50ms faster but falls over when you sneeze at it. Your users won’t thank you for those saved milliseconds if they can’t complete their transactions.

Testing your resilience

Don’t wait for production to test your error handling. Use:

Load testing with failure injection.
Chaos engineering to break things on purpose.
Game days to practice incident response.

The Durable Execution alternative

While the patterns we’ve discussed are essential knowledge for any distributed systems engineer, there’s a new approach that fundamentally changes how we handle failures: Durable Execution. Temporal embeds these resilience patterns directly into the execution model.

With Durable Execution, your workflow state is automatically persisted, retries happen without your intervention, and the platform handles the complexity of coordinating distributed transactions. Instead of writing defensive code full of error handling, you write business logic that looks almost sequential:

async function processOrder(orderId: string) {
  // Each step is automatically retried on failure
  const items = await checkInventory(orderId);
  const payment = await processPayment(orderId);
  await shipOrder(orderId, items);
  await sendConfirmationEmail(orderId);
}

Behind the scenes, the platform handles retries, maintains state across failures, and ensures exactly-once execution for Workflows and at-least-once execution for Activities. Companies like Netflix, Stripe, and Snap have adopted this model for critical workflows, reporting dramatic reductions in error-handling code and faster development cycles.

Conclusion: Embracing the chaos

Distributed systems will fail. And that’s not “dooming,” it’s physics. Networks have latency. Servers crash. Bugs happen. But with the right patterns, observability, and mindset, you can build systems that bend without breaking.

Start simple. Add timeouts to your external calls. Make your critical operations idempotent. Implement retries with exponential backoff and jitter. As you gain confidence, layer in circuit breakers and bulkheads. Invest heavily in observability. Or try Temporal: we bake these resilience patterns directly into our platform, so you can focus on your business logic.

Remember: the goal isn’t to prevent all failures, but to handle them so gracefully that your users never notice. That’s the art of distributed systems.

Now go forth and build something resilient with Temporal Cloud. Start your free trial today and get $1000 in free credits. Your future on-call self will thank you.

Error handling in distributed systems: A guide to resilience patterns

Why distributed systems are different (and why you should care)#

Partial failures: Your new normal#

The network: Your unreliable friend#

Asynchronous chaos#

Data (in)consistency#

Essential resilience patterns that actually work#

Timeouts: Your first line of defense#

Retries: Second chances for transient failures#

The golden rule of retries: Idempotency#

Circuit breakers: Stop hitting yourself#

Fallbacks: Plan B (and C)#

Dead Letter Queues: Where messages go to think about what they’ve done#

The Bulkhead pattern: Compartmentalize your failures#

Idempotency: The unsung hero#

Observability: Seeing through the chaos#

Structured logging + correlation IDs#

Distributed tracing#

Metrics that matter#

Performance: The cost of resilience#

Testing your resilience#

The Durable Execution alternative#

Conclusion: Embracing the chaos#