How to Handle Fault Tolerance in Microservices

There’s a saying: “Amateurs study tactics, while professionals study logistics.” In software, this translates to: “Amateurs focus on algorithms, while professionals focus on failures.”

At J on the Beach, I took time in my talk to expand on this saying and explain that real-world systems don’t just need code that works on the “happy path” — they need a safety net for when things go wrong.

Modern software development has layers of complexity. You’re not just writing code; you’re connecting systems across time and space, handling data that doesn’t sleep, and ensuring flawless performance at scale. What sets top developers apart is how they manage failures. Building resilience focuses on ensuring reliability when things inevitably go wrong, not just maintaining uptime.

In this post, we’ll walk through three common approaches to handling failures in software, each with its own strengths and weaknesses. Then we’ll introduce Temporal’s approach, workflow-as-code, which makes it easier to build reliability into your systems from day one.

Three Ways to Handle Failure in Your Software#

Failures are inevitable in your distributed systems. When a network link fails, a server times out, or a service crashes, systems need strategies to respond properly and ensure that your operations remain reliable.

Below, we’ll explore three common approaches to coordination between systems — Remote Procedure Calls (RPCs), persistent queues, and workflows — and their relationship to failure management.

1. Request-Response (RPC)#

The request-response, or RPC model, is a classic approach. A client makes a request, the server processes it, and sends back a response. In the best-case scenario — the “happy path” — everything works smoothly. Imagine a money transfer request: one service debits the sender while another credits the receiver. If all goes as planned, the transfer completes with no issues.

Pros of the RPC Model

Simplicity: The direct client-server connection makes this model easy to implement for straightforward workflows.
Efficiency on the “happy path”: When things go smoothly, RPC provides fast, efficient responses and low latency.

Cons of the RPC Model

Limited resilience for partial failures: If the client’s request is successful, but a response isn’t received, or a step in the process fails, RPC often requires extensive error-handling code on the client side.
Heavy client burden: Clients must handle errors, recovery, and retries, complicating systems as they scale.

The RPC model works well for simple, synchronous tasks. However, for resilience, it falls short by placing the onus on developers of the RPCs and those consuming them to manage every failure scenario — and this is no trivial matter.

2. Persistent Queues#

Persistent queues add a degree of flexibility by decoupling the client from the server. Messages are placed in a queue, and the system processes them asynchronously. Queues help distribute workloads: they support automatic retries and asynchronous processing, which can smooth out demand spikes.

Pros of Persistent Queues

Automatic retries: Persistent queues often support automatic retries, attempting tasks multiple times if they initially fail.
Load distribution: Queues smooth processing under heavy loads, distributing requests over time, to improve system reliability.
Producer-consumer separation: Decoupling producers and consumers allow the queue to function independently, improving fault tolerance.

Cons of Persistent Queues

Loss of ordering: Since queues process messages independently, tasks may execute out of order, causing unexpected issues for dependent operations.
Dead-letter queues: Tasks that continuously fail may require a separate “dead-letter” queue, adding complexity and, typically, manual intervention.
Limited visibility into status: Visibility becomes even more challenging when you have systems that use multiple queues, requiring additional tooling and infrastructure.

Queues work well when you need flexibility and decoupling, but they lack the control and visibility needed for comprehensive failure management.

3. Workflows#

Workflows provide a robust solution for orchestrating complex processes across distributed systems. Unlike RPC or queue-based models, workflows manage retries, state, and error handling automatically, making them ideal for long-running or multi-step processes.

Pros of Workflows

Built-in resilience: Workflows handle retries, recovery, and compensation steps automatically, reducing the need for custom error-handling code.
Support for long-running processes: Workflows accommodate processes that span minutes, hours, or even days, making them well-suited for complex tasks.
Enhanced visibility: Workflow systems enable real-time tracking and querying, so both clients and developers can see exactly where each process stands.

Cons of Workflows

Infrastructure requirements: Workflows require a solid infrastructure to manage states, retries, and tracking, which some teams may lack.
Setup complexity: Workflow systems can be complex to set up, especially when building custom solutions to manage workflows.

For complex processes that demand reliability and transparency, workflows provide the most comprehensive solution, though they require dedicated infrastructure to deploy effectively.

Resilience Without Extra Overhead#

At Temporal, we addressed these challenges by designing a platform that handles resilience, error handling, and state management so you don’t have to.

With Temporal, you write workflows as code - no extra XML, JSON, or YAML definition of workflow logic that is difficult to understand and debug down the line. Define your steps in regular code, and Temporal does the rest, managing retries, maintaining state, and ensuring that your workflows are reliable and simple to create.

Companies like ANZ Bank, one of the largest banks in the Asia-Pacific region, rely on Temporal to strengthen the resilience and reliability of critical financial processes. With Temporal, ANZ orchestrates and manages complex operations across distributed systems, ensuring tasks are retried automatically, failures are handled, and long-running processes are tracked seamlessly. This has enabled ANZ to boost system reliability, reduce operational complexity, and uphold strict compliance standards in their high-stakes FinServ environment.

Failure Management Is a Strategy, Not a Setback#

Any complex system will encounter failures. But how you handle those failures makes all the difference. For developers, focusing on failure management from the start distinguished exceptional teams from the average. Building resilience into your system sets your project up for long-term success.

Discover how to make your app resilient with examples from leading companies like Snap and Coinbase, or start a free trial of Temporal Cloud today.

Why Top Developers Prioritize Failure Management