Durable Execution in Distributed Systems: Increasing Observability, Reliability, and Development Velocity

temporal logo meteor indigo

Loren Sands-Ramshaw & Irina Belova

Developer Relations Engineer, Product Marketing Manager

Durable Execution is a concept used by companies such as Stripe, Netflix, HashiCorp, Datadog, and many others to address a variety of challenges in distributed systems. In this post, we look at those challenges, how durable execution can solve them, the new programming possibilities it introduces, and how it makes distributed system patterns easier.

Distributed Systems

Before the rise of cloud computing and microservices in the early 2010s, most applications used a monolithic architecture. In this model, all components were deployed as one unit and shared a single database, usually located in a local datacenter. This approach made development slow and risky because changes in one part could cause problems in others. Scalability was limited by hardware, and adding new servers often took weeks or months.

With cloud computing, companies moved to virtualized infrastructure on platforms like AWS, Azure, or GCP. Applications also shifted to microservices, breaking down into smaller, independent components. This made it easier to update and scale parts of the system quickly and safely.

Today, most applications are distributed systems, where components communicate over networks. This shift brings new challenges, as remote connections introduce potential points of failure, making it harder for developers to keep everything consistent and reliable. Let's say you’re working in a distributed system and you need to charge a credit card and update the database at the same time:

public async Task<int> HandleRequestAsync()
{
    await paymentAPI.ChargeCardAsync();
    await database.InsertOrderAsync();
    return 200;
}

This post uses C# as an example.

If the card is charged but the database update fails, the system ends up in an inconsistent state: the customer is charged, but there’s no record of it. You might try to fix this by retrying the database update until it works.

This could work if InsertOrderAsync() can safely be retried and the database issue is temporary. But if the process that’s retrying crashes before it finishes, the system is still left in an inconsistent state.

To avoid this, your application needs to:

  • Save order details.
  • Keep track of which steps are completed.
  • Have a worker process to finish any incomplete tasks.

Now imagine that the application has to do even more things, like updating inventory, creating a shipping label, or assigning a delivery driver. Writing the code to handle all the retries, timeouts, and saving state for each step can be overwhelming, and it’s easy to miss some edge cases or failure scenarios (see the full, scalable architecture. )

Durable execution helps by automatically handling these complexities, so you can focus on building your application without worrying about all the failure handling details.

Durable Execution

Durable Execution guarantees that code runs to completion, regardless of hardware reliability, network stability, or service availability. This means that even if the hardware crashes, the network fails, or downstream services are temporarily unavailable, the code execution continues seamlessly. Timeouts and retries are handled automatically and transparently, and resources are efficiently managed by freeing them up when the system is idle, such as when waiting for a downstream service to become available again.

This is possible because Durable Execution systems like Temporal persist each step our code takes. If the process running the code fails, another process takes over, maintaining all the state information, including the call stack and local variables. For example, if an execution is blocked on an external API call and the machine that hosts it crashes, when the API call returns a few days later, the execution is recovered on a different machine still blocked on the same API call. Then, the call returns, and the execution continues to the next line. This ensures that the code execution continues without any loss of progress or data.

New Possibilities

Durable execution is programming on a higher level of abstraction, where we don’t have to be concerned about transient faults in our infrastructure or dependencies. It opens up new possibilities like:

1. Writing code that sleeps for a month.

We can realistically have a method that sleeps for an arbitrary length of time, be it weeks, months, or years. Thanks to durable execution, we don’t need to be concerned about whether the process can safely be expected to run for that period of time—we can be confident that another process will continue running the method at the specified time. For example, a subscription method can charge the user’s credit card every 30 days in a loop:

  public async Task RunAsync(string userId)
  {
      while (true)
      {
          await Workflow.DelayAsync(TimeSpan.FromDays(30));

          await Workflow.ExecuteActivityAsync(
              () => MyActivities.Charge(userId),
              new() { StartToCloseTimeout = TimeSpan.FromMinutes(5) });
      }
  }

After it waits 30 days, it calls ExecuteActivityAsync to execute an activity. Activities are methods with normal code. They are the steps that are automatically timed out and retried. Here, if the Charge(userId) activity method fails or times out, it will be re-executed with exponential backoff, unless the error thrown is marked as non-retryable, like a “card declined” error.

2. Workflows can receive RPCs and respond to queries.

A workflow is a single instance of durable code execution. In the below example, a signal can be sent to notify of an order item, a query can be used to get the items, and a signal can be sent to perform checkout which will run an activity and complete the workflow:

[Workflow]
public class ShoppingCartWorkflow
{
	private PaymentDetails? paymentDetails;

	[WorkflowRun]
	public async Task<CompletedOrder> RunAsync(ShopperInfo shopperInfo)
	{
    	// Wait for checkout
    	await Workflow.WaitConditionAsync(() => paymentDetails != null);

    	// Do checkout and return
    	return await Workflow.ExecuteActivityAsync(
        	() => OrderActivities.Checkout(shopperInfo, paymentDetails, Items),
        	new() { StartToCloseTimeout = TimeSpan.FromMinutes(5) });
	}

	[WorkflowQuery]
	public List<Item> Items { get } => new();

	[WorkflowSignal]
	public async Task AddItem(Item item) =>
    	Items.Add(item);

	[WorkflowSignal]
	public async Task CheckoutAsync(PaymentDetails paymentDetails) =>
    	this.paymentDetails = paymentDetails;
}

3. Store state in instance variables instead of a database.

Since a Workflow can be long running, and we can trust that an instance variable will always be there and be accurate, we can send an RPC to get the variable’s value instead of storing it in a DB:

var items = await handle.QueryAsync(workflow => workflow.Items);

A query is a type of RPC that only returns data and doesn't mutate state.

Distributed System Design Patterns the Temporal Way

Durable execution makes it trivial or unnecessary to implement many distributed systems patterns, including event-driven architecture, task queues, sagas, cron jobs, state machines, circuit breakers, and transactional outboxes. For a more in-depth explanation of each of these, you can watch System Design on Easy ,Mode.

Durable execution makes implementing patterns like event-driven architecture, task queues, and sagas straightforward:

  • Event-Driven Architecture: Temporal is event-driven, maintaining loose coupling and simplifying development and debugging.
  • Task queues: Temporal handles task distribution, reducing the need for custom queue management.
  • Sagas: Temporal orchestrates long-running transactions, automatically compensating for failures.

Event-Driven Architecture

The great benefit of Event-Driven Architecture (EDA)—using a message bus to communicate between services—is loose coupling at runtime. A service running step B can go down without losing requests or messing up the service running step A, since service A can keep publishing messages to the bus. When service B comes back up, it can read the messages it missed while it was down.

However, EDA is tightly coupled when implemented at scale: making a breaking change to a message sent to a bus means finding all the other teams that depend on that message and getting them to deploy an update to their code before we can deploy our change.

Durable execution is loosely coupled at runtime: Temporal is event-driven under the hood, and any piece can go down and come back up without dropping work. But it has a much better developer experience when building and evolving systems. Not only is it easier to code and make changes, but developers are also able to understand the system much better. Instead of studying disparate instances event-handling code to try to figure out what it does with events in which circumstances, or setting up and combing through distributed tracing a diagram, we can:

  • read the durable code to see what it does, and
  • view in the Temporal UI the steps every production function execution took—or even the current state of an in-progress execution.

For more on this topic, check out the keynote from the 2023 Replay conference: The way forward for event-driven architecture.

Task Queues

Anything that we would normally use task queues for can instead be accomplished with durable execution. Under the hood, every durable function–and every step the function takes–is put on a task queue and distributed across a pool of workers. All we need to do is provide our code to Temporal’s worker library and ensure we’re running enough worker processes to get through all the work in the desired time frame.

Sagas

Sagas are long-running transactions that don’t hold locks; instead, each step is executed sequentially, and if a step fails, previous steps are undone with compensating steps. This is a common pattern when we need to alter state that’s stored across multiple data stores or other systems, and it requires either choreography (event-based) or orchestration (central coordinator) to be accurately implemented. The Microservices Patterns book recommends using orchestration for non-trivial use cases (and I'd argue for all use cases) due to the complexity of choreography (see event-driven architecture above), and durable execution is developer-friendly, automatic orchestration—it orchestrates each step our code takes.

In our first example, we can replace this request handler:

public async Task<int> HandleRequestAsync()
{
    await paymentAPI.ChargeCardAsync();
    await database.InsertOrderAsync();
    return 200;
}

with executing a Workflow:

public async Task<int> HandleRequestAsync(order)
{
    await client.ExecuteWorkflowAsync(
        (ProcessOrder workflow) => workflow.RunAsync(order),
        new(id: order.OrderId, taskQueue: "my-store"));
    return 200;
}

and the Workflow is a saga with two steps:

[WorkflowRun]
public async Task RunAsync(Order order)
{
    await Workflow.ExecuteActivityAsync(
        () => MyActivities.ChargeCardAsync(order.paymentInfo),
        new() { StartToCloseTimeout = TimeSpan.FromMinutes(5) });

    try
    {
        await Workflow.ExecuteActivityAsync(
          () => MyActivities.CreateOrderAsync(order),
          new() { StartToCloseTimeout = TimeSpan.FromMinutes(5) });
    }
    catch (ActivityFailureException ex)
    {
        Workflow.Logger.LogWarning(ex, "Creating order failed, compensating");
            () => MyActivities.RefundCardAsync(order.paymentInfo),
            new() { StartToCloseTimeout = TimeSpan.FromMinutes(5) });
        throw;
    }
}

If the second step—CreateOrderAsync—has a non-retryable failure (like “item out of stock”), then we compensate by refunding the charge. (If it has a transient failure, like it's unable to reach the database, it will be retried.) The compensation, combined with the guarantee that the method will complete executing, makes this small, simple method a reliable long-running transaction!

Conclusion

This blog covered consistency in distributed systems, durable execution, and its benefits: new programming possibilities, and the simplification and acceleration of building distributed systems.

Durable Execution offers the following advantages:

  • It increases development velocity by simplifying the coordination of services, APIs, and data stores, allowing features that once took 20 weeks to be built in just two weeks, with some completed in a day using Temporal.
  • It enhances reliability because a simpler codebase leads to fewer bugs. Durable Execution has been tested at scale by numerous companies, effectively handling much of the complexity for developers.
  • It significantly improves testability by providing a structured environment for testing, which allows developers to simulate various failure scenarios. Temporal’s Workflow history and replay capabilities ensure comprehensive test coverage and eliminate flakiness.
  • It enhances observability, as every step of the code execution can be tracked and viewed in Temporal’s UI. This visibility enables quick diagnosis of issues and aids in monitoring system health.
  • It makes debugging easier by providing better visibility, allowing developers to download execution logs and replay them locally. Fast-forwarding time during testing simplifies debugging scenarios.
  • It contributes to a positive user experience when systems operate reliably without failures, resulting in a smoother and more satisfying interaction for users.

Durable execution significantly simplifies software development. There are systems like Temporal, Azure Durable Functions, AWS Simple Workflow Service, and Uber Cadence that provide durable execution. Temporal's founders, Maxim and Samar, have extensive experience with all these technologies. Samar developed the Durable Task Framework, which was later adopted by Azure Durable Functions. Maxim was the tech lead for the public release of AWS Simple Workflow Service. Both of them led the Cadence project at Uber before they founded Temporal in 2019.

Temporal is open source (MIT license) with a large team of full-time engineers working to improve it. Temporal is used by thousands of companies for critical applications. If you’ve streamed a show on Netflix, ordered food from DoorDash, or made a payment through Stripe, you’ve experienced a Temporal Workflow.

Learn more about the use cases of Stripe, Datadog, Coinbase, and others.

For more information, check out the recommended resources:

Temporal also has runtimes for Go, Java, Python, TypeScript, and PHP. You can also use multiple languages, for instance having a .NET Workflow call activities implemented in Go and Java.

Ask us any questions in the Temporal Community Slack. Get started today by downloading Temporal or signing up for Temporal Cloud.

Thank you to Rick Ross, Erica Sadun, Tom Wheeler, Drew Gorton, Chad Retz, and all the other Temporal team members for reviewing this post!