Improving Latency with Eager Workflow Start

Eager Workflow Start (EWS) is an experimental latency optimization, currently in Pre-release, with the goal of reducing the time it takes to start a workflow. The start response with EWS includes the initial workflow task when there is a local worker ready to process it.

The target use case is short-lived workflows that interact with other services using local activities, ideally initiating this interaction in the first workflow task, and deployed close to the Temporal Server. These workflows have a happy path that needs to initiate interactions within low tens of milliseconds, but they also want to take advantage of server-driven retries, and reliable compensation processes, for those less happy days.

Applications of this use case include financial transactions that interact with customers in real-time, collaborative apps that need reliable replay, or remote interactions with the physical world, such as configuring drone settings, or controlling factory equipment.

In this post, we will first describe how it works, and what performance improvement you can expect. Then, we will discuss some limitations of the current implementation, and finally, wrap up with a “hello world” example in Go.

Discussion

The traditional way to start a workflow decouples the starter program from the worker by sharing a task queue name between them, similar to a publish/subscribe pattern. This has many advantages, for example, we can reliably schedule a workflow execution without a running worker, or separate the worker and workflow implementation from the starter application and host them independently.

But decoupling also makes it harder to optimize for latency. Instead, when the starter and worker are collocated and aware of each other, they can interact while bypassing the server, saving a few time-intensive operations.

output-onlinepngtools Implementation of Eager Workflow Start

The above figure shows EWS in action:

The process begins with the starter setting EnableEagerStart to true in the start workflow options.
Then, the SDK will try to locate a local worker that is willing to execute the first workflow task, and reserve an execution slot for it.
If successful, the SDK will provide a hint to the server that eager mode is preferred for the new workflow.
The server not only registers the start of the workflow in history, it also assigns the first workflow task to the starter, all in the same DB update.
The first task is included in the server response, no matching step required.
The SDK extracts the task from the response, and dispatches it to the local worker.

To recover from errors, EWS falls back to the non-eager path. For example, when the first task is returned eagerly, but the local worker refuses to honor the reserved slot, the server retries this task non-eagerly after WorkflowTaskTimeout.

What are the savings? One database update plus the matching operation that associates polling workers with messages in task queues. It can also reduce latency variation because polling worker connections are not always ready when you need them.

This translates into significantly lower latency. For example, a few months back we did a test measuring the time it takes to create a workflow and start executing its first task, a local activity. The goal was to estimate the minimum latency to interact with an external service, using a newly-created workflow, and the Temporal Cloud. The starter and worker were in the same AWS region as our Temporal Cloud namespace. The p50 latency was 16.7 ms (eager) vs 29.3 ms (non-eager), a 43% improvement. For p99 latency, we saw 30.9 ms (eager) vs 51.6 ms (non-eager), a 40% improvement.

Note that these numbers are for illustrative purposes only, just to put EWS potential improvements in perspective.

Limitations

The most intrusive limitation of this implementation is that the worker and the starter need to share a client connection to discover each other. This means that they have to run in the same process, and share a common lifecycle. In the next section, we discuss how to do that with a Go example.

Also, the current implementation does not support worker versioning with build IDs, another feature in Pre-release. Our goal is to solve this problem before General Availability.

Currently, we are supporting the following Temporal SDKs:

We will not immediately support:

TypeScript

EWS is enabled with the dynamic configuration server flag system.enableEagerWorkflowStart.

For debugging, we can start a local server with EWS enabled using:

temporal server start-dev —dynamic-config-value system.enableEagerWorkflowStart=true

To enable EWS in your namespace in Temporal Cloud, open a ticket.

Hello World Example

package main

import (
	"context"
	"log"

	"go.temporal.io/sdk/client"
	"go.temporal.io/sdk/worker"

	"github.com/temporalio/samples-go/helloworld"
)

const taskQueueName = "eager_wf_start"

func main() {
	// 1. Create the shared client.
	c, err := client.Dial(client.Options{})
	if err != nil {
		log.Fatalln("Unable to create client", err)
	}
	defer c.Close()

	// 2. Start the worker in a non-blocking manner before the workflow.
	workerOptions := worker.Options{
		OnFatalError: func(err error) { log.Fatalln("Worker error", err) },
	}
	w := worker.New(c, taskQueueName, workerOptions)
	w.RegisterWorkflow(helloworld.Workflow)
            w.RegisterActivity(helloworld.Activity)
	err = w.Start()
	if err != nil {
		log.Fatalln("Unable to start worker", err)
	}
	defer w.Stop()

	workflowOptions := client.StartWorkflowOptions{
		ID:        "eager_wf",
		TaskQueue: taskQueueName,

		// 3. Set this flag to true to enable Eager Workflow Start.
		EnableEagerStart: true,
	}

	// 4. Reuse the client connection.
	we, err := c.ExecuteWorkflow(context.Background(), workflowOptions,
		helloworld.Workflow, "Temporal")
	if err != nil {
		log.Fatalln("Unable to execute workflow", err)
	}

	// 5. Wait for workflow completion.
	var result string
	err = we.Get(context.Background(), &result)
	if err != nil {
		log.Fatalln("Unable to get workflow result", err)
	}
	log.Println("Workflow result:", result)
}

The code above is from the samples-go repository, a single-file, EWS-enabled variant of the “Hello World” sample, also in the same repository.

To share the client between worker and starter, we need to start the worker first in non-blocking mode, i.e., using Start() instead of Run(), and set up a handler for worker errors with the option OnFatalError(). Then, we just need to set the EnableEagerStart to true in the start workflow options, and reuse the previous client.

Conclusion

EWS is an experimental optimization that reduces the time to start a workflow significantly. The current implementation has some limitations, and it requires changes to the structure of your application, but, if low latency with reliability really matters to you, give it a try, and tell us what you think in our Community Slack!

Acknowledgments

Roey Berman is the architect of EWS, and he also implemented the required server-side changes. Quinn Klassen implemented EWS for the Go SDK. Dmitry Spikhalskiy did the Java SDK implementation, and Chad Retz the .Net one. Loren Sands-Ramshaw, Drew Hoskins, Paul Nordstrom, and Tasha Alfano made many insightful comments on this post.