Need a closer look? Download and review slides presented at Replay 2023 here.

Temporal is a key building block of automation at Datadog, making improving tooling for more ergonomic operation a priority. With existing tools, the onus is on the user to accurately provide all the necessary information required to start or interact with a workflow execution. By making this information programmatically accessible, we have been able to build next-generation tooling for both developing on Temporal and ergonomically interacting with automations workflows built with Temporal.

Transcript

This transcript has been edited for clarity and formatting.

Eric Chee: Hello everybody. My name is Eric Chee and I’m a software engineer here at Datadog. Today I’ll be sharing with you how we built next generation Temporal tooling with worker reflection. To start, we’ll talk about how Temporal is used at Datadog, before discussing tooling ergonomics when interacting with Temporal. We will then take a look at the next generation Temporal tooling we built and how it improves the user experience before walking through how it’s accomplished and exploring areas for further improvement.

Let’s begin with how we use Temporal here at Datadog. To start, we built a deployment system on top of Temporal and this quickly expanded to a number of other use cases, including automating the lifecycle on infrastructures like Kubernetes, creating control planes to allow self-service onboarding to entirely managed services, and automating complex operations around stateful systems and databases, like Kafka, Cassandra, and Postgres. Temporal is also leveraged to build a backend for developer tooling, including our merge queue and incident response follow-up.

We’ve been running Temporal for about three years now, and currently we execute over 150 million Workflow Executions in a month, across about 1000 different Workflow types, which are owned by dozens of teams. As a result, over 1000 engineers interact with Temporal Workflows every month. For some of you, this looks familiar. It was originally from Jacob LeGrone’s talk at Replay 2022. To briefly recap, at Datadog, we provide a development SDK on top of Temporal, which in addition to allowing us to standardize things such as deployments, observability, and auth, provides the core Temporal enablement team more control over things like Worker registration, Workflow replays, and other aspects of how the Temporal SDK is used.

This development platform is centered around our prototype SDK, in which users can define the Workers, Workflows, Activity Signals, and Queries as RPCs. There are a couple of ways that this approach benefits the user experience. Firstly, the development cycle is nearly the same as gRPC, which many are already familiar with. And, because service owners must define the Workflows, Activities, Signals, and Queries in proto, we have a source of truth for Worker’s capabilities. Furthermore, service owners are able to set the default Workflow and Activity start options. For example, this is a snippet proto Workflow Definition of a Workflow that tracks the lifecycle of a virtual pet dog. The dog Workflow has a default parent code policy of abandon that has a timeout of 30 days and has some Signals and Queries which also find the protocol, one of each are shown in this snippet. And from this Definition, which can generate our worker framework code, typesafe clients, and documentation.

In order to drive the adoption of Temporal across Datadog, it was important for us to provide an ergonomic user experience not only for creating and operating Temporal Workers, but also for interacting with Workers in the Workflow Executions. To start, we’ll look at the user journey of starting a Temporal Workflow Execution from the Temporal CLI. First, we need to provide the connection parameters to connect to Temporal. And then we need to specify the namespace task queue and Workflow Type of the Workflow that we wish to start. Next, we want to set the default options. And since we want to use the default value set by the workforce owner, we can get them from the workforce Protobuf definition. And finally, we need to provide the workload input which you can do as a JSON payload.

Now, this is a lot of information that users need to collect and provide. And doing so accurately, it can be quite challenging, especially for new users who are not familiar with Temporal. Thus, we want to improve the user experience, which we can do through better cognitive ergonomics, which we can define as understanding the behavior of humans as they interact with machines or systems like Temporal in order to find interfaces and controls to support their needs by reducing the cognitive burden and promoting awareness of their operations.

How do we apply this to the CLI experience? Well, the first way you can achieve this is by inferring required information. Instead, users explicitly having to pass on all the required information as parameters, can we infer them based on the values, and reducing the amount of information the user needs to provide directly reduces the opportunities for error to occur? The environment manager built into the Temporal CLI is a good example of this. It enables users to admit connection parameters that can be inferred based on the environment flag.

Next, we want to have comprehensive input validation. It is not obvious why a command does not work as expected and this can be a very frustrating user experience. As an example, this workflow start command will indeed start a Workflow Execution, but it will get stuck and not progress. Because in this case, we have a typo in the Workflow type resulting in execution being created for Workflow type does not exist, and there are plenty of other similar errors that can occur. For example, we could have just as easily scheduled the Workflow into a task that has no Workers, or provided an invalid input payload.

Finally, we want to reduce the proximity of required information by providing contextual help. Ideally, our users should be provided with enough information such that they can determine the desired input parameters without needing to look up additional information from other sources such as documentation or a Worker’s implementation. And in the CLI tool, this can be done through autocompletion. However, to make these improvements, we need more contextual information, such as the task queue on a namespace, the Workflow and Activity types on a task queue, and details such as argument types or default start options. And much of this information is not available to the runtime to the Temporal CLI, as the Temporal Server is largely agnostic to the implementation detail at any given Worker. And as a result, many of our users started creating custom CLI tools and scripts to interact with our Workflows, thus fragmenting the tool ecosystem used to interact with Temporal. This introduced a bunch of problems, such as duplication of work and user overchoice, as many of these tools and overlapping functionality. It is also a challenge to maintain and distribute these tools across hundreds of developer machines, especially because these tools need to be manually updated to reflect any changes made to a Worker.

And even if these tools are always updated, or redistributed in a timely manner, issues can still arise and they start versioning. Let’s look for a clear example of this problem. Here we have an example of a v1 of Temporal Worker and a CLI client. Now, let’s say we deployed an updated version of this Worker, which modifies some Workflow definitions, adds new Workflow and Signal types, as well as changing some input parameters. The existing CLI tool will not be compatible with the changes; we will need to manually update and redistribute new versions of the tool to be compatible with the changes. But this is not a perfect solution, either. Because if you have an old version of the Worker running in a different environment, there could be compatibility issues with a new tool or an old worker. And while this problem may be manageable for nearly-scoped tools with a handful of Worker and Workflow types, it does not scale well to the 1000 plus engineers who may want to interact with any of the 1000s of Workflow types here at Datadog.

What we ended up building to address this was a universal Temporal CLI tool, which leverages contextual information to provide contextual help to the user, infer values based on available information, and validate user input, all without being directly tied to the implementation of any specific Worker. Let us look at the improvements that this user experiences tool enables.

First, we introduced the concept of a context, which allows users to specify a single namespace on a given cluster with a single parameter. Next, we can infer the Workflow start options based on the Workflow type to be the ones with the default ones defined in Protobuf, and this means that users no longer need to specify them unless they wish to override the default values. We also found that many users often perform consecutive operations on a single task queue in a single context, so we allow users to resist the default values between executions. And this simplifies the start input for the same dog Workflow to this, which is not bad. But, we want anyone to be able to create a new dog easily, and crafting payload JSON in the terminal does not make for a good user experience.

Users will likely need to refer back to the message definitions to figure out the data model and the process is very prone to error. To address this, you can try and generate flags for all top level requests in the field. But in this case, it doesn’t really help much. And looking at the Request Definition, we can see why: the parameters we want to set are nested in the dog type. Our first approach was to try and recursively generate flags. But, this does not scale well as the number of flags could be too numerous with even a small number of repeated or nested structures, and can even balloon to infinity if we have recursively nested structures. And not all parameters need to be set by the users. Some fields like neediness or happiness can be seeded or set to a default value, while other values here are used to track the state to allow the work to be continued as new, so probably should not be exposed to users at all. So we’ll create a Worker, a wrapper Workflow that allows us to create a dog Workflow Child request, and only have exposed the information that we need from the user, which is the name, and optionally, the appetite and neediness.

In addition to abstracting unnecessary information, this new dog Workflow provides us a lot of flexibility to set default values, perform infant transmission validation, or even enrich inputs from data with other sources. For example, if a dog needs a registration number in this wrapper Workflow, we can provision one on behalf of the user and use it to start the Child Workflow. Now with this wrapper Workflow, we can simplify the command to start the same dog Workflow to this, with generated names from generative flags for names, appetite, neediness. And if you need to do some debugging, or have a more special use case that’s not covered by this wrapper, you can always start the original target Workflow directly. To give a better sense of how everything fits together, I put together a short demo. The name Inuotchi was inspired by Tamagotchi – which translates to “egg watch” – and so becomes “Dogwatch”. The idea was to model the lifecycle for a virtual pet as a Workflow with the ability to interact with a pet through Signals and Queries.

To start, we will start up a Temporal IDE server and the EOG worker and while that starts up, we can set a default context and task queue. We can start by inspecting the available Workflow types with workflow types list. And this gives us two: the NewDog and Dog Workflows. We can also list Activity types on this Worker. We also get to choose what is simply eating food and want to simulate going on a walk. And here we can also describe our flow type to learn more about it. In this case, we can describe the dog Workflow type.

And this will give us a bunch of information about the requests and response types of the Workflow, as well as the Signals to feed and walk the dog, and a query to visualize the dog. Now we want to create a virtual dog named bits with a new dog Workflow. And as you’ll see, we get autocompletion suggestions for the Workflow type on a task queue. And then we’re gonna get some generated flags based on the Workflow type selected.

And this starts up the Workflow. In the trace, you can see that new dog creates a NewDog Workflow with Bits, and that triggers a thirty second timer. The way that this dog Workflow works is that every 30 seconds is a chance your dog may get more hungry or less happy. And to visualize this we can query Bits. Here we can see that we get autocompletion of the Query type and the Workflow ID provided, and we can wrap the whole thing in a watch so we can see changes to our dog’s state over time. Now we want to take Bits out for a walk and we can do so by sending a Signal and once again, we get autocompletion of Signal types and also get generative flags based on the Signal type that we specified.

After we send a Signal, you see in the trace that the Signal is received and it starts the walk talk Activity. After the walk, Bits got the hungrier. We can feed the dog with another Signal. And this time we also get autocompletions for the valid values of the flag for food type. After seeing this Signal, we can see that the Signal is received in the trace, and I started the food activity. Now let’s start a new version of the Worker that adds the ability to play fetch with our dog. And then you will need to add an Activity and a Signal for the Dog workflow to enable this. With the Workflow the new worker started, we can resume our watch of the dog. Without making any changes to the CLI tool, we can see that the new Activity type is reflected when we try to list Activity types. And we’re able to get completions for the new Signal that’s been added as well.

As shown in the CLI demo, the CLI is able to pick up new additions and modifications to a Worker automatically without needing to be updated. For this to be possible, we needed a way to dynamically treat the required contextual information at runtime. And as mentioned earlier, the information is not available from Temporal Server. Thus, we need to look elsewhere for inspiration. And thankfully, we didn’t have to look far as the gRPC ecosystem which is also really in front of faces the same challenges we just looked at. But there exist many universal gRPC clients like the JB CUI and Evans CLI. But how do they work? Well, under the hood, all these tools rely on gRPC server function, which enables us to introspect the gRPC server to answer questions such as: What methods are available? What methods does the Server have support for? What particular method, and how do we interact with it? And for a particular message, what are its arguments?

This information is conveyed using Protobuf descriptors. In particular, file descriptors which contain the exact information of a .product file. The file descriptors can be combined to the sets containing all the dependencies for a given service. And because we defined our Temporal Workers in Protobuf, we can also leverage reflection with Protobuf descriptors through a reflection Workflow on every Worker, and this allows us to determine our Worker’s Workflows, Activity Signals, Queries and start options directly without relying on imported generated code. And, because we control Worker registration, we’re able to determine what our Worker’s capability or startup are by intercepting the Worker registration calls. And to properly work for Reflection registry, we’re also able to register the reflection Worker to every Worker without user intervention.

But, there are still more ergonomic challenges that our users face with our tooling. And the first centers around information density. Here’s the early example: describe the dog Workflow, and it presents us with a pretty dense wall of text. And there’s simply a practical limit to how much information you can present in a terminal while maintaining readability. The second relates to complex input. While the CLI and dynamic file generation work great for interactions with simple Workflows, there are limits with this limit, there are limits to usability around complex workflows.

For example, if we update the Dog definition to be more comprehensive with information such as breed, vaccination, ownership, tricks they can perform and more, creating a JSON payload can be quite a challenge with all the repeated and nested structures. And a UI is much better suited for matching this type of complexity, and will allow for more interactive experiences. Experiences such as a form builder for crafting complex input payloads, and could possibly agree all the Workflow reflection information into a catalog to provide better visibility discovery for users. Unfortunately, there’s no existing Proto reflector libraries in JavaScript which will allow us to leverage our Protro script-based implementation of reflection directly.

So our solution was to convert these Protobuf descriptors to JSON schema. And you can see some of this included in the describe output from earlier. This can be used in conjunction with tools and libraries in the JSON schema ecosystem to create our form builder for interactive UI experience for interacting with Workflows. We’re also interested in extending these capabilities to the open source community. While our implementation of Worker reflection is centered around Protobuf descriptors because our internally developed Temporal development platform is centered around Protobuf, the core idea of Worker reflection, which is simply the ability for workers to describe their capabilities, is generally applicable to Temporal Protobuf descriptors are simply a convenient interchange format to do so. For non-Protobuf based systems, another format like JSON schema can be used to convey the same information. And thank you for listening.