Spooky Stories: Taming Deployment Complexity with Temporal

(Part of our Spooky Stories series; there's also a video version available.)

Meet Daniel Abraham, Founding Engineer at autokitteh, with a career history that has him standing at the crossroads of backend services, infrastructure, and developer productivity. Through his stints at DataDog and Google, he's accumulated battle-worn insights into managing deployment complexity—a topic both terrifying and thrilling for those brave enough to navigate the depths.

In contrast to some of our other Spooky Stories where our theme has been things you should avoid doing in Temporal, this talk is about a common engineering nightmare that has nothing specifically to do with Temporal. However, as you will see, using Temporal will allow you to deal with this complexity much more easily, providing useful building blocks based on hard-won experience.

In a similar vein to Maxim Fateev’s Designing a Workflow Engine from First Principles talk, let us descend together through the 7 Levels of ~~Hell~~ Deployment.

Level 0: Building a Single Repository#

Let’s start with a use case which many of you can probably recognize as very simple and trivial: I have a repo in GitHub I want to build.

Here is an example from GitHub’s own website on how to build and deploy a simple Go application:

name: Go
on: [push]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.21.x'
      - name: Install dependencies
        run: go get .
      - name: Build
        run: go build -v ./...
      - name: Test with the Go CLI
        run: go test

As you can see, it's a simple YAML file that contains a few lines of code, and you can actually read it. You check out the code, you set up Go, you build it, you test it, and you're done!

Level 1: Deploying a Single Repository#

Now, if you want to deploy that repo somewhere, such as a Cloud provider, there are numerous starter workflow deployment examples available. Generally speaking, you’re no longer doing ~10 lines of YAML for this; now it’s more like ~60 lines of YAML, requiring another ~30 lines of explanation to describe the file and how to configure it.

So while there’s a little bit more trickery, once you know what you’re doing, you’re golden. So we can end this talk now and give everyone 50 minutes back, right?

…Not so fast! ;-)

Once you take this simple and trivial example out into the real world, especially for large companies, it gets much more complicated, very quickly, and starts to look like this:

HowToDrawOwl

Level 2: Deploying Multiple Repositories#

Let’s start with just a taste of this complexity: instead of building a single repository, let's say we have multiple repositories in a single organization. For example, one for the UI, one for the backend, maybe a few different backend services, and they all interact with some common libraries.

You can no longer build a single repo and be done with it. Now, you have to build multiple repos and in order to do that, it’s basically what’s known as a DAG (Directed Acyclic Graph), where you have multiple starting nodes because some of these repos can be built concurrently; they don't react with each other.

You don't want to build them sequentially—that would be too slow—but also, some of them are dependencies for some other repo, so you have to join some of these builds as inputs for another build at a higher level, and ultimately you end up with some big binary based on builds from multiple repos.

And, consider building for multiple operating systems, or hardware architectures. You take all of this complexity, and multiply it several times over.

645a66c615d791d2456e13f761cec9ef

You ask yourself: Can’t I just build one thing after another sequentially? While you can, with enough dependencies, this could potentially take hours to build a single binary, so this approach is a non-starter from a developer experience perspective.

The GitHub Actions approach we saw in the initial example for a single repo build still “sort of” works. But, imagine you have to run a few things in parallel. Considerations include:

How do you synchronize workflows in different repositories? (While there are triggers you can use, they relate to the beginning of something or progress within it, not necessarily the completion.)
How do you support fan-out and fan-in workflows that spawn multiple workflows in multiple repositories? (Repositories don’t talk to each other!)
How do you differentiate between the success of a build and failure of a build? (Not just the fact that it’s started and has had some progress, or is finished.)

Taken together, all of these things mean that if you want to solve these problems in a really robust way, you need to create a system for managing this. A system that includes software developers, getting together, in a team, with a charter, calling themselves “Developer Excellence” maybe, and undertaking a huge effort.

And even if you think of this effort as only two people, or timeboxed to three months—which is optimistic, but possible, depending on the size of your builds—even that is a lot of time when you sit down and calculate it.

You also need to factor in considerations for this system:

The system needs to be as fast as possible
But, it also can’t waste resources infinitely; you need to throttle it
But, this efficiency has to be subtle and not in a way that’s perceivable by developers, in order to maintain velocity
You also need to maintain artifacts and provenance, like “this build depended on that build, and that source code, and this version”

If you think about designing the UI for all of this, the database for all of this… it quickly becomes a nightmare in itself.

And remember, we haven’t even gotten to a lot of the hard stuff yet. ;-)

Level 3: Blue Green Deployments#

*Image source: redhat.com *

For people not familiar with the concept of blue green deployments, it means that instead of having one set of infrastructure components that you update whenever you do a release, you have two mirror environments: one of them is running the production code, and you have another backup environment. (Resulting in the famous adage: why pay for one system when you can pay twice for two systems? ;-))

The advantage of this approach is that when you deploy new code, you can release it to the backup environment and then route users to the new environment gradually as you make sure everything’s working properly. This ensures the product is always up and responsive.

How a deployment works in this scenario is a procedure, where you start in some state, you deploy and install things in a specific place, you configure the router to switch to that place, and finally you wait anywhere from a few moments to a few hours, and then the procedure is complete.

This type of procedure is exactly what a Temporal Workflow can do. If you capture your deployment logic in a Temporal Workflow and Activities, you get all sorts of advantages for free. For instance, the deployment of a specific node could be an Activity. If that specific node failed, the Activity could be retried. And if something more fundamental failed, the whole Workflow can fail, that’s very easy to implement. As opposed to needing an entire “system” that does this.

To be fair, Cloud providers usually do provide some kind of blue green deployment options, but if you stray just a little bit off the trodden path, you find that what you need to do is Impractical and sometimes impossible, unless again you have a “system” for it.

So now we have a system of Builds, we’re extending it to be a system of Builds and Deployments. This is a lot of work. Unless you can use Temporal, and then all of this is basically just two Workflows—one to handle Builds, one to handle Deployments—and you’re done! All of the configuration information such as where it’s being deployed, where it’s being built… these are just input parameters for that Workflow. The different repos in a build can be represented by different Activities or Child Workflows.

Level 4: Manual Approval Steps#

Screenshot 2024-10-30 at 11.24.55 AM

I mentioned before that we can wait during deployments. For example, as we’re monitoring the results of a blue green deployment to determine whether an old deployment is no longer necessary, in case we want to roll back.

A common way to do that sometimes is to have manual approval steps. You can have a person such as an Ops Engineer, or a Site Reliability Engineer, signing off on some checklist of conditions and saying “Ok, now we can declare this deployment done and move onto the next step.” (Or, alternately, “Now we can start the deployment, because I’m sure that the code is correct.”) This sort of manual approval process can also sometimes be a legal requirement.

So at the end of the day, you have a big red button and you need to press it in order to get it to do something.

If you’ve ever tried to create a system that allows for manual approval with people, you know that you will have:

A UI for that, because God Forbid you let people do something in a non-cool UI. ;-) (even if it’s internal)
A database, to keep the state of their choices and map it to all the things that they approve
A history of actions taken that can be audited

That’s quite a lot for something that ought to be a “simple” system.

How would you do this in Temporal?

Temporal doesn’t have a built-in fancy UI for manual approvals, but actually it can. For example, you can use Slack as your medium for approvals. People can send a Slack command or approve it as a message in some kind of channel, and all you have to do is have a Slack worker built in Temporal that receives events from your workspace and allows the code to send interactive messages.

So the Workflow we discussed earlier, that handled the deployment, it can start with an Activity that sends a message to a relevant channel or relevant person, asking “Do you approve this? Is it ok?”

And that’s about it! Because once they click that button or reply to that message, the Workflow gets that response as a Signal, as an Event, etc. and continues the Workflow.

What happens if our Temporal server crashes during that time? Nothing. It will keep that knowledge. What happens if anything else crashes in the middle and comes back up? Nothing. We keep that State in Temporal.

The Workflow lives on practically forever (or however you configure it) waiting for people to make the choice and knowing how to map that choice to the right destination.

Level 5: Cloud Regions and Availability Zones#

Now, let’s say your company is really successful, so much so that it’s becoming a “victim” of its own success. And now you have to deploy that product not just on that laptop that’s hidden away somewhere in a closet, but rather in a real data center.

People who have dealt with this for real, such as DevOps Engineers, know that this is not something for the meek. There’s a lot of support, a lot of help, a lot of examples on how to do things… but this is where complexity really rears its ugly head.

Screenshot 2024-10-30 at 11.28.10 AM

Here is a map of AWS regions across the world. Google and Azure have something similar. You have multiple regions per continent (some in the US, some in Europe, some in Asia) and within each region, there are multiple availability zones.

Let’s focus on availability zones first. We’re not that successful yet, we’re just working in “US East” region, as a common example. But, you have to deploy your code in multiple availability zones in that region, because if something happens to that physical data center—a natural disaster, a network outage, a fire in the building, etc.—you want the other two data centers in its vicinity to continue working, doing some load balancing between the two.

Therefore, you always want to deploy your code concurrently to multiple availability zones. Cloud providers have some solutions for that, but stray a little and once again you have to start deploying things on your own.

This might mean you need to manage deploying your own Kubernetes nodes to multiple availability zones in a region. You have to make sure the deployments happen not just concurrently but in a very synchronous way. You want to make sure that as much as possible the versions are the same in the same region. Maybe there’s a backward compatibility bug where the UI for the latest version changes something that breaks with the older infrastructure. You want to make sure that this risk is averted as much as possible.

Level 6: Rolling Deployments#

Tsunami by hokusai 19th century

Now, what happens with regions? Your business is now even more successful, and have customers in the US, in Europe, in Asia. Some of them have requirements related to data residency, where they want to make sure their data doesn’t leave the borders of their country.

Initially, it’s easy to think of this naively: “I’ve seen this problem before with Availability Zones, so I’ll just deploy everything at the same time! Right?”

Enter the concept of rolling deployments, another tactic for minimizing risk. This is a deployment that goes to US East first, then US West, then EU North, then EU East, and so on. If one of them fails, you pause, or maybe you roll back. But you don’t deploy everything everywhere. Because when you have an outage, or bad code, or bugs, you limit the exposure of that bug by deploying it as a sequence of deployments rather than all at once.

In Temporal, this now evolves to a “main” workflow that controls subordinate Child Workflows. Once you know how to do a deployment in a given region, you can sequence deployments in a series. If one of them fails, you fail the entire Workflow for the entire world. The Saga Pattern is appropriate to use here, since you don’t want to merely abort a Workflow if it errors out, you also need mitigations to undo those deployments and return the system to its most recent working state.

In DIY land, we’re now adding another “system” for Rolling Deployments that needs to be maintained. This system would need to see the progression of deployments as they made their way around the globe, and also provide insight if something broke, so you can see which deployments are the latest and which rolled back to their previous state. You also probably need another system to analyze logs of various components to find out what went wrong and how to fix it.

Meanwhile, on the Temporal side, we are now up to three workflows. And we get deployment insight and error inspection for free, as the Temporal UI has the full history of all workflow execution.

Side note: Companies who are able to afford this type of development of internal systems spend absurd amounts of money and time on them because they are necessary to be able to develop quickly. These systems are also generally seen as less valuable and important than the “real” customer-facing product, and so it becomes one uphill battle after another to not only build the internal systems in the first place, but also maintain them over time, let alone redesign them for lessons learned. The amount of time and money being poured into internal systems is amazing, and is entirely non-trivial to a company’s bottom line.

Level 7: Canaries#

1573651240015

(This is a concept that might not be as familiar to junior engineers.)

What are canaries? How do they relate to deployments? How are they being implemented? And how does this relate to Temporal?

A canary (as in “canary in a coal mine”) is a practice used by miners where if there was a risk of suffocation, they brought a canary in a cage with them. Since a canary is a small bird, it would suffer from these consequences first. So if you saw a canary in distress, you knew it was time to leave the mine.

As a metaphor in software engineering, we use canaries as a term that means basically that we’re deploying something in a minimal way, and want to see what’s happening with it. Is it good? Is it bad? Is it incompatible with the previous version, so we need a more careful deployment? Does it have extra errors that we didn’t used to have? Does it display an issue to users that is unexpected? Or, it might have a cool feature people were waiting for and they’re happy to experiment with it once they see it.

In order to implement canaries, there’s not really a good system that I’m familiar with. You always develop this internally, in house, because you need to manage that kind of experimentation. It’s not just the building blocks of “I’ll deploy XYZ” but rather “I’ll deploy them but I also need to monitor them” to understand what’s happening with them.

This also applies to a similar term called A/B Testing where you deploy two versions of the same thing, such as the front-end, and each version looks a little different. And you measure the engagement of both UIs to determine which is getting better engagement, better feedback, and you choose that one. This is also a type of canary.

Why do you need to handle canaries as special kinds of deployments? Because, you usually need to define the terms of these experiments in a very detailed way, such as exposing a specific percentage of users to a given version, or a specific group (segment) of users, or a random sampling of users.

You have to be careful about the percentage because you want to minimize risk. So you will start small, with only 1-5% of users seeing something new. (How do you tell that’s happening in practice? You guessed it! Another system, this one an “Experimentation Framework” which allows you to define experiment variables and look at them like a lab experiment and measure the results). How do you measure an experiment’s effectiveness? You can use a monitoring system, such as DataDog or Prometheus, and look at metrics that you’ve defined, see them in a dashboard. And based on what you see, you decide to move forward with the experiment or not.

All of this is very manual, and if you manage to automate it, you need yet another system! And that system needs to implement the API to talk to your metrics systems that are measuring your experiments’ effectiveness, and that’s yet another layer of complexity.

While it’s not fair to say that a simple Workflow or a couple of Workflows could capture the complexity of canaries, there are many building blocks in Temporal to make this easier. For example, allow for gradual changes or undo a previous step if it failed. (Once again, drawing on the Saga pattern.) But there are other helpers as well. For example, talking to your metrics API. You can add that API into a Worker and it basically becomes a retrieval Activity.

Wrap-Up: Bringing it Back to Temporal#

So how does all of this relate to Temporal?

Temporal provides you with infrastructure that allows you to just plug in your additions to your code and logic and see it run robustly and reliably for you… this is a huge step that a lot of companies usually don’t have. Temporal helps to democratize this kind of technology and make it more widely available.

As we saw throughout, what requires numerous fully-fledged (and fully-staffed) internal “systems” to accomplish can be done with just adjustments to a handful of Workflows in Temporal. And the quality between the two (having managed Developer Excellence and Infrastructure teams at DataDog and Google before) is comparable, and in some ways Temporal is better than systems I’ve used at previous companies.

Temporal also supports the notion of composability. Workflows can be superimposed on top of other Workflows, and now you have a composition of logic rather than interconnection. Composition is almost always easier and safer than connecting systems together.

Temporal components support the concept of blackboxing, where responsibility for one or more Workers can be assigned to specific developers. (For example, this developer is in charge of the workflow that handles blue green deployments and that developer is in charge of the rolling deployments.) While it’s expected these developers will talk to each other, they don’t actually need to synchronize anything between them as long as the contracts are clear between their workers. They don’t need to care how all the nitty-gritty is implemented under the hood, as long as the specified inputs and outputs match expectations.

Temporal really is an ecosystem. I’ve been to the last two Replay conventions, and you can see the change. It’s not just developers scoping it out, you also have a whole ecosystem of service providers, people who know how to use Temporal and are ready, willing, and able to help companies integrate it into their systems. (Temporal themselves are working in this way of thinking with Nexus.)

Spooky Stories: ​​Taming Deployment Complexity with Temporal