This transcript has been edited for clarity and formatting. You can follow along with a recording of the original content at Replay 2023 in the video above.
Anthony Davis: Hi, my name is Anthony, and with me is Chris. We’re senior engineers in the HashiCorp infrastructure platform organization. Today we’re going to be sharing a little bit about our journey of adopting Temporal at HashiCorp. By explaining how we got started with Temporal, and what things look like today, we’ll also provide an overview of four of the notable challenges that we had during our journey, and discuss some of the specifics of how we solved them.
Despite some of the challenges, I want to preface this by saying our journey has been spectacularly painless for adopting a new technology, thanks to this entire community, and the lessons that were shared at last year’s Replay. So if our talk seems like it’s actually three talks in a trench coat masquerading as one talk, it’s because we just really wanted to give back as much as we could about the lessons that we’ve learned. We’d really like to give a shout out to Jacob (Legrone), for last year’s talk. Temporal @ Datadog was a huge influence on our design decisions. And it definitely helped us avoid some of what would have probably been early stumbling blocks. So thanks, Jacob. And finally, we’ll kind of close by just giving you an end-to-end example of everything put together. So with that agenda, let’s dive right in.
The first part of our story is about TerraForm Cloud. It’s a story about blast radius, how it impacted us as an infrastructure organization, and how we dealt with it by embracing cellular architecture. And we’ll explain how Temporal was instrumental in enabling this architectural change. And then we’ll discuss how we use Temporal today.
Since the beginning, we’ve run most of TerraForm Cloud on, well, everything except for the asynchronous bits like TerraForm, runs on Nomad, using one big cluster per environment. We manage the provisioning and configuration of these Nomad clusters using TerraForm, and use the custom CLI to safely roll the clusters during deployments. After we released TerraForm Cloud agents two years ago, we decided to refactor our asynchronous bits to also use TerraForm Cloud agents, and to run those on Nomad as well. While we were fairly confident we could scale up our existing Nomad clusters, we took a different approach, and instead decided to run these agents on a second hardened cluster that was designed specifically for what is essentially your remote code execution as a service. And this worked great actually, until it didn’t. And we soon found that the highly variable bursty nature of customer workloads on this second cluster post some unique challenges for us, and impacted shared resources on the cluster.
Chris Ludden: So yeah, at that point, we did what anyone would do, right, and we split that second cluster into a bunch of smaller clusters. And we utilized a form of cellular architecture known as “shuffle sharding,” to schedule customer workloads across this pool of smaller clusters. With shuffle sharding, what we do is we assign each customer a random but deterministic subset of clusters from the larger pool. And kind of where the magic comes in is that the odds that any two customers share the exact same virtual shard assignment is exceedingly small. And that helps us reduce the likelihood that any one customer would impact another.
To give a quick example, imagine this scenario with two customers, each with a virtual shard assignment of four independent Nomad clusters. And in this particular example, the two virtual shards overlapped by a single cluster. Now, imagine that the second customer causes an issue that results in a cluster degradation or complete outage. This can be for a variety of reasons. Sometimes it’s malicious workloads. Sometimes it’s pathological workloads that result in thundering herds, and things like that. In this situation, the second customer’s workloads are going to be rescheduled onto the next available cluster in their shard, where the same issue is probably going to happen again. And as you can imagine, this pattern will continue until the second customer has exhausted all of the clusters in their virtual shard, resulting in a complete outage for that customer. Kind of the awesomeness of shuffle sharding is that customer one is redirected to the next available cluster in their virtual shard, which was not impacted by this issue with customer two. That magic is what allows us to offer a single tenant-like experience on top of multi-tenant infrastructure, and it has helped us reduce the blast radius for a large class of issues. It ultimately has helped us sleep better at night as an infrastructure team.
On the other hand, it did kind of give us some new problems, like how the heck do we manage all of these Nomad clusters. As we continue to expand the number of clusters in our fleet, we found some gaps in our management tooling that, while used to be rare, started to become more and more common, as we accelerated both the number of clusters in our fleet, as well as the frequency and number of deployments of cluster deployments. That forced us to rethink our approach to managing these clusters, so we worked with our friends on the Nomad team, who were awesome, and they helped us identify where the gaps were in our current process, and helped us devise a new process for rolling clusters that largely mirrors the official product documentation for Nomad, but included some additional safety mechanisms for our particular use case.
Then we started looking for a tool or technology that could help meet the requirements of this new process. We started with an early POC of Amazon Step Functions. We really liked the integration with AWS, but we found the DSL to be lacking in terms of its readability, testability, and maintainability. No offense to any AWS folks in the house. Next, we considered, I think the colloquial term is “the workflow engine that shall not be named.” We do have a pretty large expertise with this other tool at HashiCorp. But when we looked into it, we found that we would need to do some significant investment in order to repurpose the existing implementation of that tool for our use case. And so we took a step back, and we kind of said, if we’re going to have to invest time and resources into running a solution like this ourselves, we would like to find an alternative that gives us a path to manage the offering in the future. And, at the time, Temporal Cloud was still in beta, but we felt pretty confident that it provided a roadmap for us to migrate workloads in the future. So, ultimately, we landed on Temporal.
Anthony Davis: Our first foray into using Temporal was, we started out with a single Temporal service for managing our growing fleet of nomad clusters. And immediately, we found it to be an exceedingly good fit for that use case. It really only took a couple of days to implement the first draft of the management policies as Temporal Workflows given the huge Headstart provided by the GO samples SDK, or the GO samples repo, Temporalite running locally, and lots of other great community resources. So we felt productive pretty quickly. And we also found translating our management policies into Temporal Workflows to be extremely natural.
We also had some great early learnings. So for example, in our previous world, automation bugs would ultimately often just be totally unrecoverable. But with Temporal, we found that whenever a Workflow was stuck on some unexpected case, or maybe just buggy logic that was totally our fault, usually we could just push a fix in real time, like there would be some latency, you know, retries happening. But generally, we could fix things in flight without losing any work. And when we kind of realized this, it felt like we had just gained a superpower. Because historically, it was very stressful to worry about automation failing, and now it’s like, well, we’ll get around to it, it’s fine. We also found that Temporal just gave us an entirely new set of tools for dealing with unexpected state edge cases, things which just started to increase as we scaled up. And we felt that we now have the tools to identify these things and just work them into the Workflow logic, we are gaining confidence in the automation that we run since then.
As we continue to embrace cellular architecture, we were also very cognizant of the potentially large impact it could have on our internal customers. It was important to us to not force engineering teams to have to reason about our growing and dynamically changing compute topology during their routine infrastructure interactions. And we also recognize that we would kind of lose all of the excellent resiliency properties of cellular architecture if we were just to change all cells simultaneously. So at this point, we realized, “Okay, we have a new Workflow problem.” And again, we turned to Temporal to help us build a compute API to allow our customers to manage their workloads across the many Nomad clusters in our growing fleet. So this was the first time that a Temporal service had been exposed to anyone outside of the infrastructure organization. And it ended up being a great opportunity for us to revisit some of the early decisions with Temporal for a new use case, which was essentially offering self-service infrastructure for our internal customers. The implementation and capability of that new use case ended up quickly exceeding our expectations. And from those first two Temporal services, we were soon joined by several others to form what is now pretty much the beginnings of our next generation infrastructure platform, which is powered by Temporal.
Chris Ludden: I’d like to switch gears and get into the meat of our talk, which is kind of sharing some of the scars and lessons learned, and the challenges that we faced, on our journey from POC with Nomad all the way to building an internal infrastructure platform with Temporal. In particular, we’ll dig into how we simplify onboarding and adoption of Temporal by extending Temporal’s official SDKs with our own internal SDK; how we avoid or reduce the opportunities for making mistakes with Temporal by embracing cogeneration with protocol buffers; how we provide granular authorization via policy-based access control for securing our Temporal API’s; and last, but not least, we’ll talk about how we stopped reinventing the wheel by embracing TerraForm and reusable Temporal Workflows and Activities.
The very first challenge we ran into as Temporal was starting to grow organically was, you know, how do we make it easy to onboard new teams and new services to Temporal? I think in order to truly appreciate this challenge from our perspective, it’s helpful if I share a little bit about what production-ready Temporal service looks like at HashiCorp. Today, we run multiple Temporal clusters across several different logical environments and geographic regions. In order to provide high availability and disaster recovery capabilities to the services that depend on Temporal, we rely on global namespaces and multi-cluster replication to ensure that every service deployment’s namespace is replicated to at least one primary and secondary zone in different geographic regions. We also leverage Vault’s PKI secrets engine to provide short-lived spiffy flavored TLS credentials that provide mutual authentication and encryption of the underlying TCP connections between Temporal clusters, Clients, and Workers.
We rely heavily on Temporal’s remote Codec Server integration to provide end-to-end encryption of Activity and Workflow payloads, because a lot of our Workflows require highly sensitive and privileged credentials for interacting with production systems. And last, but not least, we took Jacob’s advice from last year and decided to have services expose their API’s directly as Temporal Workflows, query signals, and updates. And we found this to be incredibly powerful in practice. In order to ease the learning curve and reduce the risk of misconfigurations, we decided to offer an internal Go SDK that provides types utilities and methods that make it easy and secure to integrate with our globally distributed Temporal architecture. It’s used by both our clients and Workers for several important capabilities: basic initializing of Temporal Clients with all the necessary configuration for credential management; header providers; context propagators; codec converters; Client and Worker interceptors for common validation tasks; and most importantly, it provides cluster discovery to locate the active cluster for a particular service namespace, and then handles automatic failover in the event of a cluster outage. It also provides some helpful utilities for authorization, unit and integration testing, and Workflow replay. This has helped kind of simplify some of the more tedious parts of securing and testing Temporal Workflows.
One thing I really do want to call out here, and I think this differs from what a lot of folks in the community have advocated for, is: we deliberately chose not to hide or wrap the Temporal SDKs. Instead, we continue to make them available for our more intermediate or advanced users. But we do keep it out of the way as much as possible for those who are just getting their feet wet.
Anthony Davis: Yes, we do want the Temporal SDK to be available. But, we were really inspired by the talks last year that advocated for cogeneration. At HashiCorp, we love Go. And while Temporal does make things easier, we found that working with the official Go SDK was sometimes repetitive, and sometimes left opportunities for developers to trip up. Some of these trip ups were trivial, like forgetting to register a Workflow or an Activity before trying to call it, or not knowing the task queue name and having to figure it out. But there are other problems that you may not notice until you get to production like having the wrong ID reuse policy or the wrong retry policy. And that’s not theoretical. Actually, probably the most embarrassing incident for our team internally was caused by just a really tiny retry interval generating a lot of work. Like I always do that. The wrong decimal place.
Most of all, we really miss static typing and all the benefits that come with that. No offense to anyone here who’s not into that. We believe there are enough footguns here that cogeneration could really pay off for us. So to address these shortcomings, and make it easier for us to onboard other people, both Workflow authors and Workflow consumers, we’ve implemented a fully open source protoc plugin that generates Temporal clients and Workers in Go from Protobuf schemas.
And we’re going to be showing you a little bit about how it works today. We chose protocol buffers for many reasons, including the ability to leverage the vibrant ecosystem, our extensive use of them internally at Hashicorp, the excellent tooling that we have built up around them, and their existing support within Temporal. For example, the default data converter. We were also able to lean heavily on prior art in our initial iterations, thanks to having stumbled upon the very inspiring Temporal SDK Go advanced project that was created by Temporal Zone Chad Retz. I don’t know if Chad’s in the room, but thank you. Thank you, Chad. Our plugin allows Workflow authors to configure sensible defaults. Guardrails simplify the implementation and testing of Temporal Workers and streamlines integrations by providing client SDKs, and even an optionally generated CLI interface. It works by having Workflow authors annotate their proto service schemas, with options for Workflows, Activities, queues, et cetera, and Temporal primitives. These annotations can include default timeouts, default Workflow and update IDs or ID patterns to generate Workflow IDs, search attributes, policies for ID reuse, etc. In other words, many of the Activity and Workflow options that you may already be familiar with.
We liked this approach, because we feel it keeps what is really critical logic in the hands of the service author rather than leaving it up to the consumer to figure out the right number. From the schema, the plugin generates statically typed helpers for implementing and registering Temporal primitives from your Workers. These helpers include friendly methods for executing Activities and Child Workflows, handling Signals, or signaling Workflows in flight. We found that this removes some of the boilerplate that tends to be common with Temporal code. Ultimately, the effect is, it makes your Workflow logic a lot easier to grok having a lot of this boilerplate sort of hidden. It also generates typed client SDKs that can be used to interact with your Temporal Workers or durable executions. These clients include methods for executing and fetching Workflows and interacting with executions in flight, executing queries, and sending Signals to running Workflows.
And, last but not least, the plugin can optionally generate a typed command line interface for interacting with your Temporal Workers. It documents the CLI based on comments in your Protobuf schema and generates typed and validated CLI command line flag helpers, and other helper methods for, you know, integrating your CLI with other CLI, etc.
Chris Ludden: Another challenge that we faced while building an internal infrastructure platform with Temporal was how to handle authorization for services that expose their API’s directly as Temporal Workflows. While MTLS provides authentication at the connection level, we really needed the ability to provide granular authorization within individual Workflow executions. We actually found that Temporal posed unique constraints in this area. So unlike, you know, traditional HTTP or gRPC Requests, Temporal Workflows are durable operations. They’re often long-lived; they can be archived for even longer periods of time; they can be rescheduled across Workers, and even reset or replayed for testing or troubleshooting purposes. And, unlike HTTP, or gRPC request headers, the metadata associated with Workflows – which includes context propagation payloads – are passed as Workflow headers and recorded in plain text and the Workflow history. Also, the fact that many of these Workflows are long-lived, requires that any authorization processes be deterministic to avoid non-determinism errors if the Workflow is reset, rescheduled, or replayed.
So, we set out to find an authorization solution that kind of solved for Temporals unique constraints, but also something that we could implement in a familiar fashion for those of the engineers on our team that are more familiar with working with traditional web-based authorization frameworks. And we found our solution in biscuits, which is like one of my favorite breakfast foods, but I think it’s actually for the like, the Euro biscuit.
Anthony Davis: Fancy cookies. Cookies for breakfast.
Chris Ludden: Yeah, so for those unfamiliar with biscuits, they’re an authorization token that was kind of designed to address many of the shortcomings inherent in a lot of the other alternatives in this space. Specifically, like JSON Web Tokens, they do use public key cryptography, which means that any application or user with a public key can validate a token offline. But, unlike JSON Web Tokens, biscuits are capabilities-based, so they can carry rights information, instead of, or in addition to, like traditional identity metadata. And then like macaroons. I don’t know if anyone knows what macaroons are. They’re probably the token that most inspired biscuits but, like macaroons, biscuits support offline attenuation, which means that any valid token can be used to issue a new, less capable valid token by attenuating its rights without communicating with the issuing party. I think the most different thing from macaroons is that biscuits use a standardized logic language for defining capabilities.
And it’s similar to Rego, this highly capable data log-based language can model almost any authorization paradigm with ease, you know, role-based, group-based, attribute-based, even policy-based access control. And, I think one of the coolest things about biscuits logic language is it has tools that allow you to evolve the underlying authorization model over time as your requirements inevitably change. This is something that we’ve already taken advantage of a few times as we’ve matured our Temporal practices and evolved our authorization strategies.
And last but not least, biscuits are actually pretty small. They use a pretty clever combination of efficient serialization strategies like protocol buffers and symbol tables to remain incredibly compact on disk.
Anthony Davis: I want to call out the second and last comment about the fact that you can evolve it without breaking your existing tokens, which is amazing.
Chris Ludden: Okay, so how we use biscuits with some poor old HashiCorp is we have a tiny IAM service that allows administrators to define roles, which represent a collection of capabilities that are available to a principle or set of principles under a certain set of conditions. Our SDK then exchanges a user’s identity token for a short-lived biscuit that contains the data log program compiled from the role definition. Then prior to executing a Temporal command, the SDK generates or issues a single-use child token from that short-lived biscuit which it passes to the Worker via context propagation. When the Workflow is received by the Worker, its inbound interceptors deserialize the biscuit, verify it using the token’s public key, and then initialize an authorizer with a standard collection of authorization context. At that point, the Workflow is ready for you to inject any additional authorization context that’s relevant to the particular Workflow or Signal or update before evaluating the authorization decision. The decision itself is side effect free, and it doesn’t need to communicate with an external party.
One last cool thing is that because all of the relevant authorization details are included in this authorization virtual machine, any authorization failures are self-describing. It’s really helpful in troubleshooting authorization logic and Workflows. So, despite their somewhat intimidating first appearance, their unique combination of capabilities has been an incredibly powerful combination, and a really great fit for our use case.
Anthony Davis: With those first three challenges sort of out of the way, we were able to get pretty far along in our journey. And the farther along we got, the more we started to recognize obvious patterns. One of these patterns was that a lot of our Workflows in the infrastructure domain revolve around managing cloud infrastructure. These workloads, they’re often just calling a bunch of cloud provider API’s. And really, as experienced TerraForm practitioners, we sort of recognized that we’re starting to rewrite a lot of code that already exists in the vast ecosystem of TerraForm providers. Similarly, many of the Workflows themselves began to resemble the familiar Create, Read, Update, Delete pattern, and were starting to resemble cloud provider API’s themselves. So we saw an exciting opportunity to simplify our Workflows to expose them to our consumers and by taking advantage of all the investments in TerraForm that have been made within Hashicorp.
In practice, TerraForm has been a useful tool for both exposing consumer-facing Workflows, as well as simplifying the internal implementation of our own Workflows which are managing cloud infrastructure. For the external interface, we leverage our cogeneration capabilities to expose the Temporal services Workflows as TerraForm resources managed through our own custom TerraForm provider. On the internal implementation side, we published a reusable collection of Temporal Workflows and Activities for executing TerraForm using the TerraForm enterprise API and the 1000s of open source providers that it supports. Taken together, this has made it incredibly easy for platform teams to offer managed infrastructure as a service using their own domain expertise and existing TerraForm configurations without many of the shortcomings inherent to TerraForm modules, such as requiring their consumers to track module versions or requiring their consumers to acquire privileges for managing sensitive cloud infrastructure that maybe they shouldn’t actually have access to manage. This approach allows platform teams to focus on what they’re good at: managing and evolving the underlying infrastructure, independent of downstream application deployment pipelines.
It’s just TerraForm under the hood.
Chris Ludden: So we’re nearing the end of our talk. To wrap up, I’m going to try to tie it all together by walking through a quick end-to-end journey of what a Temporal service looks like today at HashiCorp. We’ll start with the platform team implementing infrastructure Workflows on day zero. We’ll follow that with an end user of our platform consuming those Workflows on day one, and then with platform operators evolving those same Workflows on day two.
So it begins with the platform team deciding to offer some managed infrastructure capability via one or more Temporal Workflows that they defined using Protobuf schemas along with Temporal annotations provided by our prototype plugin. In many cases, these Workflows take the familiar form of create, read, update, delete operations around some managed infrastructure primitive. And even more often, the primitives themselves are configured via TerraForm. And they might have some, you know, auxiliary lifecycle automation Workflows. After defining their service schema, they implement the generated Workflow and Activity interfaces produced by cogeneration. They’re then able to leverage our reusable collection of TerraForm Workflows and Activities for simplifying the integration between Temporal and TerraForm before publishing their service with some help from our internal SDK. Once their service is deployed, they’re able to expose their infrastructure resources, or infrastructure Workflows, I should say, as one or more TerraForm resources via our internal TerraForm provider through cogeneration.
Then, on day one, an end user or platform chooses a managed infrastructure resource as part of their services provisioning code. They’re able to lean on our provider documentation to understand the supported configurations. And they’re also able to get started easily by copying some of the code samples that we include in our provider documentation. At deployment time, our custom provider utilizes the services generated client to execute the corresponding lifecycle Workflow. The client’s outbound interceptor validates the request and then generates a single-use biscuit, which it passes to the Worker via Workflow metadata. The Worker uses that to authorize the Workflow. Once authorized, the Workflow executes the supporting underlying TerraForm code using our reusable TerraForm Workflow libraries.
On day two, a platform operator is asked to evolve the underlying infrastructure – which is inevitable, right? And they do that by adjusting the TerraForm configuration or modifying the operational Workflows. And, by leaning on Temporal as an abstraction between consumer-facing and platform done on TerraForm, Temporal enables two distinct and independent software development life cycles for platform and product teams, which allows platform teams to deliver high-leverage, force multiplying capabilities to the organization while freeing up product teams to focus on delivering value for our customers.
If I could leave you with a handful of takeaways, I would encourage you to consider not hiding or wrapping the Temporal SDKs, but maybe just extending them for your use case. I’d highly encourage the use of cogeneration. And if you’re into Protobufs or Go, check out our ProTalk plugin. If you’re going to, you know, dive in the deep end and start using Temporal directly as your API checkout biscuits for securing your Workflows. And last but not least TerraForm and Temporal are “Better Together”. TerraForm is a great user interface mechanism for exposing self service Workflows, and also a great implementation tool for managing cloud infrastructure within Temporal. And that’s our talk. Thank you so much for listening. Thanks to all the Temporal folks for this great conference. Thanks. Questions?
Guest: Hi, there. I know you mentioned you were inspired by last year’s talk about using the Temporal Workflows and signals and stuff as API’s. But, I wonder if you could just walk through the pros and cons of why you actually made that decision versus, like, putting a small API in front of it or something like that?
Chris Ludden: I think, for us, because we deal with predominantly infrastructure-focused Workflows, a lot of these operations are very long-lived, when we’re working with, you know, database clusters, or search index clusters, or things like that. And we found that if we were to wrap our Temporal Workflows behind, like, a gRPC service, that we would end up having to reinvent the polling mechanisms and all those other things that kind of go along with it, that we just get out of the box for free with Temporal, but I will caveat that and say, you have use cases where we expose some of these Temporal API’s kind of outside of our internal network, and in those cases, we do wrap Temporal with gRPC Connect.
Guest: I’m trying to understand your philosophy about extending and not hiding the SDK. So, one example that I’ve seen you do is manage the task queues for the users. We actually do the same thing at Instacart. Because in most cases, like a Workflow is kind of tied to a particular task, so would an example be like, by default, you send the workload to that task queue. But, since you’re not hiding the SDK, the user can just set their own task queue, and like, shoot themselves in the foot in the process?
Chris Ludden: That’s a good question, I think the approach we’ve taken is to bake as much into cogeneration as possible. And that is owned by the service provider. And so they’re the ones that write values for retry policies, or ID formats, and things like that. And cogeneration, with a combination with our SDK takes all of those questions that most people are confronted with when they’re getting started with Temporal out of the picture. We do allow folks to override those things, but it’s more of like an escape hatch model, it’s not something that we push up front and make it, you know, an obvious thing to do, if that makes sense.
Anthony Davis: I think another thing worth saying is just that all the generated code is also able to be imported. So it’s like, it’s also kind of a mechanism for sharing what would otherwise be kind of just a random string. It’s a mechanism for making the process of sharing things across different Workers robust, as well.
Chris Ludden: I guess I’d add one more caveat that we have. We have a very specific convention for how we think about Task Queues and how we think about namespaces, and how we think about clusters, and things like that. And so we try really hard to not leave that up for individual teams to make their own decisions. And we bake that into our cogeneration and SDK processes.
Anthony Davis: All right, I think we have time for one last question in the back.
Guest: I wanted to just cover the last portion again – was I seeing that the SDK was generating a Workflow specification that represented a cloud resource? And then at the end, there was a TerraForm provider being generated?
Chris Ludden: Yeah, so it’s kind of TerraForm all the way down, right? For us, TerraForm is definitely the lingua franca of the organization and so on. Yeah, I guess to kind of summarize that, what we’ve found is that a lot of the infrastructure that we manage on behalf of product teams at HashiCorp, it’s more than just like the Getting Started TerraForm code you find in the public registry for you know, things like Amazon RDS, or stuff like that. A lot goes into making a cloud service production ready for us at HashiCorp. And we include, you know, all of the Datadog resources that we use to monitor it. We include Vault integrations, console integrations, things like that, retention policies, deletion protection, yeah.
We have in the past used TerraForm modules to bundle all this stuff up. But what we found is that it creates a lot of mental overhead, and it requires teams to have access to a bunch of different systems that we don’t necessarily have great tools for making very granular ACLs for. So, by being able to kind of package up our managed infrastructure as a managed TerraForm resource and use Temporal as the mechanism to kind of enforce that, we found a really powerful pattern. Thanks. Alright, how about a round of applause? Thanks. Thank you all.