Failure is a fact of life; it's a certainty. Temporal provides an incredibly capable set of tools for handling failure.
Industry
High Tech
Use Case
Infrastructure Management
Company Size
250-2000
SDK
Go
Temporal
Self-Hosted
NOTE: This is a summary of a talk given by the Hashicorp team at Replay 2023.
Terraform Cloud from Hashicorp uses Nomad clusters to deploy customer instances. This talk dives into the challenges faced by HashiCorp with the blast radius and how they implemented cellular architecture with Shuffle sharding to address it. Temporal was then introduced as a solution to manage the Nomad clusters.
Here are the key points:
Challenges with Blast Rasdius: HashiCorp Terraform Cloud runs on Nomad clusters. Initially, they had one big cluster per environment. This caused a blast radius issue, where a problem in one cluster could impact all customers. In simpler terms, imagine a scenario where all the users are connected to a single power grid. If there is a power outage in that grid, it will impact all the users. Similarly, when HashiCorp had one big Nomad cluster, an issue in that cluster could impact all the customers who were using that cluster.
Cellular Architecture with Shuffle Sharding: To mitigate the blast radius problem, the team implemented cellular architecture with shuffle sharding. This involves splitting the Nomad clusters into smaller clusters and assigning customers to random subsets of these clusters. Cellular architecture is a way of partitioning a system into smaller, isolated cells. Each cell has its own set of resources and is responsible for a specific set of tasks. Shuffle sharding is a technique for distributing data across multiple servers in a way that reduces the likelihood of hot spots. In the context of HashiCorp's use case, shuffle sharding means that customers are assigned to random Nomad clusters, which helps to isolate customer workloads and reduce the blast radius.
Temporal for Nomad Cluster Management: HashiCorp looked for a tool to manage the Nomad clusters at scale. They considered Amazon Step Functions and a workflow engine from HashiCorp, but ultimately decided on Temporal because of its roadmap for a managed offering. Temporal is a workflow orchestration platform that allows you to define complex workflows as code. It provides features like distributed execution, retries, and observability. HashiCorp considered Amazon Step Functions, but ultimately decided on Temporal because Temporal had a roadmap for a managed service, which would eventually save them time and effort.
Compute API for Self-Service Infrastructure: HashiCorp built a compute API using Temporal to allow internal customers to manage their workloads across the Nomad clusters. This API allows internal customers to spin up and down workloads on the Nomad clusters in a self-service manner. Temporal workflows are used behind the scenes to orchestrate the provisioning and management of these workloads.
Proto Plugin for Temporal Clients and Workers: To make it easier for others to use Temporal, HashiCorp implemented a fully open-source Proto plugin that generates typed Temporal clients and workers in go from Protobuf schemas. Protobuf is a language-neutral, platform-neutral mechanism for serializing structured data. By developing a Proto plugin, HashiCorp was able to generate statically typed code for Temporal clients and workers, which helps to improve type safety and reduce the risk of errors.
HashiCorp found Temporal to be a good fit for managing their Nomad clusters. It allowed them to easily implement workflow logic and gave them a way to deal with unexpected states and edge cases. Temporal's workflow orchestration capabilities made it easy for the team to define and manage the complex workflows involved in managing Nomad clusters. Temporal also provides features like retries and handling unexpected states, which helped HashiCorp to make their Nomad cluster management more realiable.
This summary was pulled from a talk that the Hashicorp team provided at Replay 2023.
Ready to learn why companies like Netflix, Doordash, and Stripe trust Temporal as their secure and scalable way to build and innovate?