In today’s world of managed cloud services, delivering exceptional user experiences often requires rethinking traditional architecture and operational strategies. At Temporal, we faced this challenge head-on, navigating complex decisions about tenancy models, resource management, and durable execution to build a reliable, scalable cloud service. This post explores our approach and the lessons we learned while creating Temporal Cloud.
The Case for Managed Cloud Services
Managed services have become the default for delivering hosted solutions to customers. Whether it’s a database, queueing system, or another server-side technology, hosting a service not only provides a better user experience but also opens doors for monetization, especially for open-source projects. The challenge is how to do it effectively while maintaining reliability and scalability.
One of the first decisions we made was about tenancy models. Should we pursue single-tenancy — provisioning dedicated clusters for each customer — or opt for multi-tenancy, which allows multiple customers to share the same resources? While single-tenancy offers simplicity and isolation, its inefficiencies quickly become apparent. Customers end up paying for unused capacity, and providers shoulder higher operational costs. Multi-tenancy, though harder to implement, emerged as the clear winner. It optimizes resource usage, allows customers to pay for actual usage, and creates shared headroom for handling traffic spikes.
Data Plane vs. Control Plane: Defining Responsibilities
Architecting a managed service in terms of the data plane and control plane is an industry best practice that we followed, clearly defining and implementing their distinct roles within our cloud architecture.
- Data Plane: This is where the actual work happens — processing transactions, executing workflows, and handling customer data. It must maintain high availability, low latency, and resilience to failures. For Temporal Cloud, we adopted a cell-based architecture to isolate resources and minimize the blast radius of potential failures.
- Control Plane: This acts as the brain of the system, managing resources, provisioning namespaces, and handling configurations. While its performance is less critical than the data plane, reliability here still matters for customer experience. For instance, provisioning a namespace may not be urgent, but delays or errors in this process can frustrate users.
Implementing the Data Plane: A Cell-Based Architecture
For the data plane, we applied a cell-based architecture to achieve strong isolation and scalability. Each cell operates as a self-contained unit with its own AWS account, VPC, EKS cluster, and supporting infrastructure. While this approach is framed within the context of AWS, we have applied the same principles to Google Cloud Platform (GCP), leveraging its equivalent primitives to ensure consistency and reliability across cloud providers. This approach ensures that failures or updates in one cell do not impact others, reducing the risk of cascading outages.
Each cell in Temporal Cloud includes:
- Compute Pods: Running Temporal services and infrastructure tools for observability, ingress management, and certificate handling.
- Databases: Both primary databases and Elasticsearch for enhanced visibility.
- Additional Components: Load balancers, private connectivity endpoints, and other supporting infrastructure that ensures smooth operation and integration across environments. Currently, Temporal Cloud operates across 14 AWS regions, and we’ve also added support for GCP. This architecture allows us to meet the diverse needs of our customers while maintaining reliability at scale.
Durable Execution: The Foundation of the Control Plane
Building the control plane presented its own set of challenges, particularly around reliability and maintainability. Control plane tasks, such as provisioning namespaces or rolling out updates, involve complex long-running processes with many interdependent steps. Writing this logic as traditional, ad-hoc code often leads to brittle systems that are hard to debug and evolve.
This is where Temporal’s durable execution model shines. Designed based on experience with earlier systems like AWS Simple Workflow Service and Azure Durable Functions, Temporal’s approach separates business logic from state management and failure handling. Developers can write workflows as straightforward, happy-path code without worrying about retries, error handling, or state persistence. The system automatically manages these concerns, allowing workflows to seamlessly recover from failures.
Namespace Provisioning: A Real-World Example
Consider the process of creating a new namespace in Temporal Cloud. When a user clicks “Create Namespace” on the web interface, the control plane orchestrates a series of tasks:
- Selecting a suitable cell within the chosen region.
- Creating database records and roles.
- Generating and provisioning mTLS certificates.
- Configuring ingress routes and verifying connectivity. Each step involves external API calls, DNS propagation, and other potential points of failure.
Without durable execution, managing retries, backoffs, and state persistence would result in a tangle of brittle code. With Temporal, these tasks are encapsulated in workflows, which transparently handle retries and maintain state across failures. Developers can focus on the high-level logic, confident that the system will handle the edge cases.
Rolling Upgrades: Ensuring Safe Deployments
Another common control plane scenario is rolling out updates to the Temporal Cloud fleet. Our deployment strategy involves organizing cells into deployment rings, progressing from pre-production environments to customer-facing cells with increasing priority of traffic.
The rollout process is carefully staged:
- Ring 0: Synthetic traffic only, no customer impact. Changes are monitored here for at least a week.
- Ring 1: Low-priority traffic namespaces, allowing for additional testing with minimal risk.
- Higher Rings: Gradually expanding to critical, high-priority traffic customers. Within each ring, updates are applied in batches, with pauses between batches to observe for potential issues like memory leaks or race conditions. Temporal workflows handle this process, ensuring that even long-running deployments (which can span weeks) are resilient to failures or restarts.
Entity Workflows: A Powerful Pattern
Temporal’s durable execution also enables powerful patterns like entity workflows. These are workflows tied to specific resources, such as cells or namespaces, providing a natural way to model state and operations. For example, each cell in Temporal Cloud has an entity workflow that manages its lifecycle, from provisioning to upgrades. This approach ensures consistency and simplifies concurrency control.
Developer Happiness and Productivity
One of the biggest benefits of Temporal’s approach is the impact on developer experience. By eliminating the need to write boilerplate code for retries, backoffs, and state management, developers can focus on delivering business value. Temporal’s built-in tools for observing and debugging workflows further enhance productivity, making it easier to understand and troubleshoot complex systems.
Happy developers are productive developers, and Temporal’s approach fosters this by reducing the cognitive load and frustration associated with traditional workflow coding.
Why Durable Execution Matters
Durable execution is more than a technical innovation; it’s a paradigm shift for building cloud-native systems. By decoupling business logic from state management and failure handling, Temporal empowers developers to build reliable, scalable systems with less effort. Whether you’re managing control planes, provisioning resources, orchestrating complex workflows, performing money transfers, training AI models, or processing social media posts, this approach delivers clear benefits.
At Temporal, we’ve seen firsthand how durable execution transforms the development process, enabling us to deliver a robust managed service that scales with our customers’ needs.
Ready to Transform Your Control Plane?
Temporal isn’t just a tool for building cloud systems; it’s a better way to think about workflows and application architecture. If you’re building or planning a managed cloud service, consider how durable execution can simplify your journey and unlock new possibilities. For more insights into our approach, check out my full talk at QCon.
Ready to explore what Temporal Cloud can do? Now’s the perfect time — get $1,000 in Temporal Cloud credits and start building today!