Temporal Cloud is now a multi-cloud platform. In this post, we’ll explore how we leveraged Temporal’s own capabilities to expand our infrastructure from AWS to Google Cloud, the challenges we faced along the way, and how we solved them using cloud-agnostic workflows. Whether you’re considering a multi-cloud strategy or interested in scaling distributed systems, our experience offers valuable insights into managing complexity while maintaining consistency across cloud providers.

Software-as-a-Service (SaaS) and multi-cloud#

In today's SaaS ecosystem, multi-cloud isn't just a technical choice — it's a business imperative driven by evolving customer needs and market dynamics.

For SaaS providers, multi-cloud generally means deploying and running your infrastructure and systems on different cloud providers without major modifications, while providing the same core functionality and user experience.

Multi-cloud saas provider

Several factors drive the SaaS industry to move toward multi-cloud solutions:

Customer preferences: Customers often have existing relationships with specific cloud providers. They prefer SaaS solutions that align with their cloud strategy.
Data residency and compliance: Different regions have varying data sovereignty laws, requiring data to be stored in specific geographic locations. Some regions might be available in some cloud providers but not others.
Risk mitigation: Some industries have requirements of having backup deployments in other providers. Offering multiple cloud providers enhances the disaster recovery of your customers, and reduces your dependency on a single cloud provider.
Market expansion: Accessing new markets where certain cloud providers have a stronger presence or are preferred/mandated.
Leveraging features: Some cloud providers offer unique services or capabilities you can use to create a great product experience.
Future-proofing: Pricing, features, and other factors might cause customers to move to a different cloud in the future. Going multi-cloud gives your customers the flexibility to adapt to future cloud strategies without needing to change SaaS providers.

Fundamentally, multi-cloud for SaaS is about adapting to a landscape where the choice of cloud infrastructure is increasingly driven by customer needs and preferences, rather than solely by the SaaS provider's technical considerations.

This is also true for Temporal Cloud: between 2023 and 2024, we took a journey to expand our presence from only Amazon Web Services (AWS), to get closer to customers trusting Google Cloud for their services.

Cloud, we have a problem#

The path to multi-cloud isn't without its hurdles, from increased architectural complexity as you try to handle all the “special cases”, to ensuring consistency across environments with different features and APIs.

On top of that, managing multiple cloud platforms require a broader skill from engineering teams, and can complicate cost optimization efforts. Adding to the mix, you now need to maintain feature parity between cloud providers, adding to your development cycles.

When deploying Temporal Servers for Temporal Cloud, we follow a few common and effective strategies for deploying consistent and repeatable infrastructure. However, when trying to leverage those for multi-cloud deployments, we hit their limitations pretty fast.

Containerization and orchestration technologies like Kubernetes help standardize the deployment process for compute resources, but Kubernetes itself needs to be deployed first, as well as the nodes and network resources for your cluster, and those deployments are Cloud Provider-specific. There is also the Kubernetes Deployment annotations that providers offer to allow to configure special behaviors for the pods, such as load balancer configurations, which are different between the Cloud Providers. As an example, we can look at the manifest to deploy a service in AWS Elastic Kubernetes Service (EKS):

# AWS EKS annotation example
apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-attributes: load_balancing.cross_zone.enabled=true

To expose our service in EKS, we need to add three (3) different annotations for the aws-load-balancer-controller, to indicate that we want a network load balancer to expose our service (nlb), that the NLB should be publicly accessible (internet-facing), and that we authorize cross-zone load balancing — which means the load balancer can use all available backends even from different Availability Zones.

If we want to expose a similar service in Google Kubernetes Engine (GKE), we would need the following manifest:

# GCP GKE annotation example
apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    cloud.google.com/l4-rbs: enabled

To expose our service in GKE, we only need a single (1) annotation to instructs GKE to create a backend service-based external passthrough NLB, or more simply put the equivalent to what we did above for EKS.

Infrastructure-as-Code (IaC) tools, such as Terraform or Pulumi, are invaluable for spawning and maintaining infrastructure. However, when a new Cloud Provider enters the equation, net new code often needs to be written. This is because IaC tools, for good reasons, tend to align closely with each Cloud Provider's specific APIs and resource models. As an example, the following is the HCL necessary for deploying an AWS load balancer with TLS termination using Terraform. Note that for simplification, this does not include the declaration of the linked resources (subnets, certificate, service receiving the traffic, …).

# AWS Application Load Balancer in Terraform
resource "aws_lb" "example" {
  name               = "example-lb"
  load_balancer_type = "application"
  subnets            = ["subnet-1", "subnet-2"]
}

resource "aws_lb_listener" "front_end" {
  load_balancer_arn = aws_lb.example.arn
  port              = "443"
  protocol          = "HTTPS"
  certificate_arn   = "arn:aws:acm:region:account:certificate/cert-id"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.front_end.arn
  }
}

We need to create two (2) resources. First a load balancer using the aws_lb Terraform resource and defining the type of the load balancer and the list of subnets to attach to the load balancer. We also create a load balancer listener using the aws_lb_listener resource, which configures the port on which to receive traffic, the protocol for that traffic, and links to both the certificate to use for terminating TLS and the service to forward traffic to (target_group_arn).

Now to set up a similar HTTPS load balancer in Google Cloud, we would need the following HCL file:

  # GCP HTTPS Load Balancer in Terraform
  resource "google_compute_global_forwarding_rule" "default" {
    name       = "global-rule"
    target     = google_compute_target_https_proxy.default.id
    port_range = "443"
  }

  resource "google_compute_target_https_proxy" "default" {
    name             = "test-proxy"
    url_map          = google_compute_url_map.default.id
    ssl_certificates = ["projects/my-project/global/sslCertificates/my-cert"]
  }

  resource "google_compute_url_map" "default" {
    name            = "url-map"
    default_service = google_compute_backend_service.default.id
  }

This time we need to create three (3) resources. First a URL map using the google_compute_url_map resource, which allows to define rules and route requests to a backend service — in our case, we just use the default service. Then a target HTTPS proxy using the google_compute_target_https_proxy resource, which links to the certificate to use for terminating TLS, and to the URL map we just created. Finally, a global forwarding rule using the google_compute_global_forwarding_rule resource, which exposes the target proxy we defined on a specific port.

We can observe here that the difference in model between Google Cloud and AWS is not only in the name and type of resources, but in the chain of dependencies to create those resources: in AWS we define a load balancer and we attach listeners to it, while in Google Cloud we define routing rules and we attach a load balancer to those.

And no matter what you do, there are sometimes clear differences in features between the Cloud Providers. For instance, Temporal depends on a visibility layer, for which we use AWS OpenSearch with no clear equivalent in Google Cloud directly compatible with ElasticSearch 7. This required us to work with another vendor to deploy our visibility layer in Google Cloud, thus a clearly different path than the one taken for AWS deployments.

Developing abstraction or encapsulation layers to centralize the handling of discrepancies between providers becomes inevitable to keep sane codebases and processes. We have talked on a few occasions about how Temporal Cloud uses Temporal Cloud to deploy Temporal Cloud, and when building our services on Google Cloud, this definitely came in handy.

Failure is not an option: using Temporal#

… quick interlude: what is Temporal?

Now if you’re here reading about this, you probably know about Temporal or are at least intrigued by it. Let’s put it this way: if you’re building distributed systems, or even just distributed applications, you already know the headaches that come with it. State management, error handling, retries — the list goes on. That’s where Temporal comes in: it lets you write your business logic as if it were a monolithic, linear program.

Temporal brings Durable Execution. This means that no matter where there’s a failure, whether it’s bad code, an unavailable API or a burning computer, your code will continue to execute (probably on another machine, in the latter case, though — unless you succeed at salvaging it). Temporal allows for cleaner code, easier debugging, and more resilient systems. It's not the silver bullet that will find the question to life the universe and everything, but it sure makes life easier when you're dealing with complex, distributed processes.

Recovery through Replay

And if you’re curious, I gave a presentation about Temporal itself at Pycon Sweden last year, that you can watch here.

… ok, now we’re back in orbit!

So let’s go back to what I was saying: we use Temporal to deploy Temporal Cloud, and this brought us a few advantages when setting up a new Cloud Provider. Internally, we have two control planes: one that is the user-facing control plane (User CP), which handles resources logically, and another that is the infrastructure control plane (Infra CP), which handles resources “physically” (as physically as cloud resources can be).

user-cp and infra-cp

Thanks to these two control planes, we already have an abstraction layer in place. The Infra CP is responsible for communicating with the external providers and provisioning (and deprovisioning) the required resources. It provides APIs to the User CP, which in turn does not need to communicate with any provider. Those two control planes are actually Temporal Namespaces with separate workers, and they exchange messages which generally contain all the information to identify a Temporal cluster, including the Cloud Provider in which it is deployed.

At the beginning of our journey, we identified a few limitations in the way we were doing things: the Infra CP assumed that all requests from the User CP were for AWS, since the latter was not provider-aware, and we only used single provider. Some of our Workflows were also written around AWS-specifics, which required different inputs for Google Cloud deployments, such as the AWS OpenSearch deployment parameters which were not matching one-to-one with the required parameters for our other search vendor.

As we started writing parallel, Google Cloud-specific Workflows, we took paused and looked at what we were building and how we could prevent our codebase from becoming messy, but also make it easier for our future selves. Instead of cluttering our code with if-else statements, we went with the creation of cloud-agnostic Temporal Workflows. These Workflows are built around the use of a factory that passes in the Cloud Provider received as input from the User CP, and returns an object implementing a known interface, that can then be used to spawn Child Workflows from our cloud-agnostic Workflow.

Let’s take a look at a trimmed-down example in Go of how we use this factory pattern in our cloud-agnostic Workflows. We have defined an interface TemporalClusterProvider with all the methods we expect our provider to implement:

// TemporalClusterProvider is an interface representing a provider that
// can be used to deploy, manage and destroy a Temporal Cluster
type TemporalClusterProvider interface {
	// DeployPersistenceStore deploys the persistence store used
	// by the Temporal Server, e.g. cassandra
	DeployPersistenceStore(DeployPersistenceStoreInput) (DeployPersistenceStoreOutput, error)

	// DeployVisibilityPersistenceStore deploys the visibility persistence
	// store used by the Temporal Server, e.g. elasticsearch
	DeployVisibilityPersistenceStore(DeployVisibilityPersistenceStoreInput) (DeployVisibilityPersistenceStoreOutput, error)
}

And we wrote a factory function GetTemporalClusterProvider to return an implementation of the TemporalClusterProvider for the cloud provider required to handle the cluster passed as parameter. In this example, we support only AWS and Google Cloud, or raise an error if the provider is unsupported.

// GetTemporalClusterProvider is a factory function that returns the
// implementation of a TemporalClusterProvider that matches the Cloud
// Provider of the provided Cluster
func GetTemporalClusterProvider(cluster Cluster) (TemporalClusterProvider, error) {
	switch cluster.CloudProvider {
	case "aws":
		return NewAwsTemporalClusterProvider(), nil
	case "gcp":
		return NewGcpTemporalClusterProvider(), nil
	default:
		return nil, fmt.Errorf("unsupported cloud provider: %s", cluster.CloudProvider)
	}
}

We can then use GetTemporalClusterProvider in the context of one of our cloud-agnostic Workflows to get a matching provider for the functionality we are looking for. Depending on the cloud provider, the operations can be very different, but from the code point of view of our cloud-agnostic Workflow, we are following the same set of tasks, as shown in DeployTemporalCluster below.

// DeployTemporalCluster is a cloud-agnostic Workflow to deploy all the
// resources needed for a Temporal Cluster, from the networks to the
// persistence stores to the Temporal Server itself
func (w *Workflows) DeployTemporalCluster(ctx workflow.Context, input DeployTemporalClusterInput) (DeployTemporalClusterOutput, error) {
	// Get the provider for the provided cluster details
	provider, err := GetTemporalClusterProvider(input.Cluster)
	if err != nil {
		return DeployTemporalClusterOutput{}, err
	}

	// Deploy the persistence store
	pers_stor_out, err := provider.DeployPersistenceStore(ctx, DeployPersistenceStoreInput{
		// ...
	})
	if err != nil {
		return DeployTemporalClusterOutput{}, err
	}
	// Deploy the visibility persistence store
	vis_pers_stor_out, err := provider.DeployVisibilityPersistenceStore(ctx, DeployVisibilityPersistenceStoreInput{
		// ...
	})
	if err != nil {
		return DeployTemporalClusterOutput{}, err
	}

	// Perform zero-gravity operations and other galactic activities

	// Return the details of the deployed Temporal Cluster
	return DeployTemporalClusterOutput{
		// ...
	}, nil
}

We get cloud-specific behaviors while keeping our code clean, easier to maintain and extensible. It also ensures not to miss feature parity, at least at the infrastructure level, as adding any new method to our provider interface will remind us to add it for other cloud providers.

As we combined this approach to Temporal’s handling of retries and replay-based recovery, we were able to save a lot of time and headaches. Indeed, when it happened that we missed some permissions that our Temporal Workers required, or overlooked that little detail in our code, Temporal was our escape shuttle. Instead of restarting entire processes from the beginning for each issue or mistake (and some resources can take time to deploy!), we could add the missing permissions, update the code, and wait for the next retry to happen. This streamlined our development and debugging processes, making our multi-cloud adaptation much more efficient.

I see Temporal Cloud on Google Cloud! It is so beautiful.#

Yes! Temporal Cloud is now available on Google Cloud. You can go to your account and create a new Namespace in any of our currently available regions, or talk with our sales teams about any other region you would be interested to see.

Create namespace

The final frontier?#

We are still working on getting to feature parity with AWS. Coming soon are features such as private connectivity (which is a great example or something requiring a different architecture between AWS and GCP, especially with GCP’s limitations regarding the proxy protocol v2 headers), API keys support, export, Nexus and Multi-Region Namespaces.

Global Footprint Replay 2024

We also have plans to get to Azure next, where we will be able to once again leverage our cloud-agnostic Workflows to build the underlying infrastructure. Our factory pattern and interface-based approach will let us easily extend our system to support Azure-specific services while maintaining a consistent API for our cloud-agnostic Workflows. However, this does not mean that the Cloud Provider itself won’t bring it’s share of challenges!

Key Takeaways#

Multi-cloud strategies are becoming essential for SaaS providers to meet diverse customer needs.
Challenges in multi-cloud deployments include managing different APIs, ensuring consistency, and maintaining feature parity.
Temporal's durable execution model can significantly reduce the complexity of managing distributed systems across multiple cloud providers.
Using cloud-agnostic Workflows with provider-specific implementations can help maintain a clean, extensible codebase in multi-cloud environments.
Temporal Cloud's journey to multi-cloud serves as a real-world example of these principles in action.

Multi-Cloud: A Giant Leap for Reliability with Temporal