How to build a resilient platform control plane with Temporal

In Part 1, we covered why Temporal is a natural fit for platform control planes. Now let’s build one.

We’ll start with a basic infrastructure Workflow and then scale up to the Entity Workflow pattern: one long-running Workflow per managed resource. Along the way, we’ll cover message handling, Continue-as-New, error handling, compensation, and the design choices that matter once your control plane moves beyond a demo.

The examples use the Temporal Python SDK.

From scripts to Workflows#

A typical infrastructure operation, such as provisioning a database, configuring networking, or registering DNS maps cleanly onto a Temporal Workflow. Each interaction with an external API becomes an Activity. The Workflow coordinates the sequence, timeouts, retries, and error behavior.

Before the Temporal version, picture how this usually starts. A shell script fails runs the steps in order:

#!/usr/bin/env bash
set -e

# scale_database.sh: resize compute, repoint the load balancer, shift DNS.

# No retries, no logging, no record of progress, no rollback.
resize_compute "$DB_ID" "$INSTANCE_CLASS"
update_load_balancer "$DB_ID" "$INSTANCE_CLASS"
update_dns_weights "$DB_ID"

If that script dies after resizing compute but before the DNS update, nothing records how far it got, nothing retries the failed step, and nothing rolls the change back. You rerun it and hope every step is idempotent. The Temporal version below makes each step durable, observable, and recoverable.

CONTROL_PLANE_RETRY = RetryPolicy(
    maximum_attempts=5,
    backoff_coefficient=2.0,
)

@workflow.defn
class ProvisionDatabaseWorkflow:
    @workflow.run
    async def run(self, request: ProvisionRequest) -> ProvisionResult:
        resource = await workflow.execute_activity(
            create_database,
            request.spec,
            start_to_close_timeout=timedelta(minutes=10),
            retry_policy=CONTROL_PLANE_RETRY,
        )

        await workflow.execute_activity(
            update_config,
            args=[resource.id, request.spec.config],
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=CONTROL_PLANE_RETRY,
        )

        await workflow.execute_activity(
            register_dns,
            args=[resource.id, request.dns_name],
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=CONTROL_PLANE_RETRY,
        )

        await workflow.execute_activity(
            notify_team,
            args=[request.team_slack_channel, resource.id],
            start_to_close_timeout=timedelta(minutes=2),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

        return ProvisionResult(resource_id=resource.id, status="provisioned")

This is already a big improvement over a shell script. If a Worker crashes, the Workflow state is durable. If a cloud API times out, the Activity can retry. If the process spans minutes or hours, nobody has to keep a terminal session open.

But this is still a one-shot verb: provision this database and stop. Real infrastructure has a lifecycle. A database is provisioned, then scaled, reconfigured, health-checked, upgraded, rotated, and eventually deleted. That lifecycle can stretch across months or years. For that, you want a Workflow that lives as long as the resource does.

The Entity Workflow pattern#

An Entity Workflow is a long-running Workflow Execution that maps one-to-one with a specific resource. If the one-shot Workflow is a verb (“provision this database”), an Entity Workflow is a noun (“this database”).

The Entity Workflow becomes the durable control-plane record for that resource. It tracks the resource’s current state, accepts commands, serializes side effects, reports status, and delegates complex operations to Activities, dedicated methods, or Child Workflows.

Control planes blog 1

The state you keep inside the Workflow should be compact. Store the information needed to resume the lifecycle, not an unbounded audit log.

@dataclass
class ResourceState:
    status: str = "provisioning"
    resource_id: str = ""
    config: dict[str, Any] = field(default_factory=dict)

    marked_for_deletion: bool = False
    pending_commands: list[PendingCommand] = field(default_factory=list)

    # Bounded idempotency/result cache for recent client requests.
    completed_operations: dict[str, OperationResult] = field(default_factory=dict)

A real implementation would also define request types such as ScaleRequest, ConfigUpdateRequest, PendingCommand, and OperationResult. The important design choice is the shape of the state: current resource status, desired/current configuration, queued commands, and a bounded record of recently completed requests.

Message handling: Updates, Signals, and Queries#

External systems interact with the Entity Workflow through Temporal messages.

Client intent	Temporal primitive	Control-plane use
Read current state	Query	Get status, observed config, pending work, last operation result
Submit a tracked mutation	Update	Scale, reconfigure, rotate credentials, request an upgrade
Submit a fire-and-forget command	Signal	Begin deletion, cancel a background loop, notify of an external event

Use Updates when the caller needs validation, acknowledgement, and a result. Use Signals when the caller only needs the server to accept the message. Use Queries for read-only state. For an operation that can run for minutes, return from the Update once the work is accepted and let callers poll a Query for the result, since a single Workflow allows only about ten in-flight Updates at a time.

@workflow.defn
class DatabaseEntityWorkflow:
    def __init__(self) -> None:
        self.state = ResourceState()

   @workflow.update
    async def scale(self, request: ScaleRequest) -> OperationResult:
        self._enqueue_once(PendingCommand.scale(request))

        # Wait briefly for a fast result. Don’t hold an Update slot open for
        # a long operation; a Workflow allows only 10 in-flight Updates.
        try:
            await workflow.wait_condition(
                lambda: request.request_id in self.state.completed_operations,
                timeout=timedelta(seconds=5),
            )
            return self.state.completed_operations[request.request_id]
        except asyncio.TimeoutError:
            return OperationResult(
                request_id=request.request_id,
                status="accepted",
                message="scale in progress; poll get_status for the result",
            )

    @scale.validator
    def validate_scale(self, request: ScaleRequest) -> None:
        if self.state.marked_for_deletion:
            raise ApplicationError(
                "database is deleting",
                type="InvalidState",
                non_retryable=True,
            )
        if request.storage_gib <= 0:
            raise ApplicationError(
                "storage_gib must be positive",
                type="InvalidScaleRequest",
                non_retryable=True,
            )

    @workflow.signal
    def delete(self) -> None:
        self.state.marked_for_deletion = True

    @workflow.query
    def get_status(self) -> ResourceStatus:
        return ResourceStatus.from_state(self.state)

There are three details worth calling out.

The Update handler enqueues a command; it does not run Activities directly. The main Workflow loop owns side effects. That makes ordering explicit and avoids multiple async handlers changing infrastructure concurrently.
The Update validator rejects bad requests before the Update is accepted into the Event History. You can still validate inside the handler if the validation depends on later state, but simple state checks and input checks are cleaner as validators.
The request carries an application-level request_id. Temporal can deduplicate retried Update calls by Update ID, and the client should set that ID. The application-level ID is still useful because it lets the Workflow recognize the same operation after Continue-as-New and lets different client surfaces correlate the same logical request.

The main loop: Serialize side effects in one place#

The Entity Workflow’s main loop is the heart of the control plane. It waits until there is work to do, processes one command at a time, runs periodic health checks, and rotates history with Continue-as-New when appropriate.

Control planes blog 2

In Python, workflow.wait_condition(..., timeout=...) raises asyncio.TimeoutError when the timeout expires. Catching that exception gives the Workflow a clean timer tick without failing the Execution.

@workflow.run
async def run(self, input: DatabaseEntityInput) -> None:
    self.state = input.state or await self._provision(input.spec)

    while True:
        health_check_due = False

        try:
            await workflow.wait_condition(
                lambda: self.state.marked_for_deletion
                or bool(self.state.pending_commands),
                timeout=timedelta(minutes=10),
                timeout_summary="database-entity-tick",
            )
        except asyncio.TimeoutError:
            health_check_due = True

        if self.state.marked_for_deletion:
            break

        if self.state.pending_commands:
            await self._process_next_command()

        if health_check_due:
            await self._run_health_check()

        if self._ready_to_continue_as_new():
            await workflow.wait_condition(workflow.all_handlers_finished)
            workflow.continue_as_new(
                DatabaseEntityInput(spec=input.spec, state=self.state)
            )

    await self._delete_database()

The command processor keeps infrastructure mutations serialized. A small config update can stay an Activity. A larger operation, such as scaling a database, runs as a dedicated async method on the Entity Workflow that records completed steps and compensates on failure.

async def _process_next_command(self) -> None:
    command = self.state.pending_commands.pop(0)

    try:
        if command.kind == "config":
            request = ConfigUpdateRequest.from_command(command)
            self.state.status = "updating"

            await workflow.execute_activity(
                apply_config_update,
                args=[self.state.resource_id, request.changes],
                start_to_close_timeout=timedelta(minutes=10),
                retry_policy=CONTROL_PLANE_RETRY,
            )
            self.state.config |= request.changes
            self._remember_result(request.request_id, "applied")

        elif command.kind == "scale":
            request = ScaleRequest.from_command(command)
            self.state.status = "scaling"

           await self._scale_database(request)
            self._apply_scale_to_state(request)
            self._remember_result(request.request_id, "applied")

    except (ActivityError, ChildWorkflowError) as err:
        self._remember_result(command.request_id, "failed", str(err.cause))
    finally:
        if not self.state.marked_for_deletion:
            self.state.status = "running"

This is the central move: handlers accept messages, but the main loop performs side effects. That gives the Entity Workflow a single place to enforce ordering, deduplication, state transitions, and Continue-as-New.

Handling long-running Workflows with Continue-as-New#

Entity Workflows are meant to live a long time, so their Event Histories keep growing. Every Activity, Update, Signal, timer, and Child Workflow event adds history. Continue-as-New lets a Workflow close the current run and atomically start a new run with the same Workflow ID, a new Run ID, and a fresh Event History.

Build Continue-as-New into the Entity Workflow from the first version. Retrofitting it after long-lived entities are already in production is possible, but it is harder than starting with the pattern in place.

A few practical rules help:

Continue-as-New from the main Workflow method, not from a Signal or Update handler.
Wait for active handlers to finish before continuing as new.
Carry forward only compact state needed to resume the lifecycle.
Keep large audit logs, plans, diffs, and payloads outside Workflow state, and store references to them instead.
Keep addressing the entity by Workflow ID, not Run ID. Continue-as-New changes the Run ID.

The request_id cache should also be bounded. You want enough history to answer duplicate client requests, not an ever-growing map of every operation the resource has ever seen.

def _remember_result(
    self,
    request_id: str,
    status: str,
    message: str = "",
) -> None:
    self.state.completed_operations[request_id] = OperationResult(
        request_id=request_id,
        status=status,
        message=message,
    )

    while len(self.state.completed_operations) > MAX_COMPLETED_OPERATIONS:
        oldest_request_id = next(iter(self.state.completed_operations))
        del self.state.completed_operations[oldest_request_id]

def _ready_to_continue_as_new(self) -> bool:
    return (
        not self.state.pending_commands
        and workflow.info().is_continue_as_new_suggested()
    )

Continue-as-New also helps with deployability. Long-running Workflows naturally span code versions. Periodic state handoff reduces how long any one run stays on an older version of your code.

Interacting with Entity Workflows#

External systems such as a developer portal, CLI, or Slack bot interact with each resource through the Temporal Client. Derive the Workflow ID from a stable, non-sensitive resource identifier so there can be only one open Entity Workflow for that resource in the Namespace.

client = await Client.connect("localhost:7233")
workflow_id = f"database/{database_id}"  # Use a stable, non-PII ID.

try:
    handle = await client.start_workflow(
        DatabaseEntityWorkflow.run,
        DatabaseEntityInput(spec=database_spec),
        id=workflow_id,
        task_queue="control-plane",
    )
except WorkflowAlreadyStartedError:
    handle = client.get_workflow_handle_for(
        DatabaseEntityWorkflow.run,
        workflow_id,
    )

status = await handle.query(DatabaseEntityWorkflow.get_status)

scale_result = await handle.execute_update(
    DatabaseEntityWorkflow.scale,
    ScaleRequest(
        request_id="req-1042",
        instance_class="db.r6g.xlarge",
        storage_gib=200,
    ),
    id="req-1042",  # Temporal Update ID for client retry deduplication.
)

await handle.signal(DatabaseEntityWorkflow.delete)

The Workflow ID is a core design decision. Temporal guarantees that only one Workflow Execution with a given Workflow ID can be open in a Namespace at a time. If a second caller tries to start the same entity, the client can reconnect to the existing Workflow instead of creating a competing control-plane record.

For lazy creation, use Signal-With-Start or Update-With-Start. That lets the first message create the Entity Workflow if it does not already exist. This is useful when resources can be discovered from the outside or when the developer portal should not pre-create a control-plane record for every possible resource. For self-hosted clusters, check your Temporal Server version before relying on Update-With-Start.

Error handling and compensation#

Infrastructure changes fail in messy ways. A scale request might resize compute successfully and then fail while reconfiguring a load balancer. Your control plane needs two layers of defense.

The first layer is Activity retry. Network glitches, rate limits, and transient cloud API failures usually belong in Activity Retry Policies. That keeps the happy path clean and avoids filling Workflow code with low-level retry loops.

The second layer is Workflow-level recovery. When an operation has several steps and only some of them succeed, retrying the whole Workflow is usually the wrong answer. You need compensation.

A good way to structure this is to move the multi-step operation into a dedicated async method on the Entity Workflow and let that method own the saga logic.

Control planes blog 3

The method records which forward steps completed and compensates in reverse order.

   async def _scale_database(self, request: ScaleRequest) -> None:
        completed_steps: list[str] = []
        try:
            await workflow.execute_activity(
                resize_compute,
                args=[
                    self.state.resource_id,
                    request.instance_class,
                    request.storage_gib,
                ],
                start_to_close_timeout=timedelta(minutes=10),
                retry_policy=CONTROL_PLANE_RETRY,
            )
            completed_steps.append("compute")
            await workflow.execute_activity(
                update_load_balancer,
                args=[self.state.resource_id, request.instance_class],
                start_to_close_timeout=timedelta(minutes=5),
                retry_policy=CONTROL_PLANE_RETRY,
            )
            completed_steps.append("load_balancer")
            await workflow.execute_activity(
                update_dns_weights,
                self.state.resource_id,
                start_to_close_timeout=timedelta(minutes=5),
                retry_policy=CONTROL_PLANE_RETRY,
            )
            completed_steps.append("dns")
        except ActivityError as err:
            await self._compensate(completed_steps, request)
            raise ApplicationError(
                f"scale failed and was rolled back: {err.cause}",
                type="ScaleOperationFailed",
                non_retryable=True,
            ) from err

Compensation should undo the work that actually happened, not blindly run a fixed rollback sequence. Both forward Activities and compensating Activities must be idempotent. Temporal may retry them, and retry behavior is only safe if repeated execution preserves a valid end state.

Also notice what is not retried here: the Workflow itself. In most control-plane cases, you do not want a Workflow Retry Policy on the Entity Workflow. Retries belong on Activities; multi-step recovery belongs in explicit Workflow logic.

Activities, methods, and Child Workflows: When to use each#

As your control plane grows, you will repeatedly make one architectural call: should this operation be an Activity, a method on the Entity Workflow, or a Child Workflow?

Use an Activity when...	Use a Child Workflow when...
The operation is one side effect	The operation has its own lifecycle
The timeout and retry policy are straightforward	The operation needs its own Task Queue or Worker pool
The result can be handled directly by the parent	The operation runs independently or in parallel with its own retries
The Event History added to the parent is acceptable	The operation deserves its own Event History
Example: update DNS, call a cloud API, write metadata	Example: migrate a cluster, perform a staged upgrade, fan out across regions

In the example above, apply_config_update stays an Activity because it is conceptually one operation. scale stays in the Entity Workflow as a dedicated async method because it is a bounded multi-step operation: resize compute, update the load balancer, update DNS, and compensate in reverse order if needed. Reach for a Child Workflow when an operation needs its own Event History, lifecycle, or Task Queue.

That separation keeps the Entity Workflow focused on what it should own: resource state, command ordering, idempotency, and high-level orchestration.

Production design checklist#

A resilient control plane is more than a happy-path Workflow. Before you ship the pattern, pressure-test these choices.

Idempotency. Every Activity that talks to a cloud API, IaC tool, DNS provider, ticketing system, or chat system should be safe to retry. Use provider-supported idempotency keys where possible. Otherwise, design the Activity to check current external state before mutating it.
State size. Workflow state should contain compact control-plane facts: resource IDs, desired config, observed status, pending commands, and bounded result caches. Store large plans, diffs, logs, generated manifests, and audit history in object storage or a database, and keep references in Workflow state.
Message ordering. Decide whether commands should be strictly serialized, partially concurrent, or coalesced. Many infrastructure resources should serialize mutations. Some operations, such as health checks or read-only drift detection, can run independently if they do not change the same external state.
Deletion semantics. Treat deletion as a lifecycle transition, not just a final API call. Decide what happens to pending commands, in-flight Child Workflows, final notifications, cleanup failures, and resources that are already partially gone.
Observability. Use resource-oriented Workflow IDs, Search Attributes, logs, and metrics so operators can answer questions like: Which databases are scaling? Which upgrades are stuck? Which team owns this resource? Which operation failed and why?
Security. Do not put secrets or sensitive user data in Workflow IDs, Task Queue names, Activity names, Update names, Search Attributes, or other identifiers that appear in UI, CLI output, logs, or Event History. Pass sensitive payloads through an appropriate Data Converter or keep them outside Temporal and pass references.
IaC integration. Terraform, Pulumi, and similar tools are still valuable. Temporal does not replace them. Temporal gives you durable orchestration around them: approvals, retries, timeouts, compensation, scheduling, health checks, and cross-system coordination.

Putting it all together#

A Temporal-backed control plane has a simple shape:

A developer portal, CLI, Slack bot, or API server derives a deterministic Workflow ID from the managed resource.
The Client starts the Entity Workflow, reconnects to it, or uses Signal-With-Start / Update-With-Start.
Queries expose current state.
Updates submit tracked mutations and return results.
Signals submit fire-and-forget lifecycle events.
The Entity Workflow serializes side effects in its main loop.
Activities perform single external operations.
Dedicated methods on the Entity Workflow own multi-step operations and compensation. Reach for Child Workflows when an operation needs its own Event History, lifecycle, or Task Queue.
Continue-as-New keeps long-lived entities healthy over time.

That is the core pattern: one durable Workflow per resource, compact state, explicit messages, serialized side effects, idempotent Activities, and dedicated methods for complex multi-step operations.

Wrap-up#

If you are building a platform control plane, model each managed resource as an Entity Workflow. Give it a deterministic Workflow ID. Make it the durable control-plane record for desired state, observed status, and in-flight work. Prefer Updates when callers need validation and a result. Use Signals when they do not. Keep Queries read-only. Keep side effects in the main loop. Keep complex multi-step operations in dedicated methods on the Entity Workflow that own their compensation logic, and reach for Child Workflows when an operation needs its own Event History, lifecycle, or Task Queue.

Most importantly, design for the long haul on day one. Long-lived entities need Continue-as-New, bounded state, idempotent Activities, explicit recovery paths, and a message model that still makes sense after thousands of commands. Temporal gives you the building blocks. The control plane becomes much easier to reason about once you use them in the right places.

Want to see these ideas in action? Explore the Infrastructure Provisioning sample with Terraform, the Infrastructure Provisioning sample with Pulumi, and the Terraforming Temporal Cloud Resources sample. And if you are ready to build your control plane on Temporal Cloud, get started here.