Deploying Temporal Workers to Amazon ECS

AUTHORS
Ebenezer Ankrah
DATE
Apr 02, 2026
CATEGORY
DURATION
6 MIN

Temporal Workers operate differently from typical web services. They long-poll the Temporal Server for tasks and execute them locally. Because Workers don’t receive inbound HTTP traffic, there’s no load balancer to route requests and no request-per-second metric to scale on. The Worker must stay running, connected, and ready.

Amazon ECS with Fargate is a natural fit for this kind of workload. With the right task definition, IAM configuration, and health checks, getting a Temporal Worker running on ECS Fargate is simple. This guide covers the required ECS configuration and key decisions. A complete working example is available in the temporal-ecs repository.

Configuring your application#

The Worker container#

A Temporal Worker is a long-running process that imports your Workflows and Activities, connects to Temporal Cloud, and calls worker.run(). Here’s a minimal Dockerfile:

FROM python:3.13-slim

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:0.9.9 /uv /usr/local/bin/uv
WORKDIR /app

COPY pyproject.toml uv.lock ./
RUN --mount=type=cache,target=/root/.cache/uv uv sync --locked --no-dev
COPY . .

HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

CMD ["uv", "run", "-m", "worker"]

Copying pyproject.toml and uv.lock before the source code means dependency installation is cached across builds and only source code changes trigger a reinstall.

Health checks#

ECS needs to know if your Worker is alive. Since there’s no load balancer, you rely on the container-level HEALTHCHECK directive. Add a lightweight HTTP endpoint inside your Worker process:

from aiohttp import web

async def _start_health_server():
    async def handle_health(_):
        return web.Response(text="OK")

    app = web.Application()
    app.router.add_get("/health", handle_health)
    runner = web.AppRunner(app)
    await runner.setup()
    site = web.TCPSite(runner, "0.0.0.0", 8080)
    await site.start()

Start this alongside your Worker with asyncio.create_task(_start_health_server()). If the Worker process crashes, the health endpoint stops responding, and ECS replaces the task.

Graceful shutdown#

When ECS stops a task during a deployment, scale-in, or Spot reclamation, it sends SIGTERM. Handle it by draining the Worker:

import signal
import asyncio

async def shutdown_handler(worker):
    await worker.shutdown()

loop = asyncio.get_running_loop()
for sig in (signal.SIGINT, signal.SIGTERM):
    loop.add_signal_handler(
        sig,
        lambda: asyncio.create_task(shutdown_handler(worker)),
    )

worker.shutdown() stops polling and waits for in-progress Activities to finish. ECS gives 30 seconds by default before SIGKILL. Configure stopTimeout in the task definition if your Activities need more.

Infrastructure#

The ECS task definition#

The task definition is where you declare what the Worker looks like to ECS. Two things stand out compared to a typical web service:

No port mappings. The Worker makes only outbound connections like gRPC to Temporal Cloud (port 7233) and HTTPS to AWS services (port 443). Nothing connects inbound.

Secrets via SSM Parameter Store. ECS natively integrates with SSM. The execution role reads parameters at task startup and injects them as environment variables:

secrets:
  - name: TEMPORAL_HOST
    valueFrom: arn:aws:ssm:us-west-2:123456789012:parameter/prod/temporal/host
  - name: TEMPORAL_NAMESPACE
    valueFrom: arn:aws:ssm:us-west-2:123456789012:parameter/prod/temporal/namespace
  - name: TEMPORAL_TLS_CERT
    valueFrom: arn:aws:ssm:us-west-2:123456789012:parameter/prod/temporal/tls-cert
  - name: TEMPORAL_TLS_KEY
    valueFrom: arn:aws:ssm:us-west-2:123456789012:parameter/prod/temporal/tls-key

Your application reads them with os.environ["TEMPORAL_TLS_CERT"]. Never put secrets in Terraform variables or .env files.

IAM: Two roles, two purposes#

ECS uses two distinct IAM roles, and confusing them is one of the most common deployment mistakes:

  • Execution role: Used by the ECS agent (not your code) to pull the Docker image from ECR, read secrets from SSM, and ship logs to CloudWatch.
  • Task role: Assumed by the running container. This is what your Worker uses at runtime to talk to S3, Athena, or any other AWS service.

The mental model: execution role = ECS plumbing. Task role = your application logic. Keep them separate and scoped tightly.

The ECS service#

desired_count: 2
capacity_provider_strategy:
  - capacity_provider: FARGATE_SPOT
    weight: 80
  - capacity_provider: FARGATE
    weight: 20
    base: 1
network_configuration:
  subnets: [private-subnet-ids]
  security_groups: [worker-sg]
  assign_public_ip: false

There’s no load_balancer block. The Worker doesn’t accept inbound connections, which dramatically simplifies the infrastructure compared to a typical ECS service.

Fargate Spot#

Temporal Workers are excellent candidates for Fargate Spot:

  • Workers are stateless. All durable state lives in the Temporal Server. If a Spot instance is reclaimed, Temporal automatically retries in-progress Activities on another Worker.
  • Graceful shutdown is built in. When ECS sends SIGTERM, the Worker finishes current tasks and deregisters cleanly.
  • Cost savings are substantial. Fargate Spot is typically 50–70% cheaper than on-demand.

The base = 1 on the FARGATE provider ensures you always have at least one on-demand instance as a reliability floor.

Scaling#

Scaling Temporal Workers requires different thinking than scaling web services. There’s no request rate to track, and Workers doing I/O-bound work won’t show high CPU even when fully loaded. Standard metrics can be misleading.

The simplest starting point is ECS Application Auto Scaling on CPU utilization:

  • Set a low CPU target (~25%). A low target scales out before the Worker becomes a bottleneck, which matters especially for I/O-heavy workloads where CPU alone understates load.
  • Use asymmetric cooldowns. Scale out fast (60s), scale in slow (300s). Helps respond quickly to spikes without thrashing.
  • Set min_capacity to 2. If one Worker is replaced during deployment, the other keeps processing.

Observability#

ECS gives you observability with minimal setup:

  • CloudWatch logs: Attach a log group to the task definition and structure your Worker logs with a consistent format.
  • Container insights: Enable it on the ECS cluster for CPU, memory, and network metrics without additional instrumentation.
  • Temporal Cloud metrics: Workflow completion rates, Activity latencies, Task Queue depth, and Schedule lag are available in the Temporal UI.

If a Workflow is slow, check Temporal’s UI for Activity retries and timeouts. If a task won’t start, check CloudWatch for OOM kills or failed health checks.

When to Use ECS vs. EKS#

We recommend starting with ECS and Fargate. It works well when your team doesn’t already run Kubernetes, you have a small number of Workers (1–10 tasks), and you want tight AWS integration with IAM, SSM, and CloudWatch working natively.

Consider graduating to EKS when you’re already running Kubernetes, need the Temporal Worker Versioning operator, or your fleet exceeds ~50 tasks and requires granular scheduling.

The Worker code doesn’t change — only the infrastructure around it.


Have questions about deploying Temporal Workers on ECS? Join us in the #operations channel on the Temporal Community Slack.

MAY 5–7, Moscone South

The Durable Execution Conference for agentic AI

Replay is back, and tickets are on sale! Join us in San Francisco May 5–7 for Temporal’s annual developer conference. Three days of workshops, talks, and a hackathon focused on building reliable agentic systems.

Replay attendee graphic

Build invincible applications

It sounds like magic, we promise it's not.