| This guest post was co-written by Jacques Lemire (Founding Principal Engineer), Volkan Gurel (CEO & Co-founder), and Alex Engel (VP of Product) at Layer.ai |
|---|
Last year, we talked with the Temporal team and published a great article about how we use this platform to power our AI asset generation pipeline. If you read it, then you know the story: mobile game studios need prod-ready assets and they need them fast, so we needed a way to run heavy async workloads without completely melting our servers… and Temporal's.
That story is still true, and now it has a sequel!
In the year since that blog went live, Layer went from a tool game artists use to generate a single asset to something more like a creative operating system for entire studios.
We've grown a lot, and the best part is that Temporal was able to grow with us. What started out as a handful of workflows is now around 50 workflow types running in production, orchestrating everything from image generation to LoRA fine-tuning to multi-modal workflow pipelines to billing syncs.
This article is an honest technical account of what changed, why it worked, and what we'd do differently. I'll start with the product context (because you can't understand the architecture without it) and then get into the parts engineers like yourself care about most.
The year everything got more interesting#
A year ago, nearly 100% of our generations happened through simple inference forms. It went a little something like this: a user uploads a reference image, picks a model, and hits generate. That was the whole loop. While it was extremely satisfying, it's not something you could really call a pipeline with confidence.
We knew we needed to make a shift to meet customer demand. Volkan Gurel, our CEO and Co-founder, put it best:
"The moment studios stopped asking 'can AI make this asset?' and started asking 'how do we run this at 10,000 assets a month?' — that's when we knew we'd crossed a threshold. Layer had to grow up to meet that demand."
We realized we had an operations problem on our hands, and that solving it meant building something that looked a lot less like a generation form and a lot more like a production line.
So we built Workflows — a node-based canvas where teams assemble complex production pipelines for images, video, audio, and 3D assets. Node-based systems are tried and true; any Unreal developer has spent quality time inside Workflow, but what's different about ours is the ability to parallelize generation at massive scale so studios can render hundreds or thousands of assets in the time it previously took to generate just one. Under the hood, every node in a Workflow is a Temporal child workflow, but we'll come back to that.
Then video showed up and broke all our assumptions#
Image generation going from "oh, cool demo?" to "nice, we can actually use this in production" was one turning point, but video was an entirely different beast.
Many video game studios spend billions on user acquisition, and the dominant format in high-performing UA campaigns is short-form video across platforms like TikTok, YouTube, Meta, AppLovin, and UnityAds. The creative teams running those campaigns need volume to achieve their goals — we're talking dozens or hundreds of variants to A/B test at any given time. Traditional production pipelines can't move that fast. Even when the creative is conceptually simple, the coordination overhead absolutely kills velocity.
When models like Veo, Kling, and Seedance became capable enough for production use, video generation suddenly became viable inside a studio pipeline. What surprised us was how studios started using this feature. The art teams, already trained on Layer for creating in-game assets, also started using it for marketing — and as a result, studios are now producing video ads, game trailers, and seasonal LiveOps campaigns in a fraction of the time it took before.
For us on the engineering side, video meant a whole new tier of constraints: larger files, longer processing times, FFmpeg-based composition, multi-track timeline rendering, stitching, reframing. The good news is that Temporal was able to adapt with us — more on that later.
The last human in the loop#
The third big shift is one we're still in the early innings of, but it might end up being the most consequential.
We launched an MCP connector: a server that lets LLM agents directly invoke Layer workflows. Previously, an engineer had to log in, configure a workflow, and trigger it. With MCP, an agent can do all of that — through natural language, via an LLM that has been given access to Layer's tools. Jacques Lemire, Founding Principal Engineer at Layer:
"MCP is the layer that turns Layer from a platform you log into to a system that works for you autonomously. When an LLM can directly invoke a workflow, you've removed the last human bottleneck from the creative loop."
The use case that has us most excited is what we're calling the full autonomous loop: an agent monitors campaign performance data, identifies which creatives are fatiguing, triggers new variant generation in Layer, and pushes updated assets to ad networks — overnight, without a single human in the loop.
The format we think benefits most from this is playable ads. You've definitely encountered them if you're an avid TikTok scroller: a lightweight, interactive experience where the user plays a mini-version of the game inside the ad unit before deciding to install. The conversion data is unambiguous — higher engagement, better CVR, stronger Day 1 retention. The barrier has always been production cost, which has historically run $5,000+ per creative and required custom development for every variant.
We believe playable ads are at a similar turning point to where video ads were about 18 months ago. Alex Engel, VP of Product:
"The data on playable ads is unambiguous — higher engagement, better CVR, stronger D1 retention from installs. The barrier has never been demand. It's been the cost and complexity of production. Removing that barrier changes the math entirely."
With agent-driven generation, we expect studios to bring that cost below $20 per creative.
50 workflows and counting#
Our original blog centered on FileUploadWorkflow. That workflow still exists and still earns its keep, but a solid foundation is one you keep building on. Jacques Lemire, who has lived inside this architecture longer than anyone:
"Temporal became our async execution backbone almost by accident — we started with file uploads and ended up with 50 workflow types orchestrating everything from LoRA fine-tuning to multi-modal Workflow pipelines. The durability guarantees are what made it possible to build a visual pipeline builder on top of it without constantly worrying about what happens when things fail mid-run."
"Almost by accident" is right. The rule of thumb we've landed on: if something takes more than a few seconds or needs to survive a server restart, it's a workflow. Here's what that looks like across the platform today:
- Workflow execution (
RunBlueprintWorkflow) — orchestrates our visual DAG pipeline builder - Inference (
RunInferenceWorkflow) — every generation request across all modalities - Model training — LoRA fine-tuning across multiple providers
- File processing — the original workflows, plus a growing family of post-processing variants
- Billing and entitlements — usage tracking, credit management
- CRM and analytics sync — event forwarding to downstream systems
- Scheduled maintenance — cleanup, cache warming, health checks
The one where we built a visual pipeline builder on top of Temporal#
The Workflow system is the most architecturally interesting thing we've built on Temporal, and probably the most novel use of the platform we've come across in the wild.
Users build pipelines visually: a series of nodes (generate image, trim video, mix audio, compose layers, add text overlay, export) connected by edges that define data flow. Each node is an independent Temporal child workflow. The parent RunBlueprintWorkflow fans out all nodes via asyncio.gather() and resolves inter-node dependencies using workflow.wait_condition() against a shared all_values dict that acts as a durable data bus:
# Parent workflow fans out all nodes concurrently
node_futures = [
workflow.execute_child_workflow(RunNodeWorkflow.run, node_input)
for node_input in dag.nodes
]
# Each node blocks on its inputs via wait_condition
await workflow.wait_condition(
lambda: all(dep in all_values for dep in node.dependencies)
)
# Node executes, writes output back to shared values
all_values[node.output_key] = node_output
What this gives us is a visual pipeline builder backed by Temporal's durability guarantees. If a node fails mid-pipeline, Temporal's retry logic handles it. If the worker goes down, the pipeline picks up where it left off. Users see a clean visual canvas; the chaos lives safely inside the workflow history where it belongs.
We also support partial re-runs: if a user tweaks one node in an existing pipeline, only the invalidated downstream nodes re-execute. Previous outputs for unchanged nodes are restored from the work plan. It's one of those things users don't notice until it's gone.
One workflow, four modalities, zero redesigns#
Video generation is fundamentally heavier than image generation — longer processing times, larger files, more complex post-processing. A huge feat for us is that we didn't have to redesign anything to support it.
Every generation on the platform, whether it's a FLUX image, a Kling video, a Hunyuan 3D model, or an ElevenLabs audio clip, flows through the same RunInferenceWorkflow:
@workflow.defn
class RunInferenceWorkflow:
TASK_QUEUE = "inference"
def __init__(self) -> None:
self.queue_sync_version = 0
self.last_queue_sync_version = 0
self.queue_status: InferenceStatus | None = None
self.cancel_requested = False
self.inference_provider: InferenceProvider | None = None
@workflow.run
async def run(self, input: RunInferenceWorkflowInput) -> RunInferenceWorkflowOutput:
inference_context = await workflow.execute_local_activity_method(
RunInferenceActivities.extract_inference_context, ...)
await workflow.execute_activity_method(
RunInferenceActivities.preprocess_inputs, ...)
if inference_context.requires_translation:
await workflow.execute_activity_method(
RunInferenceActivities.translate_prompt, ...)
# Modality-aware timeout: 5min text, 15min image, 30min video
execution_timeout = WORKFLOW_EXECUTION_TIMEOUT_BY_MODALITY.get(
inference_context.generated_modality)
output = await asyncio.wait_for(
self.generate_files(input.inference_id, ...),
execution_timeout.total_seconds(),
)
# Post-creation steps run in parallel
if len(inference_context.inference_steps) > 1:
await asyncio.gather(*[
apply_post_creation_steps(fid) for fid in output.file_ids
])
await workflow.execute_activity_method(
RunInferenceActivities.mark_inference_complete, ...)
return RunInferenceWorkflowOutput(
inference_id=input.inference_id, status=InferenceStatus.COMPLETE)
async def generate_files(self, inference_id, ...):
"""Hybrid webhook+polling for external provider completion."""
output = await workflow.execute_activity_method(
RunInferenceActivities.generate_files, ...)
if output.status == InferenceStatus.IN_PROGRESS:
for _ in range(output.external_queue_handles.poll_attempts):
try:
await workflow.wait_condition(
lambda: self.last_queue_sync_version != self.queue_sync_version,
timeout=output.external_queue_handles.poll_period,
)
except asyncio.TimeoutError:
pass # No webhook received, fall back to polling
sync_output = await workflow.execute_activity_method(
RunInferenceActivities.sync_queue_status, ...)
if sync_output.status != InferenceStatus.IN_PROGRESS:
break
else:
raise ApplicationError("Polling Timeout", non_retryable=True)
return output
@workflow.signal
async def queue_status_updated(self, input: QueueStatusUpdatedSignal):
self.queue_sync_version += 1
self.queue_status = input.status
@workflow.signal
async def cancel(self):
self.queue_sync_version += 1
self.cancel_requested = True
Three things worth calling out:
The webhook+polling hybrid. External providers give you two options: webhooks (fast, but lossy if your service is down) or polling (reliable, but slow). We use both via workflow.wait_condition() with the poll period as a timeout. Webhook arrives, the condition resolves immediately. Doesn't arrive, we fall through to polling. Webhook latency with polling reliability — this has saved us a lot of debugging time.
Modality-aware timeouts. Five minutes for text, fifteen for images, thirty for video — all stored in a dict keyed by modality. Adding a new modality means adding one entry. Simple, maybe even a little boring, but exactly what you want in timeout logic.
Signal-based cancellation. The cancel signal sets a flag the workflow checks at safe points. No orphaned tasks, no zombie processes — which matters more than you'd think once studios are running batch pipelines.
50 ad creatives and one short window#
UA pipelines have a completely different throughput profile than single-asset generation. A studio needs 50 ad creatives (5 backgrounds × 5 text overlays × 2 aspect ratios) fast, not queued behind each other.
We handle parallelism at two levels. At the Workflow level, independent nodes run concurrently via asyncio.gather(). Within each generation node, all inference requests for a given batch run as parallel child workflows. Fifty variants gets you 50 RunInferenceWorkflow instances running simultaneously, subject to rate limits.
One thing Temporal doesn't give you out of the box: batch flow control. Dropping 500 child workflows on your cluster at once is a bad idea, so we built our own batching layer to stagger the fan-out. Temporal handles reliability; throttling is our job.
We also isolate workloads by task queue. Workflow batch inference runs on blueprint.inference, separate from the direct API queue. A studio running a 1,000-asset UA batch doesn't affect the artist doing single-asset generation in the next tab. Nobody wants to be the noisy neighbor.
Meet the Sheriff#
DevOps at Layer is a shared responsibility. We run a rotating "Sheriff" — one engineer per rotation who owns live issues. For this to work, you need to get from "something is wrong" to "we know exactly what failed" as fast as possible. Every minute spent correlating logs manually is a minute a studio pipeline sits idle.
When a workflow fails, Temporal's history gives us the exact activity that failed, the input it received, and how many retries were attempted. We pair this with a custom LoggingContextInterceptor that injects workflow_id, run_id, task_queue, and parent workflow info into every log line. Given a workflow_id from a user complaint, you jump straight to the correlated logs in GCP and errors in Sentry. No grep archaeology.
We also have OpenTelemetry across HTTP, GraphQL, and MCP tool calls, per-node success/error counters and execution time histograms for workflows, and inference events flowing through Pub/Sub into BigQuery via Apache Beam with Looker dashboards on top. Our CS team sees model health, latency, and failure rates without filing an engineering ticket. We've also started using Claude with BigQuery MCP and Sentry to investigate issues — using AI to debug AI, which feels appropriate.
Jacques Lemire on where the real return came from:
"The investment in observability paid off faster than anything else we did — being able to jump from a workflow ID straight to logs, errors, and the exact activity that failed means our Sheriff rotation actually works. If I had one piece of advice for teams building AI infrastructure on Temporal: encode your cleanup logic into the workflow from day one, not as an afterthought. We learned that the hard way."
The good and the "oh yeah, we set alerts for that"#
We run a self-hosted Temporal cluster for more control. Temporal doesn't autoscale out of the box, so we feed its telemetry into our cloud monitoring stack with proactive alerts:
Acquire Shard Latency P95 > 30msActivity Schedule to Start Latency P95 increased > 50%
When either fires, we upscale before users feel it. Factor this into your operational planning if you're considering self-hosting.
Task queue isolation has been the other big win. Interactive inference, Workflow batch inference, video rendering, and training all run on separate queues with separate worker pools. Separate queues also mean separate deployment targets — rolling updates get a lot cleaner.
Don't skip asyncio.shield()#
We got burned by this, and you probably will too if you don't read this section.
Temporal workflows often need cleanup steps to run no matter what — clearing a credit reservation, writing a failure status. If you don't protect those activities, a cancellation will kill them before they complete. What you get: orphaned credit reservations, painful to reconcile at scale.
The fix:
# Without this, cancellation kills cleanup before it runs
await asyncio.shield(
workflow.execute_activity_method(
BillingActivities.clear_credit_reservation,
ClearCreditReservationInput(inference_id=input.inference_id),
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=5),
)
)
asyncio.shield() on every cleanup activity is now on our code review checklist. Add it from the start.
What's next#
The MCP connector is live. Agents can invoke Layer workflows directly. The next step is closing the loop: agents that monitor campaign performance, identify fatiguing creatives, trigger generation, and push assets to ad networks — autonomously, on a cadence, while everyone sleeps.
Workflow execution will keep growing as a share of total generations — the mix has shifted significantly since Workflows launched, and it's not slowing down. As agentic workflows become more common, orchestration complexity grows with them: more workflow types, more inter-system coordination, more places where durability guarantees aren't optional.
We plan to keep Temporal in the middle of all of it. Building reliable systems on top of unreliable components is what it was designed for. We've tested that claim pretty thoroughly at this point. It holds.
If you want to see the production pipeline in action, here's what top-performing UA creative generation looks like end to end. And if you're building something similar, find us on LinkedIn — let's chat!