If you’ve ever built a little AI feature that worked beautifully in a demo and then behaved like a haunted appliance in production, you’re in good company.
The first time through, everything lines up: the model answers, the tool call returns, the app produces the output, and you start imagining all the places this could go. Then you ship it. A worker restarts at the wrong moment, a network request stalls, or a rate limit shows up uninvited, and suddenly your system is doing the equivalent of flipping the board game table because it can’t remember what it already did.
It’s a familiar arc, and it’s basically the engineering version of Campbell’s Hero’s Journey. The “ordinary world” is the demo, the “call to adventure” is shipping it, and from there you’re pulled into unknown territory where old assumptions stop working.
The good news is that, like any hero’s journey, there’s a path through. You come back changed — not because you’ve eliminated failure, but because you’ve learned to build systems that handle it gracefully.
Here’s a quick sanity check. If any of these sound familiar, you’re already heard the call:
- If a process restarts halfway through, do I lose the work I already paid for?
- Do I have steps that must run exactly once, even if I retry?
- Do I need to wait on humans, approvals, or external systems?
- Do I need to show progress or partial output while the work is ongoing?
When people say they want AI features to be “reliable,” they usually don’t mean “nothing ever fails.” They mean: when something fails, it recovers in a way that doesn’t waste time, money, or user trust, and it doesn’t require you to invent an entire reliability framework in your application code.
That’s what durability is about. Let’s walk through what it looks like to actually get there.
Crossing the threshold#
A lot of AI-powered features are multi-step flows: call a model, call a few tools, combine results, generate a file or update a system, maybe ask a human to approve something, then keep going later.
The tricky part is that those steps don’t live in one place. They touch networks, APIs, databases, file systems, and user actions, and any of those can fail in ways you can’t fully control. The moment you ship and start dealing with real failures, you’ve crossed into unfamiliar territory. Your cozy local-dev assumptions don’t apply anymore.
This is the threshold every developer crosses eventually. You can’t un-know what production taught you, and you can’t go back to pretending happy paths are enough.
Temporal is built for exactly this kind of work: it lets you write long-running, multi-step logic as a Workflow that can survive restarts, while the actual side effects happen in Activities that can time out and retry cleanly.
Here’s the rule of thumb:
- Workflows hold the plan and the state (“what step am I on, and what did I learn so far?”)
- Activities do the external work (“call the model, hit the API, write the PDF, update the DB”)
Once you start using Temporal, a bunch of messy edge cases stop being bespoke problems. But knowing the tool exists isn’t the same as knowing how to use it well — and knowing when not to use it matters too.
A quick calibration: If your flow is a single model call with no side effects, you probably don’t need Temporal. The value kicks in the moment you add a second step that can fail independently, or when you need to wait for something external (a human, a webhook, a rate limit cooldown). If you’re unsure, ask yourself: “What happens if this process dies right here?” If the answer is “we lose work and have to start over,” durability is worth the investment.
Let’s look at the patterns that matter most.
Don’t pay twice for the same model call#
The central crisis in AI reliability usually arrives the first time you watch a model call succeed, followed by a crash, followed by the same model call running again because your system forgot it already happened.
One of our Learn tutorials uses a simple research flow that makes this concrete: you call an LLM to generate content and then you generate a PDF. The failure mode is exactly what you’d expect. If the app crashes during PDF creation, you don’t want to rerun the model call just because the last step failed.
The implementation pattern that gets you through:
- Put the model call in an Activity
- Put the PDF generation in an Activity
- Orchestrate both from a Workflow with reasonable timeouts
In Python, that looks like this:
from temporalio import activity
@activity.defn
def llm_call(prompt: str) -> str:
# Call your model provider here
...
@activity.defn
def create_pdf(content: str) -> str:
# Write the PDF and return a filename/path
...
Then the Workflow coordinates the steps:
from datetime import timedelta
from temporalio import workflow
with workflow.unsafe.imports_passed_through():
from activities import llm_call, create_pdf
@workflow.defn
class GenerateReportWorkflow:
@workflow.run
async def run(self, prompt: str) -> str:
text = await workflow.execute_activity(
llm_call,
prompt,
start_to_close_timeout=timedelta(seconds=30),
)
pdf_path = await workflow.execute_activity(
create_pdf,
text,
start_to_close_timeout=timedelta(seconds=60),
)
return pdf_path
What changes here isn’t that failures go away. What changes is that you stop treating a process crash as total amnesia. If the system restarts after the model call succeeds, Temporal replays the Workflow history, recognizes that the first Activity already completed, and picks up where you left off, finishing the PDF step without repeating the expensive work.
That’s the reward: your system now remembers what it's already done, and you stop paying twice for the same work.
If you want the full walkthrough and runnable code, this is the Learn tutorial that drives that example.
Human feedback without losing your way#
You’ve built something that works, but now a human needs to weigh in. That might be approval, edits, selecting among options, or simply saying “try again.” The awkward part is that people don’t respond on your timeline, and their decision often arrives after your process has restarted, scaled, or moved to a different machine.
Without durability, this is where state turns into confetti: scattered across webhooks, database flags, and ad hoc timers that nobody fully understands. You end up building a bespoke state machine for every feature, and each one has slightly different bugs.
Temporal gives you a clean model for human-in-the-loop flows:
- Use a Signal to send a decision into a running Workflow
- Use a Query to ask the Workflow what it currently has (so you can show progress or partial results)
Here’s a minimal sketch of “generate a draft, wait for feedback, then proceed”:
from temporalio import workflow
@workflow.defn
class DraftWorkflow:
def __init__(self):
self.decision = None
self.latest_text = ""
@workflow.signal
async def set_decision(self, decision: str):
self.decision = decision
@workflow.query
def get_latest_text(self) -> str:
return self.latest_text
@workflow.run
async def run(self, prompt: str) -> str:
self.latest_text = await workflow.execute_activity(...)
try:
await workflow.wait_condition(
lambda: self.decision is not None,
timeout=timedelta(days=5),
)
except TimeoutError:
# Decide what “no response” means for your product:
# return the current draft, mark as expired, notify, etc.
return self.latest_text
if self.decision == "keep":
return self.latest_text
if self.decision == "redo":
self.latest_text = await workflow.execute_activity(...)
return self.latest_text
return self.latest_textOn timeouts: How long should that wait be? Pick a value that matches your product promise and operational reality. Five days might be right for contract review pipelines where humans need time to consult. One hour might be right for customer support triage where staleness kills value. The key questions: How long are you willing to keep the Workflow “live”? What should the system do when the deadline passes: auto-complete, expire, notify, escalate?
You can build pretty friendly UX on top of this: “show me what you have so far,” “approve,” “redo with edits,” and so on, without needing to glue together timers, webhooks, and ad hoc database state. The Workflow is your state machine, and it survives restarts by design.
The tools you bring back: Durable tools with MCP#
Once you’ve internalized durability for your own code, you start seeing the same gaps in the tools your AI agents use. MCP (Model Context Protocol) is a useful standard for exposing tools to AI clients. It gives you a clean interface for tool discovery and invocation. But MCP doesn’t magically make tool execution reliable. Your tools can still time out, rate limit, or fail mid-run. And if you’re chaining multiple tool calls, a failure partway through can leave you in an awkward partial state.
The practical pattern is to run MCP tool execution through Temporal:
- Wrap each tool call in an Activity so retries and timeouts are handled consistently. If a tool fails transiently, Temporal retries it according to your policy. If it fails persistently, you get clean error handling rather than a hung process.
- Use a Workflow to orchestrate multi-step tool chains so restarts don’t wipe progress. If your agent calls three tools in sequence and the Worker dies after tool two, you resume at tool three, not back at the beginning.
- Persist intermediate results in Workflow state so you can resume without repeating earlier steps, and so you can query the Workflow to show users what’s happened so far.
When MCP tool calls live inside Activities, a tool failure doesn’t have to mean “the whole agent run is toast.” The Workflow can retry the failing call, branch to a fallback tool, or pause and resume later, all while keeping the overall execution and conversation state coherent.
One small detail that’s easy to miss: if you need to pause for rate limits or backoff, use Temporal’s workflow.sleep() rather than asyncio.sleep(). That timer is durable. If your Worker restarts during the wait, Temporal reinstates the Workflow history, and the remaining wait time continues from where it left off rather than starting over.
Explore our two-part MCP tutorial to see this in action: Part 1 | Part 2
This blog post by Josh Smith is another resource you might want to check out.
The transformation#
In Campbell’s framework, the hero returns home, but home looks different now. The ordinary world hasn’t changed; the hero has. They see possibilities and dangers that were invisible before.
If you’ve followed this journey, you've had a version of that experience. The demo that once felt complete now looks fragile. The production failures that once felt like emergencies now feel like expected terrain. You've internalized a different way of thinking about distributed work.
Here’s what’s actually changed:
| Before | Now |
|---|---|
| A process restart meant starting over. | Your system has memory that survives its components. |
| Waiting for a human meant building custom webhook handlers, database flags, and polling loops. | Waiting for a human means await workflow.wait_condition(). The complexity is absorbed by the platform. |
| A tool failure in a multi-step chain meant unwinding everything or accepting inconsistent state. | A tool failure means retry, fallback, or clean error handling, without losing the work that already succeeded. |
| You tested the happy path and hoped for the best. | You design for “what happens when this dies right here?” and the answer is never “we lose everything.” |
The ordinary world (the demo, the single-process prototype, the happy path) is still where ideas start. You can’t skip it, and you shouldn’t want to. But you can’t stay there either. Production is the threshold you have to cross, and durability is what lets you cross it without getting lost.
When you see a multi-step flow, you automatically think about where it can fail and how it should recover. The haunted appliances become predictable machines.
That’s the transformation. Welcome home.