Building AI agents that overcome the complexity cliff

AUTHORS

Ethan Ruhe

DATE

Mar 10, 2026

The problem: Agent capability is a function of time and tools#

Perhaps the least controversial thing you can say about agents is that they tend to get more useful as they can do more work. Frontier coding assistants are impressive because they can iteratively read a codebase, propose changes, and test the changes. Deep research systems are useful because they can fan out and investigate many related topics before fanning in and aggregating findings.

Building agents with increasingly valuable capabilities will require more execution time, more tools, and more coordination. As a simple heuristic: the value of an agent tends to increase with the number of steps it is capable of performing. And the value tends to be superlinear. A 100-step workflow isn’t twice as useful as a 50-step workflow; it’s often an order of magnitude more useful.

This raises two challenges that compound with time.

Challenge 1: Failures become (essentially) inevitable#

Screenshot 2026-03-09 at 11.59.42 AM

Just as entropy increases with time, the odds an AI agent experiences some kind of failure increase with each marginal step. As capability increases, failure becomes not just possible but asymptotically certain.

Dependent services flake. APIs rate-limit. Clusters die. Networks partition. The longer your agent runs, the more of these your agent will have to navigate.

So, what happens when failures occur? For a single prompt-response inference call, the answer is easy: retry. For a 4-hour research agent that’s already made 582 API calls and written intermediate results to three different systems, the answer has to be better.

Challenge 2: Iteration becomes increasingly slow#

Development velocity slows as agent sophistication grows. Building agents is an empirical exercise. You experiment with prompts, swap tools, and try different strategies. You look for what works and what doesn’t.

But longer workflows mean more expensive experiments. You’re paying for both the token costs of re-execution and the time costs for for developer productivity. Every prompt tweak at step 30 of a 30-step workflow means re-running steps one through 29 first, and that can take quite a while.

The complexity cliff: Where most agent frameworks fail#

Screenshot 2026-03-09 at 12.05.37 PM

What makes these challenges especially acute is that there isn’t a gradual progression of concern. There’s a step change as sophistication increases. The odds of a failure in any step compound with each marginal step. And, the iteration problem grows quadratically with the number of steps. We can call this step change “the complexity cliff.”

Below the cliff, restarts are viable. Your workflow is cheap enough that re-running from scratch is annoying but acceptable.

Above the cliff, restart ranges from painful to catastrophic because they are:

Prohibitively expensive: You’ve already burned significant compute, API costs, and wall-clock time.
Dangerous: Side effects have already happened. Emails sent. Records created. Actions taken.
Impossible: External state has changed. The world moved on while your agent was running.

And above the cliff, testing one change requires re-running everything that came before it. You burn tokens and compute re-running steps that already work just to reach the one you’re trying to fix.

Let’s dive deeper into these capability levels and how to cross the complexity cliff.

Levels of agent capability#

We’ve developed a framework for thinking about escalating levels of agent capability and the infrastructure each level demands.

Landscape at a glance#

Level	Duration	Failure cost	Iteration cost	Core challenge
L1	ms	Negligible	Trivial	None
L2	sec–min	Low	Low	Session rate
CLIFF
L3	min–hrs	High	High	Need for Durable Execution
L4	hrs–days	Very high	Very high	Cross-system coordination
L5	days–∞	Existential	Existential	Self-governance

Level 1: Reflexive#


Duration	Milliseconds
Pattern	Single inference, 1–2 tools
Examples	Chat completion, RAG query, classification
Failure cost	Negligible: user retries
Iteration cost	Trivial: batch thousands of variants
Infrastructure need	HTTP is sufficient

At this stage, a request comes in, an inference happens, and a response goes out. Complexity, the cost of failure, and the speed of iteration are not concerns.

Level 2: Conversational#


Duration	Seconds to minutes
Pattern	Multi-turn session, in-memory state
Examples	Customer service bot, Q&A system, troubleshooting assistant
Failure cost	Low: lost context, restart conversation
Iteration cost	Low: simulate conversations
Infrastructure need	Session state (e.g., database)

Now we’re managing state across turns. The user expects continuity, but if something fails, the worst case is a frustrated user who has to start their conversation over. Annoying, but not catastrophic.

THE COMPLEXITY CLIFF#

Level 3: Orchestrated#


Duration	Minutes to hours (longer with human-in-the-loop pauses)
Pattern	Multi-step workflow, external integrations, sub-agents
Examples	Deep research, coding agents, data pipelines, document processing
Failure cost	High: partial completion, side effects, expensive restart
Iteration cost	High: re-running 29 steps to test a new prompt at step 30
Infrastructure need	Durable Execution, checkpointing, replay

This is where most frameworks get difficult to work with or break.

An orchestrated agent pursues a goal through multiple steps. It might spawn sub-agents or child workflows, but it's still a single system with one owner. The complexity is in its depth: many steps and many tool calls. This is both a lot of steps to tune and a lot of opportunities for something to go wrong.

Failure here can mean partial completion with external side effects. Your agent sent three emails, updated two records, and then crashed. Resuming requires knowing exactly where you left off and avoiding duplicate consequences.

Level 4: Coordinated#


Duration	Hours to days
Pattern	Multiple independent agents, each pursuing distinct goals, that coordinate
Examples	Deal desk (sales + legal + finance agents), supply chain orchestration, CI/CD with cross-team handoffs
Failure cost	Very high: cascade failures, cross-boundary rollback
Iteration cost	Very high: must test integration, not just individual agents
Infrastructure need	Cross-system signaling, saga patterns, distributed compensation

The challenge here is both the depth of any particular agent, and its integration across boundaries. Multiple independent agents with different objectives must coordinate. When something fails, you need rollback strategies that span agents and systems. When you iterate, you're testing both individual agent logic and integration behavior.

Level 5: Autonomous#


Duration	Indefinite
Pattern	Self-directing system that spawns, monitors, and adjusts its own workflows
Examples	AI employee, self-improving pipeline, autonomous business process
Failure cost	Existential
Iteration cost	Essentially impossible with re-execution; requires replay from event history
Infrastructure need	Governance, audit trails, backtest capability

An autonomous agent manages its own lifecycle. It decides what to do, when to spawn work, and when to escalate. Failure here could be a literal crash, or just a drift in efficacy over time. How do you know your autonomous agent is still pursuing the right goals after days of independent operation? You need complete audit trails, and the ability to replay and backtest a large corpus of execution history.

How Durable Execution unlocks agent capability#

Teams crossing the complexity cliff discover that their architecture can’t handle two fundamental requirements:

Failure recovery: Resuming from the exact point of failure with full context intact.
Efficient iteration: Testing prompt variants without re-running entire workflows.

Standard approaches fall short:

Cron restarts jobs; it doesn’t resume them.
Queues lose in-flight state when workers die.
Checkpointing doesn’t account for progress and consequences that occur between checkpoints.
Experimentation requires re-running entire workflows in parallel, even when only testing a single step.

Inevitiably, developers end up rolling their own durability layers on top of their frameworks. It’s tedious, error-prone, and insufficient.

Durable Execution as the solution#

What’s actually needed above the Cliff is:

Automatic state persistence: Every single step is immutably logged by default. Progress survives process death without additional code or “checkpointing” boundaries.
Replay capability: Execution can resume from any point in execution history. State is reconstructed from event history without requiring re-execution.
History branching / replay-powered experimentation: Execution history forking enables extremely fast and efficient parallel experiments that don’t require re-running completed work.

Durable Execution solves both the reliability problem and the velocity problem.

Consider iteration velocity. You have a 100-step agent, and you’re tuning the prompt at step 92. Now imagine doing it without Durable Execution, compared to doing it with Temporal:

Without Temporal: Testing 100 prompt variants means running 10,000 steps (steps 1–100, 100 times). Every variant re-executes the 91 steps that haven’t changed.
With Temporal: Testing 100 variants means running (100-91)*100 = 900 steps, a 91% reduction! Replay steps 1–91 from history (no re-execution, just state reconstruction), then branch into 100 parallel experiments that run steps 92–100. The advantage compounds with workflow length.

Now consider failure recovery. Your agent summarizes 10,000 SEC filings and fails after processing 9,999.

Without Temporal: You restart at filing #1 — potentially hours of work and thousands of API calls, all repeated.
With Temporal: The history for the first 9,999 filings is recorded. State is rebuilt from the event log and the last filing is processed.

Again, this advantage compounds with workflow length. The longer your workflow, the more the value proposition of Durable Execution grows.

Companies like OpenAI, Abridge, Lovable, Replit, and Hebbia already use Temporal to build some of the most sophisticated agents on the market. Temporal gives them the building blocks to move faster and build deeper.

The complexity cliff is real. Most frameworks get you to the edge and leave you there. If you’re building agents that need to operate above it, Durable Execution is your ideal foundation.

Ready to build agents that operate above the cliff? Explore our code samples and AI Cookbook to get started with Durable Execution, or join the #topic-ai channel in our Community Slack to connect with other developers building agentic systems on Temporal.