Building AI agents that overcome the complexity cliff

AUTHORS
Ethan Ruhe
DATE
Mar 10, 2026
DURATION
10 MIN

Agent capability is a function of time and tools. The more steps an agent can think through (i.e., “time”) and the more systems it can reliably interact with (i.e., “tools”), the more valuable it is for your business. Consider the difference between two customer service agents:

  • A chatbot that answers one-off questions based on a static knowledge base: a simple one-step, one-system design.
  • A chatbot that can look up a user’s order history in a database, check inventory in an external system, and initiate a refund or exchange: just a few more steps and systems, but dramatically more powerful.

Screenshot 2026-03-09 at 11.55.31 AM

Increasing agent capability with more time and tools creates two compounding challenges:

  • Experiencing failures becomes increasingly certain.
  • Iteration speed slows with each marginal step.

We’ve created a taxonomy of escalating agent capability levels to help you understand what challenges you’re likely to face as you build increasingly powerful agents. At a certain point — what we call “the complexity cliff” — most frameworks break down, but Durable Execution becomes required. We’ll explain how you can take your agents beyond the Cliff, to an autonomous level.

The problem: Agent capability is a function of time and tools#

Perhaps the least controversial thing you can say about agents is that they tend to get more useful as they can do more work. Frontier coding assistants are impressive because they can iteratively read a codebase, propose changes, and test the changes. Deep research systems are useful because they can fan out and investigate many related topics before fanning in and aggregating findings.

Building agents with increasingly valuable capabilities will require more execution time, more tools, and more coordination. As a simple heuristic: the value of an agent tends to increase with the number of steps it is capable of performing. And the value tends to be superlinear. A 100-step workflow isn’t twice as useful as a 50-step workflow; it’s often an order of magnitude more useful.

This raises two challenges that compound with time.

Challenge 1: Failures become (essentially) inevitable#

Screenshot 2026-03-09 at 11.59.42 AM

Just as entropy increases with time, the odds an AI agent experiences some kind of failure increase with each marginal step. As capability increases, failure becomes not just possible but asymptotically certain.

Dependent services flake. APIs rate-limit. Clusters die. Networks partition. The longer your agent runs, the more of these your agent will have to navigate.

So, what happens when failures occur? For a single prompt-response inference call, the answer is easy: retry. For a 4-hour research agent that’s already made 582 API calls and written intermediate results to three different systems, the answer has to be better.

Challenge 2: Iteration becomes increasingly slow#

Development velocity slows as agent sophistication grows. Building agents is an empirical exercise. You experiment with prompts, swap tools, and try different strategies. You look for what works and what doesn’t.

But longer workflows mean more expensive experiments. You’re paying for both the token costs of re-execution and the time costs for for developer productivity. Every prompt tweak at step 30 of a 30-step workflow means re-running steps one through 29 first, and that can take quite a while.

The complexity cliff: Where most agent frameworks fail#

Screenshot 2026-03-09 at 12.05.37 PM

What makes these challenges especially acute is that there isn’t a gradual progression of concern. There’s a step change as sophistication increases. The odds of a failure in any step compound with each marginal step. And, the iteration problem grows quadratically with the number of steps. We can call this step change “the complexity cliff.

Below the cliff, restarts are viable. Your workflow is cheap enough that re-running from scratch is annoying but acceptable.

Above the cliff, restart ranges from painful to catastrophic because they are:

  • Prohibitively expensive: You’ve already burned significant compute, API costs, and wall-clock time.
  • Dangerous: Side effects have already happened. Emails sent. Records created. Actions taken.
  • Impossible: External state has changed. The world moved on while your agent was running.

And above the cliff, testing one change requires re-running everything that came before it. You burn tokens and compute re-running steps that already work just to reach the one you’re trying to fix.

Let’s dive deeper into these capability levels and how to cross the complexity cliff.

Levels of agent capability#

We’ve developed a framework for thinking about escalating levels of agent capability and the infrastructure each level demands.

Landscape at a glance#

Level Duration Failure cost Iteration cost Core challenge
L1 ms Negligible Trivial None
L2 sec–min Low Low Session rate
CLIFF
L3 min–hrs High High Need for Durable Execution
L4 hrs–days Very high Very high Cross-system coordination
L5 days–∞ Existential Existential Self-governance

Level 1: Reflexive#

Duration Milliseconds
Pattern Single inference, 1–2 tools
Examples Chat completion, RAG query, classification
Failure cost Negligible: user retries
Iteration cost Trivial: batch thousands of variants
Infrastructure need HTTP is sufficient

At this stage, a request comes in, an inference happens, and a response goes out. Complexity, the cost of failure, and the speed of iteration are not concerns.

Level 2: Conversational#

Duration Seconds to minutes
Pattern Multi-turn session, in-memory state
Examples Customer service bot, Q&A system, troubleshooting assistant
Failure cost Low: lost context, restart conversation
Iteration cost Low: simulate conversations
Infrastructure need Session state (e.g., database)

Now we’re managing state across turns. The user expects continuity, but if something fails, the worst case is a frustrated user who has to start their conversation over. Annoying, but not catastrophic.

THE COMPLEXITY CLIFF#


Level 3: Orchestrated#

Duration Minutes to hours (longer with human-in-the-loop pauses)
Pattern Multi-step workflow, external integrations, sub-agents
Examples Deep research, coding agents, data pipelines, document processing
Failure cost High: partial completion, side effects, expensive restart
Iteration cost High: re-running 29 steps to test a new prompt at step 30
Infrastructure need Durable Execution, checkpointing, replay

This is where most frameworks get difficult to work with or break.

An orchestrated agent pursues a goal through multiple steps. It might spawn sub-agents or child workflows, but it's still a single system with one owner. The complexity is in its depth: many steps and many tool calls. This is both a lot of steps to tune and a lot of opportunities for something to go wrong.

Failure here can mean partial completion with external side effects. Your agent sent three emails, updated two records, and then crashed. Resuming requires knowing exactly where you left off and avoiding duplicate consequences.

Level 4: Coordinated#

Duration Hours to days
Pattern Multiple independent agents, each pursuing distinct goals, that coordinate
Examples Deal desk (sales + legal + finance agents), supply chain orchestration, CI/CD with cross-team handoffs
Failure cost Very high: cascade failures, cross-boundary rollback
Iteration cost Very high: must test integration, not just individual agents
Infrastructure need Cross-system signaling, saga patterns, distributed compensation

The challenge here is both the depth of any particular agent, and its integration across boundaries. Multiple independent agents with different objectives must coordinate. When something fails, you need rollback strategies that span agents and systems. When you iterate, you're testing both individual agent logic and integration behavior.

Level 5: Autonomous#

Duration Indefinite
Pattern Self-directing system that spawns, monitors, and adjusts its own workflows
Examples AI employee, self-improving pipeline, autonomous business process
Failure cost Existential
Iteration cost Essentially impossible with re-execution; requires replay from event history
Infrastructure need Governance, audit trails, backtest capability

An autonomous agent manages its own lifecycle. It decides what to do, when to spawn work, and when to escalate. Failure here could be a literal crash, or just a drift in efficacy over time. How do you know your autonomous agent is still pursuing the right goals after days of independent operation? You need complete audit trails, and the ability to replay and backtest a large corpus of execution history.

How Durable Execution unlocks agent capability#

Teams crossing the complexity cliff discover that their architecture can’t handle two fundamental requirements:

  1. Failure recovery: Resuming from the exact point of failure with full context intact.
  2. Efficient iteration: Testing prompt variants without re-running entire workflows.

Standard approaches fall short:

  • Cron restarts jobs; it doesn’t resume them.
  • Queues lose in-flight state when workers die.
  • Checkpointing doesn’t account for progress and consequences that occur between checkpoints.
  • Experimentation requires re-running entire workflows in parallel, even when only testing a single step.

Inevitiably, developers end up rolling their own durability layers on top of their frameworks. It’s tedious, error-prone, and insufficient.

Durable Execution as the solution#

What’s actually needed above the Cliff is:

  • Automatic state persistence: Every single step is immutably logged by default. Progress survives process death without additional code or “checkpointing” boundaries.
  • Replay capability: Execution can resume from any point in execution history. State is reconstructed from event history without requiring re-execution.
  • History branching / replay-powered experimentation: Execution history forking enables extremely fast and efficient parallel experiments that don’t require re-running completed work.

Durable Execution solves both the reliability problem and the velocity problem.

Consider iteration velocity. You have a 100-step agent, and you’re tuning the prompt at step 92. Now imagine doing it without Durable Execution, compared to doing it with Temporal:

  • Without Temporal: Testing 100 prompt variants means running 10,000 steps (steps 1–100, 100 times). Every variant re-executes the 91 steps that haven’t changed.
  • With Temporal: Testing 100 variants means running (100-91)*100 = 900 steps, a 91% reduction! Replay steps 1–91 from history (no re-execution, just state reconstruction), then branch into 100 parallel experiments that run steps 92–100. The advantage compounds with workflow length.

Now consider failure recovery. Your agent summarizes 10,000 SEC filings and fails after processing 9,999.

  • Without Temporal: You restart at filing #1 — potentially hours of work and thousands of API calls, all repeated.
  • With Temporal: The history for the first 9,999 filings is recorded. State is rebuilt from the event log and the last filing is processed.

Again, this advantage compounds with workflow length. The longer your workflow, the more the value proposition of Durable Execution grows.

Companies like OpenAI, Abridge, Lovable, Replit, and Hebbia already use Temporal to build some of the most sophisticated agents on the market. Temporal gives them the building blocks to move faster and build deeper.

The complexity cliff is real. Most frameworks get you to the edge and leave you there. If you’re building agents that need to operate above it, Durable Execution is your ideal foundation.

Ready to build agents that operate above the cliff? Explore our code samples and AI Cookbook to get started with Durable Execution, or join the #topic-ai channel in our Community Slack to connect with other developers building agentic systems on Temporal.

Temporal Cloud

Ready to see for yourself?

Sign up for Temporal Cloud today and get $1,000 in free credits.

Build invincible applications

It sounds like magic, we promise it's not.