AI reliability is a decade-old problem. And we’re still only solving half of it

AUTHORS
Melanie Warrick
DATE
Apr 01, 2026
DURATION
7 MIN

The AI agents being deployed today can reason through complex tasks, chain together dozens of tool calls, and operate autonomously for hours. What most of them can’t do is survive something going wrong halfway through.

Even if an agent were 85% reliable at each step, a 10-step workflow would succeed end-to-end only about 20% of the time. Scale that to the longer workflows that production agents actually run, and even strong step-level performance produces cascading failure. Not because the model got something wrong, but because the system had no way to checkpoint progress, recover from a partial failure, or resume where it left off.

That gap between a system that can reason about a problem and a system that can survive one captures where AI reliability stands today. The industry is investing heavily in one half and largely ignoring the other.

A widely reported incident in late 2025 made this concrete. A developer using Google’s Antigravity AI coding assistant asked it to clear a project’s cache folder. Instead, the agent reportedly wiped the user’s entire D: drive. The data was unrecoverable. The AI could diagnose exactly what had gone wrong. It could articulate the failure in detail. What it could not do was recover.

The intelligence was there. The resilience was not.

How we got here#

This isn’t a new problem. AI reliability has been a challenge for more than a decade, and I’ve watched it evolve firsthand. When I started working in AI in 2015, breakthroughs like Microsoft’s ImageNet result (a 4.94% top-5 error rate that edged past the commonly cited human benchmark) made the field feel like it was crossing an important threshold. I was building neural network platforms and implementing ML algorithms for recommendation systems. Even then, I learned fast that the gap between “impressive” and “dependable” was enormous.

The first generation of AI reliability problems was about the models themselves.

Could they think correctly?

Could they avoid fabricating information, encoding bias, or confidently presenting wrong answers?

Sound familiar?

In the mid-2010s, image recognition models were confidently identifying objects that weren’t there. By 2016, Microsoft’s Tay chatbot was generating toxic content within hours of launch, not because someone was careless, but because unsupervised learning from uncurated data does exactly what you’d expect. In 2018, IBM discovered Watson for Oncology was suggesting blood thinners for patients at risk of severe bleeding, MD Anderson Cancer Center shut down the $62 million project.

None of these failures happened because the teams weren’t trying. They happened because making AI reliable in everyday conditions is a genuinely hard problem, one with consequences that scale with the technology’s capability.

Genuine progress followed. One study found that 55% of citations generated by ChatGPT 3.5 were fabricated, a rate that dropped to 18% with GPT-4. Guardrails, RLHF, and reasoning chains are making models meaningfully more trustworthy. The progress is real. But ask a leading model to solve a multi-step algebra problem ten times and you’ll get different answers, some correct, some wildly off, all presented with equal confidence. The problem still isn’t solved.

And while model reliability was improving, something else was shifting underneath it.

The ground moved#

Somewhere around 2024, AI crossed a threshold. We moved from systems that suggest to systems that act. Agents began browsing the web, writing and executing code, managing calendars, interacting with APIs, and orchestrating multi-step workflows on behalf of users. This wasn’t an incremental capability gain. It changed the nature of what “AI reliability” means.

When a chatbot hallucinates, someone reads a wrong answer. When an AI agent hallucinates mid-workflow, it might wipe a hard drive, make an unauthorized purchase, or fabricate records to cover its tracks. That’s the difference the Antigravity incident illustrated: it was so much more than a bad answer. It was an irreversible action in a system with no mechanism to recover.

Recent research on agent reliability argues that traditional evaluations miss the operational qualities that determine whether agents hold up in practice: consistency across runs, robustness to perturbations, predictability, and bounded failure severity. In that work, capability gains translated into only small reliability improvements.

METR’s research puts numbers on this. They tested frontier models on real tasks of varying length and found that models succeeded reliably on tasks that took human experts a few minutes, but success rates dropped sharply as tasks stretched to hours. It’s not that the models were less capable on the longer tasks. They just couldn’t hold it together across the full sequence of steps required to complete them. That’s the compound failure problem in practice. The 2026 International AI Safety Report, authored by over 100 experts, identifies persistent unreliability as a core challenge for the foundation models underpinning these systems.

AI reliability and system reliability have always coexisted in production. But for most of that history, model researchers worked on accuracy, bias, and hallucination while infrastructure engineers focused on serving a prediction and returning a result. The failure modes were bounded. AI agents changed the equation. The moment AI started executing autonomous, multi-step workflows in production, the infrastructure had to do something it was never designed for: keep an unpredictable system reliable over long-running operations.

The foundation that’s missing#

Almost all of the AI reliability conversation today centers on the model layer: better training, better guardrails, better benchmarks. That work is essential. But production AI agents need something else entirely.

What happens if the process crashes halfway through a ten-step workflow? What happens if a downstream service times out? What happens if a human needs to approve a step two days later? What happens if a tool call succeeds but the acknowledgment fails?

These are infrastructure questions, not model questions. And right now, most AI systems don't have good answers for them.

I think of the answer as a digital bookmark for your workflow: a checkpoint that captures exactly where you are, what’s already happened, and what’s left to do, so recovery means resuming, not rebuilding.

An agent crashes mid-tool-call and wakes up with full context of what already succeeded, what failed, and where to pick up. No re-execution. No lost state. No silent corruption.

That’s the principle behind Durable Execution, and it’s what we build at Temporal. The same infrastructure that has kept mission-critical workflows running at companies like Snap, Netflix, and Stripe is now underpinning AI agent orchestration in production, because the reliability problem an AI agent faces mid-workflow is fundamentally the same problem any long-running distributed process faces. It just matters more now, because the system is making decisions, not just moving data.

I’ve spent a decade in AI, building through each wave of the field’s development. Deploying AI without the right infrastructure is how we end up with agents that can diagnose their own failures in perfect detail and do nothing to recover from them. That’s not a model problem, but an infrastructure one. And it’s solvable.


Sources and further reading:

MAY 5–7, Moscone South

The Durable Execution Conference for agentic AI

Replay is back, and tickets are on sale! Join us in San Francisco May 5–7 for Temporal’s annual developer conference. Three days of workshops, talks, and a hackathon focused on building reliable agentic systems.

Replay attendee graphic

Build invincible applications

It sounds like magic, we promise it's not.