We built a durable agent that debugs durable agents

AUTHORS
Almog Baku
DATE
Jun 18, 2026
CATEGORY
DURATION
7 MIN
  • AI/ML
  • Durable Execution
  • Architecture
  • Code Samples
This post is a guest post written by Almog Baku, CEO and co-founder of Kelet AI

When regular software breaks, a harness agent (Claude Code, Codex, whatever) follows the stacktrace and fixes it. AI fails silently across hundreds or thousands of traces, with no stack trace to follow. That's why we had to stop building harnesses and start building proper, durable workflows.

TL;DR: Kelet is an AI that continuously diagnoses quality failures in AI agents, and it's built on Temporal. This post covers the architecture, why it had to be structured this way, why a naive agent loop over your traces cannot solve the problem, and how you can use Kelet to debug your own Temporal-based agents.

Look at a single failing session in your trace viewer. You'll see what went wrong. You won't see why it keeps happening.

AI failures don't show up like ordinary bugs. The same input succeeds ten times and fails on the eleventh, and no two failures look quite alike. The actual root cause is a fuzzy cluster: a pattern that only becomes visible across hundreds of sessions, when you start asking what the failures have in common.

We worked with an insurance team whose two-agent pipeline kept misclassifying claims. Every individual trace looked like a scoring error in the second agent. The real problem was the first agent: it was stripping chronological order from call transcripts, and in this particular insurance company, the sequence of events determines the claim type. One session would have sent you patching the wrong agent entirely. It took hundreds of sessions before the shape emerged.

kelet-ai-session-architecture

That's the problem Kelet solves. And it's the reason we built it on Temporal.

Durable orchestration, not an agent loop#

Coding agents (like Claude Code, Codex, and even OpenClaw) can easily diagnose software bugs. Give it the trace or the log, and it'll find and fix the exception. That part works.

Root causes for AI Quality are the harder case, because they don't live in an individual session or a trace. They emerge from the overlap pattern across many occurrences. To find that overlap, you need to do three things: process each session as it arrives, accumulate hypotheses about what's going wrong, and then reason across the accumulated set. Those are three different jobs with their own latency profiles, inputs, and coordination requirements, and you can't fold them into a single LLM call.

Dumping thousands of traces into one LLM call doesn't fix this either. The bottleneck isn't context length. It's that pattern-matching across sessions requires building up state over time, gating the next stage on the previous one, and surviving restarts in the middle. That's pipeline infrastructure, and that's what we needed Temporal for. Specifically, we needed these Temporal primitives:

  • Long-running state: Sessions accumulate over hours and days; analysis can't be synchronous.
  • Durable Execution: Workers restart mid-analysis; each stage must resume cleanly from where it left off.
  • Event-driven coordination: A new Signal should trigger reprocessing immediately, without polling.
  • Cross-Workflow gating: Cluster analysis only fires once enough hypotheses have accumulated. A cron runner can't do the event-driven part. A generic task queue can't hold the durable long-running state. We needed Workflow hierarchy, Signals, wait_condition, and continue_as_new (to reset Event History for long-running workflows without losing state). Temporal gave us all four.

How we built it#

The algorithmic problem above drove the architecture directly. You need analysis stages that are isolated (tractable input → tractable output), coordinated (each stage gates on the previous), and durable (any stage can restart cleanly after a Worker crash).

Temporal gave us the primitives to build those properties without writing infrastructure from scratch. The system that runs in production today is built around a four-level Workflow hierarchy.

event-router-workflow-architecture-diagram

This is a simplified view of what we actually run. The real system has more Workflow types, more nesting, and additional coordination paths.

Session Workflow — one per session, but not triggered immediately. It waits for a 5-minute silence window before starting analysis, avoiding partial-state work when a session is still in progress:

DEBOUNCE_WINDOW: timedelta = timedelta(minutes=5)
# In the run loop:
await workflow.wait_condition(self._should_process, timeout=wait_time)
def _should_process(self) -> bool:
    return (workflow.now() - self._last_message) >= self.DEBOUNCE_WINDOW

The Workflow rebuilds all state from the database on startup — no in-memory state that can't survive a Worker restart.

Signal Workflows — Signal enrichment, merging, and agent interrogation run in parallel per session. They're fully independent; there's no reason to serialize them.

Agent Aggregation Workflow — collects hypothesis attributions across sessions for a given agent. This is where individual session analyses become a cross-deployment failure profile. The stage that solves the second-order problem from earlier. Investigate Issue Workflow — takes a failure cluster, reasons over it, produces a root cause with evidence, and generates a prompt patch. This stage only fires once enough sessions have accumulated in the aggregation layer, which is the gate that makes the reasoning tractable.

We use Kelet on Kelet, and you can too#

We monitor Kelet's own Temporal Workflows in production with Kelet. Temporal-native integration was the starting point, not an afterthought — we needed a system that could work on itself.

The integration is a single Temporal plugin that does two things:

  1. Configure Temporal's built-in OpenTelemetryPlugin so your traces are wired up out of the box.
  2. Propagate Kelet's session attributes (session ID, user ID, metadata) across Workers, Workflows, and Activities — including child Workflows and continue_as_new — by stamping Temporal headers on the way out and reading them on the way in to attach as OTel span attributes.

kelet-ai-workflow-diagram

Headers become part of event history, so propagation replays deterministically and survives Worker restarts and continue_as_new for free. On the Activity side, the inbound interceptor opens agentic_session(...) around the body, and every OTel span emitted inside picks up the session as an attribute — your LLM calls, retrieval, and tool spans land in the right Kelet session without your code ever touching it.

One detail that matters in our own prod: the interceptor filters out Kelet's own monitoring Workflows so they don't get re-ingested as sessions. Otherwise self-monitoring becomes an infinite loop — every diagnosis spawns a session, which spawns another diagnosis, which spawns another session.

If you're building agents on Temporal, the fastest path:

> npx skills add Kelet-ai/skills

The skill registers the plugin for you. Hand-wired, it's three lines:

import kelet
from kelet.temporal import KeletPlugin

kelet.configure(api_key="...", project="my-agent")
client = await Client.connect("localhost:7233", plugins=[KeletPlugin()])
worker = Worker(client, task_queue="ai", workflows=[MyWorkflow], activities=[my_activity])

Point Kelet at your own Workflows and the durable analysis pipeline — cross-session aggregation, root-cause clustering, prompt patches — runs on top of the traces you're already producing.

What this looks like in production#

We diagnose thousands of sessions per day. No human in the loop. No polling, no cron jobs, no manual triggers — new sessions flow in via Temporal Signals, and the pipeline picks them up from there.

Worker restarts don't drop anything. Every Workflow resumes from where it stopped. The infrastructure we didn't have to build: a scheduler, a distributed state machine, per-stage retry logic, and a coordination layer. Temporal handles all of that.

That durability rests on a contract with Temporal: activities are idempotent, and signals carry an idempotency key so a worker restart that re-delivers a signal doesn't double-process the session.

The work that took us months was the analysis pipeline itself: getting the stage boundaries right, tuning the debounce windows, figuring out how to aggregate across sessions, and getting the second-order reasoning to turn thousands of individual diagnoses into a single, named root cause.

Temporal freed us to spend that time on the actual problem instead of on coordination plumbing. Kelet is what we built with that time. If you're running agents or LLM applications on Temporal, pointing Kelet at them gets you continuous Failure Analysis on top of the Durable Execution you already have, which is what surfaces the cross-session patterns your traces won't show.

Temporal Cloud

Ready to see for yourself?

Sign up for Temporal Cloud today and get $1,000 in free credits.

Build invincible applications

It sounds like magic, we promise it's not.