Building observable AI agents: Temporal now integrates with Braintrust

AI agents are rapidly becoming the dominant pattern for LLM applications. But as these agents grow more sophisticated — orchestrating multiple models, calling external APIs, and executing multi-step workflows — two challenges emerge: durability and observability.

What happens when your agent crashes mid-task? How do you know which prompt version produced better results? How do you debug a workflow that touches five different models across dozens of API calls?

Today, we’re excited to announce the integration between Braintrust and Temporal, bringing together Durable Execution and LLM observability to make AI agents easier to build, monitor, and operate in production.

The problem: AI agents in the real world#

Getting an agent to production means dealing with some familiar headaches:

Agents fail mid-execution. API rate limits, network timeouts, model errors — your 30-minute research workflow dies at minute 29.
Prompts are black boxes. You tweak a system prompt, but you can’t quantify whether outputs improved or regressed.
Debugging is archaeology. Something went wrong, but reconstructing what happened across multiple LLM calls is painful.
Iteration is slow. Changing a prompt requires a code deploy, even when the logic hasn’t changed.

What is Temporal?#

Temporal provides Durable Execution for your code. When you write a Temporal Workflow, you get:

Automatic retries: Failed Activities retry with configurable backoff
State persistence: If your Worker crashes, another picks up exactly where it left off
Visibility: See every Workflow execution, its state, and its history

For AI agents, this means your multi-step research Workflow survives infrastructure failures. If the synthesis step fails after 20 successful web searches, you don’t re-run those searches; Temporal replays the cached results and retries only the failed step.

Some of the most sophisticated agentic applications on the market already run on Temporal, including those at OpenAI, Scale AI, and Replit.

What is Braintrust?#

Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it. Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.

The integration: Better together#

The Braintrust-Temporal integration connects these capabilities:

YOUR AI AGENT

Temporal	Braintrust
Durable workflow execution	LLM call tracing
Automatic retries	Prompt versioning
State persistence	Offline evals
Workflow visibility	Cost and latency metrics

With a few lines of code, every Temporal Workflow and Activity becomes a Braintrust span. Every LLM call is traced with full context. And prompts can be managed in Braintrust’s UI while your Temporal Workers pull the latest versions automatically.

Demo: Deep research agent#

Let’s look at a real example: a deep research agent that takes a question, plans a research strategy, fans out into parallel web searches, and synthesizes findings into a report.

The agent orchestrates four specialized sub-agents:

Planning agent: Decomposes the question into research aspects
Query generation agent: Creates optimized search queries
Web search agent: Executes searches and extracts findings (multiple run in parallel)
Synthesis agent: Produces the final research report

This workflow can take several minutes and involves dozens of API calls. Without Temporal, a failure at any point means starting over. Without Braintrust, you’re flying blind.

Setting up the integration#

First, install the dependencies:

uv pip install temporalio braintrust openai

Step 1: Wrap your OpenAI client#

One line to trace every LLM call:

# activities/invoke_model.py
from braintrust import wrap_openai
from openai import AsyncOpenAI

# Every call through this client is now traced in Braintrust
# max_retries=0 because Temporal handles retries
client = wrap_openai(AsyncOpenAI(max_retries=0))

Step 2: Add the Braintrust plugin to Temporal#

Initialize the logger and add the plugin to your Worker:

# worker.py
import os
from braintrust import init_logger
from braintrust.contrib.temporal import BraintrustPlugin
from temporalio.client import Client
from temporalio.worker import Worker

# Initialize Braintrust before creating the worker
init_logger(project=os.environ.get("BRAINTRUST_PROJECT", "deep-research"))

async def main():
    client = await Client.connect("localhost:7233")

    worker = Worker(
        client,
        task_queue="deep-research-task-queue",
        workflows=[DeepResearchWorkflow],
        activities=[invoke_model],
        plugins=[BraintrustPlugin()],  # ← This is all you need
    )

    await worker.run()

Add the plugin to your Client for full trace propagation:

# start_workflow.py
from braintrust import init_logger, start_span
from braintrust.contrib.temporal import BraintrustPlugin
from temporalio.client import Client

init_logger(project="deep-research")

async def main():
    client = await Client.connect(
        "localhost:7233",
        plugins=[BraintrustPlugin()],
    )

    with start_span(name="deep-research-request", type="task") as span:
        span.log(input={"query": research_query})

        result = await client.execute_workflow(
            DeepResearchWorkflow.run,
            research_query,
            id=workflow_id,
            task_queue="deep-research-task-queue",
        )

        span.log(output={"result": result})

That’s it. Now every Workflow execution appears in Braintrust with full hierarchy.

Step 3: Manage prompts with load_prompt()#

Instead of hardcoding prompts , you can load them from Braintrust:

# activities/invoke_model.py
import braintrust
from temporalio import activity

@activity.defn
async def invoke_model(request: InvokeModelRequest) -> InvokeModelResponse:
    instructions = request.instructions  # Fallback

    # Load prompt from Braintrust if available
    if request.prompt_slug:
        try:
            prompt = braintrust.load_prompt(
                project=os.environ.get("BRAINTRUST_PROJECT", "deep-research"),
                slug=request.prompt_slug,
            )
            built = prompt.build()
            for msg in built.get("messages", []):
                if msg.get("role") == "system":
                    instructions = msg["content"]
                    activity.logger.info(
                        f"Loaded prompt '{request.prompt_slug}' from Braintrust"
                    )
                    break
        except Exception as e:
            activity.logger.warning(f"Failed to load prompt: {e}")

    client = wrap_openai(AsyncOpenAI(max_retries=0))
    # ... use instructions in your LLM call

Your synthesis agent can then reference a Braintrust-managed prompt:

# agents/research_report_synthesis.py
result = await workflow.execute_activity(
    invoke_model,
    InvokeModelRequest(
        model=COMPLEX_REASONING_MODEL,
        instructions=REPORT_SYNTHESIS_INSTRUCTIONS,  # Fallback
        prompt_slug="report-synthesis",  # ← Load from Braintrust
        input=synthesis_input,
        response_format=ResearchReport,
    ),
    start_to_close_timeout=timedelta(seconds=300),
)

The workflow:

Develop prompts in code, see results in Braintrust traces
Create a prompt in Braintrust UI from your best version
Evaluate different versions using Braintrust’s eval tools
Deploy by pointing your code at the Braintrust prompt
Iterate in the UI — changes go live without code deploys

What you can see in Braintrust#

Once integrated, you get complete visibility into your agent’s behavior:

Full trace hierarchy: From client request through every Workflow step
LLM call details: Inputs, outputs, token counts, latency for each call
Prompt versions: Which version was used for each execution
Cost metrics: What each research query costs across models

What you can see in Temporal#

And on the Temporal side:

Workflow executions: Every research request with full Event History
Activity retries: Which Activities failed and how they recovered
Execution state: Where each Workflow is in its lifecycle

Try it yourself#

The complete deep research example is available on GitHub:

→ View the deep research cookbook example

To run it locally:

# Terminal 1: Start Temporal
temporal server start-dev

# Terminal 2: Start the Worker
export BRAINTRUST_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"
export BRAINTRUST_PROJECT="deep-research"
uv run python -m worker

# Terminal 3: Run a research query
uv run python -m start_workflow "What are the latest advances in quantum computing?"

We’re excited to see what you build. If you have questions or feedback, reach out: we’d love to hear from you.

Ethan Ruhe is the Product Lead for AI at Temporal. Ornella Altunyan is a Developer Marketer at Braintrust.