When I told a colleague that I was working on a post that would present a mental model for AI agents, he asked, “What will make it different from the dozens of others out there already?”
A great question that I noodled on for a bit and here’s what it came down to: I read a whole bunch of pieces, yet still didn’t feel like I had a working model in my head. They were either very high-level, or down in the weeds. I couldn’t see how it all fit together: goals, tools, loops, LLMs.
So I kept studying. I played with LangGraph. I vibe coded with Cursor (coding agents are a great example of AI agents). I created a new agent using a sample that my colleague Steve created. I drew some pictures (that I will share here), and I socialized with a bunch of people who are well ahead of me on the learning curve, as well as some who were also learning. And now, I get it. Don’t get me wrong, I still have a ton to learn, but I now have a framing around which I can extend that learning.
This post is the mental model I wish I’d had when I started. It’s for my fellow devs who want to understand how agentic AI applications actually work: what you need to build, how the pieces fit together, and why it’s not as magical (or as mysterious) as it might seem initially.
I’m Talking about LLMs and Tools Here
First, let me be clear about the flavor of agent I’m talking about: the kind that leverages an LLM to drive the flow of an application; that is, the LLM is determining the next steps that the application will take, and the agent’s job is to execute tools at the LLM’s direction. (Not to worry, I’ll talk more clearly about tools in just a moment).
I’m not going to talk about whether we should use the term “agent” to describe apps that invoke LLMs and downstream services through code that has a predetermined flow. It’s an interesting debate, for sure, but this is something that is quite adequately covered in some of the aforementioned pieces (I particularly like this one from LangChain).
From here on out, when I use the term “AI agent”, or just “agent”, I’m talking about an application that operates roughly like this:
- The agent is generally implemented as an event loop that is kicked off with an expression of some goal.
- In the loop, it:
- Asks the LLM to determine the next steps in the flow, and then
- Invokes one or more tools to perform those actions.
- It keeps looping until the LLM hits its goal or the user stops it.
An agent has a fixed set of tools available to it; for now, think of a tool as just an API (more to come in a moment). Each time the LLM responds with some instruction, it is the job of the agent to take that response and make preparations for tool execution. The agent may ask the user for confirmation before executing the tool. And after a tool is executed, it is the job of the agent to update the content that is fed to the LLM for the next turn.
These are effectively the pieces that you, the AI agent developer, are responsible for building:
- A prompt that will bootstrap the LLM to kick off the work of the agent.
- A library of tools that may be invoked by the agent.
- A mechanism for turning LLM decisions into a concrete tool execution.
- A mechanism for updating the input to the LLM on each turn through the loop. And then you have to put all of these pieces together, so you will need:
- An application that durably implements the event loop. A durable event loop is one that can recover from a crash, picking up where it left off in the event of failure.
- And finally, a way to durably invoke LLMs and tools. Durable invocations are ones that will survive intermittent network and service outages, rate-limited LLMs, and more.
This, right here, is my mental model for an AI agent.
In the rest of this post, I will cover each of the things on this list in a bit more detail.
The Language of LLMs and Tools
By now, you likely know the basics of how an LLM works: it takes in some content as a sequence of tokens, and it outputs content, usually as a sequence of tokens. The output tokens are determined from the large amounts of content that the models have been trained on (say, the content of the entire internet). The training of these models isn’t germane to our conversation here, but the language of the input and output is.
Just a few scenarios:
- Natural language as input, natural language as output — for example, getting a summary of a meeting transcript.
- Natural language as input, an image as output — for example, the generation of your latest cat meme 😸
- Natural language plus code as input, code as output — for example, asking Claude code, Github Copilot, Codex or Cursor to make updates to a code base.
That is, the interface to an LLM is the language for the input and the language for the output. As the AI agent developer, you get to decide what that interface is. Most often, the input will include at least some natural language that expresses the goal that the agent is to fulfill ← remember this for a moment.
Now let’s turn to the interface for tools.
Ultimately, a tool turns into an invocation of a service that has rigorously defined inputs and outputs that might be described through specifications like OpenAPI, GraphQL SDL, or Protobuf for gRPC. But earlier I suggested there was a bit more to it than this — and that is all about language.
In order for an LLM to determine which tools must be executed, and in which order, it must have a way to associate the words that make up the goal with the right tools. This is done by associating a set of words with each of the tools — namely, each tool will have natural language descriptions of:
- The overall function of the tool — for example, a service that will search for flights with available seats.
- The meaning of each of the input parameters — for example, the departure and arrival locations, and date of travel.
- The meaning of each of the output parameters — for example, a flight number and cost.
The Model Context Protocol (MCP) has become the de facto standard for tool specifications.
The goal should be quite detailed, and must be expressed with language that makes it easy for the LLM to associate it with the set of available tools. While LLMs are quite adept at navigating synonyms and alternate phrasing — and indeed, this is one of the major advantages of agentic systems over the rigid, form-based experiences that have dominated our digital experiences in the past — the more closely aligned the language of your goal is to the descriptions of the tools, the better the outcome is likely to be.
To bootstrap the agentic loop, the goal, the tool descriptions, and some additional context (which I will go into more detail on in just a bit) are fed to the LLM, which will return output indicating which tool is to be invoked.
Let’s now talk about taking that output and readying it for tool invocation.
Preparing for Tool Invocation
Let’s walk though a concrete example. Say the goal is to search for flights, book one, and then charge the user. With a properly crafted goal, upon invocation, the LLM will decide that the first thing the agent will do is search for relevant flights, using the flight search tool. That tool takes in arguments including departure and destination cities, and a travel date. So before running the tool, the agent must collect those arguments and structure them in such a way that the tool can be invoked with the right values. The agent handles the preparation for invoking the tool.
How the agent does this preparation can vary. If necessary, such as in the above example, it could present the user a form that allows them to provide the necessary information, or it may engage in a chat with the user to get that data; the latter is depicted in the following diagram.
The lower left depicts the main event loop, but then you can see another loop that is used to prepare for tool execution. In this second loop, the LLM on the left is used to generate prompts to the user, guiding them toward the needed inputs, and the LLM on the right (hmm, that kind of looks like the tool I’m using is an LLM!! 👀) is used to validate the response from the user. To be clear, this agent structure is not prescriptive — you will decide the structure of your agents — but I’m finding the mental model of the primary event loop helpful in keeping my agents well structured.
Once all the inputs are gathered, an API will be invoked, and for that, the data must be structured. But what better way to transform unstructured data (“I’d like to go from LAX to Toronto on April 16th”) into structured form ({ “departure” : “LAX”, “destination” : “YYZ”, “date” : “2025-04-16” }
) than an LLM? You achieve this by supplying additional instructions to the LLM.
For example, the loop at the upper right would include instructions that might say something like “once you have gathered all the inputs needed for the current tool, structure it as JSON, using the parameter names you find in the tool description.” Remember the breadcrumb I left earlier that suggested the LLM would take in context beyond the goal and tool descriptions? This is a perfect example. As an agent developer, it behooves you to get LLMs to do a lot of work for you, and the more instruction and context you provide, the better.
And speaking of supplying data to the LLM, let’s now turn our attention to the lower part of the event loop where we will take stock of where things sit after the tool execution, and craft the input to the next turn with the LLM.
Updating the LLM Input
You might have noticed that the further we got into this post, the more I spoke about prompts. And indeed, prompt engineering is one of the main jobs of the AI agent developer. I have found it helpful to think about the LLM input as containing the following chunks of data:
- The goal — this provides an overall description of the function of the agent; this should be quite detailed and expressed in such a way that the LLM can make decisions in the context of the available tools.
- The tools — the APIs that will be executed in achieving a goal, along with natural language descriptions of the function, inputs, and outputs.
- Example conversation — which demonstrates what an ideal interaction looks like.
- Context instructions — additional instructions that guide the LLM’s output, for example providing output formats, or input validation instructions.
- Conversation history — which records the dialog between the user and the agent; it is up to the agent to keep track of the conversation as LLMs do not preserve any state over multiple invocations.
I already talked about bootstrapping the event loop with the goal and tools, and while I’ve called it out in this list separately, the example conversation can be thought of as a detailed part of the goal; these pieces should be included during bootstrapping and remains throughout all the iterations of the event loop.
When you are updating the input to the LLM on subsequent turns, you will primarily be working with the context instructions and the conversation history. The latter is quite straightforward — you will append both the question asked of the user (which was, of course, generated by the LLM), and the user’s response to the conversation history.
The context instructions will be updated with the results of the tool execution and perhaps some additional instructions, allowing the LLM to assess the progress against the goal and choose the next step. One of my favorite examples here is that when coding agents have made some changes to source code, the context instructions must be updated to reflect the new state of those source files. Pretty cool to think about it that way, huh?
Okay, we’ve covered all of the pieces and parts, let’s now put it all together.
Putting It All Together
As a reminder, here are those pieces and parts:
- Goals and tools, expressed using language that allows the LLM to leverage the right tool at the right time.
- Mechanisms for preparing for tool invocation, following direction from the LLM.
- Mechanisms for updating the LLM input with the new state of the world following tool invocation.
Of course, we also need the agent to invoke the LLM and the tools.
This sounds like an orchestration problem to me, captured in the following diagram:
It’s just an app that you can write in whatever programming language you fancy. Python? TypeScript? Java? Golang? .NET? Ruby? Even PHP? Take your pick!
I will draw your attention to two things that need special attention.
First, notice that the invocation of the LLM and the tools are external calls, and of course you know that all sorts of things can go wrong. Rate limiting on the LLM might cause that call to fail, even if only intermittently. Networks can be flakey. Even tool APIs could fail. But in the cloud-native era we are all living in, your application is expected to be resilient to these types of failures. You need to ensure you have durability for these external calls.
And second, the AI agent app — that which is depicted in the above flowchart — could itself go down, yet you will want your application to pick up where it left off after something like Kubernetes resurrects it. In other words, you need to ensure durability for the AI agent itself. This means we must preserve the agent state as well as an understanding of where, exactly, the agent was when the crash occurred.
And there you have it, the details of my mental model for AI agents.
Takeaways
As I hinted, I’m building some agents, both to explore a variety of ideas and to demo some Temporal capabilities in an AI context. This work not only led to the formation of the mental model I’ve just presented, but also a deeper appreciation for three elements of AI agent development.
First, is prompt engineering. Prompt engineering was one of the first AI programming techniques that emerged after the release of ChatGPT. Remember all of the talk about “one-shot” and “few-shot”? (And can you believe that was only 4–5 years ago?!) While this area of development has continued to evolve to include concepts like chain-of-thought, and tree-of-thoughts, now that robots are doing the prompting, not just humans, the “science” of prompt engineering has become even more important. Prompts are at the heart of how agents work, and we are seeing standards like MCP emerge around them.
Second, I know where I stand on the question of what the best language for agent construction is. Answer: whatever programming language you like! Even while a significant portion of the agent implementation is contained within the prompts (goals, tools, context instructions, etc.), you still need an engine that drives things. While I am suggesting that the mental model I present here is a good framing to use when designing your agents, the details of what happens in the various parts of the event loop will vary. For example, you may need to include security and compliance checks before you invoke a tool. You need to have complete control over that logic and there is no better way to do this than with a general-purpose programming language. Simply put, you need flexibility.
And finally, ensuring durability for these agentic systems is absolutely critical. The distributed, cloud-native applications that have come to dominate the landscape over the last 10–15 years have already proven that a host of patterns (like retries and externalized state) were needed to deliver reliable applications over an inherently unreliable infrastructure. Agentic AI applications take that challenge to the next level, easily increasing the number of downstream calls to LLMs and tools by an order of magnitude. Check out this short video for more details.
With that I invite you to have a look at the sample AI agent that I mentioned at the start. And here is a short video of Steve demoing it.
I’d also love to hear from you. One of the best ways you can do this is by finding me on the Temporal Community Slack — I’m the only Cornelia Davis there! 🙂