The Customer

Replit is known as an industry leader in AI coding, although the company originally got its start selling a cloud integrated development environment (IDE) in 2016. They launched Replit Agent in 2024: an agent that automates software development, letting anyone build applications more easily. After a massive surge in adoption, Replit pivoted their business around the agent product. The company's momentum continues to grow with the recent release of Agent 3.

The Challenge

It's a pretty bad user experience to have the agent get super far into something and then hit a catastrophic error, and you lose everything and have to restart.

Replit Agent was initially built with custom orchestration and reliability logic. After the product launched, the Platform team faced the tasks of cleaning up tech debt and improving system reliability. Two main challenges were:

Orchestrating a control plane layer. All agent processes must be tracked by a control plane layer that knows where the agents are running, ensures there’s only one agent process per user session, and handles agent failures. The control plane must also manage the same requirements for the cloud containers where the code is running, and support multiplayer coding. These are big distributed systems problems with many edge cases that require fine-tuning.
Reliability. The agent encountered common reliability challenges like running out of memory, model provider outages, tool calls failing, and so on. The team knew they needed something to supervise the agents and respond to reliability issues. They decided to search for an off-the-shelf infrastructure technology or a primitive to address these challenges.

The Solution

We launched Replit Agent in September 2024, and in November we looked into Temporal. It took a couple of weeks to get pretty much the whole thing migrated onto Temporal. So it’s very easy to use, with a very good DevEx.

An engineer on Replit’s Platform Team had familiarity with Temporal from past work experience. He understood Temporal as a leader in the orchestration space, and evaluated the product to see how easy it was to use.

Here’s what stood out to the team:

Quick to get a prototype live
Easy, intuitive Workflow interface that creates a good DevEx
SDKs are idiomatic, letting you use existing code and plug it into Workflows
Recovery from reliability issues like the agent crashing
Managed service (Temporal Cloud) with good SLA

I appreciate the SDKs tend to be idiomatic for the language. The Python SDK really excels at that. For example, you can use regular “async/await.” And so it was really easy to just take our existing code, and start plugging pieces of it into the Workflow and Activities, and to get something working end-to-end.

The Platform team built a prototype on Temporal. After seeing good results, they migrated the control plane layer of Replit Agent.

Temporal solves Replit Agent’s orchestration challenges through the following architecture: Every Agent is its own Temporal Workflow. Workflow IDs are unique, so Temporal ensures there’s only one active Workflow at a time—and therefore only one agent process per user session. Temporal coordinates all the steps of the agent lifecycle, like spinning up the agent and turning it off.

The Workflows runs Activities, which contain failure-prone, non-deterministic logic. Activities ensure this logic automatically recovers. With the Workflow Updates feature, messages can be pushed into the Workflow that may include human-in-the-loop interactions. For example, Replit Agent can pause and wait for the user to accept a consent message, before the Workflow continues driving the agent.

The Results

Temporal gives us a lot more confidence to build the product and know that it's not going to have lots of edge cases that lead to bad user experiences.

With Temporal, the Platform Team has improved Replit Agent’s orchestration and reliability, at increasingly high scale. They also cited additional unexpected benefits of adopting Temporal. For example, the team has more confidence to build great products, and they can move faster because non-infrastructure engineers don’t need to worry about plumbing.

Replit Agent is just one of many use cases. The Platform team also uses Temporal for:

Previews. Replit’s “Previews” feature lets you preview the application you’re building with Replit Agent. When the agent develops an app, it takes checkpoints (a snapshot of the app and the database at any given point in time). Temporal orchestrates this flow.
Cloud services. Once a user has built an application with Replit Agent, they can deploy it and share it. Replit’s deployment product automatically builds a container image of the product and ships it to a cloud provider managed by Replit. This is a classic Temporal infrastructure management Workflow: Temporal orchestrates the build step, spins up the infra, deploys the service, etc., and then reports the status back to the user. Previously, Replit “hand-rolled” this flow using queues, but the manual process was challenging. With Temporal, they simply write the imperative code, and Temporal handles every step, coordinates long-running processes, and waits for user approval.
Domain lifecycle. Replit’s platform also lets users purchase a website domain before deploying their application. Temporal orchestrates the full domain lifecycle, from purchasing to renewal to cancellation. Each domain lifecycle is its own Temporal Workflows, and Schedules are used to handle the proper timing.

Replit started out using the Python SDK, and now they’ve adopted the Typescript and Go SDKs for the other use cases.

In the time we've used it, we haven't had any major incidents that trace back to Temporal Cloud, which is great.

As hoped, the team hasn’t experienced any incidents while using Temporal Cloud. They also started using Multi-Region Replication, a feature that replicates their primary Namespaces to a backup region. This feature allowed them to avoid an incident recently when there was a cloud provider degradation in their primary region.

The Takeaways

We've been able to scale up, and Temporal has never been the bottleneck. The agent has massively increased in its usage, and not having to rebuild our entire orchestration engine is great.

Replit Agent is a game-changer for application development. With Temporal, the Replit Platform Team can provide users with the experience they deserve, and do so with confidence.

Looking to build reliable, scalable agents like Replit? Start today with a free trial of Temporal Cloud and $1,000 in credits.

Replit uses Temporal to power Replit Agent reliably at scale

The Customer

The Challenge

The Solution

The Results

The Takeaways

Build invincible apps