How Monk migrated 100+ workflows from Inngest to Temporal

This blog is a guest post written By Francesco Coacci, Founding Engineer at Monk

1. Where we started#

Monk is an AI-powered platform for accounts receivable (AR). Our bespoke AI agents help customers get paid faster via an asynchronous collections agent and a cash application agent for core AR functions. Most of the product is autonomous and async: sending invoices, running Intelligent Collections to follow up on outstanding invoices, applying bank transactions to invoices, syncing into accounting systems and ERPs, and managing the automation layer that connects all of it.

By the time we started thinking about Temporal, that async surface was somewhere north of a hundred jobs supporting internal and external functionality at Monk.

Inngest had been great for going fast on day one, but with time it started hurting us at scale. The decision to move was easy. The how was not. You do not migrate a hundred running workflows in a weekend, and you should not try. We moved them incrementally, one reversible pull request at a time, with both systems live the entire way and no migration freeze.

This post is about the system that made that safe, and the parts of Temporal that surprised us once we were inside it.

2. Each workflow took four to seven PRs#

A migration this size fails in one of two ways. The first is the full rewrite: you build the new system on a side branch for months, then flip everything at once. It rarely ships well, because there are always minor orchestration differences or edge cases. The second is the half-migration, and it is worse. You move the easy jobs, lose momentum, and the intermediate state sets like concrete. Now every engineer has to remember which jobs live where, forever. An incomplete migration does not just add debt, it compounds it.

So we never let any single workflow sit half-moved. They went one at a time, and each one followed the same sequence:

Characterize the Inngest behavior with full test coverage around critical business logic and major orchestration requirements.
Add the Temporal scaffold alongside the Inngest job, not wired up.
Cut over the dispatch behind a feature flag.
Remove the Inngest receiver once the cutover is stable.

Each step was independently reversible and shipped small. You could stop at any of the four stages and the system was still coherent. Step 3 rolled back by flipping a flag; step 4 rolled back by reverting a delete. There was never a single PR where the migration was half-done in production.

During step 3, both runtimes were live at the same time, and the feature flag decided which one handled a given tenant. What made that safe was that the two paths were not two implementations of the logic. They were two wrappers around the same one.

The Activity was a thin shell over the same domain service the Inngest job already called. The business logic stayed in one place throughout, and the flag only chose which runtime wrapped it. That is what let both run in production at once, and let us trust that a tenant on Temporal and a tenant still on Inngest behaved identically. The characterization tests from step 1 held the service steady while the wrapper changed underneath it.

The pattern repeats so clearly in our Git log you can read it linearly: every workflow we moved has the same four commit titles, in the same order. Once it was muscle memory, the per-workflow cycle was a day or two.

2.1 Characterize#

This is the step engineers want to skip. Do not skip it.

Before you touch anything, write tests that lock in what the current Inngest job actually does: inputs, outputs, side effects, retry behavior, idempotency assumptions, branch logic that nobody remembers anymore, and the behavior that is implicit in the order of step.run calls.

This is the safety net for steps 2 through 4. The Temporal scaffold has to pass the same tests. The cutover has to keep passing them. The Inngest removal cannot break them. When you find behavior that is wrong on the Inngest side, that is the right time to decide whether to carry it over or fix it.

Two patterns help here:

Write the tests against the service layer, not the framework. Inngest job bodies should be thin wrappers around domain services anyway. If they are not, refactor first so the tests do not have to know about Inngest.
Capture real production payloads from the last 30 days and use them as test fixtures. The cases you remember are not the ones that will break in cutover.

2.2 Scaffold#

The Temporal Workflow and its Activities land next, with no production traffic pointed at them yet. The dispatcher still sends everything to Inngest.

This PR is the largest of the four. It is also the safest, because the new code is dark.

A few choices paid off in every scaffold:

Workflow Type lives in a string constant exported from a *.types.ts file next to the Workflow.
Activities are thin shells over the same domain services the Inngest job called. Anything more than that goes in the service.
If the Workflow runs on a schedule, the Schedule definition ships in this PR and lands in a registry in Git that CI syncs on deploy. We deploy Temporal end to end via IaC to keep both migration and maintenance safe.

2.3 Cut over#

The dispatch flips. Inngest stops receiving the event. Temporal starts receiving it. The behavior tests written in step 1 keep passing.

We always did this behind a feature flag scoped to a tenant ID. The flag let us route one canary tenant first, watch it for as long as we needed, and progressively roll forward. It also let us roll back in seconds if anything looked off.

A few rules:

Build the feature flag before the migration, not during. The temptation to skip it on workflow 6 because “it’s small” is the same temptation that costs you a Friday on workflow 7.
The cutover PR should change one thing: the dispatch. No technical debt cleanups. The smaller the PR, the easier it is to read in the rollback meeting nobody wants to have.
Watch the Workflow in the Temporal UI for at least one full firing cycle before declaring success. For a scheduled job that fires Monday through Friday at 10 a.m. ET, that is a full business day. For a webhook job, watch through one peak.

2.4 Remove#

After the cutover has been stable for a sprint or so, the Inngest receiver gets deleted. The job stops existing.

This is the PR with the highest line count (large deletes) and the lowest risk (the code has not been receiving traffic). It is also the one that often gets skipped because the team has moved on. Resist that. Code that is unreachable today becomes confusing tomorrow when a new engineer asks why both implementations exist.

3. One workflow, mapped piece by piece#

The clearest way to show what the migration actually involved is one workflow, end to end, with the business logic stripped out. The job is a scheduled drift check: every few hours it scans recent records for a kind of inconsistency and raises if it finds any. Boring, which is exactly why it is a good first migration.

On Inngest, it is one function. The trigger, the concurrency cap, and the retry policy are all config on the same object as the body:

export const driftCheck = inngest.createFunction(
  { id: 'drift-check', retries: 0, concurrency: 1 },
  [{ cron: '0 */4 * * *' }],
  async ({ step }) => {
    const current = await step.run('check-current', () =>
      DriftService.scan(currentWindow)
    );
    const previous = await step.run('check-previous', () =>
      DriftService.scan(previousWindow)
    );

    const issues = [...current, ...previous];
    if (issues.length > 0) {
      throw new Error(`drift check failed: ${issues.length} records`);
    }
  }
);

On Temporal, the same job is three small files. The Workflow orchestrates. Retry and timeout live on the Activity proxy, not in the body:

const { checkCurrent, checkPrevious } = proxyActivities<DriftActivities>({
  startToCloseTimeout: '60 seconds',
  retry: { maximumAttempts: 1 },
});

export const driftCheckWorkflow = async (): Promise<void> => {
  const current = await checkCurrent();
  const previous = await checkPrevious();

  const issues = [...current, ...previous];
  if (issues.length > 0) {
    throw ApplicationFailure.nonRetryable(
      `drift check failed: ${issues.length} records`
    );
  }
};

The Activities are deliberately thin. Each is a shell over the same service method the Inngest step called:

export const checkCurrent = () => DriftService.scan(currentWindow);
export const checkPrevious = () => DriftService.scan(previousWindow);

The Schedule is its own object:

export const driftCheckSchedule: ScheduleOptions = {
  scheduleId: 'drift-check',
  spec: { cronExpressions: ['0 */4 * * *'], timezone: 'America/New_York' },
  action: {
    type: 'startWorkflow',
    workflowType: 'driftCheck',
    workflowId: 'drift-check',
    taskQueue: TASK_QUEUES.MAIN,
  },
  policies: { overlap: 'SKIP', catchupWindow: '1 year' },
};

Every Inngest concept has a home on the Temporal side. They just stop living in one config object and spread across the Workflow, the Activity proxy, and the Schedule:

Inngest	Temporal	What actually changes
Cron trigger	Schedule object, autosynced with GitHub Actions	A separate object, not a field on the function
Event trigger	`client.workflow.start(type, ...)` from your dispatcher	Event matching becomes an explicit start call you write
`concurrency: 1` on a cron	Schedule overlap: `SKIP`	Skip the next tick while the last run is still going
`concurrency: { key }` per tenant	No native equivalent	The one knob with no drop-in replacement
`retries: 0`	Activity `retry.maximumAttempts: 1`	Counts differently, see below
`step.run('name', fn)`	Activity behind `proxyActivities`	Memoized-by-name becomes replayed-from-history, see below
`step.sleep('3 days')`	`await sleep('3 days')`	Durable on both sides
`step.waitForEvent(...)`	Signal handler plus `condition()`	Match-by-expression becomes Signal-by-Workflow-ID
Event idempotency key	Workflow ID plus reuse policy	Dedupe moves off the event and onto the Workflow ID
`throw new Error(...)`	`ApplicationFailure.nonRetryable(...)`	You choose whether the failure is retryable

Two rows in that table are quiet traps, and both bit us before we learned to watch for them.

Retries count differently. Inngest’s retries: N means N attempts after the first failure. Temporal’s maximumAttempts: N is the total, first attempt included. Port retries: 2 straight across to maximumAttempts: 2, and you have silently dropped an attempt. The correct mapping is maximumAttempts: N + 1. Our drift check ran no retries, so retries: 0 became maximumAttempts: 1.

step.run and an Activity are not the same primitive. Inngest re-invokes your function over HTTP once per step and memoizes each step’s result by its name. Temporal replays the whole Workflow function from Event History and hands back the recorded result of each Activity. The consequence shows up the moment you port a function body: anything that sat between step.run calls in Inngest, such as Date.now(), a random pick, or a quick database read, was harmless there because each step ran in a fresh invocation. The same line in a Temporal Workflow body runs on every replay and breaks determinism. It has to move into an Activity. This is the difference between copying the body across and actually migrating it.

What did not change was the point. DriftService.scan is untouched. The migration swapped the wrapper around it: a createFunction config object became a Workflow plus a Schedule, and each step.run became an Activity that just calls the service. The risky code stayed where it was, under test. That is what made doing this 100+ times mechanical instead of terrifying.

4. The thing you will miss most: Flow control#

There is one row in that table with no clean answer, and it required the most engineering rigor.

In Inngest, capping concurrency per tenant is two lines of config on the function:

{
  concurrency: { limit: 5, key: 'event.data.orgId' },
}

Inngest keeps a separate virtual queue per org and never runs more than five of that org’s steps at a time. Throttle, debounce, rate limiting, priority, and event batching all work the same way: declarative config, no code.

Temporal isn’t built the same way. A Worker can cap how many Activities or Workflow Tasks a single process runs at once, but there is no native “run at most N Workflows of this type” or “run at most N per tenant.” At the time of our migration, that gap showed up in a long-running, heavily upvoted request on the Temporal repo. So you build it. Two levers got us back what we gave up.

Coarse: Separate Namespaces. Our highest-volume integrations can produce more events in an hour than most other jobs produce in a day. We gave them their own Namespace, their own Worker pool, and their own ECS service. A flood on that side cannot starve the Workers running billing and customer-facing flows. One Docker image, two entry points, picked per service by the container command override:

# core worker
command = ["node", "dist/worker.js"]

# integrations worker
command = ["node", "dist/integrations-worker.js"]

The blast radius of a burst is the Namespace it lives in, not the whole platform. If your traffic is uniform and light, you do not need this. The day a workload’s burst profile pulls away from the rest, the split is a second entry point and a Terraform module.

Fine: A coordinator Workflow. Inside a pool, when you actually need a per-tenant limit, you write it. One Workflow lists the work and fans out Child Workflows behind a sliding window:

let inFlight = 0;
const running: Promise<unknown>[] = [];

for (const item of work) {
  await condition(() => inFlight < MAX_CONCURRENT);
  inFlight++;
  running.push(
    executeChild(processItemWorkflow, { args: [item] }).finally(() => {
      inFlight--;
    })
  );
}

await Promise.all(running);

condition() parks the Workflow until a slot frees, deterministically, with no busy loop and no Redis. This is the concurrency key Inngest handed you for free, except now it is code you own and can shape however the workload needs.

The one bug everyone writes first: decrement inFlight on the failure path too, not just success. Miss it, and slots leak until the coordinator wedges. The .finally() above is load-bearing. For very large fan-outs, the coordinator’s Event History grows with every child, so Continue-as-New periodically rather than looping forever in one run.

Temporal is closing the gap. Newer versions include Task Queue Priority and Fairness with per-key rate limits, the closest thing yet to Inngest’s concurrency key landing natively. If you are reading this later, check whether it covers your case before you reach for a coordinator.

5. How we deployed it#

The Worker runs on ECS Fargate: two services, one shared Docker image, and two entry points selected per service via the container command override. Workers autoscale on CPU. Most of the pool is on Fargate Spot with one always-on Fargate task as a floor. Spot is essentially free reliability headroom because Temporal reschedules any Activity whose Worker was reclaimed mid-run.

6. What changed#

The hard wins:

Workflow-level visibility. Inngest’s CEL search was slow and limited. Temporal gives every run Search Attributes, full Event History, and a status, so finding the one that misbehaved is quick.
Idempotency by construction. Deterministic Workflow IDs replaced an entire class of “did we already do this?” checks across the codebase.
Workload isolation by Namespace. High-volume webhook ingestion is structurally separated from core Workflows. A burst on one cannot pressure the other.

The soft wins:

New engineers shipped their first Workflow on day two. The repeating shape of *.workflow.ts, *.types.ts, and *.schedule.ts, plus a working local Temporal setup, meant the only real learning curve was the determinism constraint.
Migration discipline compounded. The four-PR pattern is now muscle memory across the team. The same shape applies to a future migration off any other framework.

7. Wrapping up#

We never had a big cutover. Just the same loop, one workflow at a time: characterize, scaffold, flip it behind a flag, delete the old one. Small PRs, easy to roll back, both systems running side by side the whole time.

Keep both systems live, go one workflow at a time, and have both paths call the same service so the new version behaves just like the old one. It is not exciting, but that is kind of the point. Slow and boring is how you move a hundred workflows without breaking things.

Good luck :)