| This blog is a guest post written byBy Francesco Coacci, Founding Engineer at Monk |
|---|
1. Where we started#
Monk is an AI-powered platform for accounts receivable (AR). Our bespoke AI agents help customers get paid faster via an asynchronous collections agent as well as cash application agent for core AR functions. Most of the product is autonomous and async — sending invoices, running Intelligent Collections to follow up on outstanding ones, applying bank transactions to invoices, syncing into accounting systems and ERPs, and the automation layer that connects all of it. By the time we started thinking about Temporal, that async surface was somewhere north of a hundred jobs to support all the internal and external functionalities at Monk. Inngest had been great for going fast on day one, but with time it started hurting us at scale. The decision to move was easy. However, the how was. You do not migrate a hundred running workflows in a weekend, (and you should not try). We moved them incrementally, one reversible pull request at a time, both systems live the entire way, no migration freeze. This post is the system that made that safe, and the parts of Temporal that surprised us once we were inside it.
2. Each workflow took four to seven PRs#
A migration this size fails in one of two ways. The first is the full rewrite: you build the new system on a side branch for months, then flip everything at once. It rarely ships well, because there are always minor orchestration differences or edge cases. The second is the half-migration, and it is worse. You move the easy jobs, lose momentum, and the intermediate state sets like concrete. Now every engineer has to remember which jobs live where, forever. An incomplete migration does not just add debt, it compounds it. So we never let any single workflow sit half-moved. They went one at a time, and each one followed the same sequencing:
- Characterize the Inngest behavior with full test coverage on critical and major business logic and orchestration requirements
- Add the Temporal scaffold alongside the Inngest job, not wired up
- Cut over the dispatch behind a feature flag
- Remove the Inngest receiver once the cutover is stable
Each step is independently reversible and ships small. You can stop at any of the four stages and the system is still coherent. Step 3 rolls back by flipping a flag, step 4 by reverting a delete. There is never a single PR where the migration is half-done in production. During step 3, both runtimes are live at the same time and the feature flag decides which one handles a given tenant. What makes that safe is that the two paths are not two implementations of the logic. They are two wrappers around the same one: The activity is a thin shell over the same domain service the Inngest job already called. The business logic keeps one home the whole way through, and the flag only chooses which runtime wraps it. That is what lets both run in production at once and lets you trust that a tenant on Temporal and a tenant still on Inngest behave identically. The characterization tests from step 1 hold the service still while the wrapper changes underneath it. The shape is so repeated in our git log you can read it linearly: every workflow we moved has the same four commit titles, in the same order. Once it was muscle memory, the per-workflow cycle was a day or two.
2.1 Characterize#
This is the step engineers want to skip. Do not skip it. Before you touch anything, write tests that lock in what the current Inngest job actually does. Inputs, outputs, side effects, retry behavior, idempotency assumptions. Branch logic that nobody remembers anymore. The behavior that is implicit in the order of step.run calls. This is the safety net for steps 2 through 4. The Temporal scaffold has to pass the same tests. The cutover has to keep passing them. The Inngest removal cannot break them. When you find behavior that is wrong on the Inngest side, that is the right time to decide whether to carry it over or fix it. Two patterns help here:
- Write the tests against the service layer, not the framework. Inngest job bodies should be thin wrappers around domain services anyway. If they are not, refactor first so the tests do not have to know about Inngest
- Capture real production payloads from the last 30 days and use them as test fixtures. The cases you remember are not the ones that will break in cutover
2.2 Scaffold#
The Temporal Workflow and its activities land next, with no production traffic pointed at them yet. The dispatcher still sends everything to Inngest. This PR is the largest of the four. It is also the safest, because the new code is dark. A few choices that paid off in every scaffold:
- Workflow type lives in a string constant exported from a *.types.ts file next to the workflow
- Activities are thin shells over the same domain services the Inngest job called. Anything more than that goes in the service
- If the Workflow runs on a schedule, the schedule definition ships in this PR and lands in a registry-in-Git that CI syncs on deploy. We deploy IaC protocol for Temporal end to end to ensure safety in the migration and maintenance
2.3 Cut over#
The dispatch flips. Inngest stops receiving the event. Temporal starts receiving it. The behavior tests written in step 1 keep passing. We always did this behind a feature flag scoped to a tenant ID. The flag let us route one canary tenant first, watch it for as long as we needed, and progressively roll forward. It also let us roll back in seconds if anything looked off. A few rules:
- Build the feature flag before the migration, not during. The temptation to skip it on workflow #6 because "it's small" is the same temptation that costs you a Friday on workflow #7
- The cutover PR should change one thing: the dispatch. No technical debt cleanups. The smaller the PR, the easier it is to read in the rollback meeting nobody wants to have
- Watch the Workflow in the Temporal UI for at least one full firing cycle before declaring success. For a scheduled job that fires Mon-Fri at 10am ET, that is a full business day. For a webhook job, watch through one peak
2.4 Remove#
After the cutover has been stable for a sprint or so, the Inngest receiver gets deleted. The job stops existing. This is the PR with the highest line count (large deletes) and the lowest risk (the code has not been receiving traffic). It is also the one that often gets skipped because the team has moved on. Resist that. Code that is unreachable today becomes confusing tomorrow when a new engineer asks why both implementations exist.
3. One workflow, mapped piece by piece#
The clearest way to show what the migration actually involved is one workflow, end to end, with the business logic stripped out. The job is a scheduled drift check: every few hours it scans recent records for a kind of inconsistency and raises if it finds any. Boring, which is exactly why it is a good first migration. On Inngest it is one function. The trigger, the concurrency cap, and the retry policy are all config on the same object as the body:
export const driftCheck = inngest.createFunction(
{ id: 'drift-check', retries: 0, concurrency: 1 },
[{ cron: '0 */4 * * *' }],
async ({ step }) => {
const current = await step.run('check-current', () =>
DriftService.scan(currentWindow)
);
const previous = await step.run('check-previous', () =>
DriftService.scan(previousWindow)
);
const issues = [...current, ...previous];
if (issues.length > 0) {
throw new Error(`drift check failed: ${issues.length} records`);
}
}
);
On Temporal the same job is three small files. The Workflow orchestrates. Retry and timeout live on the Activity proxy, not in the body:
const { checkCurrent, checkPrevious } = proxyActivities<DriftActivities>({
startToCloseTimeout: '60 seconds',
retry: { maximumAttempts: 1 },
});
export const driftCheckWorkflow = async (): Promise<void> => {
const current = await checkCurrent();
const previous = await checkPrevious();
const issues = [...current, ...previous];
if (issues.length > 0) {
throw ApplicationFailure.nonRetryable(
`drift check failed: ${issues.length} records`
);
}
};
The activities are deliberately thin. Each is a shell over the same service method the Inngest step called:
export const checkCurrent = () => DriftService.scan(currentWindow);
export const checkPrevious = () => DriftService.scan(previousWindow);
The schedule is its own object:
export const driftCheckSchedule: ScheduleOptions = {
scheduleId: 'drift-check',
spec: { cronExpressions: ['0 */4 * * *'], timezone: 'America/New_York' },
action: {
type: 'startWorkflow',
workflowType: 'driftCheck',
workflowId: 'drift-check',
taskQueue: TASK_QUEUES.MAIN,
},
policies: { overlap: 'SKIP', catchupWindow: '1 year' },
};
Every Inngest concept has a home on the Temporal side. They just stop living in one config object and spread across the workflow, the activity proxy, and the schedule:
| Inngest | Temporal | What actually changes |
|---|---|---|
| cron trigger | a Schedule object, autosynced with GitHub Actions | a separate object, not a field on the function |
| event trigger | client.workflow.start(type, ...) from your dispatcher | event-matching becomes an explicit start call you write |
| concurrency: 1 on a cron | schedule overlap: 'SKIP' | skip the next tick while the last run is still going |
| concurrency: { key } per tenant | no native equivalent (see §4) | the one knob with no drop-in replacement |
| retries: 0 | activity retry.maximumAttempts: 1 | counts differently, see below |
| step.run('name', fn) | an activity behind proxyActivities | memoized-by-name becomes replayed-from-history, see below |
| step.sleep('3 days') | await sleep('3 days') | durable on both sides |
| step.waitForEvent(...) | a signal handler plus condition() | match-by-expression becomes signal-by-workflow-id |
| event idempotency key | workflow id plus reuse policy | dedupe moves off the event and onto the workflow id |
| throw new Error(...) | ApplicationFailure.nonRetryable(...) | you choose whether the failure is retryable |
Two rows in that table are quiet traps, and both bit us before we learned to watch for them. Retries count differently. Inngest's retries: N means N attempts after the first failure. Temporal's maximumAttempts: N is the total, first attempt included. Port retries: 2 straight across to maximumAttempts: 2 and you have silently dropped an attempt. The correct mapping is maximumAttempts: N + 1. Our drift check ran no retries, so retries: 0 became maximumAttempts: 1. step.run and an activity are not the same primitive. Inngest re-invokes your function over HTTP once per step and memoizes each step's result by its name. Temporal replays the whole workflow function from event history and hands back the recorded result of each activity. The consequence shows up the moment you port a function body: anything that sat between step.run calls in Inngest, a Date.now(), a random pick, a quick read off the database, was harmless there because each step ran in a fresh invocation. The same line in a Temporal workflow body runs on every replay and breaks determinism. It has to move into an activity. This is the difference between copying the body across and actually migrating it. What did not change is the point. DriftService.scan is untouched. The migration swapped the wrapper around it: a createFunction config object became a workflow plus a schedule, and each step.run became an activity that just calls the service. The risky code stayed where it was, under test. That is what made doing this 100+ times mechanical instead of terrifying.
4. The thing you will miss most: flow control#
There is one row in that table with no clean answer and requires the most engineering rigor. In Inngest, capping concurrency per tenant is two lines of config on the function:
{
concurrency: { limit: 5, key: 'event.data.orgId' },
}
Inngest keeps a separate virtual queue per org and never runs more than five of that org's steps at a time. Throttle, debounce, rate limiting, priority, event batching, all the same way: declarative config, no code. Temporal isn't built the same way. A worker can cap how many Activities or Workflow tasks a single process runs at once, but there is no native "run at most N workflows of this type" or "at most N per tenant." It has been an open, heavily upvoted request on the Temporal repo for years. So you build it. Two levers got us back what we gave up. Coarse: separate namespaces. Our highest-volume integrations can produce more events in an hour than most other jobs produce in a day. We give it its own namespace, its own worker pool, its own ECS service. A flood on that side cannot starve the workers running billing and customer-facing flows. One Docker image, two entry points, picked per service by the container command override:
# core worker
command = ["node", "dist/worker.js"]
# integrations worker
command = ["node", "dist/integrations-worker.js"]
The blast radius of a burst is the namespace it lives in, not the whole platform. If your traffic is uniform and light you do not need this. The day one workload's burst profile pulls away from the rest, the split is a second entry point and a terraform module. Fine: a coordinator workflow. Inside a pool, when you actually need a per-tenant limit, you write it. One workflow lists the work and fans out children behind a sliding window:
let inFlight = 0;
const running: Promise<unknown>[] = [];
for (const item of work) {
await condition(() => inFlight < MAX_CONCURRENT);
inFlight++;
running.push(
executeChild(processItemWorkflow, { args: [item] }).finally(() => {
inFlight--;
})
);
}
await Promise.all(running);
condition() parks the workflow until a slot frees, deterministically, with no busy loop and no Redis. This is the concurrency key Inngest handed you for free, except now it is code you own and can shape however the workload needs. The one bug everyone writes first: decrement inFlight on the failure path too, not just success. Miss it and slots leak until the coordinator wedges. The .finally() above is load-bearing. For very large fan-outs the coordinator's event history grows with every child, so continue-as-new periodically rather than looping forever in one run. Temporal is closing the gap. Recent versions ship task queue fairness with per-key rate limits, the closest thing yet to Inngest's concurrency key landing natively. If you are reading this later, check whether it covers your case before you reach for a coordinator.
5. How we deployed it#
The worker runs on ECS Fargate. Two services, one shared Docker image with two entry points selected per service via the container command override. Workers autoscale on CPU. Most of the pool is on Fargate Spot with one always-on Fargate task as a floor: Spot is essentially free reliability headroom because Temporal reschedules any activity whose worker was reclaimed mid-run.
6. What changed#
The hard wins:
- Per-workflow visibility. Inngest's CEL search was slow and limited. Temporal gives every run searchable attributes, full history, and a status, so finding the one that misbehaved is quick
- Idempotency by construction. Deterministic workflow IDs replaced an entire class of "did we already do this" check across the codebase
- Workload isolation by namespace. High-volume webhook ingestion is structurally separated from core workflows. A burst on one cannot pressure the other
The soft wins:
- New engineers ship their first workflow on day two. The repeating shape of *.workflow.ts + *.types.ts + *.schedule.ts plus a working local Temporal mean the only real learning curve is the determinism constraint
- Migration discipline that compounded. The four-PR pattern is now muscle memory across the team. The same shape applies to a future migration off any other framework
7. Wrapping up#
We never had a big cutover. Just the same loop, one workflow at a time: characterize, scaffold, flip it behind a flag, delete the old one. Small PRs, easy to roll back, both systems running side by side the whole time. Keep both systems live, go one workflow at a time, and have both paths call the same service so the new version behaves just like the old one. It is not exciting, but that is kind of the point. Slow and boring is how you move a hundred workflows without breaking things. Good luck :)