At Apartment 304, we're fortunate to work on interesting problems, like stabilizing a financial institution's sweep network system. This client, offering cash management accounts akin to checking and savings, sought to attract customers and mitigate financial risk. Like many financial institutions, they joined a sweep network, allowing them to offer higher interest rates and maximize their FDIC insurance.

Although their initial integration was successful, it came with stability issues and operational hurdles that ideally should be resolved automatically, rather than requiring manual intervention from the engineering team. We worked with their team to prioritize the issues and made a plan to incrementally resolve them, instead of rewriting the whole process from scratch.

Temporal's flexibility and convenient message passing made it easy to lift and shift workflow steps piece by piece, which reduced the time it took to see impactful stability improvements, and eliminated the risk of a full rewrite's single cut-over event.

Before diving in on the technical challenges and how we let Temporal do the heavy lifting, let's explore sweep networks.

Sweep networks in a nutshell

A sweep network is a group of financial institutions that pool their deposited funds together. This lets them collectively manage their cash flow and balances while increasing financial stability across the network. For example, a financial institution might want to increase their lending business, which requires additional cash on hand. The institution can draw those funds from the sweep network. Conversely, some financial institutions carry a higher cash balance than they need, so they contribute their excess to the network.

How does this benefit the people banking at these financial institutions? They get higher interest rates and additional FDIC insurance coverage on their accounts.

From the depositor's point of view, they deposit money into their financial institution like they would normally, but their institution allocates the funds across various institutions in the network.

animated sweep

Upon withdrawal, the reverse happens. The depositor withdraws from their account, and the financial institution recalculates the amount they need to withhold in their institution and rebalances the amounts shared across the network.

Challenges in sweep network operations

The sweep process runs daily, sending updated account lists and balances to the network each morning. This sounds simple enough, but there are a few hurdles. As a regulated industry, banks and financial institutions tend to move slowly and sometimes use technologies that a cloud-first approach might otherwise ignore.

I don't mean that they still move physical money around. It's all digital. No more cowboys and stagecoaches hightailing it through the hills. The money transfers electronically, but, in this case, the sweep network uses SFTP file uploads to communicate with the financial institutions in the network. Specifically, banks and the sweep network use files to communicate in both directions, several times each day.

Unfortunately, these SFTP uploads and downloads were a primary pain point. At times, the SFTP server, overwhelmed by requests from across the network, needed our system to pause operations until it recovered. Other times, the sweep network's internal processes were slow, cascading to our system, which wasn't resilient to such delays. The SFTP uploads and downloads failed fairly frequently, and, though the issues were usually temporary, they required human intervention by the engineering team.

There were also many queries needed to calculate sweep balances, any of which could fail if the database was already under load. Add in a Slack integration and networking issues, and there were a number of intermittent errors.

To round it out, the sweep network's strict deadlines had the engineering team rushing to resolve errors, auditing the workflow's progress and manually resuming operations. Before migrating to Temporal, preventable errors needed manual intervention several times each month.

A new plan

After identifying the issues, we set project goals:

  1. Resolve stability issues.
  2. Minimize the risk of migrating by avoiding a single cutover event.
  3. Gracefully handle humans in the loop — aka finance team approval.
  4. Improve the workflow's audit trail.

Temporal was a natural fit, especially since the original codebase was in Go, a language we love. Temporal has a Go SDK, which let us reuse code from the existing workflow, saving time and effort while migrating individual workflow steps. It's worth noting that we could have easily switched languages in the migration. Sticking with Go was preferred because we could reuse much of the original code, and, well, we love Go.

Being able to incrementally migrate (instead of rewriting the whole legacy process using a more strict workflow system) was a major factor in the project's approval and success. It let us make intentional and methodical changes, testing as we went, while benefiting from early stability improvements.

We prioritized migrating the SFTP operations to a Temporal Workflow first, followed by the database intensive operations, and finally migrated the Slack human-in-the-loop integration.

We also benefited from Temporal's event history, which tracks all events during a Workflow execution's lifetime. Should future errors occur, we were well positioned to investigate the preceding events.

Our general approach in the migration was to move decision-making logic to a Temporal Workflow and move external operations, like SFTP file uploads and database queries, into Activities. Temporal Workflows define the overall flow and logic in the process, which it accomplishes by calling Activities. Activities execute a single action, such as calling an external service.

Sweep workflow

The workflow was scheduled to run after the daily account balance job finished each morning. After account balances were set, the sweep process would:

  1. Send account info & changes to the sweep network via SFTP upload
    1. Retrieve account info from database and generate the sweep network account file
    2. Send the file via SFTP upload
    3. Wait for acknowledgement via SFTP response file download
  2. Calculate account amounts to reserve at the "home bank" and remaining balances to send to the network.
    1. Retrieve account balances, calculating the amounts to reserve and send.
    2. Generate a balance file
    3. Store the file, wait for internal approval
  3. Finance team reviews and approves the balances
    1. Send Slack message to finance team
    2. Finance team reviews and approves. (Disapproval is infrequent and requires additional review from finance and engineering)
  4. Send the balance file to the sweep network via SFTP upload
    1. Send the balance file via SFTP upload
    2. Wait for acknowledgement via SFTP response file download

This sequence diagram shows how the process used to work.

Legacy Workflow

SFTP stabilization

The SFTP upload process had multiple steps: Upload the file to the SFTP server, wait for an acknowledgement file to be uploaded by the sweep network, and download the acknowledgement file. The Workflow maintained the decision logic, while each specific operation (SFTP interaction) was implemented in an Activity.

If you've used Temporal before, the benefit is obvious: Temporal automatically retries failed Activities. By wrapping the SFTP operations in Activities, any errors caused by finicky SFTP errors would automatically retry, and usually succeed on subsequent attempts. The sweep process's main pain point was resolved simply by using Temporal primitives.

As an added benefit of using Temporal's Go SDK, the Activity implementations were copied and pasted from the original process with minimal changes, which saved time and increased confidence in the new implementation.

Once the Temporal Workflow was ready to handle the SFTP operations, we updated the legacy process to start the Temporal Workflow and Signal when the SFTP file should upload. It then waited for the account file SFTP upload steps to complete by Querying the Temporal Workflow status. The Temporal Workflow would run, successfully complete the SFTP steps, update its status, and wait for a Signal to send the final balance file.

At that point, control returned to the legacy workflow, which would calculate balances, and send a Slack notification to the finance team for approval. After the balances were approved, the legacy workflow sent a second signal to the Temporal Workflow, which triggered the balance file SFTP upload steps.

This sequence diagram shows the updated process, with the Temporal Workflow managing the SFTP uploads and downloads.

New Workflow

Let's look at the Temporal implementation.

The activities manage the SFTP uploads and downloads. UploadFile accepts an S3 file key, referencing the file to upload, and the SFTP upload file path. ReceiveAck downloads the SFTP file stored at the specified path, and verifies that the upload was acknowledged.

type sweepActivity struct {
	bucket     *blob.Bucket
	sftpClient *SFTPHandler
}

type UploadRequest struct {
	S3FileKey    string
	SFTPFilePath string
}

func (sa *sweepActivity) UploadFile(ctx context.Context, req UploadRequest) error {
	// Open reader for file
	file, err := sa.bucket.NewReader(ctx, req.S3FileKey, nil)
	if err != nil {
		return err
	}
	defer file.Close()

	// Upload file to SFTP
	err = sa.sftpClient.Upload(ctx, req.SFTPFilePath, file)
	if err != nil {
		return err
	}

	return nil
}

type FileWaitRequest struct {
	SFTPFilePath string
}

func (sa *sweepActivity) ReceiveAck(ctx context.Context, req FileWaitRequest) (bool, error) {
	body, err := sa.sftpClient.Download(ctx, req.SFTPFilePath)
	if err != nil {
		return false, err
	}

	if body != "success" {
		return false, errors.New("SFTP ack error")
	}

	return true, nil
}

You call these activities from a workflow.

After starting, the Workflow waits to receive an "upload-account-file" Signal from the legacy process, which contains a reference to the file that needs uploading. Once received, the Temporal Workflow uploads the file to the SFTP server and waits for acknowledgement. After the file is acknowledged, the Temporal Workflow returns control to the legacy process, and waits for the "upload-balance-file" Signal, sent by the legacy workflow. Again, this triggers the SFTP upload and acknowledgement process.

After each of these steps, the Temporal Workflow updates its status. The legacy process Queries the Temporal Workflow status to know when it should resume processing. This Query is handled at the start of the Temporal Workflow, simply returning the current Workflow status, which is stored in the SweepWorkflow struct.

Temporal Workflows are just code, so we encapsulated the SFTP upload and acknowledgement steps in the function uploadFileAndWait, which handles the account and balance exchanges. Temporal is doing the heavy lifting, automatically retrying failed steps. The retries automatically back off, gracefully handling the common error cases when the SFTP server was overloaded and unable to respond, or the sweep network's own processes were delayed. This alone prevented the most frequent errors in the system.

type SweepWorkflow struct {
	Status string // account-uploaded,
}

func RunSweepWorkflow(ctx workflow.Context) error {
	// Configure status query
	work := SweepWorkflow{Status: "running"}
	err := workflow.SetQueryHandler(ctx, "current-status", func() (string, error) {
		return work.Status, nil
	})
	if err != nil {
		return err
	}

	// Wait for account file upload signal
	uploadAccountFileSig := workflow.GetSignalChannel(ctx, "upload-account-file")

	var uploadSignalData UploadRequest
	uploadAccountFileSig.Receive(ctx, &uploadSignalData)

	err = uploadFileAndWait(ctx, uploadSignalData)
	if err != nil {
		return err
	}

	// Update status
	work.Status = "account-file-uploaded"

	// Wait for balance file upload signal
	uploadBalanceFileSig := workflow.GetSignalChannel(ctx, "upload-balance-file")
	uploadBalanceFileSig.Receive(ctx, &uploadSignalData)

	err = uploadFileAndWait(ctx, uploadSignalData)
	if err != nil {
		return err
	}

	// Update status
	work.Status = "balance-file-uploaded"

	return nil
}

func uploadFileAndWait(ctx workflow.Context, uploadReq UploadRequest) error {
	// Upload file
	uploadCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
		StartToCloseTimeout: 5 * time.Minute,
		RetryPolicy:         &temporal.RetryPolicy{MaximumAttempts: 5},
	})

	err := workflow.ExecuteActivity(uploadCtx, sweepActivities.UploadFile, uploadReq).Get(uploadCtx, nil)
	if err != nil {
		// Human intervention is needed. Notify finance and engineering stakeholders.
		return err
	}

	// Wait for acknowledgement
	waitCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
		StartToCloseTimeout: 15 * time.Minute,
		RetryPolicy: &temporal.RetryPolicy{
			MaximumAttempts: 15,
			InitialInterval: 10 * time.Second,
		},
	})

	waitReq := FileWaitRequest{
		SFTPFilePath: "resp-file",
	}
	var success bool
	err = workflow.ExecuteActivity(waitCtx, sweepActivities.ReceiveAck, waitReq).Get(waitCtx, &success)
	if err != nil {
		// Human intervention is needed. Notify finance and engineering stakeholders.
		return err
	}

	return nil
}

We'd prevented the most frequent recoverable errors in the legacy system, but we weren't finished. With a successful Temporal Workflow launch, we were ready to migrate and stabilize the rest of the workflow.

Calculating sweep amounts

The natural next step was to migrate the account and sweep balance queries and calculations, which generated the account and sweep balance files. While the queries themselves were effective and performant, there were many queries, and other intensive jobs would occasionally run simultaneously, leading to query failures. Instead of letting the sweep process halt altogether, we leaned on Temporal's automatic retry backoff to gracefully continue.

We lifted and shifted the queries into Temporal Activities, and called them from the Temporal Workflow. The Workflow similarly needed few changes, as the decision logic didn't change.

As an added benefit, the account file generation and balance calculations happened before and after the account file SFTP upload. This meant the Temporal Workflow's initial responsibility expanded before handing control back to the legacy workflow. The legacy workflow needed minimal changes — removing the account and balance calculations and changing the Temporal Workflow status it queried for to "calculation-complete".

New Workflow With Sweep Calculation

Humans in the loop

We were moving and grooving. The final step integrated the Slack approval process into the Temporal Workflow. The integration used the Slack API to send daily balance summaries along with an 'approve' button to the finance team. Upon review, clicking the 'approve' button would signal the Workflow to proceed.

Like the previous migration steps, we moved the decision logic over to the Temporal Workflow, while external Slack API calls moved to Activities. This step showcases a great Temporal benefit — Temporal Workflows are designed to run indefinitely, if you need them to. It makes them well suited for workflows that need input from us slow humans. In this case, we only needed it to run for as long as it took for the finance team to review and respond, and we could easily add subsequent reminders should the approval deadline approach.

And with that, the workflow was migrated! We stopped running the legacy workflow, and let the Temporal Workflow drive the process going forward.

New Workflow With Humans In The Loop

Parting thoughts

By transitioning to Temporal, the financial institution not only addressed immediate operational challenges but also established a scalable, resilient, and efficient foundation for future operations. The incremental migration reduced project risks and accelerated the stability improvements. We were able to see the positive impact almost immediately, without waiting for a complete system overhaul. By tackling critical issues first, we ensured that each step delivered substantial benefits.

Since migrating to Temporal, the need for manual interventions, which were once a frequent necessity, has been virtually eliminated. For me, knowing that our solution has led to a dependable system certainly contributes to a good night's sleep.