EvenUp is transforming the personal injury legal space with AI-powered technology that helps law firms secure better outcomes for their clients. Our innovative mission requires a less visible but still crucial layer: reliable workflow orchestration.
As our processing volume and product complexity grew, it was obvious that our engineers needed to focus their valuable development time on building business logic rather than maintaining or improving the existing workflow execution system.
It’s why we decided to migrate our existing workloads to Temporal.
From fighting problems to building product#
Before adopting Temporal, our infrastructure team frequently discussed how to better monitor, scale, and enable service owners to use the existing workflow execution system.
The discussion was vital. As customer usage of our products grew, service owners found themselves working to overcome the newly identified inefficiency or bottleneck of the week in the existing workflow execution system.
These recurring issues consumed valuable engineering time. Time that could have gone into improving our AI models or building customer-facing capabilities was instead spent stabilizing workflow plumbing.
We know no magic bullet will eliminate all workflow processing problems or bottlenecks. Still, the robust set of metrics associated with Temporal’s workloads make it significantly easier for our engineers to triage and eliminate these issues. With Temporal’s metrics, the team can better scale and monitor asynchronous workloads.
Scaling without the guesswork#
We primarily use two different sets of metrics to scale Temporal workers up and down as workloads start and finish:
Slot-based scaling#
The first scaling strategy used temporal_worker_task_slots_used.
Temporal workers were configured to scale up when activity or worker-slots-used exceeded a service-owner-defined percentage of total configured slots. This approach ensured existing workers retained enough headroom to absorb sudden workload spikes.
We learned that the slots-used metric proved less effective for scaling workflow tasks or shorter-lived activities. We incorporated additional latency-based metrics into the scaling configuration to address this.
Latency-based fallbacks#
The metrics temporal_workflow_task_schedule_to_start_latency and temporal_activity_schedule_to_start_latency became important fallbacks.
Our use cases require most workloads to process in near real time. That means schedule-to-start latency must remain low. Our team ensures processing delays remain minimal by adding workers whenever the 95th-percentile schedule-to-start latency exceeds a defined threshold.
Together, these Temporal-provided metrics allow our scaling and traffic patterns to align more closely than they did in the existing system. As a result, our platform falls behind on queued work less frequently and does not over-scale as often.
Monitoring that eliminates firefighting#
Many existing workloads were directly migrated onto Temporal with minimal refactoring when we first adopted the technology. Temporal immediately coordinated orchestration better than the existing system, even without making improvements to the workloads.
Temporal is much more feature-rich than what we used internally and directly eliminated some inefficiencies in the system, even without refactoring.
The quick, direct migration meant that there was room for improvement in our initial implementation that became clearer as our experience with Temporal grew. In some cases, Workers would become “stuck” due to failed Workflow cache evictions that slowly ate up Workflow task slots. The culprit was usually too much synchronous CPU-intensive work performed in Workflow tasks.
The SDK metrics and logs emitted by Temporal allowed engineers to pinpoint and fix many of these issues. SDK metrics also enabled the infrastructure team to create a liveness probe that Kubernetes uses to automatically restart containers that are no longer processing Workflows.
The liveness probe tracks a combination of
temporal_num_pollerstemporal_worker_task_slots_availabletemporal_request{operation="RespondWorkflowTaskCompleted"}
Together, these determine if the Worker container is healthy. The probe completely eliminated the periodic processing slowdowns in Temporal Workers (and the associated on-call pages engineers would receive), while still allowing engineers to identify and fix the root causes in the Workers.
The logic in the probe declares a container as unhealthy if either condition is true for an extended period of time:
- Activities: If the number of pollers is zero and the number of available Activity Task slots has been a non-zero constant value.
- Workflows: If the number of pollers is zero and the number of completed Workflow Tasks has not changed.
This liveness probe eliminated periodic slowdowns in Temporal Workers (and with them, the associated on-call pages) while still allowing engineers to investigate and fix underlying causes.
Refocusing engineering on what matters most#
Our engineering team could have expanded its existing orchestration system by adding more metrics, improving its scaling logic, and recreating features similar to Temporal.
Improving the existing workflow system would have required ongoing infrastructure investment at the expense of core product innovation.
By adopting Temporal, our engineers were able to keep their focus on building differentiated AI capabilities for legal professionals all while gaining the scaling, observability, and reliability they direly need, all out of the box.
Check out what we’re doing and how we’re evening up the playing field.