How groundcover built a durable alert dispatch system with Temporal

This is a guest post from Omer Karjevsky and Yosi Zelensky, groundcover Software Engineering.

At groundcover, we provide a full observability stack for metrics, logs, traces, profiling, Kubernetes events, and application views. The platform is delivered as managed BYOC (Bring Your Own Cloud), which means the stack runs inside each customer's Kubernetes cluster and keeps customer data inside their cloud boundary.

That deployment model shaped the way we built our alert dispatch system.

When a monitor condition triggers, the resulting alert state is written to ClickHouse. Our Dispatch Center reads those pre-evaluated issues, matches them against notification routes with gcQL (groundcover Query Language), enriches them with context, and sends notifications to the right destination: Slack, PagerDuty, OpsGenie, incident.io, Rootly, Microsoft Teams, or a generic webhook.

The system also has to handle the full alert lifecycle. A new alert may need an initial notification, later renotifications, and finally a resolution notification that links back to an external incident.

Because this all runs inside our managed BYOC footprint, we needed the dispatch layer to recover cleanly from infrastructure failures without depending on an external managed orchestration service.

This is the story of how we rebuilt the Dispatch Center on Temporal.

What we were replacing#

Before Temporal, the Dispatch Center used an open-source Python/FastAPI alerting platform. It gave us a useful starting point with provider integrations and a rules engine, but the fit got worse as groundcover scaled.

The first problem was operational. Most of our backend is written in Go, so this service had different deployment, dependency, and debugging paths from the rest of the system.

The second problem was durability. Workflow execution depended on an in-process scheduler. If the process crashed during notification delivery, there was no durable execution state to recover from and no built-in way to detect or retry a missed delivery.

The third problem was observability. Debugging a notification issue meant correlating logs across several systems. We did not have one place to see the path from alert detection, to route matching, to provider delivery.

And finally, the architecture was not built for the volume of a Kubernetes-native observability platform. We needed high-throughput fan-out across thousands of monitored clusters while preserving ordering inside each alert stream.

We looked at three options: build a custom state machine on top of queues, build a general-purpose workflow engine ourselves, or adopt Temporal. Temporal was the cleanest fit for three reasons.

Durable Execution gave us the failure model we wanted. Instead of manually checkpointing progress or reconciling external state, we could write workflow code and let Temporal preserve execution history.
SignalWithStartWorkflow matched the dispatch pattern almost exactly. For each alert route, we needed to either start a new Workflow or Signal an existing one, atomically.
The Go SDK let us build the service in the same language as the rest of our backend. Workflows and Activities are just Go functions, which made the model easy for the team to adopt.

We did consider Temporal Cloud, but chose self-hosted Temporal because of our BYOC architecture. Temporal runs as a Helm subchart inside the groundcover backend deployment, with PostgreSQL-backed persistence and the standard frontend, history, matching, and Worker services. That keeps orchestration inside the same Kubernetes-native footprint as the rest of the platform.

The architecture: Two Workflow levels#

The design rule was simple: keep the number of live Workflows low, push throughput through Signals and goroutines, and never reorder events for the same alert.

The Dispatch Center has two Workflow levels.

Level 1: The orchestrator#

A single long-running MonitorInstancesWorkflow acts as the heartbeat. About every 15 seconds, it runs a ReadAndDispatch Activity. That Activity queries ClickHouse for new, continuing, and resolved issues since the last scan. It matches those issues against notification routes using gcQL, groups them by a deterministic key, and dispatches each group with SignalWithStartWorkflow.

The grouping key is:

alert fingerprint x route x connected app

The orchestrator is intentionally minimalist. It scans, dispatches, sleeps, and carries the cursor timestamp forward. Every 200 iterations, it uses Continue-as-New to keep Event History bounded.

Level 2: Per-alert Workflows#

Each MonitorInstanceAlert Workflow handles one unique combination of fingerprint, route, and connected app. Its Workflow ID is deterministic:

alert-{fingerprint}-{route}-{connectedApp}

That ID is the routing layer. We do not need an external lookup table to find the right Workflow. SignalWithStartWorkflow either sends the Signal to the current execution or starts one if none is running.

Per-alert Workflows do not poll. They wake up when signaled, drain buffered Signals, run a notification state machine, and deliver to the configured provider. When the Signal channel is empty, the Workflow completes. If more Signals arrive later, a fresh execution starts under the same deterministic ID.

That last point matters. The live Workflow count tracks active alert volume, not all alerts we have ever seen.

Figure 1: Two-level Temporal Workflow architecture. The orchestrator reads pre-evaluated issues from ClickHouse, matches routes, and fans out with SignalWithStartWorkflow to per-alert Workflows that handle deduplication, delivery, and state recording.

Signal batching in the per-alert Workflow#

The most important implementation detail is how each per-alert Workflow handles Signals.

A simple implementation would process one Signal per Workflow Task. We chose not to do that. Instead, the Workflow drains everything currently buffered in the Signal channel with ReceiveAsync(), then processes the batch.

In simplified form:

func MonitorInstanceAlert(ctx workflow.Context, input AlertInput) error {
    state := fetchNotificationState(ctx, input.WorkflowID)
    signalChan := workflow.GetSignalChannel(ctx, "issue")

    for {
        var signals []IssueSignal
        for {
            var sig IssueSignal
            if ok := signalChan.ReceiveAsync(&sig); !ok {
                break
            }
            signals = append(signals, sig)
        }

        if len(signals) == 0 {
            return nil
        }

        now := workflow.Now(ctx)
        for _, sig := range signals {
            if isStale(sig, now) {
                recordDroppedSignal(ctx, sig)
                continue
            }
            decision := state.ShouldNotify(sig, now)
            if !decision.Notify {
                continue
            }
            err := executeSignalNotification(ctx, sig)
            state.RecordAttempt(now, err)
        }

        _ = workflow.Sleep(ctx, 100*time.Millisecond)
    }
}

The 100 ms sleep creates a small batching window. In practice, a single Workflow invocation often processes five to ten Signals that arrived during the previous cycle. During an alert storm, one Workflow Task can handle all buffered Signals for that alert stream. That cuts down Temporal Event and Task Queue load compared with waking once per Signal.

The Workflow also uses workflow.GetVersion() to evolve behavior safely. For example, when we added suppression bypass for previously suppressed alerts, we put the new path behind a version gate. Old executions replay on the original path. New executions take the updated path.

Figure 2: Signal processing sequence in MonitorInstanceAlert. The Workflow drains all buffered Signals, evaluates each through the shouldNotify state machine, delivers notifications to the provider API, and records delivery state to PostgreSQL. Deduplicated, stale, or intentionally dropped Signals are skipped and recorded for auditability.

Fan-out, ordering, and backpressure#

The fan-out path lives in ReadAndDispatch. It groups Signals by Workflow ID, then sends groups concurrently through a semaphore-controlled goroutine pool. Inside one Workflow ID group, Signals are sent sequentially.

That gives us both properties we care about: independent alerts fan out concurrently, while a single alert's new, continuing, and resolved sequence is never reordered.

The Activity also has two small correctness rules that have paid off.

When we cap the number of issues in a scan, we extend the cap to include every issue with the boundary timestamp. That avoids splitting simultaneously inserted issues across scan cycles.

When some Signals in a batch fail to dispatch, we rewind the cursor to the earliest failed issue timestamp. The next scan will pick them up again. It is a simple way to get at-least-once dispatch without a separate dead-letter queue.

For backpressure, we use Temporal's Visibility API. ReadAndDispatch calls CountWorkflow to check how many per-alert Workflows are currently running, then computes the next scan limit as:

maxConcurrentWorkflows - currentlyRunning

Our configured ceiling is 5,000 concurrent per-alert Workflows. As the system approaches that ceiling, scan limits shrink automatically. Since per-alert Workflows complete when their Signal channels drain, the live count is usually far below the ceiling and stays tied to current alert activity.

Figure 3: One orchestrator iteration. ReadAndDispatch checks backpressure, queries ClickHouse, matches routes, groups by Workflow ID, and fans out Signals concurrently while preserving per-alert ordering.

State management#

Each per-alert Workflow maintains an in-memory workflowState struct tracking notification count, last notification timestamp, and third-party provider alert IDs. This state is initialized from the persistent DB store at Workflow start and updated in-memory as notifications are sent.

This two-tier design is an optimization that Temporal’s Durable Execution model makes possible. Within a single Workflow Execution, the in-memory state is updated after each notification and immediately visible to subsequent Signals in the same batch — no database round-trip needed. The persistent DB state only needs to be read once (at Workflow start) and written once per actual notification sent.

Because Temporal guarantees the in-memory state survives crashes through deterministic replay, we get the performance of in-memory decisions with the durability of persistent storage.

Even if the DB write fails, the in-memory state is still updated, so the next Signal in the batch is correctly deduplicated.

Handling retries and rate limits#

We use Temporal Retry Policies where they fit well. Short Activities such as config fetches and state lookups use a small number of attempts with short timeouts. Notification delivery Activities use Temporal-managed backoff for transient non-rate-limit errors.

For notification freshness, we use time windows rather than fixed retry counts. Each Signal has a SignalCreatedAt timestamp and a maximum age, which defaults to one hour. For renotifications, the limit tightens to the renotification interval. A renotification that arrives after the next one is already due is stale.

Provider rate limits get special handling. When a provider returns HTTP 429, we do not bubble that error back to Temporal for retry scheduling. The Activity sleeps for the provider's Retry-After value and tries again inside the same Activity:

for {
    if time.Since(signalCreatedAt) > maxSignalAge {
        return nonRetryable("signal expired after rate limiting")
    }
    err := provider.Send(ctx, secret, issue, metadata)
    if err == nil {
        return nil
    }
    if !isRateLimit(err) {
        return err
    }
    time.Sleep(retryAfter(err))
}

This keeps rate-limit waits from creating extra Workflow events, new Task Queue dispatches, and new Activity executions. A five-second provider throttle is just a sleep inside the Activity. The loop is still bounded by Signal lifetime, so stale notifications drop cleanly instead of retrying forever. Each provider parses its corresponding rate limit response values, returning the appropriate wait duration.

Testing#

We test the Dispatch Center at four levels.

The ShouldNotify state machine has table-driven unit tests for every important transition: new alert, continuing alert, renotification, resolved without prior notification, and resolved after notification.
Workflow integration tests use Temporal's testsuite.WorkflowTestSuite. These tests control time, mock Activities, simulate provider failures, and verify Signal processing order.
End-to-end lifecycle tests stand up real ClickHouse and PostgreSQL instances, then exercise the scan, match, dispatch, notify, and record path through Temporal's test Activity environment.
Finally, Workflow replay tests use WorkflowReplayer against saved Event Histories. If a code change introduces nondeterminism, the replay test catches it before production. When a change is intentional, such as a version-gated behavior update, we regenerate the fixture.
- The fixtures are generated by a dedicated test that spins up a local Temporal dev server, executes each Workflow with mock Activities, and saves the resulting Event History as JSON. This gives us a safety net for every code change that touches Workflow logic.

Debugging with groundcover#

We monitor the Dispatch Center with groundcover itself. Temporal metrics, application metrics, traces, logs, and profiles all flow through our own observability pipeline.

Temporal's OpenTelemetry interceptor gives us spans for Workflow Tasks and Activity executions. We add application spans for scan, dispatch, and notification steps, with attributes such as alert fingerprint, route ID, connected app ID, provider, Workflow ID, and notification duration.

We also create each per-alert Workflow root span with WithNewRoot(). That gives each alert stream its own trace instead of burying it inside the orchestrator's long-running trace. The traces can still be linked by shared Workflow attributes.

When a customer reports a missed notification, we can query the full lifecycle in groundcover's UI with gcQL:

issue.fingerprint:2823f61ada06aa1c | fields _time, span_name, attributes
notification.duration_ms > 5000 | fields notification.provider, workflow.id
span_name:record_dropped_signals | fields _time, workflow.id, issue.fingerprint, attributes

That gives us one place to see the ClickHouse scan, route match, Temporal Signal dispatch, state machine decision, provider API call, and PostgreSQL record write. Dropped Signals are recorded with structured attributes, so staleness, deduplication, and rate-limit drops are inspectable instead of disappearing into logs.

For ad-hoc Workflow history inspection, we use the Temporal CLI rather than deploying the Web UI. In practice, gcQL-based trace analysis is usually faster and more expressive for debugging notification issues, with Workflow history available as the fallback.

What we learned#

This was our first production use of Temporal at groundcover. A few lessons were clear.

Determinism takes practice. Workflow code cannot call time.Now(), iterate over maps when order matters, or make network calls directly. We wrote those rules down early, and they became normal team habits.
Continue-as-New should be part of the initial design. Long-running Workflows accumulate event history. Decide up front what state must cross the Continue-as-New boundary and structure inputs around it.
Versioning was easier than expected. workflow.GetVersion() let us ship suppression bypass, staleness checks, and a resolve-only path without draining in-flight Workflows.
The Visibility API is worth exploring. We did not start with CountWorkflow as our backpressure plan, but it turned out to be the simplest way to keep intake tied to current Workflow load.

Most importantly, Temporal let us avoid building durable orchestration infrastructure ourselves. Replacing the old notification backend meant we needed state management, retries, crash recovery, ordering, backpressure, and enough visibility to debug customer issues. Temporal gave us the foundation, so we could spend our time on groundcover-specific logic: routing, deduplication, provider behavior, and the operating tools around alert delivery.

This is our first use of Temporal at groundcover inside our BYOC offering. Additional use cases are already planned, and we expect delivery times to decrease further as the team grows more acquainted with Temporal’s primitives and patterns.