Cloudflare: Production Readiness Checks at scale with Temporal and Temporal Schedules

Abstract

Having a production ready service is always important in modern services and becoming more and more important everyday, and it is always important to make sure every service in your organization meets the standard of production ready. Not all teams in the company have the same level of expertise on what the production ready means, and it is important to help each service owner teams to make their service more production ready. There are some interesting production readiness metrics, like unit tests coverage, CI/CD pipeline setup, any known vulnerabilities or if the service is resilient in multiple clusters in case of disaster.

That’s what we do in Cloudflare is to have a production readiness dashboard, that will display your service health and production readiness information and what users could do to improve it. We have tons of services and repositories in Cloudflare, and for one big instance to check all services would not be scalable and reliable, so that’s where Temporal comes into play to help us solve this problem at scale, and run production readiness checks in parallel to save time.

We used Temporal to help coordinate a list of instances to distribute the workload of those production readiness checks, and also used Temporal schedule to help check the metrics on a defined interval, and update our production readiness dashboard in real time.

With the help of Temporal, we are able to coordinate and complete the production readiness checks in just a few hours for thousands of repositories and well before the next scheduled refresh, and are able to configure a much shorter refresh interval if needed. Not to mention the automatic retries provided by Temporal out of the box for the workflows that help us retry failures automatically when needed. This gives us the scalability and reliability we need for our dashboard.

About the Presenter

Sijie is an experienced systems engineer in Cloudflare, and is core designer and contributor to production readiness metrics dashboard, and lots of internal productivity and infrastructure toolings in Cloudflare. Before Cloudflare, Sijie worked as software engineer in Expedia Group and Microsoft.

Build invincible apps

Ready to learn why companies like Netflix, Doordash, and Stripe trust Temporal as their secure and scalable way to build and innovate?