Actor Workflows: Reliably orchestrating thousands of Flink clusters at Netflix

At Netflix, we operate over 12,000 Apache Flink clusters, processing over 60 PB of data per day. Reliably managing these clusters pose various challenges such as fault tolerance, concurrency control, and consistency between actual and desired infrastructure state.

In this talk, we'll present how we leveraged Temporal to build a reliable and scalable control plane for the Flink platform at Netflix. We've designed our solution using the actor model implemented via long-running Temporal workflows. We'll discuss the benefits and the challenges that we've encountered while building our architecture.

Build invincible apps

Ready to learn why companies like Netflix, Doordash, and Stripe trust Temporal as their secure and scalable way to build and innovate?