Need a closer look? Download and review slides presented at Replay 2023 here.

At Netflix, we operate over 12000 Apache Flink clusters, processing over 60 PB of data per day. Reliably managing these clusters pose various challenges such as fault tolerance, concurrency control, and consistency between actual and desired infrastructure state.

In this talk, we’ll present how we leveraged Temporal to build a reliable and scalable control plane for the Flink platform at Netflix. We’ve designed our solution using the actor model implemented via long-running Temporal workflows. We’ll discuss the benefits and the challenges that we’ve encountered while building our architecture.