Durable Large-scale Media Workflows: Insights from Netflix’s Plato Platform

Abstract

Media processing workflows are inherently complex, often requiring extensive state management and continuous updates to ensure media encodes remain current. At Netflix, we have developed the Plato media workflow platform, a key component of our larger Cosmos media processing system. This talk will delve into how we process millions of media workflow events daily, ensuring durability and scalability while enhancing the developer experience.

Enhancing RPC Durability in Media Workflows: Cosmos is a microservices-based platform, where each processing component is implemented as a microservice or as a serverless function. This necessitates the workflows to make RPC calls to execute the tasks at scale asynchronously. Plato’s unique approach to handling remote procedure calls (RPCs) using message-passing techniques makes the flaky RPC calls more durable and reliable. This adaptation allows our users to build on a resilient RPC client foundation, mitigating the impacts of potential failures on workflow continuity.

__ Scaling to Millions of Workflow Events:__ The media processing domain is characterized by its bursty nature of work, where the demand for producing encodes often exceeds available compute resources. To address this, Plato incorporates features like priority-based task queues, execution avoidance, and a combination of dynamic and static graph execution models. Together, these features enable us to process millions of workflow events daily. We will present real-world scenarios that showcase how these technologies allow Plato to efficiently scale up and durably execute millions of workflow events.

Prioritizing Developer Experience: While ensuring durability is crucial for our users, it cannot come at the cost of developer experience. The Plato platform allows users to seamlessly bring their own strongly typed data models. This feature ensures that workflow execution state can be stored and retrieved reliably, testing workflows with strong contracts, and lowering the barrier to entry for our users by enhancing the platform’s usability. We will highlight case studies that demonstrate how Plato provides a good developer experience and discuss some of the open challenges we are working on.

This talk will provide an overview of how Netflix implements durable executions to process media encodes at scale. Attendees will gain insights into the challenges and techniques that Netflix uses in the media processing space, with practical examples from the Plato platform that highlights our approach to durability, scalability, and developer experience.

About the Presenters

Naveen Mareddy is a Senior Staff Engineer in Netflix's Content Infrastructure Solutions (CIS) group, where he works at the intersection of media processing platforms and large-scale distributed cloud computing systems. His team is responsible for building and managing the infrastructure that powers the encoding of various media assets, including movies, TV shows, trailers, ads, and image artwork, to create seamless viewing experiences for over 260 million Netflix users worldwide.

Dmitry Vasilyev is a graduate of BSU in Minsk, Belarus. He spent 7 years building an online marketplace before joining Netflix in 2016. At Netflix, he has been working on workflow orchestration distributed systems in the area of media processing. Currently his interests also include serverless multi tenant systems.

Build invincible apps

Ready to learn why companies like Netflix, Doordash, and Stripe trust Temporal as their secure and scalable way to build and innovate?