Need a closer look? Download and review slides presented at Replay 2023 here.

Businesses are looking to build apps on Serverless cloud platforms to get the benefits of elastic scale, micro billing, and scale-to-zero. However, even with Temporal Cloud, it’s not yet possible for durable apps to be 100% serverless. Or is it?

This talk explores how this could be achieved. We’ll start by looking at the architecture of Azure Durable Functions, which shares the same lineage as Temporal. In particular, we’ll explore how the extensibility of the Azure Functions runtime was exploited to enable durable execution and elastic scale on top of ephemeral compute with scale-to-zero support. We’ll then show a prototype of running Temporal workflows on Azure’s serverless compute platforms in conjunction with Temporal Cloud and walk through how this was accomplished. After this talk, attendees should have a better understanding of whether or how they can build Temporal Cloud apps today on Serverless platforms, and how it might be achieved to an even greater degree tomorrow.

Transcript

This transcript has been edited for clarity and formatting. You can follow along with a recording of the original content at Replay 2023 in the video above.

Chris Gillum: My name is Chris, Software Architect at Microsoft, and I’m going to be talking to you about durable executions on serverless platforms. So, definitely a technical talk. A little bit more about me. So let’s see, I’ve been an engineer at Microsoft for about 17 years now. I helped design and build much of the Azure serverless platform, and I’m also the creator of Durable Functions and maintainer of the Durable Task Framework – if folks have maybe heard of that before. Also, a little bit of a generative AI fanboy so I really enjoyed Ryland’s talk yesterday. I don’t know if anyone got to see that. I also study Japanese in my spare time, as well. Fun fact, actually, before today, I’ve given more public talks in Japanese than I have in English, despite the fact that my Japanese is pretty terrible. But, if you can distract them with good slides, it usually works out okay.

A little bit more about where Durable Functions fit into the durable execution family tree. If any of you came to Replay last year, you may have seen a similar graphic that Maxim showed where basically, you know, Simple Workflow Service was sort of the roots of durable execution programming frameworks. Samar and Maxim worked together on that. Samar left for Microsoft, he created the Durable Task Framework, and then later went to Uber to create Cadence with Max. A few years after that, I had sort of taken the Durable Task Framework and created Durable Functions out of it, which is basically the serverless version of the Durable Task Framework.

So that’s what I want to talk about today. There are a few other interesting durable execution platforms out there that you know, I’ve mentioned Dapr Workflow at the bottom there, which is sort of an open source variant of Durable Task Framework. There’s also a few other interesting startups that many folks may have seen, such as infinitic, and more recently, restate. Definitely some very interesting projects. But that’s just sort of where we fit into the durable execution space. We’re kind of cousin technologies with Temporal. That’s how I like to think about it.

So what is Azure Durable Functions? It’s basically the Durable Task Framework, which is a library packaged as an Azure Functions extension. So the way that Azure Functions works is that there are a bunch of different trigger types that you can use to run your functions. These functions are like Lambdas. And it’s extensible, the type of triggers that you can have that run in our serverless environment. So basically, I took the Durable Task Framework, created an extension out of it for functions, and then was able to come up with Azure Durable Functions that way. We don’t call them Workflows, we call them orchestrations, but we also have the concept of activities. And you can see a couple of code snippets there on the bottom, the left side being C#, the right side being Python, the programming model is very similar to what you would see in Temporal. A little bit different syntactically. You might notice, like, for example, we have these contexts objects. But otherwise, the basic idea is exactly the same, just a slightly different syntax. C# async, await Python, we use generator functions, but you sort of get the idea of how that might work.

So, a little bit more about serverless and Functions-as-a-Service. I often use them interchangeably. But when I’m talking about serverless, I’m really talking about FaaS or Functions-as-a-Service, which is sort of a highly opinionated subset of serverless. It’s known as an event-driven programming model, right? So you have some event, whether it’s a queue event, or HTTP, that can then trigger your function to run. That’s typically characterized by elastic scale, right? You don’t provision your compute up front. The platform provides the compute for you and expands it or retracts it as necessary.

Paper executions. That’s a very important characteristic of FaaS. You’re not billed based on CPU time, but per execution of your functions, which if you’re not running a lot of functions, that’s really great because you can run these really cheaply. And of course, it’s available on major clouds. So most folks are probably familiar with AWS Lambda as being sort of the first ones to really coin the term “serverless” and “FaaS”, but then, of course, Azure Functions and Google Cloud Functions.

Now, Google is interesting, they also have a product called Google Cloud Run, which many of you have probably heard of, I sort of see that as, it’s definitely serverless. Maybe not quite FaaS, though, because they don’t have the same programming model and sort of pricing model is, you know, the per execution billing, but it sort of still falls into that. Their Cloud Functions is really the pure FaaS play.

When serverless really became popular, the community got together and came up with a set of principles or best practices for how you should write your functions and FaaS. One of them was that functions must not be long-running. They must be stateless. They may not call other functions. And they should only do one thing. And so if you’re familiar with something like Temporal, or in my case, it was the Durable Task Framework, you look at those rules and your next thought is, well, one does not simply execute durably on serverless, right? I have a workflow I needed to run for a week that might exceed the function timeout, maybe.

To break it down a little bit more, let’s talk about what some of the key problems are that need to be overcome if you want to get something like Durable Execution working on serverless. One of them, as I just mentioned, is function timeouts, right? Most clouds have a default of five minutes for your function execution. You can extend it out to maybe like 15 minutes, I believe, in AWS. Same with Microsoft.

There’s also this concept of double billing. So when I talked about that best practice of functions not calling other functions, it’s usually sort of referring to this. If you have one function that you’re paying for its execution time calling another function, and then maybe it has to wait for that function to return some value back, and you’re paying both for the function that’s actively running and also the function that’s waiting for the other function to run. You’re being billed for two things, even though only one function is sort of actively doing anything. So that’s a bit of an issue. And you can imagine, like, if you have a Workflow calling an Activity, that Workflow’s waiting for that Activity to complete, you don’t want to be billed for the time that the Workflow is sitting there waiting, in addition to the Activity. So that’s scaling to zero. That’s one of the biggest challenges and also the biggest benefits of FaaS: how do you scale to zero?

You know, if you have a Workflow-type programming model that needs to be listening for new events, you know, how do you make that happen? How do you go from zero to one, or one to zero? Elastic scale, sort of what are the parameters that you use to determine whether or not it’s time to scale out? You know, do you rely? Well, typically, FaaS platforms don’t rely on things like CPU and memory. Typically, they look at sort of the event load. And so there’s a question of well, okay, with a Workflow, Durable Execution-type system, how do you do that? What metrics do you use to sort of make those real time scale decisions.

And then, lastly, the ephemeral compute, the fact that the VMs or the containers that you’re given to run your code, they’re there, but they can be taken away at any time. And so, if you need to do things like caching, that becomes a real big problem, especially for something like durable execution, where you may have a Workflow, and it’s nice to sort of keep it in memory so that as it gets new events, signals, whatever, it can work with the in memory copy, rather than needing to reload the history from the data store every single time. So these are some of the key problems.

I thought it might be interesting to talk a little bit about how we solve some of those problems in Durable Functions, but of course, talking about it in sort of a general way. So let’s see, function timeouts and double billing. Let’s start with that. So the solution that we sort of came up with for this is basically to take that Workflow function – or the orchestrator in our terminology – and break it up into multiple executions, stopping at each new yield statement in the case of this Python function here. So, for example, the very first time that we invoke this orchestration, we just run that highlighted line there up until the yield statement, and once we hit that yield statement, we end the function execution right there. So the execution that you get billed for stops at the yield and doesn’t resume until the function gets called a second time. In which case, we run the first line and then the part of the second line and hit that next yield, and then we stop. And we continue this process until we eventually get to the whole thing.

So we’ve taken this simple Workflow here, and we’ve actually broken it down into four distinct function invocations. And it turns out that each of those invocations are actually extremely fast, just a few milliseconds at most, because we’re not actually stopping to wait for that Activity to finish. This is important for sort of avoiding the short timeouts that you get with a lot of the FaaS platforms. Because, let’s say, for example, if that Activity down below wasn’t a simple hello, but what if it was something that took two minutes, and maybe we have like a five minute limitation for functions. Now, if you add that two minutes to the three calls that you see in the orchestrator function, that will take six minutes, and you’re gonna exceed your timeout. Because we’re able to break this down into small chunks, each of those orchestrator function invocations are just a few milliseconds. And so that, you know, allows people to be able to build a Workflow that can run for really long periods of time, sort of bypassing the timeout limits. That’s something that gets a lot of people excited when they’re sort of bought into serverless, they really want it, but they want to understand, how can I avoid these timeout limitations?

Okay, so what about some of the other problems that I mentioned, you know, elastic scale and ephemeral compute? But first, I thought it might be interesting to talk about a more naive approach to how you might solve that. And so, you know, one option is: let’s say you have a workflow engine, you know, and I’ll be a little bit vague about what that means. But, you know, maybe if it needs to run Workflows, or Activities, it just, you know, calls out to HTTP, to whatever FaaS platform you happen to be using. And, you know, that spins up Lambda functions or whatever to do those. Conceptually, it’s very simple, which is one of the pros; and it solves a scale to zero problem, right? But, you’d still have to figure out what to do with that workflow engine thing on the left. But as far as all of the other pieces of code, you know, those can all scale to zero, which is nice. And this works with any FaaS cloud, they all support HTTP.

But, there are some problems with this design approach. One of them is API gateway timeouts, which I believe for Lambda is 30 seconds; for Azure Functions it’s a little bit more than that. But basically, it’s something that’s shorter than the actual function timeout itself. So, even if you were to keep your Activities and things to be less than five minutes, you still have a gateway timeout that you have to deal with, which can be challenging. Function execution timeouts, network isolation, that can be a little interesting. If that workflow engine needs to talk to those Lambdas, and those Lambdas are in a private network, there’s some interesting challenges there. And, no container affinity for the workflow tasks. So this goes back to the point about caching. You know, if you have a workflow that’s loaded on one particular container, and you need to dispatch a task to it, how do you make sure that it goes to the correct container, so that we don’t have to reload the history every single time? That can be a little bit more challenging than you might expect.

The way that we did it in Azure Functions, just to give you sort of a reference example, is we have this component that we call the Azure Functions scale controller. And basically, what it does is: a piece of infrastructure runs in the cloud and it periodically pulls the underlying data source. We call it a task hub. You know, more generically, we could talk about it as like a workflow state store. It’s pulling and asking this question: how much work is there to do? Are there any Activities that have been scheduled? Or any timers that have been scheduled? Or new Workflows that have been scheduled? And how many are there that need to be executed? So if it finds some and only when it finds some, it can go and start provisioning some compute to run the Workflows. And in this case, the workers actually run directly within the container that runs your code. And so once they’re up, those are able to start the polling process and start pulling in the work that they need. Because it’s the pull model, instead of the push model, you’re able to get the affinity that you need, as far as you know, making sure that the Workflows that you’ve loaded are able to stay in memory and benefit from the caching, and those sorts of things. And, you know, they can just pull in messages as they need.

There’s another interesting aspect to it as well, which is dealing with the ephemeral compute. When we do need to scale down, one of the things that we’re able to do is send a drain message to these workers. I basically tell them, “hey, you need to stop polling now, because I’m going to take away your container or your VM. So please stop pulling messages, please finish any work that you’re doing.” At which point, once we’re able to detect that that particular worker is no longer executing, we can go ahead and deprovision it without interrupting the inflight work, which would otherwise result in duplicate invocations. With that, we’ve been pretty much able to solve the key problems that I mentioned as far as durable execution. And I mention this, not just to say that, you know, okay, Azure Functions is cool, but this is a recipe that I think even the Temporal engineering team thinks about when it comes to strategies for serverless. These are some of the things that we’ve done which have helped enable these FaaS-type platforms to run Workflows.

So with that, I thought it’d be cool to actually show a demo, where I can show you how we did Durable Functions. But, I thought it’d be even more interesting, since this is the Temporal conference, if I took the Temporal SDK, and enabled it to run directly on Azure functions in a serverless way. What I have here is a project that I created in Visual Studio, which is the primary editor that .NET developers and C/C++ developers use. So I have a function app. And here you can see, I’m using the brand new .NET SDK, because .NET and C#, those are my main things.

Oh, and I forgot to mention, in this project, one thing that I did is I actually created an extension specifically for Temporal to work with Azure functions. In this case, it’s a project reference, but in the fullness of time, it would be something like a nougat package. So one of the things that we also have in Functions is this concept of “bindings.” So in this extension, I’ve defined a Temporal client binding, where just by saying, “Hey, I’ve got this parameter in my function, and I want it to give me a Temporal client,” the extension will take care of automatically setting up that client, connecting it to the Temporal server, and making it available in your Function. There’s a JSON file that you can configure that I’ll actually show you. Here it is, on the top, where you can sort of configure, for example, the target host for connecting to Temporal, the task queue, the namespace – those sorts of things.

So anyways, you can take that client, and then you can call the usual methods. In this case, we just call the run method. We have a few other functions such as, you know, I want to get information. So again, I bind to this Temporal client. One other thing that I forgot to mention is that you can define a route for these HTTP triggers. So in this case, if you do GET hello/, and then you pass in a Workflow ID, you can then take that Workflow ID in your function and use that to get that Workflow. And we return to the description, which is going to just return it as JSON. Same thing, if you want to get the results, we have something for that. So basically, I’m just exposing my Workflow over some HTTP APIs. We can use the signals to call update name, we can do the queries to get the greetings. And, you know, when update comes to the .NET SDK, we could do something like that, too. But anyways, that is the set of functions we’re going to be defining.

Behind the scenes, Azure Functions doesn’t know anything about Temporal, right? So I have this Workflow. It doesn’t know what to do with this Workflow.But, I actually have a source generator, which goes and sort of generates a sort of fake function, which it then understands as a Temporal Workflow trigger.If I were to deploy this – which you can do pretty easily from here, you can just right click and publish – you get a dialog that lets you publish it to Azure.So I’ve already done that. I’m going to switch over to an already published copy of this Temporal demo.I deployed this to Linux.And you can see all the functions that we have in this application – one of them being the Temporal Workflow trigger – which was that special auto generated one, and then all these HTTP triggers.

Why don’t we go ahead and run this? I’ve got Postman here. We’re gonna say, start a new Workflow. Let’s call it Hello, Temporal. I’m gonna go ahead and send that. And then what it’s going to do – hopefully this is going to work, I’ve been running into some timeout issues with the .NET SDK. And, I think I’m hitting that timeout issue. I could go ahead and restart this app – that usually fixes it. Chad was telling me about this. And I think that might be the problem I’m running into. But, the cool thing about Azure Functions is that if you want, you can also run everything locally. So why don’t I do that, so that we don’t have to sit here and wait too long? I’m gonna go ahead and just run this project. Let’s see if I can get everything working here. If you press f5, in Visual Studio, you can also run the functions runtime locally, and you get basically the same sort of experience that you would get in the cloud, except locally.

So you can see here, it’s discovered all my functions and sort of understands the special new Workflow trigger here. And if I’m gonna switch over here, I think I might have something in my history where I can go ahead and send an HTTP POST. Sorry, I don’t have this setup in Postman. But I can do it here. And I can say something like “Hello, Temporal.” And we should see a very similar response that basically says, yep, started a Workflow. And if you actually look at the logs here, you can see that yes, we executed that HTTP trigger function. But we also executed this Workflow function called Hello, Temporal. And you can see that it actually executed really fast – 31 milliseconds – even though if I’m to go and now query that Workflow, just by doing a GET, I actually need to pass in this Workflow ID here. This is going to basically dump a description of the running Workflow. We can see that it’s still running, but the function isn’t still running. The function completed, as it showed here.

Let’s see. I could do a POST. I can update the greeting to be something else, which I believe is called Update name. And let’s do Replay instead. So, that accepted. We can see some action. If we look at the logs here, the Workflow code ran again, you know, basically processing the Signal again. Very fast. It ran for just nine milliseconds. So you can sort of see that we’re able to interact with this Workflow in a serverless way. Basically, each of these executions are going very quickly and not, you know, basically not charging the user for this long running Workflow that’s being held there. So we can go ahead and call, you know, we called finish.

Then, if we go ahead and query the status of that again, we can see that the Workflow completed. In this case, it was running in the local dev server. It also works in Temporal Cloud, which, unfortunately, I ran into that timeout issue with Postman. I think if I hit the restart it might work now. And yeah, now it’s working after I did the restart. In which case, you know, I can actually go and see it running in Temporal Cloud, if I do a refresh here. Yeah, there’s that new Hello Workflow that I ran. So this, you know, this project works. It’s compatible with Temporal Cloud. And just to sort of hone in and confirm that, yes, this indeed works. And it’s seen by Azure as proper function invocations.

Let’s see, I think the timing is a little bit off. So I don’t know if the logs are gonna be here. Okay, it looks like it is. You can see that, you know, we actually ran this Hello Workflow function, which is actually the Workflow. We can even scroll over and see the latency, you know, was pretty fast, just 46 milliseconds on this first run. So it’s seen as a proper Function Execution. And you know, the user is only going to get billed for this 46 milliseconds, and not for the full duration of however long that it’s running. So anyways, we can go ahead and switch back to the presentation.

That’s the demo. I wanted to show some of the things that we were able to do with the Temporal SDK to get it running on a serverless platform. Now, just to be clear, it’s completely a proof of concept. This is nothing that’s necessarily going to ship, but I just wanted to show it as an illustration of the type of thing that we do with Durable Functions, which could apply to other durable execution platforms as well.

What’s missing from the proof of concept that I showed? So support for other languages is one thing that’s missing. So what I showed you was .NET, slightly different work would need to be done to get it to work with all the other Temporal SDKs. Let’s see, event-based scaling was one thing that I didn’t have time to sort of get in this proof of concept, basically updating that scale controller – a piece that I showed you before – that’s kind of a platform piece anyways. But, the idea that one of the things that we would also sort of need for that is something on the Temporal Server that can tell us how much work is pending, so that we can know how many VMs should be allocated to get that going. That’s sort of another thing that could be explored more: flexible authentication options. It turns out, setting this up was really painful, setting up the MTLS certificates on the function itself. They definitely didn’t follow best practices when setting that up. Once we have API keys for the data plane, then this would work a lot more beautifully. Excited to hear that it’s at least coming for the control plane.

So switching gears a little bit, I thought it might be cool to talk about some of the business impact that we’re seeing with Durable Functions to sort of paint a picture that, you know, the durable execution thing is legit, even from our perspective – it’s not just Temporal hyping things up like this. We’re seeing some really great things with durable functions, including: billions of executions per day in our Azure public cloud; 65% of our top 500 enterprise FaaS customers are using Durable Execution, which basically tells us that they find a lot of value out of this beyond just sort of the mainline, okay, I’ve got my HTTP triggers, I’ve got my queue trigger.

In the last year, there has been a 30% increase in the number of Azure subscriptions using Durable Functions. So you know, there’s some nice growth there, even revenue, we’re seeing a nice revenue growth. And you know, I apologize, I can’t give you specific numbers for any of this. But, you know, I just want to show you that there is a trend of growth going on. And it’s going on very organically, too, which is exciting. We don’t have a marketing and sales team to help us push this. But what we’re finding is that our customers are sort of looking to solve the problems that many folks are running into and they’re finding out about Durable Functions. They’re trying it and they’re loving it. So, a similar story that you guys are hearing from other folk’s talks.

We have a lot of different industry verticals that are also finding a lot of value in durable execution. Again, just reinforcing the same message that’s already being told that a lot of people are getting value out of durable execution. It’s making a lot of teams more productive. There are some pain points with Durable Functions, specifically the fact that it’s sort of linked to FaaS. The highly opinionated programming model is one thing that folks sometimes complain about. They want to have control over the main method, for example, and be able to control the worker themselves, which we don’t really sort of give you that option. You know, FaaS, frankly, is not ideal for all workloads. If you have sort of a low volume workload, it’s amazing. Oftentimes, you don’t have to pay anything to host your stuff.

But, you know, as you get to higher level workloads, sometimes it’s better to have dedicated compute. We have that option. Something you need to be aware of is that we’re more of a library; we don’t have a server product that sits in the back.We’re just a library that points to whatever storage account you tell it to point to. Which is nice, because it’s sort of a lightweight setup, but it can also be a pain because you have to bring your own storage account. And, you know, we have to try to support you on your storage account – but, not a problem with something like Temporal Cloud. So, I’m a little bit jealous of Temporal Cloud.

Key takeaways that I want folks to leave here with is that durable execution is fully compatible with FaaS, in spite of sort of the best practices that I showed before. It’s possible to make it work. And, I wanted to demonstrate that durable execution has big business potential on FaaS as well. That’s what we’re seeing. And so, you know, I’m really excited as Temporal continues to grow and see how it makes its way into the FaaS space. Temporal workflows on FaaS is absolutely possible. There are a few things that we sort of identified that’ll make it more realistic. And, hopefully, that’s something that we can continue to explore in the future. That’s basically it. Thanks for your patience with me. If you want to learn more about Durable Functions.Feel free to follow me on Twitter, too, if you’re interested in hearing what I have to tweet about it. So alright, thanks so much.