A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions.
Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative experiences. Operational complexity becomes a headache for this distributed, granular software model in production. Is it possible to solve microservices’ problems while retaining their advantages?
Microservices shorten development cycles. Changing a monolithic code base is a complex affair that risks unexpected ramifications. It’s like unraveling a sweater so that you can change its design. Breaking that monolith down into lots of smaller services managed by two-pizza teams can make software easier to develop, update, and fix. It’s what helped Amazon grow from a small e-commerce outfit to the beast it is today.
Microservices also introduce new challenges. Their distributed nature exposes developers to complex state management issues.
The Yoda Principle
Ideally, developers shouldn’t deal with state management at all. Instead, the platform should handle it as a core abstraction. Database transaction management is a good example; many database platforms support atomic transactions, which divide a single transaction into a set of smaller operations and ensure that either all of them happen or none of them do. To achieve this behavior, the database uses transaction isolation, which restricts the visibility of each operation in a transaction until the entire transaction completes. If an operation fails, the application using the database sees only the pre-transaction state, as though none of the operations happened.
This transactionality enables the developer to concentrate on their business logic while the database platform handles the underlying state. A database transaction doesn’t fail half complete and then leave the developer to sort out what happened. An account won’t be debited without the corresponding party’s account being credited, for example. As Yoda said: “Do or do not. There is no try.” Appreciate ACIDSQL databases, he would have.
“Phew,” you think. “Thank goodness I don’t have to write code to unravel half-completed operations just to work out the transaction state.” Unfortunately, microservices developers are still living in that era. This is why Yoda never used Kubernetes.
I’ve Got a Bad Feeling about This
In microservice architectures, a single business process interacts with multiple services, each of which operates and fails autonomously. There is no single monolithic engine to manage and maintain state in the event of a failure.
This lack of transactionality between independent services leaves developers holding the bag. Instead of just focusing on their own applications’ functionality, they must also handle application resilience by managing what happens when things go wrong. What was once abstracted is now their problem.
In practice, things can go wrong quickly in microservice architectures, with cascading failures that cause performance and reliability problems. For example, a service that one development team updates with new error types can cause other services to fail if they haven’t also been updated to handle those new errors.
The brittle complexity of microservices is a challenge, in part because of the weakest link effect. An application’s overall reliability is only as good as its least reliable microservice. The whole thing becomes a lot harder with asynchronous primitives. State management is more difficult if a microservice’s response time is uncertain.
Look at the Size of That Thing
Another aspect of this problem is that managing state on your own doesn’t scale well. The more microservices a user has, the more time-consuming managing their state becomes. Companies often have thousands of microservices in production, outnumbering their developers. This is what we noticed as early developers at Uber. Uber had 4,000 microservices, even back in 2018. In this environment, we spent most of our time writing code to manage the microservice state.
Developers have taken several approaches to solve homegrown state management. Some use Kafka event streams hidden behind an API to queue microservice-based messages, but the lack of diagnostics makes root cause analysis a nightmare. Others use databases and timers to keep track of the system state.
Monitoring and tracing can help, but only up to a point. Monitoring tools oversee platform services and infrastructure health while tracing makes it easier to troubleshoot bottlenecks and unexpected anomalies. There are many on offer. For example, Prometheus offers open-source monitoring that developers can query, while its sibling Grafana adds visualization capabilities to trace system behavior.
These solutions can be useful, providing at least some observability into microservices-based systems. However, monitoring tools don’t help with the task of state management, leaving that burden with the developer. That’s why developers spend way too much time writing state management code instead of highly differentiated business logic. In an ideal world, something else would abstract state management for them.
Use the Microservices State Management Platform, Luke
The answer to simplifying state management in microservices is to offer it as a core abstraction for distributed systems.
We worked on a statement management platform after spending far too much time manually managing microservices state at Uber. We wanted a product that would enable us to define workflows that make calls to different microservices (in the language of the developer’s choosing), and then execute them without worrying about it afterwards.
In our solution, which we originally called Cadence, a workflow (a function that defines your high-level business logic in code) automatically maintains state while waiting for potentially long-running microservices to respond. Its concurrent nature also enables the workflow to continue with other non-dependent operations in the meantime.
The system manages disruption in state without requiring developer intervention. For example, in the event of a hardware failure, the state management platform will continue running a Workflow on another machine in the same state without the developer needing to do anything.
Do. Don’t Not Do.
A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions. Developers can be certain that a Workflow will run once, to completion. Temporal takes care of any failures and restarts under the hood. Now, microservices-based applications can guarantee that a debit from one account will always credit the other in just a couple of lines of code. Developers get the best of both worlds: the shorter development cycles of microservices with the exactly once execution of database transactions.
This fixes a long-standing problem with microservices and supercharges developer productivity, especially now that they are typically responsible for the operation of their application in addition to the development. Finally, developers that want the benefits of microservices can enjoy them without having to go to the dark side.