How Vodafone aims to orchestrate value added services across devices with Temporal

Temporal has paid off in terms of our speed–our ability to deliver new features and new capabilities, without having to write code every time to cope with retries and error conditions.

vodafone small logo

Industry

Telecom

Use Case

Software orchestration

Company Size

Megacorp

SDK

Java

Temporal

Cloud


Vodafone, a leading multinational telecommunications company, is in the business of connecting people through technology.

Recently, Vodafone rolled out a next generation of customer premise equipment (CPEs) to improve their user experience. These CPEs (which include devices like routers) allow customers to add services like security and IOT to their networks.

As part of this initiative, the engineering division of the Home Technologies and Products team needed to build a new platform to install software across the CPEs. In the following story, we’ll explore the challenges the team faced, and why and how they adopted Temporal.

Orchestrating software installs across devices

The team was on a tight deadline to complete their project. The goal was to create a cloud-native CPE management system that leveraged a new protocol from the Broadband Forum, called User Services Platform (USP), or more commonly known as TR-369.

Vodafone’s new CPE management system needed to orchestrate software installs, both when the CPE is initially onboarded and throughout the CPE’s lifetime. The system is composed of four services called USP Controllers that are responsible for different domains.

From an engineering standpoint, there were numerous challenges:

  • Asynchronous work: there are potential for long delays between when the USP Controller asks the CPE to install the software and when the Controller receives a response that the task is completed. This increases the probability of intermittent failure.
  • Potential for split brain: there can be situations where two brains are trying to orchestrate the CPE – the local system and the new CPE management system.
  • Two very different workloads: the system had to support two workloads with drastically different scale requirements.
    • Onboarding: When a customer orders a new device, unboxes it, and plugs it in, the CPE onboards to the platform and potential software is installed or updated. This workload is low-scale and is spread naturally.
    • Campaigns: When new features and capabilities are rolled out, the platform needs to be capable of updating the entire fleet of CPEs (or subsets thereof) in one night. This workload is high-volume as the change may target many devices.
  • Scheduling Work: the team were looking for a way to schedule recurring tasks across the fleet of devices with little, if any, ongoing supervision.

The team started searching for a tool that could manage intermittent failure and orchestrate their asynchronous workflows.

A code-first, scalable workflow tool that supports Java

One of the reasons we chose Temporal is because we can use Java. We don’t need to declare workflows in XML or something else and then write code that plugs it together. –James Irwin, Principal Manager, Software & Product Design Practice

The team had several north stars they needed to follow for their tooling and technology stack: Java, cloud-native, horizontal scale, and good support for CI/CD. They evaluated a variety of workflow tools with these requirements in mind.

When they found Temporal, they ran a proof of concept and several spikes. They immediately realized Temporal’s value and soon chose it because it:

  • Lets developers code purely in Java.
  • Allows developers to “focus on writing code for fulfilling business needs.”
  • Originated from Uber, giving the team confidence in the technology.
  • Is open-source and in use by well-respected companies.

An architecture aligned to the business logic

When you go to the extreme of microservices architecture, and you decompose it into different domains, you end up with microservices calling microservices calling microservices. Dealing with failures and the complexity of that becomes challenging. Whereas with Temporal, we can focus on use cases and business logic execution.

When Vodafone started building with Temporal, one immediate benefit was that it let them design their system in a new, more effective way.

The CPE management system leverages a “multi-controller paradigm,” which is a new feature that USP enables. The system is divided into three separate controllers. Each controller is responsible for a domain, with a limited set of functionality, and can only access a subset of the CPE Data Model - following the least privilege principle.

With a traditional microservices architecture, the engineers would have spent many cycles mapping out the domains and the actions each service needed to accomplish, and ensuring the microservices were properly communicating and coping with failures between the various different microservices.

With Temporal, the engineers created four Namespaces, one for each controller and another to manage updates (or campaigns) across the fleet of CPEs. Then they simply defined Workflows to encompass the required actions, without needing to account for failure scenarios. Their distributed fleet of Workers scales horizontally to execute the business logic. They can invoke Workflows from within one Namespace to another, and they intend to migrate to Nexus.

Vodafone architecture

Faster time from inception to reality

With Temporal, we just write code for fulfilling business needs, rather than writing code for fulfilling business needs in the face of durability problems. The time to realize a new use case is much much lower than if we were using a traditional microservices architecture.

With Temporal, the Vodafone team hit their tight deadline. They got their new platform up and running and in production on time. They’ve also gained a number of other benefits from Temporal:

  • Durable code execution in the face of failures: They’ve encountered numerous instances where there was an issue with a controller that normally would have impacted end customers, causing an operational headache. But with Temporal’s durability and built-in retries, the system operability wasn’t affected.
  • Horizontal scaling for high-volume campaigns: Temporal lets the system easily scale up to hundreds of thousands of Workflows per minute in order to upgrade devices for campaigns.
  • Faster, more agile realization of new features: The team just needs to write business logic now, instead of code for failure handling, so they are much quicker at rolling out new features.

Migration to Temporal Cloud

Temporal Cloud’s new custom storage backend was attractive in terms of performance.

After self-hosting Temporal for some time, the Vodafone team decided to migrate to Temporal Cloud. They never encountered issues with their self-hosted cluster, but Temporal Cloud’s advantages were appealing.

They looked at their typical Temporal usage to project what the cost of Temporal Cloud Actions would be. They also factored the cost of their current database, compute resources, and operational overhead of managing the cluster. The database costs in particular were substantial as they continued scaling their use case.

After crunching the numbers, the Temporal Cloud pricing estimate appeared positive. There were other compelling reasons Vodafone choose Temporal Cloud as well:

  • Reduced operational complexity and easier upgrades: While self-hosting, their team needed to understand the Temporal Service at a low level, ensuring it was constantly tuned to align with evolving needs of their use case. This effort, combined with their desire to keep up with Temporal Cloud’s frequent releases, took up engineering time.
  • Optimized storage backend: They predicted their database, Postgres, would eventually become a bottleneck based on their scale goals for the coming years. They would have had to migrate to Cassandra to support the necessary scale, which was undesirable. Temporal Cloud’s optimized storage backend freed up the scale they needed.
  • Automatic horizontal scale: While self-hosting, if they wanted to run a campaign, the team had to manually scale up their Postgres minimum capacity units so they were warm and ready to handle large numbers of Workflows per second. With Temporal Cloud, they expect the new custom storage backend to handle this scale without human intervention, so their team only needs to handle Worker scaling, which can easily be automated.
  • Reduced costs when not at peak scale: While self-hosting, the Vodafone team had to over-provision the capacity of their Temporal Service even during times when they weren’t running campaigns. This required them to pay for compute and other infrastructure, while Temporal Cloud costs are consumption-based.
  • Secure data model: Temporal Cloud passed the security review more easily than expected due to its data converter architecture. Temporal Cloud doesn’t see any customer data (besides Workflow metadata), and all data is encrypted.

What's next

As Vodafone expands to new geographies and continues to scale up, Temporal Cloud will support their journey and enable the team to continue innovating.

Build invincible apps

Ready to learn why companies like Netflix, Doordash, and Stripe trust Temporal as their secure and scalable way to build and innovate?