They say something strange happens deep within Snap Tower, hidden behind the pristine walls and endless lines of code. Not everyone hears about it, but developers know — whispering late into the night over debug logs.
At first, everything runs perfectly. Snap’s microservices hum like a chorus, sending ephemeral messages that vanish just as quickly as they appear. But sometimes... sometimes, messages don’t just disappear on purpose — they disappear wrong.
One foggy October evening, Emily, an engineer on-call, sat alone at her desk. Her team had deployed a new workflow using Temporal earlier that day — routine updates, nothing out of the ordinary. But tonight, her alerts app began to glow with unread notifications.
500 Internal Server Error
StatusRuntimeException: UNAVAILABLE: io exception
503 Service Unavailable
Emily groaned. “Not now.” She had planned to head home, maybe carve a pumpkin, and unwind with a true crime podcast. But it looked like the system had other plans.
She stared at the cascading failures in front of her. The ephemeral messaging service, Snap’s pride and joy, was throwing errors everywhere. It wasn’t just one queue — it was everything, all at once.
Her fingers danced over the keys, digging into traces, following cross-service calls and state changes from one span to the next. But as she reached the root cause, her terminal froze. A chill filled the air. The overhead lights flickered. A whisper curled out from the darkness beyond her screen, soft but unmistakable.
“The messages are not yours to retrieve…”
Emily’s pulse quickened. She could have sworn she heard it. A voice — no, a presence. Something ancient, something vast, watching from the depths of her system, where code and chaos intertwined.
She tried restarting her tools, but every attempt spiraled into deeper anomalies. Message traces pointed to workflows that didn’t exist. Dead-letter queues spat out strange symbols, glyphs Emily had never seen before.
She knew the message queue was implemented to help the system weather network hiccups and other brief disruptions. It was supposed to make these failures manageable. But this… this was something else. This wasn’t just a system error — it felt alive, and it didn’t want her meddling. Her terminal buzzed to life once more, but the output was... wrong. An endless loop of timestamps from the future, each accompanied by the same ominous message:
The void remembers.
Desperate, Emily escalated to her team. But no one answered. Every chat, every call, every request for help was met with silence. Outside the window, the fog pressed tighter against the glass, as if sealing her in with the haunted code.
She had no choice. With a trembling hand, she triggered the emergency state designed to purge and restart the system. The screen flickered, and for a moment, everything went dark.
Then, a notification appeared:
Workflow Complete: All systems operational.
Relief washed over Emily, though unease lingered in the pit of her stomach. She packed her things quickly, eager to escape the oppressive silence of the office.
As she left, her phone buzzed once more — a final status page? A new incident? She glanced down at a message from an unknown number:
“You fixed nothing. The void remembers.”
Emily felt her heart skip a beat. Somewhere deep in the system — in the unseen space between ephemeral messages and retries — something was still watching. And it wasn’t done with her yet.
The Real Story
While Emily’s tale is fiction, the eerie struggle she faced mirrors Snap’s real-world challenges with managing ephemeral messaging across a microservices architecture. With millions of users posting stories from around the world, the Snap engineering team had to contend with high volumes of data interacting across multiple services, databases, and cloud providers.
Ensuring resiliency and consistent event delivery in these circumstances is no small feat. And with their modern microservices architecture, the team at Snap faced a familiar challenge. The team was spending considerable time and effort orchestrating these services, writing code to handle errors and implement state tracking to ensure system resiliency.
While message queues are essential for decoupling services and handling asynchronous processes, they come with inherent limitations, especially for ephemeral messaging. In the event of a network outage or a high-volume failure, queued messages can get backed up or even lost, undermining reliability. Additionally, message queues can be tricky to scale efficiently. Snap’s team encountered these challenges firsthand: notifications would sometimes fail to deliver, and duplicate data processing jobs strained resources, driving up costs. While message queues provide short-term solutions, they don’t solve the fundamental problem of tracking message states reliably across services.
Recognizing that reliability was key to offering high-quality experiences for users and advertisers, Snap turned to Temporal to solve these issues. Temporal’s workflows allowed Snap’s engineering team to create durable systems that manage retries automatically, track message states, and ensure no data gets lost, even when parts of the system falter. With Temporal, Snap could recover gracefully from disruptions — avoiding the nightmare of dropped messages and failed queues.
In the end, adopting Temporal ensured not only peace of mind but also a reliable platform to build on, free from the lurking horrors of ephemeral chaos.
For a more detailed exploration of this story, please see: Build a Reliable System in a Microservices World at Snap.
In the end, adopting Temporal ensured not only peace of mind but also a reliable platform to build on, free from the lurking horrors of ephemeral chaos.