Temporal Ruby — crash-proof fibers

After a long effort, Temporal Ruby is now GA. The Temporal Ruby SDK allows Ruby developers to author durable software with a native Ruby feel. Ruby at Temporal is a fully-supported language with the same level of support and features as every other official Temporal-supported language.

In this post, we will cover what Temporal Ruby is at a high level. Then we will delve into some advanced technical details of how we used Rust, how we made durable fibers work, how we prevent illegal calls, and some interesting challenges we faced along way.

Introduction to Temporal and Temporal Ruby#

Temporal is a system and programming model for writing durable code that can run a long time and can survive crashes. In a nutshell, Temporal Workflows are sets of deterministic code that record side-effecting actions as events and then, upon crash or for other reasons, can use those events to resume from where the code left off. There are many other components to Temporal, such as Activities which are an abstraction over those side-effecting actions. See the high-level explainer docs and evaluation docs for more understanding of the system.

To support this in Ruby, Temporal offers the Ruby SDK. Here are some important links:

Ruby SDK repository
Ruby developer guide — with a quick-start guide
Ruby getting started
Ruby samples
Ruby API documentation
#ruby-sdk channel on Slack

It is MIT-licensed like all Temporal open-source software and is much more than a smart client, it translates imperative Ruby Workflow code into durable software.

To give you a taste of the Ruby SDK, we’ll create a simple one-click buying Workflow in Ruby where a purchase is started and then, unless cancelled, will be performed in 10 seconds.

Implementing an Activity#

Activities are automatically retryable functions in Temporal that allow arbitrary side-effecting code to run. In Ruby, these are classes. Here’s an implementation of a sample Purchase Activity:

require 'json'
require 'net/http'
require 'temporalio/activity'
require 'uri'

class Purchase < Temporalio::Activity::Definition
  def execute(purchase_details)
    # Could HTTP.start in initialize and reuse session as an optimization
    resp = Net::HTTP.post(
      URI('https://api.example.com/purchase'),
      JSON.generate(purchase_details),
      { 'Content-Type' => 'application/json' }
    )
    # Fail if not 2xx
    unless (200..299).include?(resp.code.to_i)
      raise Temporalio::ApplicationError.new(
        "Client error #{resp.code}: #{resp.body}",
        # 4xx is considered non-retryable
        non_retryable: (400..499).include?(resp.code.to_i)
      )
    end
  end
end

This Activity posts to HTTP and raises an exception on failure, taking care to not make that exception retryable if it’s not a retryable HTTP status code.

Implementing a Workflow#

Now that we have an Activity that performs a purchase, we can write the Workflow to do time-limited one-click buy:

require 'temporalio/workflow'
require_relative 'purchase' # Our Activity

class OneClickBuy < Temporalio::Workflow::Definition
  workflow_query_attr_reader :purchase_status

  def initialize
    @purchase_status = :pending
  end

  def execute(purchase_details)
    @current_purchase ||= purchase_details

    # Give user 10 seconds to cancel or update before we send it through
  begin
	  Temporalio::Workflow.sleep(10)
  rescue Temporalio::Error::CanceledError
	  # Canceled workflow/purchase
	  return @purchase_status = :canceled
  end

    # Update the status, perform the purchase, update the status again
    @purchase_status = :confirmed
    Temporalio::Workflow.execute_activity(
      Purchase,
      purchase_details,
      schedule_to_close_timeout: 2 * 60 # 2 minutes
    )
    return @purchase_status = :completed
  rescue
    @purchase_status = :failed
    raise
  end

  workflow_update
  def update_purchase(purchase_details)
    # Disallow if not pending (this logic could be in an update validator)
    raise Temporalio::ApplicationError, 'No longer pending' unless @purchase_status == :pending
    # Update current purchase (no need to return it)
    @current_purchase = purchase_details
    nil
  end
end

Workflows have to be deterministic and Temporal has a custom fiber scheduler to make sure all asynchronous work is deterministic (explained later).

That sleep is a durable timer. If the process this ran on crashed, no problem, it picks right back up where it left off. Sleeping for weeks is not uncommon.

This relies on Workflow cancellation to cancel the pending purchase and relies on Workflow update to be able to update the purchase in that window. See the Ruby developer guide for more details on everything a Workflow can do.

Running a Worker#

Workflows and Activities are run via Workers, like so:

require 'temporalio/client'
require 'temporalio/worker'
require_relative 'purchase' # Our Activity
require_relative 'one_click_buy' # Our Workflow

# Create a Client to localhost on "default" Namespace
client = Temporalio::Client.connect('localhost:7233', 'my-namespace')

# Create a Worker with the Client, Activities, and Workflows
worker = Temporalio::Worker.new(
  client:,
  task_queue: 'my-task-queue',
  workflows: [OneClickBuy],
  # This could be an instance if it needs state
  activities: [Purchase]
)

# Run the Worker until SIGINT
worker.run(shutdown_signals: ['SIGINT'])

When run, this Worker will communicate with the Temporal Server and handle all Workflow and Activity work given to it by the server.

Executing a Workflow#

A client can run our Workflow like so:

require 'temporalio/client'
require_relative 'one_click_buy' # Our Workflow

# Create a client to localhost on "default" Namespace
client = Temporalio::Client.connect('localhost:7233', 'my-namespace')

# Start a Workflow
handle = client.start_workflow(
  OneClickBuy,
  { item_id: 'item1', user_id: 'user1' }, # This is the input to the workflow
  id: 'my-workflow-id',
  task_queue: 'my-task-queue'
)

# We can update the purchase if we want
handle.execute_update(
  OneClickBuy.update_purchase,
  { item_id: 'item2', user_id: 'user1' }
)

# We can cancel it if we want
handle.cancel

# We can query its status, even if the Workflow is complete
puts "Status: #{handle.query(OneClickBuy.purchase_status)}"

# We can also wait on the result (which for our example is the same as query)
puts "Result: #{handle.result}"

This is a small sample showing some of the Ruby features, but there are many many more. See the Ruby developer guide and Ruby SDK README for more details.

Advanced SDK implementation details#

With the Temporal Ruby primer covered, the sections below provide some advanced, interesting implementation details about the SDK.

Rust Core + Ruby C extension#

Like TypeScript, Python, and .NET before it, Temporal Ruby leverages Temporal’s common Rust Core to handle complexities with gRPC clients, Worker state machines, and more. In addition to not having to reimplement the advanced Workflow state machines and Worker logic in each language, using a Rust Core allows the SDK to reduce its dependency count. In this case, the Ruby SDK does not have to take on a Ruby gRPC dependency which means it will not transitively impose gRPC library version constraints on its users.

To leverage this Core library, we create a Ruby C extension with Rust. Like wasmtime and others, we use Magnus and rb-sys to provide a relatively easy-to-use bridge between Ruby and Rust. We include the sdk-core repository as a submodule and reference it via our Ruby-specific Rust project.

The Rust Core uses async/Tokio extensively, and on the Ruby side, we expect all of our constructs to be asynchronous (via thread or fiber, user choice). So we must bridge the two. After research, it became apparent that a single-use Ruby Queue was both thread-aware and fiber-aware when it blocked waiting for response. So having the Ruby side pass in a queue as essentially a “completable promise” allows us to resolve the Ruby call asynchronously.

However, invoking a Queue#push even from Ruby C code requires holding the GVL. The GVL can only be acquired in a Ruby thread, and Ruby threads are only threads Ruby creates. There is no acceptable way to cheat this and have a Rust/Tokio thread masquerade as a Ruby thread for purposes of a simple GVL-held invocation. So the Temporal Ruby Runtime, that is meant to be global and often lazily created, has this in its initialize:

Thread.new do
  @core_runtime.run_command_loop
end

This runs the blocking Rust code:

pub fn run_command_loop(&self) {
  enter_sync!(self.handle);
  loop {
    let cmd = without_gvl(
      || self.async_command_rx.recv(),
      || {
        if let Err(err) = self.handle.async_command_tx.send(AsyncCommand::Shutdown) {
          log_error!("Unable to send shutdown command: {}", err);
        }
      },
    );
    match cmd {
      Ok(AsyncCommand::RunCallback(callback)) => {
        if let Err(err) = callback() {
          log_error!("Unexpected error inside async Ruby callback: {}", err);
        }
      }
      Ok(AsyncCommand::Shutdown) => return,
      Err(err) => {
        // Should never happen, but we exit the loop if it does
        log_error!("Unexpected error receiving runtime command: {}", err);
        return;
      }
    }
  }
}

What’s happening here is we are waiting on a Tokio channel receiver under the Ruby rb_thread_call_without_gvl C function (abstracted as without_gvl). Once we receive a callback to run, we are under the GVL and therefore can execute the Ruby callback. Usually, this callback is a simple Queue#push. We make sure that any callback is incredibly cheap since it is all handled by this single-threaded “reactor” loop. We also provide a way to shut it down either from the outside or as the rb_unblock_function_t function parameter that interrupts the thread.

Durable fiber scheduler#

A Temporal Workflow is actually a custom, deterministic Fiber::Scheduler. Temporal requires Workflow code be deterministic. This is so the SDK can replay the same Workflow code path to get to where it left off when it needs to resume. Traditionally asynchronous code is not deterministic, so while doing something like waiting for two racing concurrent fibers to complete may not be deterministic in most schedulers, it is in the Temporal scheduler.

How Ruby Workflows work is that all events (e.g., Signal received, Timer fired, Activity completed) are applied to the Workflow state and fibers are created in cases where they are needed (e.g. running a Signal handler). If we’re on the first Workflow task, a fiber is created representing the primary Workflow execute method. When a fiber needs to be run/resume, it calls #fiber or #unblock on the scheduler, and we enqueue those fibers on an array. Then our scheduler instance has the following method:

def run_until_all_yielded
  loop do
    # Run all fibers until all yielded
    while (fiber = @ready.shift)
      fiber.resume
    end

    return unless pending_wait_conditions
    # <... code omitted for wait_condition handling ...>
  end
end

This is your basic event loop. It pops fibers off @ready array and calls resume on them, which may in turn add more fibers on @ready via #fiber or #unblock. When there are no more fibers on @ready, it means they are all waiting on external stimulus from Temporal Server (e.g., Activity completion). We invoke this when we receive a Workflow task, and when it is done, we collect the commands that occurred during invocation and send them to the server (e.g., schedule Timer).

And since our Fiber::Scheduler implements #kernel_sleep and #timeout_after, technically sleep and Timeout.timeout are automatically made durable. However, we have disabled them by default, see “Implicitly Used Sync Constructs” under “Lessons Learned” later to understand why.

Additional async constructs#

Ruby’s standard library only offers basic Ruby fiber constructs and is missing some high-level forms users need. In this section we will discuss three — cancellation, future, and wait condition.

Temporal needs to be able to cancel things. Unlike some languages, Ruby does not offer a native concept of cancellation that is potentially hierarchical/linked, can be shielded against, can be waited on, and generally can interrupt any async call. To support cancellation, we developed Temporalio::Cancellation which is basically a “cancellation token” like you may see in .NET. It is general purpose and is used for Activities and Workflows in Temporal. All asynchronous Workflow calls (e.g. Workflow.sleep) accept a cancellation and default to the workflow-level one. For calls like sleep, a cancel will raise and tell the server to cancel the timer. For others like execute_activity, cancel will depend on the cancellation type, but defaults to raising and sending cancel to the Activity. Cancellation supports adding callbacks, waiting for cancel, shielding, and more.

Fibers are a low-level abstraction for running code concurrently, but they are hard to use at a high level and lack high-level primitives. Temporal::Workflow::Future was created to provide high-level concurrent code control. With a future, it is relatively easy to wait for multiple concurrent things to complete like so:

Temporalio::Workflow::Future.all_of(
  Temporalio::Workflow::Future.new { Temporalio::Workflow.execute_child_workflow(SomeChild1, 'my-param1') },
  Temporalio::Workflow::Future.new { Temporalio::Workflow.execute_child_workflow(SomeChild2, 'my-param2') },
  Temporalio::Workflow::Future.new { Temporalio::Workflow.execute_child_workflow(SomeChild3, 'my-param3') }
).wait

See the future docs for details on how exceptions are handled and which other utilities are available to assist concurrent execution in a Workflow.

Finally, and arguably most importantly, there is Temporalio::Workflow.wait_condition. This is an extremely powerful primitive to wait on a block to be truthy before returning its value. It is very normal to use this to literally wait on an attribute to be set in a Ruby Workflow. wait_condition is a feature only made possible by Temporal via our control over the durable fiber scheduler and our ability to re-evaluate wait conditions in the event loop. For this reason, the block/condition given to wait_condition must have no side effects since it is invoked frequently. The wait_condition primitive is so powerful that it is the backing primitive leveraged by cancellation waiting, future waiting, Activity/child waiting, etc.

Illegal call tracing#

For Workflow code to be deterministic, certain calls are disallowed. Doing things like asking for the current time, using threads, using system/secure random, etc. are non-deterministic and therefore illegal in Workflows. In TypeScript and Python SDKs, there is a sandbox that helps prevent use of these calls (or replaces them). In Go, there is static analysis tooling to help catch invalid calls made from a Workflow. Finally, in .NET, we use an event listener that is notified if the async code leaves the Workflow thread which doesn’t catch all illegal calls, but at least thread ones.

In Ruby, we are using TracePoint. This lets us check every call only on the Workflow thread, leaving other non-Workflow threads alone. The set of Workflow calls considered illegal is configurable via the illegal_workflow_calls parameter on the Worker. The default is Worker.default_illegal_workflow_calls which includes all obviously illegal classes/calls in the standard library. Since this is for each method call, it can also detect when one of these calls is used transitively (e.g., multiple calls deep).

In some cases, simply checking fully-qualified method name is not enough. For instance, Time.new('2000-12-31 23:59:59.5') is deterministic whereas Time.new with no parameters is not. For this specifically, we use the TracePoint binding to access the parameters to tell whether it’s safe or not.

Lessons learned during development#

During development of the SDK, we altered some of our original plans based on feedback from users of the alpha and beta releases. A few of these lessons learned are described in detail below.

Implicitly used sync constructs#

Originally, the Ruby SDK supported standard library use of sleep, Timeout.timeout, Queue, Mutex, etc. in Workflows because they are fiber-aware and therefore are automatically made durable by the fiber scheduler. So this was a very reasonable Workflow:

class WaitForSignalOrTimeout < Temporalio::Workflow::Definition
  def execute
    # Wait 5 minutes for something to process or fail
    to_process = Timeout.timeout(300) do
      Temporalio::Workflow.wait_condition { @to_process }
    end
    # Process it
    Temporalio::Workflow.execute_activity(Process, to_process, start_to_close_timeout: 30)
  end

  workflow_signal
  def process(to_process)
    @to_process = to_process
  end
end

Note the Timeout.timeout in there. This worked well for many users for many months. But then one user reported a rare case where a Workflow would get hung inexplicably. It was very hard to replicate, but the user was eventually able to replicate it with a Workflow that used logging and lots of loops and timers in a certain way to get the Workflow to hang a very small percentage of the time. Given this replication, we began to dig in. The first step was to increase logging, however increasing logging actually made the problem go away. Ugh.

At this point it was clear it was a rare race and adding logging would actually make the race less likely to appear. With some more non-logger logging we finally traced it down. In the Ruby logger, a Mutex is used on write. A Ruby mutex does not tell the fiber scheduler that it is blocked on synchronize unless it actually is blocked by another thread/fiber. So most of the time, the synchronize block for the Ruby logger was never considered blocking and never communicates with the fiber scheduler.

We implicitly made Mutexes durable with Temporal. Therefore, we assume the only reason a synchronize would ever be blocked/unblocked would be due to Workflow/Event stimulus. But in this case the mutex is a local-process mutex, and therefore it is unblocked by another thread that is completely unrelated to the Workflow at hand. So the fiber scheduler was told about the synchronize blocking, then we considered all coroutines yielded, and then removed ourselves as fiber scheduler so another Workflow task could use the thread. The unblock of synchronize came a split second later after the fiber scheduler was removed, so the Workflow never knew it was unblocked. And adding logging made Workflow tasks run just long enough to not give up their thread and fiber scheduler as quickly, so that’s why it was hard to detect with added logging.

We have seen other issues in other languages where implicitly treating built-in constructs as durable can be surprisingly unsafe. For instance, in Python we found asyncio.wait and asyncio.as_completed both use non-deterministic sets. In .NET, we found Task.WhenAny in rare cases used the thread pool instead of our task scheduler. Over the years we have come to recognize that magically making standard library constructs durable can cause problems mostly because they were not authored with durable execution in mind.

After discussion, we decided to forbid sleep, Timeout.timeout, Queue, Mutex, Logger, etc., and we have provided Workflow-safe alternatives in the Temporalio::Workflow module. The Workflow-safe Queue and Mutex classes actually just delegate to the standard library ones but surround each call with Unsafe.durable_scheduler_disabled to disable the durable scheduler. That gets us full queue/mutex support, but still requires Workflow authors to explicitly use them instead of accidentally and implicitly using standard library forms by calling Logger or similar.

IO wait in fiber schedulers#

When we first developed the durable fiber scheduler, we intentionally raised NotImplementedError from Fiber::Scheduler#io_wait. After all, Workflows do not allow IO because IO is non-deterministic. However, for certain behaviors it can make sense to violate determinism. This is especially true for telemetry such as tracing, metrics, logs, failure tracking services, etc., which doesn’t need to be recorded in history but still does a side-effecting action. Another common use case is for debuggers like Pry or RubyMine that use IO to communicate with separate processes.

Ruby does not have a default scheduler implementation you can fall back to. The documented Fiber::Scheduler class doesn’t even really exist. Instead, with some testing and the help of AI, we made a non-durable/blocking form that just reuses IO.select. But users must opt-in to it via a block passed to Temporalio::Workflow::Unsafe.io_enabled. Use of this and any other illegal blocking construct in a Workflow can make the task more likely to hit task timeout which defaults to 10s.

Conversion and runtime type hinting#

In most Temporal SDKs, Temporal leverages types to tell the converter what to deserialize payloads into. This occurs for things like a Workflow receiving parameters or a client receiving a Workflow result. For serialization, simply serializing the objects given is acceptable, but for deserialization, the converter often needs to know the desired type.

Statically typed runtimes like Go, Java, and .NET can simply use reflection for this. Even dynamically typed languages like Python provide runtime-accessible type hints that can be used. For TypeScript users, they are used to defining type-check-time-only interfaces to represent JSON objects, so it is reasonable to have untyped JSON objects at runtime.

By default in Ruby, the JSON module is used for (de)serialization. This means that if there is a JSON object payload in Temporal, it is deserialized as a Hash. This can be surprising to users that prefer to work with certain class objects. Ruby does not have a clear winner in the JSON/object-mapping arena. Options like ActiveModel and Oj are popular, but as a framework that is meant for general use, picking a winner and imposing dependencies with transitive version constraints would be harmful. The standard library JSON module does offer a JSON additions which are effectively just putting the fully-qualified class name into the JSON object, but this is a less-popular approach and is not language-agnostic. Still, Temporal does support JSON additions natively since it imposes no dependencies on users.

Temporal allows custom converters to be written for those not wanting default standard library JSON behavior. Users of object-mapping libraries stated that writing a custom converter was no problem, but they need to allow users to provide the desired type at the declaration site. We designed a concept called “hints” for this. Now, everywhere conversion may occur, hints can be provided. For instance, here is a Workflow that accepts hints:

class TransferFunds < Temporalio::Workflow::Definition
  workflow_arg_hint MyModels::TransferDetails
  workflow_result_hint MyModels::TransferComplete

  def execute(transfer_details)
    # Require approval if certain amount
    if transfer_details.amount > 5000
      Temporalio::Workflow.wait_condition { @approved }
    end

    # Run Activity to do transfer
    Temporalio::Workflow.execute_activity(DoTransfer, transfer_details, start_to_close_timeout: 300)
    MyModels::TransferComplete.new(id: Temporalio::Workflow.info.workflow_id)
  end

  workflow_update arg_hints: [MyModels::ApprovalDetails]
  def approve_transfer(approval_details)
    # Log and run Activity to apply the approval
    Temporalio::Workflow.logger.info("Approval requested by #{approval_details.approver}")
    Temporalio::Workflow.execute_activity(ApplyApproval, approval_details, start_to_close_timeout: 300)
    @approved = true
    nil
  end
end

This sets hints for the Workflow parameter, workflow result, and update parameter. The hint is passed to the converter at (de)serialize time. The default Temporal converter does nothing with hints. Custom converters can leverage this information to know the data type to convert the JSON into (and/or validate it’s the correct type when converting to JSON). Not only are the hints provided on the Worker side at Workflow parameter/result (de)serialization time, but the client side use of running a Workflow and getting a result will also use these hints. The hints don’t even need to be classes, they can be any object the converter may need.

Summary#

Temporal Ruby is now Generally Available and provides a great way to write durable software in a developer-friendly language like Ruby. In addition to the Temporal overview, we covered advanced implementation details and challenges encountered during development.

To get started with Ruby, see the Ruby SDK repository, Ruby getting started, and Ruby quick start guide. Feel free to join us on the #ruby-sdk channel on Slack or ask a question with the ruby-sdk tag on our Forums.