Google’s recent release of the Gemini 2.5 Flash and 2.5 Pro and the powerful Veo 2 video generation model has supercharged the AI community. These models allow developers to build sophisticated, agentic systems that can reason, plan, and create in new and exciting ways.

However, moving from a mind-blowing demo to a reliable, production-ready agentic system introduces a host of engineering challenges. These systems are inherently complex, long-running, and distributed. They are susceptible to transient failures, from network hiccups and API rate limits to server crashes. How do you ensure a multi-step, hour-long process doesn't fail because of a momentary glitch?

This is where Temporal comes in. As a Durable Execution platform, Temporal provides the resiliency and reliability necessary to orchestrate these complex AI workflows.

In this blog, we'll explore the common failure points of agentic systems and demonstrate how Temporal's architecture offers a powerful solution. We will illustrate this using an open-source video generation system that uses Gemini 2.5 Flash to script scenes, Veo 2 to generate video clips, and Google Cloud Storage to stage the final video.

Video 1. An AI generated video using the prompt: "Mermaids, dolphins, and octopuses performing for a circus performance."

The Achilles' heel of agentic systems in production

Generative AI agents are powerful, but the workflows they execute are often long-running and stateful, making them fragile in a production environment. Let's break down the key challenges.

  1. Stateful, long-running processes: A typical agentic system consists of a sequence of generative tasks. For a video generation system, the sequence may include prompt analysis, script generation, video generation, and post-processing. This entire process could take minutes or even hours. If the server running this logic crashes mid-way, how do you resume from where you left off without losing all the intermediate work?
  2. Transient failures: Behind the curtain, agentic systems are a form of distributed systems. The reality of distributed systems is that things fail. A call to the Gemini or Veo API might fail due to a temporary network blip, a 429 error from a rate limiter, or a brief service interruption. How will the system handle temporary issues like a network blip, a rate-limiter error from an API, or a brief service interruption?
  3. Costly re-runs: Generative AI tasks are resource-intensive and can be costly. If a video generation workflow fails during the final step, re-running the entire process from scratch is inefficient and costly. If a resource-intensive workflow fails during the final step, how can you avoid rerunning the entire job from scratch?
  4. Visibility and debugging: Agentic systems will inevitably fail due to transient issues like network hiccups, API rate limits, or server crashes. When a workflow does fail, how can you get a clear, auditable history of every step to identify the root cause of the failure?

Temporal: The foundation for your durable agents

Temporal is designed to solve precisely these challenges. By providing an observable Durable Execution platform, Temporal allows you to focus on building powerful AI features, not on the “plumbing” for failure handling.

At the heart of Temporal is the Workflow, a durable and resumable function. You write your multi-step AI process as a single piece of code, and Temporal ensures it executes to completion, even in the face of transient failures. Your Workflow's state is automatically persisted, so if a server crashes, your agent can resume from its last known good state. This ensures your hour-long video generation Workflow runs to completion exactly once, avoiding costly reruns.

Interactions with external services, like calling the Gemini or Veo APIs, are performed within Activities. Temporal manages the execution of these Activities with built-in, configurable retries, timeouts, and error handling. That transient network blip? Temporal's default automatic retry policy, which uses exponential backoff with jitter, handles it for you.

For debugging, Temporal provides a detailed event history for every Workflow execution via its Web UI. You get a complete, auditable log of every Activity, including its inputs, outputs, and any errors, making it trivial to pinpoint the exact point of failure.

Most importantly, Temporal Workflow is just code, so you can use popular tooling like Pydantic and Google Cloud SDK directly. There's no need to learn a new DSL or use complex YAML configurations; your workflow is defined in the languages you already know.

Figure 1. A visual representation of a Temporal Workflow Execution through Temporal Cloud Web UI. Figure 1. A visual representation of a Temporal Workflow Execution through Temporal Cloud Web UI.

Architecture of a resilient video generation system

Let's look at a high-level architecture of the open-source video generation system built with Temporal and Google's AI services.

Figure 2. A flow chart of activities in the sample video generation system. non-transparent Figure 2. A flow chart of activities in the sample video generation system.

  1. User prompt: The process begins when a user passes a video description prompt to start the video generation Workflow.
  2. Scene generation with Gemini: An Activity calls the Gemini 2.5 Flash API. Leveraging Gemini's structured output capabilities, the user's prompt is transformed into a set of distinct scenes with descriptions and camera directions.
  3. Parallel video generation with Veo 2.0: The Workflow then executes multiple Activities in parallel. For each scene, Gemini 2.5 Flash is used to generate a prompt, optimized for video generation tasks. Then, the prompt is used to generate a video clip for its corresponding scene using Veo 2.0.
  4. Final compilation and completion: Once all the parallel video generations are complete, the final step is to combine the clips into a single video file. The finalized video is uploaded to Google Cloud Storage and its URI is returned to the user.

A look at the code: Temporal and Gemini in action

Let's examine some code snippets from the video generation system to see how this is built.

The main logic resides in the VideoGenerationWorkflow Workflow Definition below. It outlines the exact steps of our business logic.

@workflow.defn
class VideoGenerationWorkflow:
    @workflow.run
    async def run(self, arg: VideoGenerationWorkflowInput) -> VideoGenerationWorkflowOutput:
        workflow_start_time = workflow.now()
        gcs_staging_directory = (
            f"videos/{workflow_start_time.strftime('%Y%m%d_%H%M%S')}"
        )

        # Expand user prompt into movie scenes.
        scenes = await workflow.execute_activity_method(
            VideoGenerationActivities.create_scenes,
            CreateScenesInput(prompt=arg.user_prompt),
            start_to_close_timeout=timedelta(seconds=30),
        )

	 # For each scene, generate video
        scene_gcs_paths: list[tuple[int, str]] = []
        scene_gcs_paths = await asyncio.gather(
            *[self._process_scene(scene, gcs_staging_directory) for scene in scenes]
        )

        # Combine the video scenes into a single video.
        scene_gcs_paths.sort(key=lambda x: x[0])
        full_video_gcs_name: str = await workflow.execute_activity(
            VideoGenerationActivities.merge_videos,
            MergeVideosInput(
                gcs_video_paths=[x[1] for x in scene_gcs_paths],
                gcs_staging_directory=gcs_staging_directory,
            ),
            start_to_close_timeout=timedelta(seconds=30),
        )

        return VideoGenerationWorkflowOutput(
            gcs_uri=f"gs://{video_gen_settings.GCS_BUCKET_NAME}/{full_video_gcs_name}",
        )

    async def _process_scene(self, scene: Scene, gcs_staging_directory: str) -> tuple[int, str]:
        vgm_prompt: str = await workflow.start_activity_method(
            VideoGenerationActivities.generate_vgm_prompt,
            scene,
            start_to_close_timeout=timedelta(seconds=30),
        )
        scene.vgm_prompt = vgm_prompt
        gcs_path = await workflow.start_activity_method(
            VideoGenerationActivities.generate_video_for_scene,
            GenerateVideoForSceneInput(
                current_scene=scene,
                gcs_staging_directory=gcs_staging_directory,
            ),
            start_to_close_timeout=timedelta(minutes=2),
        )
        return (scene.sequence_number, gcs_path)

Notice how the Workflow code reads like a simple script. It calls the create_scenes Activity, then fans out to run the generate_vgm_prompt and generate_video_for_scene Activities for each scene in parallel using Python's native asyncio.gather. The best part is that Temporal handles all the complexity of retries, state persistence, and parallel execution.

The create_scenes Activity is where we interact with Gemini Flash 2.5. The real power here comes from combining Temporal's reliability with Gemini's ability to provide structured output.

# workflows/videogen/schema.py

class Scene(BaseModel):
    sequence_number: int
    description: str
    duration_estimate: int
    camera_angle: str
    lighting: str
    vgm_prompt: str | None

# workflows/videogen/activities.py

class VideoGenerationActivities:
    def __init__(self):
        self._google_api_key = video_gen_settings.GOOGLE_API_KEY
        self._llm = GoogleGemini(api_key=self._google_api_key)

    @activity.defn
    async def create_scenes(self, arg: CreateScenesInput) -> list[Scene]:
        llm_prompt = f"""
You are a creative AI agent that transforms user input into cinematic movie scenes. Your task is to take any concept, story, or idea and convert it into a compelling visual narrative with dramatic flair and artistic vision.
.
.
.
Transform the user's input into something unexpectedly cinematic, whether it's mundane or fantastical. Make every second count and every frame visually stunning.

User Input: {arg.prompt}
"""
        response: list[Scene] = await self._llm.generate_content(
            prompt=llm_prompt,
            response_schema=list[Scene],
        )
        return response

By specifying a response_schema that maps to our Pydantic Scene model, we instruct Gemini to return a well-formed JSON array, not just unstructured text. This eliminates the need for brittle parsing logic and makes the entire system stronger. If the Gemini API call fails for any reason (e.g., a network issue), Temporal automatically retries the activity based on the policy defined in the Workflow, ensuring the scene generation eventually succeeds.

The generate_video_for_scene activity is responsible for the heavy lifting: calling the Veo API. This can be a long-running and resource-intensive operation.

# workflows/videogen/activities.py

   @activity.defn
    async def generate_video_for_scene(self, arg: GenerateVideoForSceneInput) -> str:
        """
        Generate a video for a scene and store it in Google Cloud Storage.
        """
        activity.logger.info("Generating a video for the scene. arg=%s", arg)
        vgm_prompt = arg.current_scene.vgm_prompt
        video_name = f"scene_{arg.current_scene.sequence_number}.mp4"
        gcs_destination_path = f"{arg.gcs_staging_directory}/{video_name}"
        # Generate the video in a local temporary directory, then upload it to Google Cloud Storage.
        with tempfile.TemporaryDirectory() as temp_dir:
            output_path = Path(temp_dir) / video_name
            output_path = await self._vgm.generate_video(
                prompt=vgm_prompt,
                output_path=output_path,
            )
            GoogleCloudStorage.upload_file(
                bucket_name=video_gen_settings.GCS_BUCKET_NAME,
                file_path=output_path,
                destination_path=gcs_destination_path,
            )

        return gcs_destination_path

This is where Temporal's durability shines. For even longer rendering tasks, we could use instrument Activity Heartbeat to let Temporal know the process is still alive and making progress. If the Worker running this Activity crashes, Temporal will automatically re-run it on another available Worker, resuming from the beginning of the Activity. This combination of timeouts, retries, and heartbeating makes building reliable, long-running AI agents possible.

Conclusion: Build fearlessly with Temporal and Google

Google's Gemini and Veo models provide immense power to developers. However, harnessing this power in production requires more than just calling an API. It requires a new way of thinking about agentic architecture, one that embraces the reality of distributed systems and their potential for failure.

Temporal provides the missing piece of this puzzle by offloading the complexity of state management, failure recovery, and resilience engineering. This allows developers like you to focus your energy on creating high-value AI features, while Temporal provides the abstractions required for a production-ready agentic system.

Ready to get started?