tokuSolutions
OCR Translation with Temporal Workflows
Japanese toy manual translator demonstrating Temporal workflow patterns. I collect Kamen Rider transformation devicesβall documentation is in Japanese. This automates translation while preserving layout context that other translation tools not specialized in toy manual loses. Extracts text from PDFs using Google Document AI, translates to English, and generates an interactive web viewer.
What You'll Learn#
Temporal workflow concepts:
- Parent/child workflow orchestration
- Fan-out/fan-in parallel execution
- Activity retry policies tuned by operation type
- Workflow queries for real-time progress
- Deterministic workflow design
- Activity heartbeats for long operations
Practical application:
- Batch OCR processing with Google Document AI
- Translation API integration
- LLM-powered cleanup with structured validation
- Interactive web viewer with inline editing
How Temporal Orchestrates Translation#
Parent workflow (pdf_translation_workflow.py) orchestrates four child workflows:
1. OCRWorkflow - Extract text from PDF
- Product search on Tokullectibles
- Get PDF page count
- Fan-out/fan-in: Parallel OCR across all pages (0-N simultaneously)
2. TranslationWorkflow - Translate extracted text
- Batch translation via Google Translate API
3. SiteGenerationWorkflow - Generate web viewer
- Convert pages to WebP images
- Create translations.json and viewer HTML
- Heartbeats every 5 pages (long-running activity)
4. CleanupWorkflow - Improve translation quality
- Stage 1: ftfy - Fix Unicode/OCR corruption (deterministic)
- Stage 2: Rule-based - Remove noise patterns (deterministic)
- Stage 3: Gemini AI - Context-aware corrections + tagging (non-deterministic)
Why child workflows?
- Separation in Temporal UI - each phase visible as distinct workflow execution
- Independent lifecycle - each phase has own event history and retry logic
- Query support - parent workflow exposes real-time progress
- Observability - better visibility into which phase is executing or failed
Real-time Progress Tracking#
CLI polls workflow using Temporal Queries (every 500ms) to display live progress:
π [1/4] OCR - Extracting text...
β Complete: 15 blocks from 5 pages
π [2/4] Translation - Translating text...
β Complete: 15 blocks
π [3/4] Site Generation - Creating viewer...
β Complete: 5 pages
β¨ [4/4] Cleanup - Improving quality...
β Fixed 3 encoding issues
β Removed 2 noise blocks
β Applied 5 AI corrections
Parent workflow updates WorkflowProgress state with phase tracking and sub-progress from each child workflow.
Temporal Best Practices#
1. Retry Policies - Three strategies tuned for operation types:
- QUICK_RETRY: Fast operations (file I/O) - 3 attempts, 1-10s backoff
- API_RETRY: External APIs (Document AI, Translation) - 5 attempts, 2-30s backoff
- LLM_RETRY: AI model calls (Gemini) - 3 attempts, 5s-2min backoff
2. Activity Separation for Determinism
Why three cleanup activities instead of one? Temporal workflows must be deterministic:
- ftfy_cleanup_activity - Deterministic Unicode fixes (always same output)
- rule_based_cleanup_activity - Deterministic pattern removal (regex patterns)
- gemini_cleanup_activity - Non-deterministic AI corrections (LLM responses vary)
Benefits: Gemini can fail without breaking workflow, different retry policies per operation type, each stage visible in Temporal UI.
3. Heartbeats - Long-running activities send heartbeats to prevent timeouts:
- Site generation: Every 5 pages during image rendering
- Gemini cleanup: Before/after LLM API calls (2min timeout)
4. Workflow Determinism - Non-deterministic operations moved outside workflows:
Pathoperations (stem, name extraction) moved to CLI layer- Manual name and output directory computed before workflow starts
- Only deterministic data transformations in workflow code
5. Type-Safe AI with Pydantic - LLM responses validated before reaching workflow:
class GeminiCleanupResponse(BaseModel):
remove: list[str] # Block indices to remove
corrections: dict[str, str] # Index β corrected text
product_name: str # Official product name
Gemini returns JSON β Pydantic validates structure β Invalid responses trigger Temporal activity retry β Type safety across entire pipeline.
Performance & Cost#
Typical 20-page manual:
- Time: ~50 seconds with single worker (30s OCR parallel, 5s translation, 10s AI cleanup, 5s site generation)
- Cost: ~$0.43 (Document AI $0.03 + Translation API $0.40 + Gemini free)
Cost estimates (December 2024 pricing):
- Small (5 pages): ~$0.11
- Medium (20 pages): ~$0.43
- Large (50 pages): ~$1.08
Gemini 1.5 Flash is free tier (15 RPM, 1M TPM, 1500 RPD) - sufficient for hobby use.
