Synthetic Pages, One Model: The Engineering Behind NVIDIA's Nemotron OCR v2

Optical character recognition has been a solved problem in narrow terms for decades. Scan an English document, run Tesseract, get text. The difficulty emerges at the intersection of language diversity, document layout complexity, and production throughput requirements. Most production OCR systems address this by maintaining separate specialized models per language or script, then stitching them together with language detection preprocessing. NVIDIA’s Nemotron OCR v2 takes a different position: a single unified model, trained entirely on synthetic data, that handles English, Simplified Chinese, Traditional Chinese, Japanese, Korean, and Russian simultaneously, at 34.7 pages per second on an A100.

The number that stands out most is not the accuracy score. It is the throughput gap. PaddleOCR v5, which is the incumbent for multilingual OCR in production environments, processes 1.2 pages per second on comparable hardware. Nemotron v2 multilingual runs at 34.7, roughly 28 times faster while achieving better accuracy across the board. Understanding how a model this size gets that fast, and why synthetic data made the multilingual generalization possible, is more interesting than the benchmark table alone.

The Architecture: Shared Backbone, Three Tasks

Nemotron OCR v2 builds on FOTS (Fast Oriented Text Spotting), a design that processes an image once through a shared convolutional backbone and then branches into multiple task heads. The backbone here is RegNetX-8GF. Text detection, recognition, and a relational model that predicts reading order and logical groupings all share that single pass.

The multilingual variant weighs in at 84 million parameters. The English-only variant is 54 million. These are not large models by current standards, which matters for inference latency and deployment cost. Vision-language models like Qwen2-VL or GOT-OCR2 can achieve high accuracy on document understanding tasks, but they carry orders of magnitude more parameters and run correspondingly slower. Nemotron v2 is not competing with general-purpose VLMs; it is competing with dedicated OCR pipelines, and the size constraint is a deliberate design choice for that operating environment.

The relational model component is worth pausing on. Most OCR systems return a flat sequence of detected text regions with bounding boxes. Nemotron v2 additionally outputs a hierarchical graph of reading relationships: which text lines belong to the same paragraph, which paragraphs share a column, what the intended reading order is for multi-column layouts. This was inspired by the HierText dataset, which provides human-annotated hierarchical structure for real-world documents. Getting that structure out of a 84M parameter model without massive annotated real-world training data required thinking carefully about how to generate it synthetically at scale.

Why Synthetic Data Works for OCR Specifically

Synthetic data has a spotty track record in machine learning. The sim-to-real gap often means that models trained on synthetic distributions fail to generalize to real-world noise, lighting variation, and distribution shift. OCR is one of the domains where synthetic data tends to work well, and there are structural reasons for this.

The core input to an OCR model is rendered text. Real documents are themselves rendered text: ink on paper photographed under lighting, or fonts rendered to screen pixels. The signal being extracted is fundamentally digital in origin, even when the final medium is physical. This means that synthetically rendered text, with augmentations for blur, distortion, contrast variation, and paper texture, approximates real scanned documents more faithfully than synthetic street scenes approximate real driving data.

For multilingual training, synthetic data has an additional advantage: you can construct balanced corpora across languages by design rather than by collection. Gathering real annotated Japanese document images in the same quantities as English annotated images requires significant effort and licensing negotiation. Rendering Japanese text synthetically from a corpus of Japanese sentences requires fonts and a rendering pipeline.

The Data Pipeline: mOSCAR, SynthDoG, and Font Diversity

NVIDIA’s synthetic dataset, nvidia/OCR-Synthetic-Multilingual-v1, contains 12.2 million images distributed across six languages. The text content comes from mOSCAR, a multilingual web corpus spanning 163 language subsets. This provides naturalistic text in each target language, avoiding the artificial vocabulary distributions that come from using templated content or translated sentences.

The rendering engine is a modified version of SynthDoG, originally developed for the Donut document understanding model. SynthDoG composites text onto background images at various scales and orientations, applying a range of augmentations. NVIDIA extended it to handle vertical text rendering for CJK scripts, where Japanese and traditional Chinese documents frequently use top-to-bottom column layouts that a horizontally-oriented renderer would get wrong.

Font diversity is one of the underappreciated variables in synthetic OCR training. A model that trains on only a few fonts for a given script will learn to recognize those fonts well and generalize poorly to others. The dataset uses between 165 and 1,258 open-source fonts per language, with the higher counts going to scripts with greater typographic variation. Korean has the largest sample count in the dataset (2.27 million images) and a correspondingly large font library. The annotation pipeline produces word, line, and paragraph-level bounding boxes with four-point quadrilateral annotations, which handle rotated and perspective-distorted text that axis-aligned rectangles would miss.

Layout diversity was also constructed explicitly. The training set includes multi-column documents, tables, slide-style layouts, and mixed-orientation pages. This matters because a detector trained only on single-column left-to-right text will produce incorrect reading order on anything more complex, even if individual word recognition is accurate.

What Nemotron v1 Got Wrong

The gap between Nemotron v1 and v2 on CJK languages is striking and instructive. On the SynthDoG benchmark using normalized edit distance (lower is better), v1 scored 0.923 on Korean and 0.723 on Japanese. Version 2 scores 0.047 and 0.046 respectively. That is not a marginal improvement; it is a qualitative shift in capability.

The v1 model appears to have been primarily English-focused, with other languages added without the depth of synthetic data and font coverage that v2 provides. Korean in particular uses the Hangul script, which is a syllabic block system where characters are composed of individual phonetic components arranged spatially. OCR for Hangul requires learning both the component shapes and their spatial composition rules, which demands training data that represents the full range of Hangul blocks, not just the most common ones. The jump in sample count and font diversity for Korean in v2 is presumably what closed that gap.

This illustrates a general principle: multilingual models fail on underrepresented scripts not because of model capacity, but because of data coverage. The architecture that works for English works for Korean; what changes is whether the training distribution actually spans the target distribution.

Comparison with the Current Landscape

PaddleOCR is the most widely deployed open-source multilingual OCR system. Its architecture uses separate detection and recognition models, and its multilingual coverage is achieved through language-specific recognition models. This makes it accurate for languages it has been explicitly trained on, but slow to deploy across multiple languages simultaneously and sensitive to language detection errors in preprocessing.

GOT-OCR2 and similar VLM-based approaches treat OCR as a generative task: the image is encoded, and the model generates the text content autoregressively. These systems handle complex layouts well and can incorporate context for ambiguous characters, but autoregressive generation is fundamentally slow compared to parallel detection approaches. At 34.7 pages per second, Nemotron v2 is operating in a regime that autoregressive VLMs cannot reach without aggressive batching and hardware investment.

For teams running document processing pipelines at scale, the throughput difference is not academic. Processing a million-page document archive at 1.2 pages per second takes roughly 9.6 days of continuous GPU time. At 34.7 pages per second, it takes about 8 hours.

Extensibility

NVIDIA’s team notes that the pipeline is designed to extend to new languages with relatively modest effort. The requirements are a text corpus for the target language and a set of open-source fonts covering its script. The rendering pipeline handles the rest. This is a meaningful design property: it means the architecture is not a bespoke solution for these six languages, but a general framework that happens to have been trained on this language set first.

The model and dataset are available under permissive licenses (NVIDIA Open Model License and CC-BY-4.0 respectively), and a live demo runs in a Hugging Face Space. The model weights can be pulled and run locally without NVIDIA API access, which matters for teams with data residency requirements.

The broader takeaway is less about this specific model and more about the feasibility of the approach. A carefully constructed synthetic data pipeline, combined with a shared-backbone architecture designed for inference throughput, can produce a multilingual OCR model that outperforms specialized alternatives on both accuracy and speed. The investment is in the data engineering, not in model scale. That is a pattern worth paying attention to as more teams face multilingual document understanding requirements without the budget for VLM inference at scale.