· 6 min read ·

Synthetic Data Did What Real Data Couldn't: Inside NVIDIA's Nemotron OCR v2

Source: huggingface

Optical character recognition has a label problem. Real-world annotated OCR datasets are expensive to produce, inconsistent in quality, and almost never cover the full diversity of fonts, layouts, and scripts that documents actually contain. For any single language you can get by. For five languages simultaneously, including CJK scripts with tens of thousands of characters, the gap between what you can label and what you need becomes serious.

NVIDIA’s answer with Nemotron OCR v2 is to sidestep the annotation problem almost entirely. The model is trained on 12.2 million synthetic samples across English, Chinese (Simplified and Traditional), Japanese, Korean, and Russian, with only around 680,000 real-world images mixed in. The resulting model runs at 34.7 pages per second on a single A100 GPU while achieving lower normalized edit distance than both generalist and per-language-specialized competitors.

That speed number deserves context. PaddleOCR v5, one of the strongest open-source alternatives, processes 1.2 pages per second under the same conditions. That is roughly a 28x throughput gap, and Nemotron v2 still beats it on accuracy across every language tested.

The Architecture: One Backbone, Three Jobs

The model has three components: a text detector built on a RegNetX-8GF backbone, a pre-norm Transformer recognizer, and a compact Transformer encoder that predicts reading order and logical groupings. What makes this efficient is that the convolutional backbone runs once per image, and all three components share its feature maps. You pay the cost of the visual encoding a single time regardless of how many text regions need to be recognized.

The English and multilingual variants differ in ways that reflect the underlying linguistic challenges. The English model uses word-level recognition with a 855-character set and a 3-layer recognizer, clocking in at 54 million parameters. The multilingual model moves to line-level recognition, expands the character set to 14,244 glyphs, deepens the recognizer to 6 layers, and lands at 84 million parameters total.

The shift from word-level to line-level for CJK scripts is not incidental. Chinese and Japanese lack whitespace-delimited word boundaries, and Korean has inconsistent spacing conventions. Any model that tries to segment these scripts into “words” before recognizing them is fighting the structure of the language. Line-level recognition sidesteps the problem by treating the natural unit as a line of text rather than a word.

The Synthetic Pipeline: Where the Real Work Happened

The training data generation is built on top of SynthDoG, a synthetic document generator originally developed by Clova AI. NVIDIA extended it substantially for this work, and the modifications are where most of the interesting decisions live.

Source text comes from mOSCAR, a multilingual web corpus covering 163 language subsets. Using a real corpus rather than random character strings matters because it preserves realistic vocabulary distribution, sentence length patterns, and character frequency. A model trained on uniformly random characters will not generalize to the statistical patterns of actual documents.

The font pool is extensive: between 165 and 1,258 unique fonts per language, drawn from Google Fonts and the Noto family. The Noto fonts are particularly important here since they were designed specifically to cover all Unicode scripts with consistent visual quality, which means they provide reasonable coverage of characters that might otherwise be absent from a training font set.

Layout diversity is handled through seven distinct generation modes. Multi-column flowing text, scattered scene-text words, vertical CJK columns, tables with and without borders, table-of-contents layouts with dot leaders, PowerPoint-style slides, and word-processor pages with heading hierarchies. Each mode produces structurally different images that push the detector to generalize across document types rather than overfit to a single format.

The augmentation stack operates at three levels. Text-level augmentations include border effects, drop shadows, extrusion, glyph noise, and stroke opacity variation. Image-level augmentations cover morphological operations like dilation and erosion, median blur, and elastic distortion. Page-level augmentations add contrast and brightness jitter, Gaussian and motion blur, color shifting, shadow overlays, and Gaussian noise on random backgrounds. This layered approach approximates the degradation patterns of real scanned documents without requiring a single real scan.

Critically, every generated sample includes pixel-precise bounding box annotations at word, line, and paragraph levels simultaneously, along with 4-point quadrilateral coordinates and parent-child indices linking the hierarchy together. The relational component of the model is trained on these hierarchical annotations, which is what allows Nemotron v2 to output not just character transcriptions but a full reading-order graph across a document.

Benchmark Results and What They Show

On the SynthDoG multilingual benchmark, the numbers are stark. For Korean, PaddleOCR’s base model achieves a normalized edit distance (NED) of 0.943, meaning it is effectively failing on that language. Even PaddleOCR’s Korean-specialized variant only reaches 0.133. Nemotron v2 multilingual hits 0.047. For Russian, the base PaddleOCR achieves 0.959 NED; Nemotron v2 achieves 0.043. For Japanese, the jump is similar: from 0.201 with the best PaddleOCR variant down to 0.046.

On OmniDocBench, which uses real-world documents rather than synthetic test sets, the results shift somewhat. EasyOCR has better English NED than Nemotron v2 multilingual (0.095 vs 0.048), though EasyOCR operates at 0.4 pages per second versus 34.7. The speed-accuracy tradeoff is not competitive for any production workload.

The multilingual model’s performance on mixed-language documents, an NED of 0.142 versus PaddleOCR’s 0.041, might initially look like a regression. But that number reflects a harder task: Nemotron v2 multilingual is a single model handling all five languages simultaneously without language detection. PaddleOCR’s result there uses a per-language pipeline with oracle language labels. Comparing them directly conflates different system designs.

Why Synthetic-First Training Is Underexplored for OCR

The underlying idea here, that a carefully designed synthetic pipeline can match or exceed real-data performance, is not new in computer vision. It has been demonstrated repeatedly in depth estimation, object detection, and semantic segmentation. OCR has been slower to adopt it, partly because text rendering seems simple but actually involves substantial subtlety around font rendering, kerning, ligatures, and script-specific shaping rules.

The SynthDoG lineage has been exploring this space since at least 2022, when the Donut paper from Clova AI showed that transformer-based document understanding models could be trained predominantly on synthetic data. What NVIDIA has done with Nemotron v2 is industrialize that approach for a classical OCR pipeline rather than an end-to-end document understanding model, and push it into the multilingual regime where the combinatorial complexity of scripts, fonts, and layouts is most severe.

The fact that the dataset, nvidia/OCR-Synthetic-Multilingual-v1, is released under CC-BY-4.0 alongside the model itself is worth noting. Most competitive OCR training pipelines rely on proprietary data collection that makes reproduction impossible. Releasing 12.2 million annotated synthetic samples with the generation methodology described means that the community can audit, extend, and improve on this baseline.

Practical Considerations

The model is available on HuggingFace under NVIDIA’s open model license and can be tested in a hosted demo space. The English variant at 54M parameters is viable on more modest hardware than an A100; the throughput benchmarks are measured with a batched A100 pipeline, and single-image latency on a consumer GPU will be different.

For applications that need to handle mixed-script documents, digitize archives with inconsistent scan quality, or process documents at volume without a language pre-classification step, the case for Nemotron v2 is straightforward. For English-only workloads where inference cost matters more than multilingual capability, the English variant competes well on speed while the accuracy numbers are more comparable to established tools.

The architectural decision to output hierarchical structure including reading order graphs is probably the least discussed feature but potentially the most useful one. Raw character transcription solves a narrow version of the OCR problem. For downstream tasks like document search, RAG pipelines, or structured data extraction, knowing which words belong to which lines, which lines belong to which paragraphs, and in what order to traverse them is often more valuable than the transcription alone. Most OCR tools treat structure recovery as a separate post-processing step; Nemotron v2 bakes it into the training objective.

The broader lesson from this release is that synthetic data quality is now sophisticated enough that the limiting factor in OCR is not dataset size or script coverage, but pipeline design. NVIDIA’s contribution here is less the model itself and more the demonstration that the synthetic generation problem is largely solved for a significant subset of global scripts, and that the solution can be described, reproduced, and extended by others.

Was this interesting?