· 6 min read ·

34 Pages Per Second: How NVIDIA's Nemotron OCR v2 Trades Accuracy for Throughput

Source: huggingface

Optical character recognition has a benchmark problem. Systems that look competitive on clean, synthetic test sets frequently stumble on the messier documents that actually show up in production: scanned receipts, photographed whiteboards, PDFs that went through a fax machine at some point in their lives. NVIDIA’s Nemotron OCR v2 is worth examining closely precisely because its numbers tell two different stories depending on which benchmark you look at, and understanding why tells you something useful about how to pick an OCR system for a real workload.

The Architecture That Enables the Speed

The headline number is 34.7 pages per second on a single A100 GPU. PaddleOCR v5 achieves 1.2 pages per second on the same hardware. OpenOCR reaches 1.5. That gap is not a marginal win; it fundamentally changes what problems become tractable. If you are processing a document archive of a million pages, the difference between 1.2 and 34.7 pages per second is the difference between eleven days of compute and seven hours.

The speed comes from a deliberate architectural choice rooted in FOTS (Fast Oriented Text Spotting), a 2018 architecture that unified text detection and recognition into a single network with a shared convolutional backbone. Most OCR pipelines are two-stage: a detector finds text regions, then a separate recognizer reads each one. Every detected region triggers another forward pass through the recognizer. FOTS avoids this by running the input image once through a shared backbone, producing feature maps that both the detector and recognizer consume. The regions are read from those pre-computed features, not from re-encoding cropped patches.

Nemotron OCR v2 extends this with a third component: a relational model that predicts reading order and logical groupings like which words belong to which lines, which lines compose paragraphs, and how the document structure flows. This runs on the same feature maps. The result is a 54-million-parameter English model and an 84-million-parameter multilingual model, both processing a page in a single coordinated forward pass rather than a cascade of separate inferences.

The multilingual model supports English, Chinese (Simplified and Traditional), Japanese, Korean, and Russian simultaneously without language detection. A single model handles all five scripts. That matters operationally because it removes a preprocessing step and avoids errors that propagate from misidentified language before recognition even begins.

Why Synthetic Data Is the Only Viable Path for Multilingual OCR

Building a multilingual OCR dataset from real annotated documents is expensive in a way that compounds with each additional script. You need documents in each language, accurate ground-truth transcriptions, and annotations for bounding boxes at whatever granularity your model requires. For a script like Traditional Chinese, which has roughly 50,000 distinct characters in common use, the annotation effort per document is substantially higher than for a Latin-script language.

NVIDIA’s approach sidesteps this by generating 12.2 million synthetic training samples using a modified version of SynthDoG, the synthetic document generator from Clova AI. The pipeline renders text from the mOSCAR multilingual corpus, a web-crawled dataset covering 163 language subsets, onto synthetic page layouts using fonts sourced from Google Fonts and the Noto family. The multilingual model’s character set grows from 855 characters in the English-only variant to 14,244 characters, covering the full Unicode ranges needed for CJK scripts.

The dataset breakdown shows where the training emphasis lands:

LanguageTraining Samples
Chinese (Simplified)1,914,948
Chinese (Traditional)1,772,280
Korean1,814,994
Japanese1,502,712
Russian1,380,404
English1,460,304

The pipeline is not just rendering text on backgrounds. It generates multi-level bounding box annotations at word, line, and paragraph granularity simultaneously, both as axis-aligned boxes and four-point quads for rotated text. It also generates the reading order graph that the relational model trains on: which words compose lines, which lines compose paragraphs, in what sequence. This structural annotation, inspired by the HierText dataset, is what allows the model to produce structured output rather than an unordered bag of text regions.

Synthetic generation also makes augmentation tractable at scale. The pipeline applies text-level augmentations like border effects, drop shadows, stroke width variation, and glyph edge noise, then image-level augmentations like elastic distortion, median blur, and motion blur, then page-level augmentations like contrast jitter, color shifting, and shadow overlays. This layered approach simulates the kinds of degradation that real scanned documents exhibit without requiring real documents at all.

The CJK Problem That Forced an Architecture Change

One specific engineering decision in the multilingual model is worth understanding: the shift from word-level to line-level recognition for CJK languages.

The English-only model operates at word granularity. A word in Latin-script text has clear visual boundaries: spaces separate tokens, and each bounding box around a word contains a predictable number of glyphs. This does not transfer to Chinese or Japanese, where word boundaries are not marked in the writing system. Segmenting Chinese text into words is itself a non-trivial NLP problem that typically requires a separate model or dictionary lookup.

For Korean, the problem is different but related: while Korean does use spaces between words, the spacing is inconsistent in practice, with writers omitting spaces in ways that are grammatically permissible but unpredictable from a visual detection standpoint.

The solution NVIDIA adopted is to shift CJK recognition to line-level granularity. Instead of trying to detect and recognize individual words, the model detects lines and recognizes the full text of each line. This removes the dependency on word segmentation and handles vertical text columns, which appear in traditional Chinese and Japanese layouts, without special-casing. The multilingual recognizer uses 6 transformer layers versus 3 in the English variant, reflecting the larger character set and the increased complexity of reading full lines in high-cardinality scripts.

What the Benchmarks Actually Show

This is where the analysis gets more nuanced. On the synthetic SynthDoG benchmark, Nemotron OCR v2 is not just competitive, it is a different tier entirely:

LanguagePaddleOCRNemotron v2 (multilingual)
English0.117 NED0.069
Japanese0.2010.046
Korean0.9430.047
Russian0.9590.043
Chinese (Simplified)0.0540.035

Normalized edit distance (NED) is lower-better; 0.043 on Russian compared to PaddleOCR’s 0.959 is an extraordinary margin. For Korean and Russian, the baseline models essentially fail while Nemotron v2 performs well. This reflects the data coverage argument directly: PaddleOCR was not trained to handle Cyrillic or Hangul at this scale.

The picture on the real-world OmniDocBench benchmark is different:

ModelPages/sEN NEDZH NED
PaddleOCR v51.20.0270.037
OpenOCR1.50.0240.033
Nemotron OCR v2 (multilingual)34.70.0480.072

On actual scanned documents, PaddleOCR and OpenOCR are meaningfully more accurate. Nemotron OCR v2 is roughly twice the error rate on English and about twice on Chinese when measured against real documents. The training on synthetic data, while comprehensive in scale, produces a model that has not fully closed the gap between rendered text and genuinely degraded real-world input.

This is a real trade-off, not a measurement artifact. Synthetic data pipelines, however sophisticated, cannot replicate every failure mode that real documents introduce: ink bleed, physical page warping, scanner calibration artifacts, overlapping text from show-through on thin paper. The augmentation strategy approximates these effects but does not replace exposure to the real distribution.

When to Use It

Nemotron OCR v2 makes sense for workloads where throughput is the binding constraint and the documents are reasonably clean. Digitizing a large corpus of born-digital PDFs, processing structured forms, or running OCR as part of a data ingestion pipeline where documents are modern and well-reproduced are all cases where the 34.7 pages per second figure matters more than the OmniDocBench accuracy gap.

For high-accuracy extraction from degraded historical documents, scanned archives, or faxed materials, PaddleOCR and OpenOCR still hold an accuracy advantage on real-world benchmarks. The choice is genuinely a function of your input distribution.

The multilingual capability is the more unambiguously strong result. For Korean and Russian specifically, the improvement over comparable baseline models is large enough that the real-world accuracy gap becomes secondary. If you are building a pipeline that needs to handle mixed-language documents, having a single model that covers Cyrillic, Hangul, CJK, and Latin without language detection is worth the accuracy trade-off against specialists.

The model and a 12.2-million-sample synthetic dataset are available under open licenses (NVIDIA Open Model License for the model, CC-BY-4.0 for the dataset). The dataset alone is a meaningful contribution: a synthetic multilingual OCR corpus at this scale and with this level of structural annotation does not otherwise exist in open form, and it gives researchers a starting point for training or fine-tuning models on specific document types or additional languages beyond the five currently supported.

Was this interesting?