The Structure Problem in Document AI, and How Nemotron OCR v2 Approaches It

Most discussions of OCR benchmarks focus on character accuracy: how many glyphs did the model get right, expressed as normalized edit distance or word accuracy rate. That framing made sense when the end goal was a text file. For modern document processing pipelines, the text alone is often the least useful output.

Consider what a retrieval-augmented generation pipeline actually needs from a scanned document. It needs the text, but it also needs to know which text belongs together, in what order to read it, which portions are headers versus body copy, which content is part of a table, and where paragraph breaks occur. A flat bag of bounding boxes with transcribed characters leaves that work to downstream logic that almost always gets it wrong on multi-column layouts, tables, and documents with non-trivial structure.

NVIDIA’s Nemotron OCR v2 is the OCR model getting attention right now for its throughput: 34.7 pages per second on a single A100, roughly 28 times faster than PaddleOCR v5 on the same hardware. But the architectural decision I find more interesting is the third component in the model: a relational encoder that outputs a hierarchical reading-order graph alongside the text transcription. That is not a common feature of OCR systems, and it is worth understanding why NVIDIA built it and what it actually costs to get right.

What Flat OCR Output Misses

A standard OCR pipeline returns detected regions, each paired with a text string and a bounding box. The ordering of those regions is typically determined by spatial heuristics: sort by vertical position, then by horizontal position within each row. This works for simple single-column documents but breaks down quickly on anything more complex.

A two-column academic paper will have text regions interleaved between columns. A table will have cells arranged in a grid where the reading order depends on semantics, not pixel coordinates. A slide with a headline and three bullets may arrange those bullets in a layout where naive top-to-bottom sorting produces the wrong sequence. Historical documents sometimes mix vertical and horizontal text on the same page.

When this output feeds into a language model via RAG, the retrieval chunks inherit whatever ordering errors the OCR imposed. A passage split incorrectly across columns, or a table concatenated row-by-row into a text string, generates embeddings that do not accurately represent the original content. The retrieval step returns text, but the semantic structure that made it meaningful is gone.

This is not a theoretical problem. Anyone who has tried to build a document Q&A pipeline over a corpus of scanned PDFs with complex layouts has run into it directly. The failure mode is subtle enough that it often presents as the language model being evasive or wrong, when the actual issue is that the context it received was malformed by a broken reading order.

How the Relational Model Works

Nemotron OCR v2 addresses this with a dedicated third component. The architecture follows the FOTS design, which runs a single convolutional backbone, in this case RegNetX-8GF, once per input image. Text detection and recognition both consume features from that single pass. The relational model, a compact Transformer encoder, also runs on those same feature maps.

The relational component predicts hierarchical relationships between detected text regions. Specifically, it outputs parent-child indices linking words to lines, lines to paragraphs, and paragraphs to document sections. Each element in the hierarchy also carries a position in reading order. The output is not a flat sequence but a directed graph that encodes the logical structure of the document alongside the transcribed text.

Generating training data for this component is what drove the design of the synthetic pipeline. Human-annotated reading order at scale does not exist as an open dataset in any quantity. The HierText dataset, which NVIDIA cites as an inspiration, provides hierarchical annotations for about 11,000 images from natural scene text. That is not enough to train a robust relational model, and it covers natural scene text rather than document layouts.

Synthetic generation solves this directly. Every image produced by the pipeline has ground-truth hierarchical annotations because the rendering engine knows exactly which words were placed on which lines, which lines compose each paragraph, and what the intended reading order was. The annotations are perfect by construction. At 12.2 million samples, this provides far more structural training signal than any human-annotated corpus could.

The modification to SynthDoG that NVIDIA built includes seven layout templates: flowing multi-column text, scattered scene-text, vertical CJK columns, tables with headers and borders, table-of-contents pages with dot leaders, PowerPoint-style slides, and word-processor documents with heading hierarchies. Each template produces structurally distinct images that push the relational encoder to generalize across document types. A model trained only on single-column text would learn to output trivial reading-order graphs; these templates force it to handle the cases where structure recovery is actually hard.

The CJK Complication

The multilingual extension of the model introduces a structural challenge specific to CJK scripts. The English model operates at word granularity: it detects word-level bounding boxes and transcribes each word individually. The relational model then predicts which words belong to the same line and how lines compose paragraphs.

For Chinese and Japanese, word-level segmentation is not directly available from visual signals. Both scripts lack whitespace between words, and Chinese in particular requires linguistic knowledge to identify word boundaries that have no visual representation. Korean uses spacing but inconsistently enough that word-boundary detection cannot be assumed reliable.

The solution in Nemotron v2 multilingual is to shift the recognition granularity from word-level to line-level for CJK content. The detector identifies line regions rather than word regions, and the recognizer transcribes the full line as a unit. The character set expands from 855 characters in the English model to 14,244 characters, covering the Unicode ranges for CJK unified ideographs, Hangul syllables, and Cyrillic alongside Latin.

This is architecturally reasonable but it changes what the relational model produces. Line-level output means the word-to-line relationship in the hierarchy is replaced by a character-to-line relationship that is implicit in the transcription, not explicit in the graph structure. For RAG applications, this means that chunk boundaries in CJK content are naturally at the line or paragraph level rather than the word level, which is probably the right granularity for most retrieval use cases anyway.

The recognizer deepens from 3 to 6 Transformer layers in the multilingual variant, reflecting the larger character set and the additional complexity of reading full lines in scripts with high character cardinality. The total parameter count grows from 54 million to 84 million. These remain small numbers by current model standards; Qwen2-VL, for comparison, starts at 2 billion parameters for its smallest variant.

The Trade-off That Benchmark Tables Obscure

The OmniDocBench results show Nemotron v2 multilingual at a normalized edit distance of 0.048 for English versus 0.027 for PaddleOCR v5. On Chinese, 0.072 versus 0.037. These are meaningful accuracy gaps in favor of PaddleOCR on real-world documents.

This trade-off is real and worth stating clearly. Synthetic training data, however extensively augmented, does not fully replicate the noise distribution of genuinely degraded documents: ink bleed from physical printing, page warp from book scanning, fax artifacts, thermal paper fading. PaddleOCR’s accuracy advantage on OmniDocBench likely reflects greater exposure to real-world document noise in its training distribution.

What the benchmark table does not show is the value of the structural output. PaddleOCR returns detected text regions with bounding boxes and a flat transcription. Nemotron v2 returns a hierarchical reading-order graph. For document processing pipelines where downstream applications need to reconstruct document structure, the comparison is not purely on character accuracy but on the total engineering work required to get usable output.

A team using PaddleOCR for a multi-column document pipeline needs to build or integrate a layout analysis component, then reconcile its output with the OCR results, then infer reading order from the combined spatial and structural signals. This is non-trivial work that introduces its own failure modes. Nemotron v2 delivers an approximation of that structure as part of the OCR output, at much higher throughput, with a single model that requires no language detection preprocessing.

Whether that trade-off makes sense depends on the application. For high-accuracy extraction from degraded historical archives where throughput is not a constraint, PaddleOCR or OpenOCR remain stronger choices. For document ingestion pipelines processing modern born-digital or cleanly scanned documents at volume, the combined throughput and structural output of Nemotron v2 reduces the engineering surface area considerably.

What an Open Synthetic Dataset Actually Enables

The nvidia/OCR-Synthetic-Multilingual-v1 dataset released under CC-BY-4.0 is worth noting separately from the model. A 12.2-million-sample synthetic OCR corpus with hierarchical annotations at word, line, and paragraph granularity, covering six scripts, does not otherwise exist in open form.

The practical value for researchers is that the dataset provides a starting point for fine-tuning the model on specific document types or additional languages. The synthetic generation methodology is described in enough detail that the pipeline can be reproduced and extended: add a source text corpus for a new language, add fonts, generate samples. The architecture requires no changes.

For teams building document AI systems, this is a more useful form of openness than releasing model weights alone. The weights capture what the current training distribution produces. The dataset and methodology enable modification of that distribution for specific requirements: a legal document processor that needs higher accuracy on case citations, a medical records pipeline that needs to handle handwritten annotations, an archival system that needs to handle a specific historical script.

The model is available at nvidia/nemotron-ocr-v2 with a live demo at the associated Hugging Face Space. The NVIDIA Open Model License governing the weights allows commercial use with attribution, which covers most production deployment scenarios.

The broader point here is that document structure recovery is the part of document AI that has received the least systematic attention relative to character recognition accuracy. Benchmark tables measure how well a model reads individual characters; they do not measure how well it reconstructs the logical organization that makes a document interpretable. Nemotron OCR v2 is the first widely-available open OCR model that treats both problems as first-class objectives in the training setup, and that is a more significant contribution than the throughput numbers alone suggest.