pith. sign in

arxiv: 2605.19866 · v1 · pith:LVNLMQBVnew · submitted 2026-05-19 · 💻 cs.CV

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

Pith reviewed 2026-05-20 05:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual document understandingout-of-distribution generalizationlayout priorsvision-language modelsRT-DETRDocTagsdocument parsing
0
0 comments X p. Extension
pith:LVNLMQBV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{LVNLMQBV}

Prints a linked pith:LVNLMQBV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Injecting RT-DETR layout detections as DocTags priors into VLM prompts resolves the two-hop bottleneck for out-of-distribution document layouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for document understanding often fail on layouts unlike their training data because they must first localize and classify layout entities before extracting content, and errors in the first step cause the second to collapse into omissions or repetition. The paper pre-resolves the localization step outside the decoder by running a lightweight RT-DETR detector, converting its outputs into the model's native DocTags vocabulary, and placing those tokens in the prompt next to the full page image. This structured prior shares the decoder's generation space and keeps the complete image visible as a fallback when detections are imperfect, unlike methods that crop the page or use plain-text instructions. Experiments on large out-of-distribution benchmarks show substantial gains in parsing metrics together with fewer decoding failures at modest added cost, and attention maps confirm the decoder switches between using the injected tokens for structure and image patches for content.

Core claim

Vision-language models parse documents end-to-end but break down on unseen layouts because of a two-hop bottleneck in which layout classification and localization must succeed before content extraction can; pre-resolving the first hop outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image allows the decoder to attend to layout tokens when emitting structure and to image patches when emitting content.

What carries the argument

Structured layout prior: RT-DETR detections serialized into DocTags and injected into the VLM prompt with the full page image, sharing the decoder's generation space while leaving the global image as fallback when detections are noisy.

If this is right

  • Markdown F1 on a 10k-page structural OOD benchmark rises from 0.37 to 0.92.
  • Table TEDS on the Chinese subset of OmniDocBench rises from 0.01 to 0.36.
  • Infinite-loop decoding failures drop across every industrial domain on the 26k-page ViDoRe V3 benchmark.
  • The gains require only 15 percent added wall-clock latency and a median of 74 prompt tokens with no change to the base VLM architecture.
  • The decoder exhibits a bimodal attention shift, attending to injected layout tokens for structure and to image patches for content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-resolution strategy could be tested on other structured VLM outputs such as chart data extraction or form field population.
  • Explicit layout cues may reduce repetition errors in long-form document generation tasks beyond the benchmarks shown.
  • Releasing the weights enables direct comparison against future detectors or alternative serialization schemes on the same OOD sets.

Load-bearing premise

The RT-DETR detector produces layout detections whose noise level remains low enough that the decoder can still recover using the full image fallback.

What would settle it

Measuring no improvement in markdown F1 or table TEDS, or an increase in infinite-loop failures, when the same prior is applied to a new structural OOD benchmark on which RT-DETR detection error rates are high would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19866 by Ahmed Nassar, A. Said Gurbuz, Christoph Auer, Peter El Hachem, Peter W. J. Staar.

Figure 1
Figure 1. Figure 1: Overview of our two-stage pipeline YOLO-based systems [29] while maintaining the low-latency inference necessary for real-time document processing. The model outputs bounding boxes in a normalized top-left coordinate system. Detected object categories are detailed in Appendix C, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stability benchmarking on ViDoRe dataset for infinite loops across 7 different industries. Red bars are the base model, green bars are the finetuned model [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pareto comparison on the English subset of OmniDocBench. The plot compares edit [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Linear SVM ROC curve for training-data vs NoveltySet separability. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention distribution conditioned on generating a location token. The mass concentrates [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention distribution conditioned on generating a layout token. The distribution remains [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15\%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes addressing out-of-distribution failures in vision-language models for visual document understanding by pre-resolving layout entities (Hop 1) via an RT-DETR detector whose outputs are serialized in the model's native DocTags vocabulary and injected into the prompt together with the full page image. This leaves the global image available as a fallback for noisy detections. The approach is evaluated on a 10k-page structural OOD benchmark (markdown F1 from 0.37 to 0.92), the Chinese subset of OmniDocBench (table TEDS from 0.01 to 0.36), and the 26k-page ViDoRe V3 benchmark (reduced infinite-loop failures), at a cost of 15% latency and 74 median prompt tokens. An attention analysis is presented showing a bimodal shift consistent with the two-hop hypothesis.

Significance. If the empirical results and mechanistic account hold, the work would be significant for practical document parsing systems. It demonstrates large, consistent gains across three distinct benchmarks using only prompt-level injection and no architectural changes to the base VLM. The attention analysis provides supporting evidence for the claimed two-hop alleviation, and the low overhead (latency and token count) makes the method immediately deployable. Releasing model weights further strengthens reproducibility.

major comments (2)
  1. [Abstract and §4] Abstract and experimental results sections: the central claim that injected DocTags priors alleviate the Hop-1 bottleneck while the full image acts as a reliable fallback when detections are noisy is load-bearing, yet no detector-level metrics (mAP, per-class precision/recall, or failure rate) are reported for the RT-DETR model on the 10k-page structural OOD benchmark or the Chinese OmniDocBench subset. Without these numbers it is impossible to determine whether the observed F1 and TEDS lifts arise primarily from correct priors or from the VLM largely ignoring the priors and falling back to the image.
  2. [§4] §4 (benchmark construction): the 10k-page structural out-of-distribution benchmark is introduced without details on how the OOD splits were constructed, what distribution shifts were deliberately introduced, or how ground-truth annotations were obtained. This information is necessary to evaluate whether the reported gains generalize beyond the specific test distribution used.
minor comments (2)
  1. [Abstract] Abstract: reported metric improvements lack error bars or confidence intervals, making it difficult to assess the statistical reliability of the large lifts (e.g., F1 0.37→0.92).
  2. [§4] The paper does not include an ablation that isolates the contribution of the DocTags prior versus simply adding the detector outputs in a different format or omitting the prior entirely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and experimental results sections: the central claim that injected DocTags priors alleviate the Hop-1 bottleneck while the full image acts as a reliable fallback when detections are noisy is load-bearing, yet no detector-level metrics (mAP, per-class precision/recall, or failure rate) are reported for the RT-DETR model on the 10k-page structural OOD benchmark or the Chinese OmniDocBench subset. Without these numbers it is impossible to determine whether the observed F1 and TEDS lifts arise primarily from correct priors or from the VLM largely ignoring the priors and falling back to the image.

    Authors: We agree that providing detector-level performance metrics would allow readers to better quantify the reliability of the injected priors and distinguish their contribution from the fallback mechanism. Although the attention analysis in the manuscript demonstrates a bimodal attention shift consistent with selective use of the priors, we will add mAP, precision, recall, and failure rates for the RT-DETR detector evaluated on the 10k-page benchmark and the Chinese OmniDocBench subset in the revised version. This addition will directly address the concern and support the central claim. revision: yes

  2. Referee: [§4] §4 (benchmark construction): the 10k-page structural out-of-distribution benchmark is introduced without details on how the OOD splits were constructed, what distribution shifts were deliberately introduced, or how ground-truth annotations were obtained. This information is necessary to evaluate whether the reported gains generalize beyond the specific test distribution used.

    Authors: We thank the referee for this observation. While some details on the benchmark are provided in §4, we acknowledge that a more explicit description of the OOD split construction, the specific distribution shifts (e.g., novel layout configurations and domain variations), and the ground-truth annotation process would improve clarity and reproducibility. In the revised manuscript, we will expand §4 with these details, including any additional information on how the 10k pages were selected and annotated to ensure the gains are evaluated on truly out-of-distribution data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivation chain

full rationale

The paper presents an empirical engineering approach: an RT-DETR detector is run to produce layout detections, which are serialized into DocTags and injected into the VLM prompt alongside the full image. Central claims consist of measured performance lifts on external benchmarks (markdown F1 from 0.37 to 0.92 on a 10k-page OOD set, table TEDS from 0.01 to 0.36 on OmniDocBench Chinese subset, reduced infinite-loop failures on ViDoRe V3). No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The results are therefore self-contained against the reported benchmarks and detector outputs; they do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of an off-the-shelf detector and the assumption that the two-hop bottleneck dominates OOD failures; no new mathematical axioms or free parameters are introduced in the abstract.

axioms (1)
  • domain assumption The primary failure mode on unseen layouts is a two-hop bottleneck in which layout classification must succeed before content extraction can proceed.
    Explicitly stated in the first two sentences of the abstract as the attribution for observed breakdowns.

pith-pipeline@v0.9.0 · 5841 in / 1274 out tokens · 46463 ms · 2026-05-20T05:43:54.626761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 13 internal anchors

  1. [1]

    Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Ap- proaches

    D. Baviskar, S. Ahirrao, and K. Kotecha. “Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Ap- proaches”. In:Ieee Access9 (2021), pp. 101494–101512.DOI: 10.1109/access.2021. 3096739

  2. [2]

    OCR-free Document Understanding Transformer

    Geewook Kim et al. “OCR-free Document Understanding Transformer”. In:arXiv preprint arXiv:2111.15664(2022)

  3. [3]

    Lukas Blecher et al.Nougat: Neural Optical Understanding for Academic Documents. 2023. arXiv:2308.13418 [cs.LG].URL:https://arxiv.org/abs/2308.13418

  4. [4]

    Nikolaos Livathinos et al.Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion. 2025. arXiv: 2501.17887 [cs.AI] .URL: https://arxiv.org/abs/2501. 17887

  5. [5]

    Haoran Wei, Yaofeng Sun, and Yukun Li.DeepSeek-OCR: Contexts Optical Compression

  6. [6]

    arXiv:2510.18234 [cs.CV].URL:https://arxiv.org/abs/2510.18234

  7. [7]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Shuai Bai et al. Peng Wang.Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. 2024. arXiv: 2409.12191 [cs.CV] .URL: https://arxiv. org/abs/2409.12191

  8. [8]

    Haoran Wei et al.General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

  9. [9]

    arXiv:2409.01704 [cs.CV].URL:https://arxiv.org/abs/2409.01704

  10. [10]

    Ido Cohen et al.Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models. 2026. arXiv: 2412.14133 [cs.CL] .URL: https://arxiv.org/abs/ 2412.14133

  11. [11]

    Constantin Venhoff et al.Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval. 2025. arXiv: 2512.03276 [cs.AI].URL: https://www.arxiv.org/ abs/2512.03276

  12. [12]

    Yian Zhao et al.DETRs Beat YOLOs on Real-time Object Detection. 2024. arXiv:2304.08069 [cs.CV].URL:https://arxiv.org/abs/2304.08069

  13. [13]

    Hao Feng et al.Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

  14. [14]

    arXiv:2505.14059 [cs.CV].URL:https://arxiv.org/abs/2505.14059

  15. [15]

    Shuaiqi Duan et al.GLM-OCR Technical Report. 2026. arXiv: 2603.10910 [cs.CL].URL: https://arxiv.org/abs/2603.10910

  16. [16]

    Mayank Singh et al.OCR++: A Robust Framework For Information Extraction from Scholarly Articles. 2016. arXiv: 1609 . 06423 [cs.DL].URL: https : / / arxiv . org / abs / 1609 . 06423

  17. [17]

    Rasha Sinha and Rekha B S.Digitization of Document and Information Extraction using OCR

  18. [18]

    arXiv:2506.11156 [cs.CV].URL:https://arxiv.org/abs/2506.11156

  19. [19]

    OCR Post Correction for Endangered Language Texts

    Shruti Rijhwani, Antonios Anastasopoulos, and Graham Neubig. “OCR Post Correction for Endangered Language Texts”. In:Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Ed. by Bonnie Webber et al. Online: Association for Computational Linguistics, Nov. 2020, pp. 5931–5942.DOI: 10.18653/v1/2020.emnlp- main.478.UR...

  20. [20]

    Juan Ramirez-Orta et al.Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models. 2022. arXiv: 2109.06264 [cs.CL].URL: https://arxiv. org/abs/2109.06264

  21. [21]

    Ahmed Nassar et al.SmolDocling: An ultra-compact vision-language model for end-to- end multi-modal document conversion. 2025. arXiv: 2503.11576 [cs.CV] .URL: https: //arxiv.org/abs/2503.11576

  22. [22]

    LayoutLM: Pre-training of Text and Layout for Document Image Under- standing

    Xu Yiheng et al. “LayoutLM: Pre-training of Text and Layout for Document Image Under- standing”. In:Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’20. ACM, Aug. 2020, pp. 1192–1200.DOI: 10.1145/ 3394486.3403172.URL:http://dx.doi.org/10.1145/3394486.3403172

  23. [23]

    Yang Xu et al.LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understand- ing. 2022. arXiv:2012.14740 [cs.CL].URL:https://arxiv.org/abs/2012.14740

  24. [24]

    Yupan Huang et al.LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. 2022. arXiv: 2204 . 08387 [cs.AI].URL: https : / / arxiv . org / abs / 2204 . 08387. 10

  25. [25]

    Jean-Baptiste Alayrac et al.Flamingo: a Visual Language Model for Few-Shot Learning. 2022. arXiv:2204.14198 [cs.CV].URL:https://arxiv.org/abs/2204.14198

  26. [26]

    Haotian Liu et al.The Llama 3 Herd of Models. 2024. arXiv: 2407.21783 [cs.AI] .URL: https://arxiv.org/abs/2407.21783

  27. [27]

    Andrés Marafioti et al.SmolVLM: Redefining small and efficient multimodal models. 2025. arXiv:2504.05299 [cs.AI].URL:https://arxiv.org/abs/2504.05299

  28. [28]

    Zhaoqing Zhu et al.A Simple yet Effective Layout Token in Large Language Models for Document Understanding. 2025. arXiv: 2503.18434 [cs.CV].URL: https://arxiv.org/ abs/2503.18434

  29. [29]

    Ari Holtzman et al.The Curious Case of Neural Text Degeneration. 2020. arXiv: 1904.09751 [cs.CL].URL:https://arxiv.org/abs/1904.09751

  30. [30]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Alexey Gritsenko et al. Michael Tschannen.SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. 2025. arXiv: 2502.14786 [cs.CV].URL:https://arxiv.org/abs/2502.14786

  31. [31]

    and Staar, Peter , title =

    Birgit Pfitzmann et al. “DocLayNet: A Large Human-Annotated Dataset for Document- Layout Segmentation”. In:Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22. ACM, Aug. 2022, pp. 3743–3751.DOI: 10.1145/ 3534678.3539043.URL:http://dx.doi.org/10.1145/3534678.3539043

  32. [32]

    Nikolaos Livathinos et al.Advanced Layout Analysis Models for Docling. 2025. arXiv: 2509. 11720 [cs.CV].URL:https://arxiv.org/abs/2509.11720

  33. [33]

    Joseph Redmon et al.You Only Look Once: Unified, Real-Time Object Detection. 2016. arXiv: 1506.02640 [cs.CV].URL:https://arxiv.org/abs/1506.02640

  34. [34]

    Maurice Weber et al.WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data. 2023. arXiv: 2312.10188 [cs.LG] .URL: https://arxiv.org/abs/2312.10188

  35. [35]

    What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024

    Hugo Laurençon et al.What matters when building vision-language models?2024. arXiv: 2405.02246 [cs.CV].URL:https://arxiv.org/abs/2405.02246

  36. [36]

    Haotian Liu et al.Visual Instruction Tuning. 2023. arXiv: 2304 . 08485 [cs.CV].URL: https://arxiv.org/abs/2304.08485

  37. [37]

    Linke Ouyang et al.OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations. 2025. arXiv: 2412.07626 [cs.CV] .URL: https://arxiv. org/abs/2412.07626

  38. [38]

    António Loison et al.ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios. 2026. arXiv: 2601.08620 [cs.AI] .URL: https://arxiv.org/abs/2601.08620

  39. [39]

    Ashish Vaswani et al.Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL].URL: https://arxiv.org/abs/1706.03762. 11 Table 4: Training hyperparameters for the proposed two-stage document parser. Hyperparameter Value Vision encoder SigLIP-2 Language model Granite-165M Multimodal connector Pixel-shuffle projector / MLP Optimizer AdamW Learning-rate sch...