pith. sign in

arxiv: 2605.19309 · v1 · pith:65WMGWVHnew · submitted 2026-05-19 · 💻 cs.CL

How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

Pith reviewed 2026-05-20 05:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords document layout analysisrobustness evaluationstructural vulnerabilityOCR instabilityauditing frameworkfootprint biasblock-level structural loss rate
0
0 comments X

The pith

Document parsers are more vulnerable to small structurally targeted probes than large area changes, as area size poorly predicts OCR instability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Document layout analysis systems underpin many retrieval and question-answering applications, yet their robustness is usually judged by how much page area a perturbation covers. The paper argues this area-centric view creates a Footprint Bias that misses how changes interact with actual layout blocks. It introduces an output-level auditing approach that measures Block-level Structural Loss Rate instead, showing this tracks perturbation-driven OCR failures far more closely than area does across two parsers and a thousand pages. Exposure descriptors in the framework also distinguish occlusion-driven failures from topology-driven ones. Small probes aimed at structure produce downstream QA and retrieval drops comparable to much larger perturbations.

Core claim

We identify this Footprint Bias and propose a lightweight output-level auditing framework that decouples probe construction, policy-driven targeting, and structure-aware diagnosis. The framework combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where perturbations interact with layout structure and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, and small structurly

What carries the argument

Block-level Structural Loss Rate (B-SLR), an output-level measure of structural disruption at the block level that better tracks how layout changes drive OCR instability than simple affected area.

If this is right

  • B-SLR provides a tighter link to actual OCR instability than area-based metrics across tested parsers.
  • Granularity-aware exposure descriptors distinguish occlusion pathways from topology pathways.
  • Structurally targeted small probes degrade QA and retrieval performance at rates similar to large-footprint changes.
  • Robustness evaluation of document intelligence systems should move from footprint stress tests to structure-aware audits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to additional downstream tasks such as summarization or information extraction to test consistency of the structural signal.
  • System builders might integrate B-SLR-style checks into continuous evaluation pipelines for document parsers to catch layout-specific weaknesses early.
  • The separation of pathways opens the possibility of targeted defenses that address occlusion versus topology failures differently.

Load-bearing premise

The chosen perturbations and 1,000-page test set are representative of real-world structural vulnerabilities, and B-SLR plus pathway attribution capture the relevant failure modes without post-hoc selection or unstated modeling assumptions.

What would settle it

A new test set or perturbation family in which affected area correlates with OCR instability at least as strongly as B-SLR does would falsify the claim that structure-aware auditing is required.

Figures

Figures reproduced from arXiv: 2605.19309 by Keze Wang, Yihao Wang, Yue Chen, Ziyi Tang.

Figure 1
Figure 1. Figure 1: Footprint Bias in DLA robustness evaluation: (a) a large-area perturbation may cause limited error, while (b) a small structural probe can trigger greater parsing failure. Despite progress on clean benchmarks and re￾alistic evaluation settings, DLA robustness is still commonly assessed through aggregate degradation under corruption. Existing protocols often param￾eterize perturbation severity by global cor… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed tripartite vulnerability auditing framework, linking controlled perturbation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: tests whether affected area is a reliable severity proxy for perturbation-induced OCR insta￾bility. It is not: TOR explains CER only weakly on MinerU (R2=0.384) and almost not at all on PP-StructureV3 (R2=0.110). Even within the matched-TOR region, configurations with compa￾rable footprint exhibit a CER spread of roughly 2.7×, showing that footprint alone cannot deter￾0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 TOR vs… view at source ↗
Figure 4
Figure 4. Figure 4: Phase 1 pathway decomposition. Bars de￾compose B-SLR into SLRmiss and SLRtopo; higher bars indicate greater structural loss, and configuration identifiers are decoded in Appendix C.1 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative visual examples of the probe families used in the controlled perturbation space. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase 2 pathway composition of mean per-image structural loss for each policy ( [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Abridged prompt templates for the prompt-based policy variants. All prompts emit the same output [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose a lightweight output-level auditing framework that decouples probe construction, policy-driven targeting, and structure-aware diagnosis. The framework combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where perturbations interact with layout structure and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, and small structurally targeted probes cause downstream QA/retrieval degradation comparable to larger-footprint perturbations. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies 'Footprint Bias' in current DLA robustness evaluation, which relies on affected area as the primary metric. It introduces a lightweight auditing framework using Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze how perturbations interact with layout structure. On MinerU and PP-StructureV3 across 1,000 pages, the work reports that affected area weakly correlates with perturbation-induced OCR instability (R²=0.384/0.110) while B-SLR correlates substantially better (R²=0.727/0.916); exposure descriptors separate occlusion- and topology-dominant failure pathways, and small targeted probes produce downstream QA/retrieval degradation comparable to larger perturbations.

Significance. If the central correlations and pathway attributions hold under broader testing, the work provides a concrete shift from area-centric to structure-aware robustness auditing for document intelligence systems. The quantitative R² comparisons offer measurable support for preferring B-SLR over footprint metrics, and the downstream task results highlight practical implications for RAG and QA pipelines. The framework's decoupling of probe construction from diagnosis is a useful methodological contribution.

major comments (3)
  1. [§4 (Experimental Results)] §4 (Experimental Results): The reported R² values (0.727/0.916 for B-SLR vs. 0.384/0.110 for affected area) are presented without error bars, bootstrap confidence intervals, or statistical tests for the difference in correlations. This is load-bearing for the central claim that B-SLR 'aligns much more closely,' as the practical superiority cannot be assessed without quantifying uncertainty or significance.
  2. [§3.2 (Dataset and Perturbation Generation)] §3.2 (Dataset and Perturbation Generation): No information is provided on corpus provenance, sampling strategy, document-type stratification (e.g., scientific papers vs. forms vs. tables), or the precise procedure for generating occlusion and topology perturbations. This directly affects the generalizability of the headline result that B-SLR superiority and pathway separation reflect structural vulnerabilities rather than properties of the chosen 1,000-page distribution.
  3. [§4.3 (Downstream Evaluation)] §4.3 (Downstream Evaluation): The claim that small structurally targeted probes cause 'comparable' QA/retrieval degradation to larger-footprint perturbations lacks quantitative details on effect sizes, variance across runs, or controls that isolate structural targeting from raw perturbation size. This comparison is central to arguing for structure-aware auditing over footprint-based testing.
minor comments (2)
  1. [Abstract] Abstract: The term 'granularity-aware exposure descriptors' is introduced without a one-sentence definition, reducing accessibility for readers outside the immediate subfield.
  2. [§3.1 (Framework Definition)] Notation: B-SLR is defined as a new metric but its exact formula (e.g., how block-level losses are aggregated and normalized) should be stated explicitly in the main text rather than deferred entirely to an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the statistical rigor, transparency, and quantitative support in our work. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: The reported R² values (0.727/0.916 for B-SLR vs. 0.384/0.110 for affected area) are presented without error bars, bootstrap confidence intervals, or statistical tests for the difference in correlations. This is load-bearing for the central claim that B-SLR 'aligns much more closely,' as the practical superiority cannot be assessed without quantifying uncertainty or significance.

    Authors: We agree that uncertainty quantification and significance testing are necessary to substantiate the superiority of B-SLR. In the revised manuscript, we will add bootstrap confidence intervals (via 1,000 resamples) for all reported R² values and apply Steiger's Z-test to assess whether the correlation differences are statistically significant. These results will appear in §4 with corresponding discussion of practical implications. revision: yes

  2. Referee: No information is provided on corpus provenance, sampling strategy, document-type stratification (e.g., scientific papers vs. forms vs. tables), or the precise procedure for generating occlusion and topology perturbations. This directly affects the generalizability of the headline result that B-SLR superiority and pathway separation reflect structural vulnerabilities rather than properties of the chosen 1,000-page distribution.

    Authors: We acknowledge that greater detail on the corpus and perturbation procedures is required for assessing generalizability. The revision will expand §3.2 to specify: corpus sources and provenance; stratified sampling by document category (scientific papers, forms, tables, etc.) and complexity metrics; and the exact generation procedures, including masking ratios for occlusion and structural edit rules for topology perturbations. revision: yes

  3. Referee: The claim that small structurally targeted probes cause 'comparable' QA/retrieval degradation to larger-footprint perturbations lacks quantitative details on effect sizes, variance across runs, or controls that isolate structural targeting from raw perturbation size. This comparison is central to arguing for structure-aware auditing over footprint-based testing.

    Authors: We agree that the downstream results need explicit quantitative backing and controls. In the revised §4.3 we will report effect sizes (e.g., absolute and relative drops in QA F1 and retrieval nDCG), standard deviations across repeated runs, and ablation controls that hold perturbation area constant while varying structural targeting. This will isolate the contribution of layout-aware probes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical correlations are independent measurements

full rationale

The paper defines B-SLR and exposure descriptors directly from parsed output structure, then reports empirical R² correlations between these metrics and perturbation-induced OCR instability on a fixed 1,000-page test set, contrasting them with the affected-area baseline. These R² values are computed post-experiment from observed data and do not reduce by construction to any fitted parameter or self-referential definition within the same dataset. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the abstract or summary to support the core claims. The derivation chain—metric definition, perturbation application, outcome measurement, and correlation reporting—is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces new evaluation concepts and metrics without listing explicit free parameters; it relies on domain assumptions about perturbation realism and metric validity.

axioms (1)
  • domain assumption Perturbations can be constructed to target layout structure while keeping overall footprint small.
    Invoked to demonstrate that small probes produce comparable downstream degradation.
invented entities (2)
  • Footprint Bias no independent evidence
    purpose: Label for the area-centric bias in existing DLA robustness evaluation.
    Newly named concept used to motivate the framework.
  • Block-level Structural Loss Rate (B-SLR) no independent evidence
    purpose: Output-level metric for structural vulnerability.
    Core new measurement proposed and validated via R^2 comparison.

pith-pipeline@v0.9.0 · 5716 in / 1476 out tokens · 51782 ms · 2026-05-20T05:58:46.136664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

    Baek, Youngmin and Nam, Daehyun and Park, Sungrae and Lee, Junyeop and Shin, Seung and Baek, Jeonghun and Lee, Chae Young and Lee, Hwalsuk , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

  2. [2]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Chen, Yufan and Zhang, Jiaming and Peng, Kunyu and Zheng, Junwei and Liu, Ruiping and Torr, Philip and Stiefelhagen, Rainer , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  3. [3]

    2025 , eprint=

    PaddleOCR 3.0 Technical Report , author=. 2025 , eprint=

  4. [4]

    2017 , eprint=

    Improved Regularization of Convolutional Neural Networks with Cutout , author=. 2017 , eprint=

  5. [5]

    2025 , eprint=

    DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation , author=. 2025 , eprint=

  6. [6]

    Wichmann and Wieland Brendel , booktitle=

    Robert Geirhos and Patricia Rubisch and Claudio Michaelis and Matthias Bethge and Felix A. Wichmann and Wieland Brendel , booktitle=. ImageNet-trained. 2019 , url=

  7. [7]

    Augraphy: A Data Augmentation Library for Document Images

    Groleau, Alexander and Chee, Kok Wei and Larson, Stefan and Maini, Samay and Boarman, Jonathan. Augraphy: A Data Augmentation Library for Document Images. Document Analysis and Recognition - ICDAR 2023. 2023

  8. [8]

    Natural Language Engineering , author=

    In-depth analysis of the impact of OCR errors on named entity recognition and linking , volume=. Natural Language Engineering , author=. 2023 , pages=. doi:10.1017/S1351324922000110 , number=

  9. [9]

    International Conference on Learning Representations , year=

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. International Conference on Learning Representations , year=

  10. [10]

    Mathematics , volume=

    Benchmarking Adversarial Patch Selection and Location , author=. Mathematics , volume=. 2025 , publisher=

  11. [11]

    TedEval: A Fair Evaluation Metric for Scene Text Detectors , year=

    Lee, Chae Young and Baek, Youngmin and Lee, Hwalsuk , booktitle=. TedEval: A Fair Evaluation Metric for Scene Text Detectors , year=

  12. [12]

    2022 , eprint=

    PP-StructureV2: A Stronger Document Analysis System , author=. 2022 , eprint=

  13. [13]

    The Thirteenth International Conference on Learning Representations , year=

    Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks , author=. The Thirteenth International Conference on Learning Representations , year=

  14. [14]

    The Thirteenth International Conference on Learning Representations , year=

    Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations , author=. The Thirteenth International Conference on Learning Representations , year=

  15. [15]

    2020 , url=

    Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , author=. 2020 , url=

  16. [16]

    and Staar, Peter , title =

    Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter , title =. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2022 , isbn =. doi:10.1145/3534678.3539043 , abstract =

  17. [17]

    2019 , url=

    On the Spectral Bias of Neural Networks , author=. 2019 , url=

  18. [18]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Tsuzuku, Yusuke and Sato, Issei , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  19. [19]

    , title =

    Wang, Haohan and Wu, Xindi and Huang, Zeyi and Xing, Eric P. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  20. [20]

    2024 , eprint=

    MinerU: An Open-Source Solution for Precise Document Content Extraction , author=. 2024 , eprint=

  21. [21]

    The 64th Annual Meeting of the Association for Computational Linguistics -- Industry Track , year=

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing , author=. The 64th Annual Meeting of the Association for Computational Linguistics -- Industry Track , year=

  22. [22]

    2026 , url=

    Zenghui Yang and Xingquan Zuo and Hai Huang and Gang Chen and Xinchao Zhao and Tianle Zhang , booktitle=. 2026 , url=

  23. [23]

    A Fourier Perspective on Model Robustness in Computer Vision , url =

    Yin, Dong and Gontijo Lopes, Raphael and Shlens, Jon and Cubuk, Ekin Dogus and Gilmer, Justin , booktitle =. A Fourier Perspective on Model Robustness in Computer Vision , url =

  24. [24]

    DocLayout-

    Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He , year=. DocLayout-

  25. [25]

    PubLayNet: Largest Dataset Ever for Document Layout Analysis , year=

    Zhong, Xu and Tang, Jianbin and Jimeno Yepes, Antonio , booktitle=. PubLayNet: Largest Dataset Ever for Document Layout Analysis , year=

  26. [26]

    Proceedings of the AAAI conference on artificial intelligence , pages=

    Random erasing data augmentation , author=. Proceedings of the AAAI conference on artificial intelligence , pages=

  27. [27]

    2026 , eprint=

    Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild , author=. 2026 , eprint=

  28. [28]

    SCAN : Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

    Ueda, Nobuhiro and Dong, Yuyang and Boros, Kriszti \'a n and Ito, Daiki and Sera, Takuya and Oyamada, Masafumi. SCAN : Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.82

  29. [29]

    D oc M ath-Eval: Evaluating Math Reasoning Capabilities of LLM s in Understanding Long and Specialized Documents

    Zhao, Yilun and Long, Yitao and Liu, Hongjun and Kamoi, Ryo and Nan, Linyong and Chen, Lyuhao and Liu, Yixin and Tang, Xiangru and Zhang, Rui and Cohan, Arman. D oc M ath-Eval: Evaluating Math Reasoning Capabilities of LLM s in Understanding Long and Specialized Documents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis...

  30. [30]

    Lightweight Domain-Specific Language Model for Real-Time Structuring of Medical Prescriptions

    Cottet, Jonathan Pattin and Eglin, V \'e ronique and Aussem, Alex. Lightweight Domain-Specific Language Model for Real-Time Structuring of Medical Prescriptions. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 5: Industry Track). 2026. doi:10.18653/v1/2026.eacl-industry.68

  31. [31]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Zhang, Junyuan and Zhang, Qintong and Wang, Bin and Ouyang, Linke and Wen, Zichen and Li, Ying and Chow, Ka-Ho and He, Conghui and Zhang, Wentao , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  32. [32]

    Adversarial Patch

    Adversarial patch , author=. arXiv preprint arXiv:1712.09665 , year=