pith. machine review for the scientific record. sign in

arxiv: 2604.11042 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords document layout detectionannotation harmonizationvision-language modelsobject detectionrepresentation learningdataset inconsistencybounding box alignment
0
0 comments X

The pith

A vision-language model reconciles conflicting category labels and bounding-box rules across layout datasets before training, lifting detection F-score from 0.860 to 0.883 and producing more separable embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that mixing document layout datasets without fixing their incompatible category definitions and spatial rules actually hurts a detector's performance. It introduces an agentic workflow in which a vision-language model first aligns the taxonomies and adjusts box granularity to produce consistent training examples. After this harmonization step, a pretrained RT-DETRv2 model shows gains on detection F-score, table structure metrics, and spatial overlap, while its post-decoder embeddings become more compact and separable. The work matters because real annotation collections rarely share identical standards, so naive data combination often distorts the learned feature space rather than improving it.

Core claim

Annotation inconsistency between datasets distorts the learned feature space in object detectors; reconciling both semantic mappings and spatial granularity with a vision-language agent before fine-tuning restores compact, separable representations and lifts end-to-end metrics such as F-score and TEDS.

What carries the argument

The agentic label harmonization workflow that employs a vision-language model to reconcile category semantics and bounding-box granularity across heterogeneous annotation taxonomies.

If this is right

  • Naive mixing of inconsistently annotated datasets degrades a pretrained detector on document layout tasks.
  • Harmonized training yields more compact and separable post-decoder embeddings.
  • Gains appear across detection F-score, table TEDS, and mean bounding-box overlap when taxonomies share only partial overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reconciliation step could be applied to other detection domains where annotation conventions differ, such as scene text or medical imaging.
  • If the vision-language model hallucinates mappings on certain rare categories, performance on those classes might degrade rather than improve.
  • Human review of a small sample of the harmonized labels could serve as a low-cost way to verify or correct the model's outputs before full-scale training.

Load-bearing premise

The vision-language model can accurately reconcile category mappings and bounding-box granularity without introducing new systematic errors that offset the reported gains.

What would settle it

Running the same harmonization pipeline on a new pair of datasets where the vision-language model visibly mis-maps categories or over- or under-adjusts boxes, then checking whether the downstream F-score and embedding metrics still improve.

Figures

Figures reproduced from arXiv: 2604.11042 by Crag Wolfe, Renyu Li, Vladimir Kirilenko, Yao You.

Figure 1
Figure 1. Figure 1: Annotation discrepancies - Tables and Forms [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Annotation discrepancies - Paragraphs and List-Items [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Post-decoder embedding space under the three training regimes. Compared with the [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise UMAP projections for commonly confused layout categories. Harmonized [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-class silhouette scores for post-decoder representations. Harmonized training [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Neighborhood purity for post-decoder embeddings using [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, na\"ive mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that an agentic VLM-based workflow can harmonize conflicting category semantics and bounding-box definitions across heterogeneous document-layout datasets (16- and 10-class taxonomies sharing only 8 direct mappings), thereby preventing degradation from naïve multi-dataset fine-tuning of RT-DETRv2 and yielding measurable gains on SCORE-Bench (detection F-score 0.860→0.883, table TEDS 0.750→0.814, mean box overlap 0.043→0.016) plus more compact post-decoder embeddings.

Significance. If the harmonized labels are verifiably more consistent and accurate than the originals, the method offers a practical route to exploit larger combined corpora without annotation-induced feature distortion, with direct relevance to any detection task that merges sources having non-identical spatial or semantic conventions.

major comments (2)
  1. [§4] §4 (Experiments): the reported metric improvements are presented as direct before/after comparisons without human-expert agreement rates, inter-annotator consistency on harmonized samples, or an ablation isolating VLM mapping/box-adjustment errors; this leaves open the possibility that gains reflect VLM inductive biases rather than resolved inconsistency.
  2. [§3] §3 (Method): no quantitative controls (e.g., prompt-variation ablation, hallucination rate on held-out samples, or statistical significance of the 0.023/0.064/0.027 deltas) are supplied for the agentic workflow, so the central claim that harmonization restores representation structure rests on unverified assumptions about VLM fidelity.
minor comments (1)
  1. [Abstract] Abstract: the escaped quote in “naïve” is a minor typesetting artifact that should be cleaned for final copy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to incorporating additional validation analyses in the revised manuscript to strengthen the empirical support for the harmonization workflow.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the reported metric improvements are presented as direct before/after comparisons without human-expert agreement rates, inter-annotator consistency on harmonized samples, or an ablation isolating VLM mapping/box-adjustment errors; this leaves open the possibility that gains reflect VLM inductive biases rather than resolved inconsistency.

    Authors: We acknowledge that the experiments section presents metric deltas as direct before/after comparisons and does not yet include human-expert agreement rates, inter-annotator consistency on the harmonized outputs, or a dedicated ablation isolating VLM mapping versus box-adjustment errors. In the revised manuscript we will add a human evaluation on a 100-sample subset of harmonized annotations, reporting Cohen's kappa between VLM outputs and two expert annotators. We will also insert an ablation that runs the full agentic pipeline against a category-mapping-only variant (no box adjustment) to quantify the incremental contribution of each component. While the consistent gains across detection F-score, table TEDS, and embedding compactness already indicate that annotation consistency is the primary driver, these additions will more directly rule out VLM-specific inductive biases. revision: yes

  2. Referee: [§3] §3 (Method): no quantitative controls (e.g., prompt-variation ablation, hallucination rate on held-out samples, or statistical significance of the 0.023/0.064/0.027 deltas) are supplied for the agentic workflow, so the central claim that harmonization restores representation structure rests on unverified assumptions about VLM fidelity.

    Authors: We agree that the method section would benefit from explicit quantitative controls on the agentic workflow. In the revision we will add: (i) a prompt-variation ablation testing three distinct prompt phrasings and reporting the resulting variance in category mappings and box adjustments; (ii) a hallucination audit on a 200-sample held-out set with known ground-truth mappings, reporting the fraction of outputs that deviate from the reference; and (iii) statistical significance tests (paired bootstrap with 10 000 resamples) for the three reported deltas. These controls will directly substantiate the fidelity assumptions underlying the claim that harmonization restores representation structure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical before/after comparisons on held-out metrics

full rationale

The paper's central claim rests on an empirical workflow: apply a VLM-based harmonization procedure to reconcile category mappings and bounding-box definitions across two datasets, then fine-tune an RT-DETRv2 detector and measure concrete deltas on SCORE-Bench (F-score, TEDS, mean overlap) plus embedding separability. No equations, fitted parameters, or self-referential definitions appear in the provided text; the reported gains (e.g., F-score 0.860→0.883) are direct experimental outcomes rather than quantities forced by construction from the harmonization inputs. No self-citations are invoked as load-bearing uniqueness theorems, and the evaluation uses standard detection and structure metrics that are independent of the method itself. This is a standard empirical ablation and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the VLM performs accurate harmonization; no free parameters, invented entities, or additional axioms are introduced beyond standard supervised fine-tuning.

axioms (1)
  • domain assumption Vision-language models can reliably interpret and reconcile semantic category mappings and bounding-box granularity in document images without introducing systematic bias.
    The entire harmonization workflow depends on this capability of the VLM.

pith-pipeline@v0.9.0 · 5527 in / 1165 out tokens · 50872 ms · 2026-05-10T16:23:44.947772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 38 canonical work pages · 4 internal anchors

  1. [1]

    MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,

    J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [Online]. Available: https://arxiv.org/abs/2112. 13762

  2. [2]

    Dynamic Supervisor for Cross-dataset Object Detection,

    Z. Chen et al., “Dynamic Supervisor for Cross-dataset Object Detection,” Neurocomputing, 2022, [Online]. Available: https://arxiv.org/abs/2204.00183

  3. [3]

    Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,

    Y.-H. Liao, D. Acuna, R. Mahmood, J. Lucas, V. Prabhu, and S. Fidler, “Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,” in International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://openreview.net/forum?id=ChHx5ORqF0

  4. [4]

    Overcoming Catastrophic Forgetting in Neural Networks,

    J. Kirkpatrick et al. , “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences (PNAS), vol. 114, no. 13, pp. 3521–3526, 2017, [Online]. Available: https://arxiv.org/abs/1612.00796

  5. [5]

    Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,

    T. Feng, M. Wang, and H. Yuan, “Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https:// arxiv.org/abs/2204.02136 18

  6. [6]

    PubLayNet: Largest Dataset Ever for Document Layout Analysis,

    X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: Largest Dataset Ever for Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2019. [Online]. Available: https://arxiv.org/abs/1908.07836

  7. [7]

    arXiv preprint arXiv:2006.01038 , year=

    M. Li et al. , “DocBank: A Benchmark Dataset for Document Layout Analysis,” in International Conference on Computational Linguistics (COLING) , 2020. [Online]. Available: https://arxiv.org/abs/2006.01038

  8. [8]

    and Staar, Peter , title =

    B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. W. J. Staar, “DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022, pp. 3743–3751. doi: 10.1145/3534678.3539043

  9. [9]

    Girshick, and Jian Sun

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://arxiv.org/abs/1506.01497

  10. [10]

    You only look once: Unified, real-time object detection

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://arxiv.org/abs/1506.02640

  11. [11]

    YOLOX: Exceeding YOLO Series in 2021

    Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” arXiv preprint arXiv:2107.08430, 2021, [Online]. Available: https://arxiv.org/abs/2107. 08430

  12. [12]

    Jocher, A

    G. Jocher, A. Chaurasia, and J. Qiu, “YOLOv8.” [Online]. Available: https://github.com/ ultralytics/ultralytics

  13. [13]

    Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to- End Object Detection with Transformers,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2005.12872

  14. [14]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations (ICLR), 2021. [Online]. Available: https://arxiv.org/abs/2010.04159

  15. [15]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to- End Object Detection,

    H. Zhang et al. , “DINO: DETR with Improved DeNoising Anchor Boxes for End-to- End Object Detection,” in International Conference on Learning Representations (ICLR),

  16. [16]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    [Online]. Available: https://arxiv.org/abs/2203.03605

  17. [17]

    DETRs Beat YOLOs on Real-time Object Detection

    Y. Zhao et al., “DETRs Beat YOLOs on Real-time Object Detection.” 2023

  18. [18]

    Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer

    W. Lv, Y. Zhao, Q. Chang, K. Huang, G. Wang, and Y. Liu, “RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer,” arXiv preprint arXiv:2407.17140, 2024, [Online]. Available: https://arxiv.org/abs/2407.17140

  19. [19]

    VGT: Vision Grid Transformer for Document Layout Analysis,

    C. Da, C. Luo, Q. Zheng, and C. Yao, “VGT: Vision Grid Transformer for Document Layout Analysis,” in IEEE/CVF International Conference on Computer Vision (ICCV),

  20. [20]

    Available: https://arxiv.org/abs/2308.14978

    [Online]. Available: https://arxiv.org/abs/2308.14978

  21. [21]

    2024 , eprint =

    Z. Zhao, H. Kang, B. Wang, and C. He, “DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception,” arXiv preprint arXiv:2410.12628, 2024, [Online]. Available: https://arxiv.org/abs/2410.12628 19

  22. [22]

    Said Gurbuz, and Peter W

    N. Livathinos, C. Auer, A. Nassar, and others, “Advanced Layout Analysis Models for Docling,” arXiv preprint arXiv:2509.11720, 2025, [Online]. Available: https://arxiv.org/ abs/2509.11720

  23. [23]

    A Survey on Transfer Learning,

    S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010, [Online]. Available: https:// ieeexplore.ieee.org/document/5288526

  24. [24]

    Deep Domain Adaptive Object Detection: A Survey,

    W. Li, F. Li, Y. Luo, P. Wang, and J. Sun, “Deep Domain Adaptive Object Detection: A Survey,” in IEEE Symposium Series on Computational Intelligence (SSCI), 2020. [Online]. Available: https://arxiv.org/abs/2002.06797

  25. [25]

    Domain Adaptive Faster R-CNN for Object Detection in the Wild,

    Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain Adaptive Faster R-CNN for Object Detection in the Wild,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Online]. Available: https://arxiv.org/abs/1803.03243

  26. [26]

    Strong-Weak Distribution Alignment for Adaptive Object Detection,

    K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-Weak Distribution Alignment for Adaptive Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1812.04798

  27. [27]

    Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,

    C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang, “Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2008.08574

  28. [28]

    Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,

    G. Zhao, G. Li, R. Xu, and L. Lin, “Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/ abs/2009.08119

  29. [29]

    Towards Universal Object Detection by Domain Attention,

    X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards Universal Object Detection by Domain Attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1904.04402

  30. [30]

    Detecting twenty-thousand classes using image-level supervision

    X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting Twenty-Thousand Classes Using Image-Level Supervision,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2201.02605

  31. [31]

    Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,

    M. Kennerley, A. I. Aviles-Rivero, C.-B. Schönlieb, and R. T. Tan, “Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,” arXiv preprint arXiv:2506.04737, 2025, [Online]. Available: https://arxiv.org/abs/2506.04737

  32. [32]

    Grounded language-image pre-training

    L. H. Li et al., “Grounded Language-Image Pre-Training,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https:// arxiv.org/abs/2112.03857

  33. [33]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    S. Liu et al. , “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection,” arXiv preprint arXiv:2303.05499, 2023, [Online]. Available: https://arxiv.org/abs/2303.05499

  34. [34]

    LSDA: Large Scale Detection Through Adaptation,

    J. Hoffman et al., “LSDA: Large Scale Detection Through Adaptation,” in Advances in Neural Information Processing Systems (NeurIPS), 2014. [Online]. Available: https:// arxiv.org/abs/1407.5035 20

  35. [35]

    Bridging the gap between object and image-level representations for open-vocabulary detection, 2022a

    H. Rasheed, M. Maaz, M. U. Khattak, S. H. Khan, and F. S. Khan, “Bridging the Gap between Object and Image-Level Representations for Open-Vocabulary Detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://arxiv.org/abs/2207.03482

  36. [36]

    SFDLA: Source-Free Document Layout Analysis,

    S. Tewes, Y. Chen, O. Moured, J. Zhang, and R. Stiefelhagen, “SFDLA: Source-Free Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2025. [Online]. Available: https://arxiv.org/abs/2503.18742

  37. [37]

    Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,

    J. Wang, K. Hu, Z. Zhong, L. Sun, and Q. Huo, “Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,” Pattern Recognition, 2024, [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0031320324005879

  38. [38]

    LayoutLM: Pre-training of text and layout for document image understanding

    Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , 2020. [Online]. Available: https:// arxiv.org/abs/1912.13318

  39. [39]

    Manmatha

    S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “DocFormer: End-to-End Transformer for Document Understanding,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [Online]. Available: https://arxiv.org/abs/2106.11539

  40. [40]

    arXiv:2203.02378 , year=

    J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “DiT: Self-Supervised Pre-Training for Document Image Transformer,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2203.02378

  41. [41]

    OCR-Free Document Understanding Transformer,

    G. Kim et al., “OCR-Free Document Understanding Transformer,” in European Conference on Computer Vision (ECCV) , 2022. [Online]. Available: https://arxiv.org/abs/2111. 15664

  42. [42]

    LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,

    Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2204.08387

  43. [43]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford et al. , “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  44. [44]

    mplug-docowl: Modularized multimodal large language model for document understanding.arXiv preprint arXiv:2307.02499, 2023

    J. Ye et al. , “mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding,” arXiv preprint arXiv:2307.02499, 2023, [Online]. Available: https://arxiv.org/abs/2307.02499

  45. [45]

    Vary: Scaling up the vision vocab- ulary for large vision-language models

    H. Wei et al., “Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Models,” arXiv preprint arXiv:2312.06109, 2023, [Online]. Available: https://arxiv.org/abs/2312. 06109

  46. [46]

    SCORE: A Semantic Evaluation Framework for Generative Document Parsing,

    R. Li, A. Jimeno Yepes, Y. You, K. Pluciński, M. Operlejn, and C. Wolfe, “SCORE: A Semantic Evaluation Framework for Generative Document Parsing,” arXiv preprint arXiv:2509.19345, 2025, [Online]. Available: https://arxiv.org/abs/2509.19345 21 7 Appendix 7.1 End-to-end file transformation metrics The metrics reported in Section 5.1 evaluate a model’s abili...