arxiv: 2604.11042 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Renyu Li , Vladimir Kirilenko , Yao You , Crag Wolfe

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords document layout detectionannotation harmonizationvision-language modelsobject detectionrepresentation learningdataset inconsistencybounding box alignment

0 comments

The pith

A vision-language model reconciles conflicting category labels and bounding-box rules across layout datasets before training, lifting detection F-score from 0.860 to 0.883 and producing more separable embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that mixing document layout datasets without fixing their incompatible category definitions and spatial rules actually hurts a detector's performance. It introduces an agentic workflow in which a vision-language model first aligns the taxonomies and adjusts box granularity to produce consistent training examples. After this harmonization step, a pretrained RT-DETRv2 model shows gains on detection F-score, table structure metrics, and spatial overlap, while its post-decoder embeddings become more compact and separable. The work matters because real annotation collections rarely share identical standards, so naive data combination often distorts the learned feature space rather than improving it.

Core claim

Annotation inconsistency between datasets distorts the learned feature space in object detectors; reconciling both semantic mappings and spatial granularity with a vision-language agent before fine-tuning restores compact, separable representations and lifts end-to-end metrics such as F-score and TEDS.

What carries the argument

The agentic label harmonization workflow that employs a vision-language model to reconcile category semantics and bounding-box granularity across heterogeneous annotation taxonomies.

If this is right

Naive mixing of inconsistently annotated datasets degrades a pretrained detector on document layout tasks.
Harmonized training yields more compact and separable post-decoder embeddings.
Gains appear across detection F-score, table TEDS, and mean bounding-box overlap when taxonomies share only partial overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconciliation step could be applied to other detection domains where annotation conventions differ, such as scene text or medical imaging.
If the vision-language model hallucinates mappings on certain rare categories, performance on those classes might degrade rather than improve.
Human review of a small sample of the harmonized labels could serve as a low-cost way to verify or correct the model's outputs before full-scale training.

Load-bearing premise

The vision-language model can accurately reconcile category mappings and bounding-box granularity without introducing new systematic errors that offset the reported gains.

What would settle it

Running the same harmonization pipeline on a new pair of datasets where the vision-language model visibly mis-maps categories or over- or under-adjusts boxes, then checking whether the downstream F-score and embedding metrics still improve.

Figures

Figures reproduced from arXiv: 2604.11042 by Crag Wolfe, Renyu Li, Vladimir Kirilenko, Yao You.

**Figure 2.** Figure 2: Annotation discrepancies - Paragraphs and List-Items [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Post-decoder embedding space under the three training regimes. Compared with the [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Pairwise UMAP projections for commonly confused layout categories. Harmonized [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Per-class silhouette scores for post-decoder representations. Harmonized training [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Neighborhood purity for post-decoder embeddings using [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, na\"ive mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM harmonization of conflicting layout annotations yields clear metric gains on mixed training but the gains rest on unverified label quality.

read the letter

The main thing to know is that this paper reports a concrete workflow where an agentic VLM reconciles both category names and bounding-box granularity across two document layout corpora, then shows that the resulting training set improves a fine-tuned RT-DETRv2 on SCORE-Bench: detection F-score rises from 0.860 to 0.883, table TEDS reaches 0.814, and mean box overlap error falls from 0.043 to 0.016. The embedding analysis also indicates tighter, more separable post-decoder features after harmonization. That is useful evidence that annotation clashes were hurting representation quality in the naive mixed case. The work does a solid job of quantifying the degradation from naive mixing and of measuring recovery across content fidelity, structure, and spatial consistency metrics. The joint semantic-spatial framing and the use of an agentic loop are a step past generic data-cleaning scripts, and the before-after numbers are reported plainly enough to be checked. The soft spot is exactly the one the stress-test flags: no human validation, no inter-annotator agreement on the harmonized labels, no prompt details, and no ablation of VLM error rates. Without those, it remains possible that the VLM is simply producing labels that align better with the detector’s inductive biases rather than recovering ground-truth consistency. The paper is evaluated on only two corpora with partial category overlap, so broader claims about multi-dataset training would need more data. This is aimed at document-AI and layout-detection groups that routinely combine sources. Readers who care about practical data-prep fixes will get value from the empirical deltas and the representation check. It deserves peer review because the problem is real, the method is described at a usable level, and the results are falsifiable; referees can reasonably ask for the missing validation steps without the paper being rejected outright.

Referee Report

2 major / 1 minor

Summary. The paper claims that an agentic VLM-based workflow can harmonize conflicting category semantics and bounding-box definitions across heterogeneous document-layout datasets (16- and 10-class taxonomies sharing only 8 direct mappings), thereby preventing degradation from naïve multi-dataset fine-tuning of RT-DETRv2 and yielding measurable gains on SCORE-Bench (detection F-score 0.860→0.883, table TEDS 0.750→0.814, mean box overlap 0.043→0.016) plus more compact post-decoder embeddings.

Significance. If the harmonized labels are verifiably more consistent and accurate than the originals, the method offers a practical route to exploit larger combined corpora without annotation-induced feature distortion, with direct relevance to any detection task that merges sources having non-identical spatial or semantic conventions.

major comments (2)

[§4] §4 (Experiments): the reported metric improvements are presented as direct before/after comparisons without human-expert agreement rates, inter-annotator consistency on harmonized samples, or an ablation isolating VLM mapping/box-adjustment errors; this leaves open the possibility that gains reflect VLM inductive biases rather than resolved inconsistency.
[§3] §3 (Method): no quantitative controls (e.g., prompt-variation ablation, hallucination rate on held-out samples, or statistical significance of the 0.023/0.064/0.027 deltas) are supplied for the agentic workflow, so the central claim that harmonization restores representation structure rests on unverified assumptions about VLM fidelity.

minor comments (1)

[Abstract] Abstract: the escaped quote in “naïve” is a minor typesetting artifact that should be cleaned for final copy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to incorporating additional validation analyses in the revised manuscript to strengthen the empirical support for the harmonization workflow.

read point-by-point responses

Referee: [§4] §4 (Experiments): the reported metric improvements are presented as direct before/after comparisons without human-expert agreement rates, inter-annotator consistency on harmonized samples, or an ablation isolating VLM mapping/box-adjustment errors; this leaves open the possibility that gains reflect VLM inductive biases rather than resolved inconsistency.

Authors: We acknowledge that the experiments section presents metric deltas as direct before/after comparisons and does not yet include human-expert agreement rates, inter-annotator consistency on the harmonized outputs, or a dedicated ablation isolating VLM mapping versus box-adjustment errors. In the revised manuscript we will add a human evaluation on a 100-sample subset of harmonized annotations, reporting Cohen's kappa between VLM outputs and two expert annotators. We will also insert an ablation that runs the full agentic pipeline against a category-mapping-only variant (no box adjustment) to quantify the incremental contribution of each component. While the consistent gains across detection F-score, table TEDS, and embedding compactness already indicate that annotation consistency is the primary driver, these additions will more directly rule out VLM-specific inductive biases. revision: yes
Referee: [§3] §3 (Method): no quantitative controls (e.g., prompt-variation ablation, hallucination rate on held-out samples, or statistical significance of the 0.023/0.064/0.027 deltas) are supplied for the agentic workflow, so the central claim that harmonization restores representation structure rests on unverified assumptions about VLM fidelity.

Authors: We agree that the method section would benefit from explicit quantitative controls on the agentic workflow. In the revision we will add: (i) a prompt-variation ablation testing three distinct prompt phrasings and reporting the resulting variance in category mappings and box adjustments; (ii) a hallucination audit on a 200-sample held-out set with known ground-truth mappings, reporting the fraction of outputs that deviate from the reference; and (iii) statistical significance tests (paired bootstrap with 10 000 resamples) for the three reported deltas. These controls will directly substantiate the fidelity assumptions underlying the claim that harmonization restores representation structure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical before/after comparisons on held-out metrics

full rationale

The paper's central claim rests on an empirical workflow: apply a VLM-based harmonization procedure to reconcile category mappings and bounding-box definitions across two datasets, then fine-tune an RT-DETRv2 detector and measure concrete deltas on SCORE-Bench (F-score, TEDS, mean overlap) plus embedding separability. No equations, fitted parameters, or self-referential definitions appear in the provided text; the reported gains (e.g., F-score 0.860→0.883) are direct experimental outcomes rather than quantities forced by construction from the harmonization inputs. No self-citations are invoked as load-bearing uniqueness theorems, and the evaluation uses standard detection and structure metrics that are independent of the method itself. This is a standard empirical ablation and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the VLM performs accurate harmonization; no free parameters, invented entities, or additional axioms are introduced beyond standard supervised fine-tuning.

axioms (1)

domain assumption Vision-language models can reliably interpret and reconcile semantic category mappings and bounding-box granularity in document images without introducing systematic bias.
The entire harmonization workflow depends on this capability of the VLM.

pith-pipeline@v0.9.0 · 5527 in / 1165 out tokens · 50872 ms · 2026-05-10T16:23:44.947772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 38 canonical work pages · 4 internal anchors

[1]

MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,

J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “MSeg: A Composite Dataset for Multi-domain Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. [Online]. Available: https://arxiv.org/abs/2112. 13762

2020
[2]

Dynamic Supervisor for Cross-dataset Object Detection,

Z. Chen et al., “Dynamic Supervisor for Cross-dataset Object Detection,” Neurocomputing, 2022, [Online]. Available: https://arxiv.org/abs/2204.00183

work page arXiv 2022
[3]

Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,

Y.-H. Liao, D. Acuna, R. Mahmood, J. Lucas, V. Prabhu, and S. Fidler, “Transferring Labels to Solve Annotation Mismatches Across Object Detection Datasets,” in International Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://openreview.net/forum?id=ChHx5ORqF0

2024
[4]

Overcoming Catastrophic Forgetting in Neural Networks,

J. Kirkpatrick et al. , “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences (PNAS), vol. 114, no. 13, pp. 3521–3526, 2017, [Online]. Available: https://arxiv.org/abs/1612.00796

work page arXiv 2017
[5]

Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,

T. Feng, M. Wang, and H. Yuan, “Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https:// arxiv.org/abs/2204.02136 18

work page arXiv 2022
[6]

PubLayNet: Largest Dataset Ever for Document Layout Analysis,

X. Zhong, J. Tang, and A. J. Yepes, “PubLayNet: Largest Dataset Ever for Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2019. [Online]. Available: https://arxiv.org/abs/1908.07836

work page arXiv 2019
[7]

arXiv preprint arXiv:2006.01038 , year=

M. Li et al. , “DocBank: A Benchmark Dataset for Document Layout Analysis,” in International Conference on Computational Linguistics (COLING) , 2020. [Online]. Available: https://arxiv.org/abs/2006.01038

work page arXiv 2020
[8]

and Staar, Peter , title =

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. W. J. Staar, “DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022, pp. 3743–3751. doi: 10.1145/3534678.3539043

work page doi:10.1145/3534678.3539043 2022
[9]

Girshick, and Jian Sun

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://arxiv.org/abs/1506.01497

work page arXiv 2015
[10]

You only look once: Uniﬁed, real-time object detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Online]. Available: https://arxiv.org/abs/1506.02640

work page arXiv 2016
[11]

YOLOX: Exceeding YOLO Series in 2021

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” arXiv preprint arXiv:2107.08430, 2021, [Online]. Available: https://arxiv.org/abs/2107. 08430

work page internal anchor Pith review arXiv 2021
[12]

Jocher, A

G. Jocher, A. Chaurasia, and J. Qiu, “YOLOv8.” [Online]. Available: https://github.com/ ultralytics/ultralytics
[13]

Xinlei Chen, Hao Fang, Tsung-yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to- End Object Detection with Transformers,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2005.12872

work page arXiv 2020
[14]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in International Conference on Learning Representations (ICLR), 2021. [Online]. Available: https://arxiv.org/abs/2010.04159

work page internal anchor Pith review arXiv 2021
[15]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to- End Object Detection,

H. Zhang et al. , “DINO: DETR with Improved DeNoising Anchor Boxes for End-to- End Object Detection,” in International Conference on Learning Representations (ICLR),
[16]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

[Online]. Available: https://arxiv.org/abs/2203.03605

work page internal anchor Pith review arXiv
[17]

DETRs Beat YOLOs on Real-time Object Detection

Y. Zhao et al., “DETRs Beat YOLOs on Real-time Object Detection.” 2023

2023
[18]

Rt-detrv2: Improved base- line with bag-of-freebies for real-time detection transformer

W. Lv, Y. Zhao, Q. Chang, K. Huang, G. Wang, and Y. Liu, “RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer,” arXiv preprint arXiv:2407.17140, 2024, [Online]. Available: https://arxiv.org/abs/2407.17140

work page arXiv 2024
[19]

VGT: Vision Grid Transformer for Document Layout Analysis,

C. Da, C. Luo, Q. Zheng, and C. Yao, “VGT: Vision Grid Transformer for Document Layout Analysis,” in IEEE/CVF International Conference on Computer Vision (ICCV),
[20]

Available: https://arxiv.org/abs/2308.14978

[Online]. Available: https://arxiv.org/abs/2308.14978

work page arXiv
[21]

2024 , eprint =

Z. Zhao, H. Kang, B. Wang, and C. He, “DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception,” arXiv preprint arXiv:2410.12628, 2024, [Online]. Available: https://arxiv.org/abs/2410.12628 19

work page arXiv 2024
[22]

Said Gurbuz, and Peter W

N. Livathinos, C. Auer, A. Nassar, and others, “Advanced Layout Analysis Models for Docling,” arXiv preprint arXiv:2509.11720, 2025, [Online]. Available: https://arxiv.org/ abs/2509.11720

work page arXiv 2025
[23]

A Survey on Transfer Learning,

S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010, [Online]. Available: https:// ieeexplore.ieee.org/document/5288526

work page arXiv 2010
[24]

Deep Domain Adaptive Object Detection: A Survey,

W. Li, F. Li, Y. Luo, P. Wang, and J. Sun, “Deep Domain Adaptive Object Detection: A Survey,” in IEEE Symposium Series on Computational Intelligence (SSCI), 2020. [Online]. Available: https://arxiv.org/abs/2002.06797

work page arXiv 2020
[25]

Domain Adaptive Faster R-CNN for Object Detection in the Wild,

Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain Adaptive Faster R-CNN for Object Detection in the Wild,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Online]. Available: https://arxiv.org/abs/1803.03243

work page arXiv 2018
[26]

Strong-Weak Distribution Alignment for Adaptive Object Detection,

K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-Weak Distribution Alignment for Adaptive Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1812.04798

work page arXiv 2019
[27]

Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,

C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang, “Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/abs/2008.08574

work page arXiv 2020
[28]

Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,

G. Zhao, G. Li, R. Xu, and L. Lin, “Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection,” in European Conference on Computer Vision (ECCV), 2020. [Online]. Available: https://arxiv.org/ abs/2009.08119

work page arXiv 2020
[29]

Towards Universal Object Detection by Domain Attention,

X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards Universal Object Detection by Domain Attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Available: https://arxiv.org/abs/1904.04402

work page arXiv 2019
[30]

Detecting twenty-thousand classes using image-level supervision

X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting Twenty-Thousand Classes Using Image-Level Supervision,” in European Conference on Computer Vision (ECCV), 2022. [Online]. Available: https://arxiv.org/abs/2201.02605

work page arXiv 2022
[31]

Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,

M. Kennerley, A. I. Aviles-Rivero, C.-B. Schönlieb, and R. T. Tan, “Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets,” arXiv preprint arXiv:2506.04737, 2025, [Online]. Available: https://arxiv.org/abs/2506.04737

work page arXiv 2025
[32]

Grounded language-image pre-training

L. H. Li et al., “Grounded Language-Image Pre-Training,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Online]. Available: https:// arxiv.org/abs/2112.03857

work page arXiv 2022
[33]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu et al. , “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection,” arXiv preprint arXiv:2303.05499, 2023, [Online]. Available: https://arxiv.org/abs/2303.05499

work page Pith review arXiv 2023
[34]

LSDA: Large Scale Detection Through Adaptation,

J. Hoffman et al., “LSDA: Large Scale Detection Through Adaptation,” in Advances in Neural Information Processing Systems (NeurIPS), 2014. [Online]. Available: https:// arxiv.org/abs/1407.5035 20

work page arXiv 2014
[35]

Bridging the gap between object and image-level representations for open-vocabulary detection, 2022a

H. Rasheed, M. Maaz, M. U. Khattak, S. H. Khan, and F. S. Khan, “Bridging the Gap between Object and Image-Level Representations for Open-Vocabulary Detection,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://arxiv.org/abs/2207.03482

work page arXiv 2022
[36]

SFDLA: Source-Free Document Layout Analysis,

S. Tewes, Y. Chen, O. Moured, J. Zhang, and R. Stiefelhagen, “SFDLA: Source-Free Document Layout Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), 2025. [Online]. Available: https://arxiv.org/abs/2503.18742

work page arXiv 2025
[37]

Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,

J. Wang, K. Hu, Z. Zhong, L. Sun, and Q. Huo, “Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis,” Pattern Recognition, 2024, [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0031320324005879

2024
[38]

LayoutLM: Pre-training of text and layout for document image understanding

Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , 2020. [Online]. Available: https:// arxiv.org/abs/1912.13318

work page arXiv 2020
[39]

Manmatha

S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “DocFormer: End-to-End Transformer for Document Understanding,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [Online]. Available: https://arxiv.org/abs/2106.11539

work page arXiv 2021
[40]

arXiv:2203.02378 , year=

J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “DiT: Self-Supervised Pre-Training for Document Image Transformer,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2203.02378

work page arXiv 2022
[41]

OCR-Free Document Understanding Transformer,

G. Kim et al., “OCR-Free Document Understanding Transformer,” in European Conference on Computer Vision (ECCV) , 2022. [Online]. Available: https://arxiv.org/abs/2111. 15664

2022
[42]

LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,

Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “LayoutLMv3: Pre-Training for Document AI with Unified Text and Image Masking,” in ACM International Conference on Multimedia (ACM MM), 2022. [Online]. Available: https://arxiv.org/abs/2204.08387

work page arXiv 2022
[43]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford et al. , “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

mplug-docowl: Modularized multimodal large language model for document understanding.arXiv preprint arXiv:2307.02499, 2023

J. Ye et al. , “mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding,” arXiv preprint arXiv:2307.02499, 2023, [Online]. Available: https://arxiv.org/abs/2307.02499

work page arXiv 2023
[45]

Vary: Scaling up the vision vocab- ulary for large vision-language models

H. Wei et al., “Vary: Scaling Up the Vision Vocabulary for Large Vision-Language Models,” arXiv preprint arXiv:2312.06109, 2023, [Online]. Available: https://arxiv.org/abs/2312. 06109

work page arXiv 2023
[46]

SCORE: A Semantic Evaluation Framework for Generative Document Parsing,

R. Li, A. Jimeno Yepes, Y. You, K. Pluciński, M. Operlejn, and C. Wolfe, “SCORE: A Semantic Evaluation Framework for Generative Document Parsing,” arXiv preprint arXiv:2509.19345, 2025, [Online]. Available: https://arxiv.org/abs/2509.19345 21 7 Appendix 7.1 End-to-end file transformation metrics The metrics reported in Section 5.1 evaluate a model’s abili...

work page arXiv 2025