arxiv: 2603.23885 · v3 · submitted 2026-03-25 · 💻 cs.CV

Recognition: no theorem link

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li , Pengyuan Lyu , Chengquan Zhang , Huawen Shen , Liang Wu , Xingyu Wan , Gangyan Zeng , Han Hu

show 2 more authors

Can Ma Yu Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords document parsingmultimodal large language modelsscene synthesisend-to-end parsingsynthetic data generationstructure-aware trainingdocument understandingreal-world robustness

0 comments

The pith

Composing layout templates with diverse document elements and applying structure-focused training lets 1B-parameter MLLMs parse real-world documents more accurately and stably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end document parsing with multimodal large language models often produces repetitive or inconsistent outputs because large-scale full-page training data is scarce and existing strategies ignore structural constraints. The paper introduces a paired data-training approach that first generates synthetic full-page examples by assembling layout templates with rich document content and then trains the model with progressive stages plus explicit structure-token adjustments. This combination is evaluated on a new benchmark of casually captured real documents, yielding higher accuracy across both clean scanned inputs and noisy real-world photos when embedded in a 1B-parameter model.

Core claim

Realistic Scene Synthesis constructs large-scale structurally diverse full-page supervision by composing layout templates with rich document elements, while the Document-Aware Training Recipe applies progressive learning and structure-token optimization to raise structural fidelity and decoding stability, enabling superior accuracy and robustness on both scanned and casually captured documents inside a 1B-parameter MLLM.

What carries the argument

Realistic Scene Synthesis that assembles layout templates with varied document elements to produce full-page end-to-end training data, combined with Document-Aware Training that uses progressive learning stages and structure-token optimization to enforce output consistency.

If this is right

The resulting model produces fewer hallucinations and more structurally consistent outputs on full-page inputs.
Performance improves simultaneously on clean scanned documents and on noisy real-world photographs.
A single 1B-parameter MLLM becomes competitive for both digital and casually captured document parsing tasks.
Releasing the synthesis pipeline and Wild-OmniDocBench benchmark supplies new resources for further end-to-end training research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis method could be extended to generate training pairs for other structured-output multimodal tasks such as table extraction or form understanding.
If the distribution match holds, similar template-composition techniques might reduce the need for manual annotation in other document-related vision-language benchmarks.
Progressive structure-token training may transfer to non-document domains where output format consistency is critical, such as code generation from images.

Load-bearing premise

Composing layout templates with document elements will generate synthetic data whose noise and variability distribution closely matches that of real-world casually captured documents.

What would settle it

Training the same 1B-parameter MLLM on existing public document datasets instead of the new synthetic data and observing equal or better accuracy on the Wild-OmniDocBench real-world test set would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2603.23885 by Can Ma, Chengquan Zhang, Gangyan Zeng, Gengluo Li, Han Hu, Huawen Shen, Liang Wu, Pengyuan Lyu, Xingyu Wan, Yu Zhou.

**Figure 2.** Figure 2: Scanned/Digital and Real-World Capture. On scanned/digital pages, both modular and E2E parsers decode correctly. Under real-world capture, modular cascades accumulate layout-analysis errors that propagate to element parsing (extra/missing regions), while generic end-to-end models exhibit repetitive outputs. • Data limitations. While table and formula parsing benefit from task-specific datasets, large-scal… view at source ↗

**Figure 3.** Figure 3: Overview of Realistic Scene Synthesis. Left: repositories of atomic elements and layout templates with reading order. Right: a synthesis pipeline that composes sampled elements into templates under spatial/structural constraints to produce page-level annotations, followed by capture-aware augmentation to simulate real-world images. multilingual paragraph groups to enhance contextual and linguistic coverage… view at source ↗

**Figure 4.** Figure 4: Wild-OmniDocBench Construction. We convert scanned pages into real-world–captured images by (i) printing, deforming, and photographing under varied lighting, and (ii) displaying on screens and re-shooting to induce moire and reflections. ´ 6. Experiments 6.1. Datasets and Evaluation We adopt a progressive training paradigm built upon our Realistic Scene Synthesis pipeline. The model is evaluated on publi… view at source ↗

read the original abstract

Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs template-based synthetic full-page data with progressive structure-aware training for MLLM document parsing, but the abstract gives no numbers to check the superiority claims.

read the letter

This paper's main contribution is a data and training co-design for making MLLMs better at full-page document parsing on real-world images. They generate synthetic data through layout template composition and add progressive learning plus structure-token optimization during training. The synthesis strategy stands out because it aims to create large-scale, diverse full-page supervision that includes rich document elements. This tackles the scarcity of end-to-end parsing data that leads to hallucinations and structural errors in current models. Pairing it with the document-aware training recipe seems like a direct way to improve decoding stability. Adding the Wild-OmniDocBench benchmark from actual captured documents is a practical step for testing robustness beyond scanned or digital cases. The work integrates these into a 1B-parameter MLLM and claims better accuracy across scenarios. Releasing the models, pipelines, and benchmarks publicly will help others build on it. One clear limitation is the absence of any quantitative results in the abstract. Claims of superiority need specific metrics, baselines, and comparisons to be convincing. The assumption that the synthetic scenes closely match the variability in casually captured documents also lacks reported validation, such as distributional metrics. If those details are in the full paper, they should be highlighted more clearly to support the robustness claims. Overall, this targets the document understanding community, especially those working on end-to-end multimodal parsing. Readers looking for new data generation techniques or training strategies for structured outputs will find value here, provided the experiments hold up. I recommend sending it for peer review. The ideas address a genuine gap with concrete methods, so referees can help refine the evaluation and confirm the gains.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a data-training co-design framework for robust end-to-end document parsing using MLLMs. It proposes Realistic Scene Synthesis to generate large-scale, structurally diverse full-page supervision data by composing layout templates with rich document elements, paired with a Document-Aware Training Recipe using progressive learning and structure-token optimization. The work also introduces the Wild-OmniDocBench benchmark derived from real-world captured documents and reports that integration into a 1B-parameter MLLM yields superior accuracy and robustness on both scanned/digital and casually captured real-world scenarios, with all models, pipelines, and benchmarks to be publicly released.

Significance. If the empirical claims are substantiated, the framework could meaningfully advance real-world document parsing by mitigating data scarcity and structural inconsistencies in MLLM outputs, offering a scalable alternative to cascaded pipelines that fail under non-standard capture conditions. The public release of synthesis code, training recipes, and the new benchmark would provide reusable resources for the community.

major comments (2)

[Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.
[§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.

minor comments (1)

[Abstract] Abstract: The statement that 'all models, data synthesis pipelines, and benchmarks will be publicly released' lacks any timeline, repository link, or licensing details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.

Authors: We agree that the abstract would benefit from explicit quantitative support for the headline claim. In the revised version, we will update the abstract to include key metrics such as the absolute accuracy improvements (e.g., +X% on Wild-OmniDocBench and +Y% on digital benchmarks) relative to strong baselines, along with a brief note on the evaluation protocol and error analysis detailed in the main body and tables. revision: yes
Referee: [§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.

Authors: We acknowledge that a direct quantitative distributional comparison would further substantiate the synthesis procedure. While the manuscript currently validates the approach via downstream performance on the real-world Wild-OmniDocBench, we will add in the revision a quantitative analysis in §3, including Wasserstein distances on element position/size distributions, noise histogram overlaps, and perceptual metrics (e.g., LPIPS or FID) between the synthetic data and real captures from the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical synthesis and benchmark evaluation

full rationale

The manuscript describes a data-training co-design: Realistic Scene Synthesis via template composition with document elements, plus a Document-Aware Training Recipe using progressive learning and structure-token optimization. No equations, parameter fittings, or derivations appear. Claims rest on training an external 1B-parameter MLLM and reporting accuracy on scanned/digital data plus the newly introduced Wild-OmniDocBench (real captures). No self-citation is invoked to justify uniqueness or to close a definitional loop; the distributional match between synthetic and real data is asserted but not reduced to a fitted quantity or self-referential definition. The result is therefore self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions in multimodal LLM training and synthetic data generation; no explicit free parameters, axioms, or invented entities are introduced beyond the new benchmark and strategies described.

pith-pipeline@v0.9.0 · 5538 in / 1118 out tokens · 55796 ms · 2026-05-15T01:11:21.425765+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
cs.CL 2026-05 unverdicted novelty 6.0

CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

https://www.mistralocr.com/, 2025

Mistral OCR: Free Online AI OCR Tool to Extract Text. https://www.mistralocr.com/, 2025. 6

work page 2025
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis.International Journal on Document Analysis and Recognition, 27(3):317–334, 2024

Ayan Banerjee, Sanket Biswas, Josep Llad ´os, and Umapada Pal. SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis.International Journal on Document Analysis and Recognition, 27(3):317–334, 2024. 2

work page 2024
[6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

work page 2024
[7]

M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding. arXiv preprint arXiv:2411.04952, 2024. 1

work page arXiv 2024
[8]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025a

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, et al. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model. arXiv preprint arXiv:2510.14528, 2025. 6

work page arXiv 2025
[9]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. PaddleOCR 3.0 Technical Report. arXiv preprint arXiv:2507.05595, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Marker.https://github.com/datalab- to/marker, 2025

Datalab. Marker.https://github.com/datalab- to/marker, 2025. 6

work page 2025
[11]

Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025. 1

work page 2025
[12]

Instruction-Guided Scene Text Recognition

Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, and Yu-Gang Jiang. Instruction-Guided Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738,

work page
[13]

Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059,

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document Image Parsing via Heterogeneous An- chor Prompting.arXiv preprint arXiv:2505.14059, 2025. 2, 6

work page arXiv 2025
[14]

Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github

Guo Wang. Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github . com / gwxie / Synthesize - Distorted - Image - and - Its - Control-Points. 4

work page
[15]

mPLUG-DocOwl2: High-resolution Compressing for OCR- free Multi-page Document Understanding.arXiv preprint arXiv:2409.03420, 2024

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl2: High-resolution Compressing for OCR- free Multi-page Document Understanding.arXiv preprint arXiv:2409.03420, 2024. 1, 2

work page arXiv 2024
[16]

Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling

Yongshuai Huang, Ning Lu, Dapeng Chen, Yibo Li, Zecheng Xie, Shenggao Zhu, Liangcai Gao, and Wei Peng. Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11134–11143, 2023. 2

work page 2023
[17]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision (ECCV), 2022. 3

work page 2022
[18]

CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github

Lihang Li. CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github. com/buptlihang/CDLA, 2024. 4

work page 2024
[19]

dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025

Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025. 2, 6

work page arXiv 2025
[20]

25 Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm.arXiv preprint arXiv:2506.05218, 2025. 2, 6

work page arXiv 2025
[21]

PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition

Zeng Li, Jin Wei, Zhijie Shen, Can Ma, Yaqiang Wu, and Yu Zhou. PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition. In International Conference on Document Analysis and Recog- nition, pages 96–114. Springer, 2025. 1

work page 2025
[22]

Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting

Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xi- ang Bai. Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting. InProceedings of the European conference on computer vision, pages 706–722. Springer, 2020. 2

work page 2020
[23]

Focus Anywhere for Fine-grained Multi-page Document Understanding.arXiv preprint arXiv:2405.14295, 2024

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus Anywhere for Fine-grained Multi-page Document Understanding.arXiv preprint arXiv:2405.14295, 2024. 3

work page arXiv 2024
[24]

Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

work page 2023
[25]

A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanx- uan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025. 4

work page arXiv 2025
[26]

ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network

Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. Inproceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9809–9818, 2020. 2

work page 2020
[27]

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, et al. POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025. 6

work page arXiv 2025
[28]

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15630–15640, 2024. 2

work page 2024
[29]

Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance

Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, and Yu Zhou. Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5919–5927, 2025. 1

work page 2025
[30]

ICDAR 2019 CROHME+ TFD: Competition on Recognition of Handwrit- ten Mathematical Expressions and Typeset Formula Detec- tion

Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME+ TFD: Competition on Recognition of Handwrit- ten Mathematical Expressions and Typeset Formula Detec- tion. InInternational Conference on Document Analysis and Recognition, pages 1533–1538. IEEE, 2019. 3

work page 2019
[31]

Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,

Mathpix. Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,

work page
[32]

MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025

Junbo Niu, Zheng Liu, Zhuangcheng Gu, et al. MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025. 6

work page 2025
[33]

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 2, 3, 5

work page 2025
[34]

DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter W J Staar. DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis. 2022. 4

work page 2022
[35]

olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025. 6

work page arXiv 2025
[36]

Assisting in writing Wikipedia-like articles from scratch with large language models

Yijia Shao, Yucheng Jiang, et al. Assisting in writing Wikipedia-like articles from scratch with large language models. pages 6252–6278, Mexico City, Mexico, 2024. As- sociation for Computational Linguistics. 1

work page 2024
[37]

Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables

Huawen Shen, Xiang Gao, Jin Wei, Liang Qiao, Yu Zhou, Qiang Li, and Zhanzhan Cheng. Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables. InInternational Joint Conferences on Artificial Intelligence, pages 1369–1377, 2023. 2

work page 2023
[38]

LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining

Huawen Shen, Gengluo Li, Jinwen Zhong, and Yu Zhou. LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6805–6813, 2025. 1

work page 2025
[39]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A Family of Highly Capable Multimodal Models.arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al

Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, et al. Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025. 2

work page arXiv 2025
[41]

Qwen2.5: A Party of Foundation Models

Qwen Team. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/,

work page
[42]

pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020

Han The Thanh. pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020. 4

work page 2020
[43]

An-Lan Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: How Far Are We from Achieving Compre- hensive and Robust Document Understanding in the Wild? InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 23002–23012,

work page 2025
[44]

UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024

Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024. 3

work page arXiv 2024
[45]

MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, et al. MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024. 1, 6

work page 2024
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. InternVL3.5: Advancing Open- Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models

Haoran Wei, Lingyu Kong, et al. Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 408–424. Springer,

work page
[48]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

Haoran Wei, Chenglong Liu, et al. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024. 2, 3, 6, 7

work page arXiv 2024
[49]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei et al. DeepSeek-OCR: Contexts Optical Com- pression.arXiv preprint arXiv:2510.18234, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

CoRR , volume =

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective Long-Context Scaling of Foundation Models.arXiv preprint arXiv:2309.16039, 2023. 4

work page arXiv 2023
[51]

XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding

Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. InFindings of the Association for Computa- tional Linguistics, pages 3214–3224, Dublin, Ireland, 2022. Association for Computational Linguistics. 5

work page 2022
[52]

IPAD: Itera- tive, Parallel, and Diffusion-based Network for Scene Text Recognition.International Journal of Computer Vision, 133 (8):5589–5609, 2025

Xiaomeng Yang, Zhi Qiao, and Yu Zhou. IPAD: Itera- tive, Parallel, and Diffusion-based Network for Scene Text Recognition.International Journal of Computer Vision, 133 (8):5589–5609, 2025. 1

work page 2025
[53]

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multi- modal Models in Literacy.arXiv preprint arXiv:2412.02210,

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Yuliang Liu, et al. CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multi- modal Models in Literacy.arXiv preprint arXiv:2412.02210,

work page arXiv
[54]

Syntax-Aware Network for Handwritten Mathematical Expression Recognition.arXiv preprint arXiv:2203.01601, 2022

Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-Aware Network for Handwritten Mathematical Expression Recognition.arXiv preprint arXiv:2203.01601, 2022. 3

work page arXiv 2022
[55]

CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition

Wenqi Zhao and Liangcai Gao. CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition. InEuropean conference on computer vision, pages 392–408, 2022. 2

work page 2022
[56]

Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer

Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, and Ziyin Zhang. Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer. InIn- ternational Conference on Document Analysis and Recogni- tion, pages 570–584, 2021. 2

work page 2021
[57]

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adap- tive Perception, 2024

Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adap- tive Perception, 2024. 1, 3, 4

work page 2024
[58]

CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition

Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition. International Journal of Computer Vision, 132(2):300–318,

work page
[59]

Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.Winter Conference for Applications in Computer Vision, 2021. 3

work page 2021
[60]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and eval- uation.arXiv preprint arXiv:1911.10683, 2019. 3

work page arXiv 1911
[61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Mod- els.arXiv preprint arXiv:2504.10479, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025