pith. machine review for the scientific record. sign in

arxiv: 2603.23885 · v3 · submitted 2026-03-25 · 💻 cs.CV

Recognition: no theorem link

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords document parsingmultimodal large language modelsscene synthesisend-to-end parsingsynthetic data generationstructure-aware trainingdocument understandingreal-world robustness
0
0 comments X

The pith

Composing layout templates with diverse document elements and applying structure-focused training lets 1B-parameter MLLMs parse real-world documents more accurately and stably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end document parsing with multimodal large language models often produces repetitive or inconsistent outputs because large-scale full-page training data is scarce and existing strategies ignore structural constraints. The paper introduces a paired data-training approach that first generates synthetic full-page examples by assembling layout templates with rich document content and then trains the model with progressive stages plus explicit structure-token adjustments. This combination is evaluated on a new benchmark of casually captured real documents, yielding higher accuracy across both clean scanned inputs and noisy real-world photos when embedded in a 1B-parameter model.

Core claim

Realistic Scene Synthesis constructs large-scale structurally diverse full-page supervision by composing layout templates with rich document elements, while the Document-Aware Training Recipe applies progressive learning and structure-token optimization to raise structural fidelity and decoding stability, enabling superior accuracy and robustness on both scanned and casually captured documents inside a 1B-parameter MLLM.

What carries the argument

Realistic Scene Synthesis that assembles layout templates with varied document elements to produce full-page end-to-end training data, combined with Document-Aware Training that uses progressive learning stages and structure-token optimization to enforce output consistency.

If this is right

  • The resulting model produces fewer hallucinations and more structurally consistent outputs on full-page inputs.
  • Performance improves simultaneously on clean scanned documents and on noisy real-world photographs.
  • A single 1B-parameter MLLM becomes competitive for both digital and casually captured document parsing tasks.
  • Releasing the synthesis pipeline and Wild-OmniDocBench benchmark supplies new resources for further end-to-end training research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis method could be extended to generate training pairs for other structured-output multimodal tasks such as table extraction or form understanding.
  • If the distribution match holds, similar template-composition techniques might reduce the need for manual annotation in other document-related vision-language benchmarks.
  • Progressive structure-token training may transfer to non-document domains where output format consistency is critical, such as code generation from images.

Load-bearing premise

Composing layout templates with document elements will generate synthetic data whose noise and variability distribution closely matches that of real-world casually captured documents.

What would settle it

Training the same 1B-parameter MLLM on existing public document datasets instead of the new synthetic data and observing equal or better accuracy on the Wild-OmniDocBench real-world test set would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2603.23885 by Can Ma, Chengquan Zhang, Gangyan Zeng, Gengluo Li, Han Hu, Huawen Shen, Liang Wu, Pengyuan Lyu, Xingyu Wan, Yu Zhou.

Figure 1
Figure 1. Figure 1: Overall Performance and Degradation from Om [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scanned/Digital and Real-World Capture. On scanned/digital pages, both modular and E2E parsers decode correctly. Under real-world capture, modular cascades accumulate layout-analysis errors that propagate to element parsing (extra/missing regions), while generic end-to-end models exhibit repetitive outputs. • Data limitations. While table and formula parsing bene￾fit from task-specific datasets, large-scal… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Realistic Scene Synthesis. Left: repositories of atomic elements and layout templates with reading order. Right: a synthesis pipeline that composes sampled elements into templates under spatial/structural constraints to produce page-level annotations, followed by capture-aware augmentation to simulate real-world images. multilingual paragraph groups to enhance contextual and linguistic coverage… view at source ↗
Figure 4
Figure 4. Figure 4: Wild-OmniDocBench Construction. We convert scanned pages into real-world–captured images by (i) printing, de￾forming, and photographing under varied lighting, and (ii) display￾ing on screens and re-shooting to induce moire and reflections. ´ 6. Experiments 6.1. Datasets and Evaluation We adopt a progressive training paradigm built upon our Realistic Scene Synthesis pipeline. The model is evaluated on publi… view at source ↗
read the original abstract

Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a data-training co-design framework for robust end-to-end document parsing using MLLMs. It proposes Realistic Scene Synthesis to generate large-scale, structurally diverse full-page supervision data by composing layout templates with rich document elements, paired with a Document-Aware Training Recipe using progressive learning and structure-token optimization. The work also introduces the Wild-OmniDocBench benchmark derived from real-world captured documents and reports that integration into a 1B-parameter MLLM yields superior accuracy and robustness on both scanned/digital and casually captured real-world scenarios, with all models, pipelines, and benchmarks to be publicly released.

Significance. If the empirical claims are substantiated, the framework could meaningfully advance real-world document parsing by mitigating data scarcity and structural inconsistencies in MLLM outputs, offering a scalable alternative to cascaded pipelines that fail under non-standard capture conditions. The public release of synthesis code, training recipes, and the new benchmark would provide reusable resources for the community.

major comments (2)
  1. [Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.
  2. [§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.
minor comments (1)
  1. [Abstract] Abstract: The statement that 'all models, data synthesis pipelines, and benchmarks will be publicly released' lacks any timeline, repository link, or licensing details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of our claims where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of 'superior accuracy and robustness' across scanned/digital and real-world scenarios is presented without any quantitative metrics, baselines, error bars, or statistical details, which is load-bearing for the central contribution since the abstract supplies no evidence to support the stated improvements.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the headline claim. In the revised version, we will update the abstract to include key metrics such as the absolute accuracy improvements (e.g., +X% on Wild-OmniDocBench and +Y% on digital benchmarks) relative to strong baselines, along with a brief note on the evaluation protocol and error analysis detailed in the main body and tables. revision: yes

  2. Referee: [§3] §3 (Realistic Scene Synthesis): The procedure for composing layout templates is described as producing data whose variability and noise match real-world captured documents, yet no quantitative distributional comparison (e.g., Wasserstein distance on element positions/sizes, noise histogram overlap, or perceptual metrics) is reported between the generated images and the Wild-OmniDocBench real captures; this gap directly affects the validity of the robustness claim on casually photographed documents.

    Authors: We acknowledge that a direct quantitative distributional comparison would further substantiate the synthesis procedure. While the manuscript currently validates the approach via downstream performance on the real-world Wild-OmniDocBench, we will add in the revision a quantitative analysis in §3, including Wasserstein distances on element position/size distributions, noise histogram overlaps, and perceptual metrics (e.g., LPIPS or FID) between the synthetic data and real captures from the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical synthesis and benchmark evaluation

full rationale

The manuscript describes a data-training co-design: Realistic Scene Synthesis via template composition with document elements, plus a Document-Aware Training Recipe using progressive learning and structure-token optimization. No equations, parameter fittings, or derivations appear. Claims rest on training an external 1B-parameter MLLM and reporting accuracy on scanned/digital data plus the newly introduced Wild-OmniDocBench (real captures). No self-citation is invoked to justify uniqueness or to close a definitional loop; the distributional match between synthetic and real data is asserted but not reduced to a fitted quantity or self-referential definition. The result is therefore self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions in multimodal LLM training and synthetic data generation; no explicit free parameters, axioms, or invented entities are introduced beyond the new benchmark and strategies described.

pith-pipeline@v0.9.0 · 5538 in / 1118 out tokens · 55796 ms · 2026-05-15T01:11:21.425765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

    cs.CL 2026-05 unverdicted novelty 6.0

    CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    https://www.mistralocr.com/, 2025

    Mistral OCR: Free Online AI OCR Tool to Extract Text. https://www.mistralocr.com/, 2025. 6

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023. 6

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025. 6

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025. 6

  5. [5]

    SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis.International Journal on Document Analysis and Recognition, 27(3):317–334, 2024

    Ayan Banerjee, Sanket Biswas, Josep Llad ´os, and Umapada Pal. SemiDocSeg: Harnessing Semi-Supervised Learning for Document Layout Analysis.International Journal on Document Analysis and Recognition, 27(3):317–334, 2024. 2

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

  7. [7]

    M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding. arXiv preprint arXiv:2411.04952, 2024. 1

  8. [8]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025a

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, et al. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model. arXiv preprint arXiv:2510.14528, 2025. 6

  9. [9]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. PaddleOCR 3.0 Technical Report. arXiv preprint arXiv:2507.05595, 2025. 6

  10. [10]

    Marker.https://github.com/datalab- to/marker, 2025

    Datalab. Marker.https://github.com/datalab- to/marker, 2025. 6

  11. [11]

    Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans

    Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context Per- ception Parallel Decoder for Scene Text Recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025. 1

  12. [12]

    Instruction-Guided Scene Text Recognition

    Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, and Yu-Gang Jiang. Instruction-Guided Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738,

  13. [13]

    Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059,

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document Image Parsing via Heterogeneous An- chor Prompting.arXiv preprint arXiv:2505.14059, 2025. 2, 6

  14. [14]

    Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github

    Guo Wang. Synthesize-Distorted-Image-and-Its- Control-Points.https : / / github . com / gwxie / Synthesize - Distorted - Image - and - Its - Control-Points. 4

  15. [15]

    mPLUG-DocOwl2: High-resolution Compressing for OCR- free Multi-page Document Understanding.arXiv preprint arXiv:2409.03420, 2024

    Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mPLUG-DocOwl2: High-resolution Compressing for OCR- free Multi-page Document Understanding.arXiv preprint arXiv:2409.03420, 2024. 1, 2

  16. [16]

    Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling

    Yongshuai Huang, Ning Lu, Dapeng Chen, Yibo Li, Zecheng Xie, Shenggao Zhu, Liangcai Gao, and Wei Peng. Improving Table Structure Recognition with Visual-Alignment Sequen- tial Coordinate Modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11134–11143, 2023. 2

  17. [17]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision (ECCV), 2022. 3

  18. [18]

    CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github

    Lihang Li. CDLA: A Comprehensive Dataset for Layout- Aware Document Understanding.https://github. com/buptlihang/CDLA, 2024. 4

  19. [19]

    dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025. 2, 6

  20. [20]

    25 Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm.arXiv preprint arXiv:2506.05218, 2025. 2, 6

  21. [21]

    PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition

    Zeng Li, Jin Wei, Zhijie Shen, Can Ma, Yaqiang Wu, and Yu Zhou. PACM: Position-Aware Cross-Modality Decoder for Handwritten Mathematical Expression Recognition. In International Conference on Document Analysis and Recog- nition, pages 96–114. Springer, 2025. 1

  22. [22]

    Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting

    Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xi- ang Bai. Mask textspotter v3: Segmentation proposal net- work for robust scene text spotting. InProceedings of the European conference on computer vision, pages 706–722. Springer, 2020. 2

  23. [23]

    Focus Anywhere for Fine-grained Multi-page Document Understanding.arXiv preprint arXiv:2405.14295, 2024

    Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus Anywhere for Fine-grained Multi-page Document Understanding.arXiv preprint arXiv:2405.14295, 2024. 3

  24. [24]

    Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  25. [25]

    A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025

    Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanx- uan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025. 4

  26. [26]

    ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network

    Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei Wang. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network. Inproceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9809–9818, 2020. 2

  27. [27]

    POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025

    Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, et al. POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025. 6

  28. [28]

    LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

    Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15630–15640, 2024. 2

  29. [29]

    Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance

    Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, and Yu Zhou. Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5919–5927, 2025. 1

  30. [30]

    ICDAR 2019 CROHME+ TFD: Competition on Recognition of Handwrit- ten Mathematical Expressions and Typeset Formula Detec- tion

    Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME+ TFD: Competition on Recognition of Handwrit- ten Mathematical Expressions and Typeset Formula Detec- tion. InInternational Conference on Document Analysis and Recognition, pages 1533–1538. IEEE, 2019. 3

  31. [31]

    Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,

    Mathpix. Mathpix Snip: Convert images and PDFs to La- TeX, DOCX, and more.https://mathpix.com/,

  32. [32]

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, et al. MinerU2.5: A Decoupled Vision-Language Model for Efficient High- Resolution Document Parsing, 2025. 6

  33. [33]

    OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Com- prehensive Annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 2, 3, 5

  34. [34]

    DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis

    Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter W J Staar. DocLayNet: A Large Human- Annotated Dataset for Document-Layout Analysis. 2022. 4

  35. [35]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025. 6

  36. [36]

    Assisting in writing Wikipedia-like articles from scratch with large language models

    Yijia Shao, Yucheng Jiang, et al. Assisting in writing Wikipedia-like articles from scratch with large language models. pages 6252–6278, Mexico City, Mexico, 2024. As- sociation for Computational Linguistics. 1

  37. [37]

    Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables

    Huawen Shen, Xiang Gao, Jin Wei, Liang Qiao, Yu Zhou, Qiang Li, and Zhanzhan Cheng. Divide Rows and Con- quer Cells: Towards Structure Recognition for Large Tables. InInternational Joint Conferences on Artificial Intelligence, pages 1369–1377, 2023. 2

  38. [38]

    LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining

    Huawen Shen, Gengluo Li, Jinwen Zhong, and Yu Zhou. LDP: Generalizing to Multilingual Visual Information Ex- traction by Language Decoupled Pretraining. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6805–6813, 2025. 1

  39. [39]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A Family of Highly Capable Multimodal Models.arXiv preprint arXiv:2312.11805, 2023. 6

  40. [40]

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, et al. Hunyuanocr Technical Report.arXiv preprint arXiv:2511.19575, 2025. 2

  41. [41]

    Qwen2.5: A Party of Foundation Models

    Qwen Team. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/,

  42. [42]

    pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020

    Han The Thanh. pdfT EX: A TeX extension for direct PDF output.http://www.pdftex.org/, 2020. 4

  43. [43]

    An-Lan Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: How Far Are We from Achieving Compre- hensive and Robust Document Understanding in the Wild? InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 23002–23012,

  44. [44]

    UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024. 3

  45. [45]

    MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, et al. MinerU: An Open-Source Solution for Precise Document Content Extraction, 2024. 1, 6

  46. [46]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. InternVL3.5: Advancing Open- Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025. 6

  47. [47]

    Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models

    Haoran Wei, Lingyu Kong, et al. Vary: Scaling up the Vision V ocabulary for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 408–424. Springer,

  48. [48]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

    Haoran Wei, Chenglong Liu, et al. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024. 2, 3, 6, 7

  49. [49]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei et al. DeepSeek-OCR: Contexts Optical Com- pression.arXiv preprint arXiv:2510.18234, 2025. 2, 6

  50. [50]

    CoRR , volume =

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective Long-Context Scaling of Foundation Models.arXiv preprint arXiv:2309.16039, 2023. 4

  51. [51]

    XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding

    Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding. InFindings of the Association for Computa- tional Linguistics, pages 3214–3224, Dublin, Ireland, 2022. Association for Computational Linguistics. 5

  52. [52]

    IPAD: Itera- tive, Parallel, and Diffusion-based Network for Scene Text Recognition.International Journal of Computer Vision, 133 (8):5589–5609, 2025

    Xiaomeng Yang, Zhi Qiao, and Yu Zhou. IPAD: Itera- tive, Parallel, and Diffusion-based Network for Scene Text Recognition.International Journal of Computer Vision, 133 (8):5589–5609, 2025. 1

  53. [53]

    CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multi- modal Models in Literacy.arXiv preprint arXiv:2412.02210,

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Yuliang Liu, et al. CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multi- modal Models in Literacy.arXiv preprint arXiv:2412.02210,

  54. [54]

    Syntax-Aware Network for Handwritten Mathematical Expression Recognition.arXiv preprint arXiv:2203.01601, 2022

    Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-Aware Network for Handwritten Mathematical Expression Recognition.arXiv preprint arXiv:2203.01601, 2022. 3

  55. [55]

    CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition

    Wenqi Zhao and Liangcai Gao. CoMER: Modeling Cov- erage for Transformer-based Handwritten Mathematical Ex- pression Recognition. InEuropean conference on computer vision, pages 392–408, 2022. 2

  56. [56]

    Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer

    Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, and Ziyin Zhang. Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer. InIn- ternational Conference on Document Analysis and Recogni- tion, pages 570–584, 2021. 2

  57. [57]

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adap- tive Perception, 2024

    Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adap- tive Perception, 2024. 1, 3, 4

  58. [58]

    CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition

    Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. CDistNet: Perceiving Multi- Domain Character Distance for Robust Text Recognition. International Journal of Computer Vision, 132(2):300–318,

  59. [59]

    Xinyi Zheng, Doug Burdick, Lucian Popa, Peter Zhong, and Nancy Xin Ru Wang. Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.Winter Conference for Applications in Computer Vision, 2021. 3

  60. [60]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and eval- uation.arXiv preprint arXiv:1911.10683, 2019. 3

  61. [61]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Mod- els.arXiv preprint arXiv:2504.10479, 2025. 6