pith. sign in

arxiv: 2605.22100 · v1 · pith:G42JQESYnew · submitted 2026-05-21 · 💻 cs.AI

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

Pith reviewed 2026-05-22 05:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords document parsingmulti-page documentsbenchmarksemantic continuityhierarchical structurereading ordervisual content
0
0 comments X

The pith

A benchmark for multi-page documents shows current models still struggle with semantic continuity, visual content, and hierarchy recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MPDocBench-Parse to fill gaps left by prior benchmarks that focus on single pages, specific tasks, or text-only settings without fine-grained checks for cross-page integration. It supplies 433 manually annotated documents spanning 3,246 pages and 15 types in English and Chinese, plus an evaluation protocol that measures text and table merging, figure extraction, reading order, and heading hierarchy at the full-document level. Experiments indicate that existing models manage basic text extraction adequately yet fall short on maintaining semantic flow across pages, preserving visual elements, and reconstructing logical document structure. This matters because document parsing supplies the structured data that powers downstream information systems, so a benchmark closer to real-world conditions can steer development toward parsers that work reliably on practical multi-page inputs.

Core claim

Existing models perform well on basic text extraction but suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery when tested on the new MPDocBench-Parse benchmark, which enables document-level end-to-end evaluation across diverse multi-page layouts and languages.

What carries the argument

The MPDocBench-Parse benchmark itself, built from 433 manually annotated multi-page documents and a protocol that jointly scores content fidelity and logical structure elements such as truncated text merging, figure extraction, reading order, and heading hierarchy.

If this is right

  • Improvements in document parsing must target cross-page information integration rather than isolated page processing.
  • Parsers will need explicit mechanisms to recover and maintain heading hierarchies and logical reading order.
  • Evaluation protocols should incorporate checks for visual content preservation and accurate merging of split tables or text blocks.
  • Benchmarks covering both English and Chinese with varied layout styles provide a more complete test of practical utility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams building parsers could adopt the benchmark to isolate and fix specific weaknesses in multi-page handling before deployment.
  • The same approach of full-document metrics might apply to other domains that rely on long structured inputs such as contracts or technical reports.
  • Future work could explore whether models trained with explicit multi-page objectives close the gaps identified here.

Load-bearing premise

The 433 chosen documents and the fine-grained metrics for continuity and hierarchy are representative of the parsing difficulties that arise in actual multi-page applications.

What would settle it

A test in which a model scores highly on every metric including semantic continuity and heading recovery when run on the full set of 433 documents, or in which results on a substantially larger and more varied collection of multi-page documents contradict the reported limitations.

Figures

Figures reproduced from arXiv: 2605.22100 by Bangbang Zhou, Feiyu Gao, Hangdi Xing, Hongtao Xie, Jianjun Xu, Jieping Ye, Ming Yan, Qi Zheng, Shuai Bai, Yifan Chen, Zhibo Yang.

Figure 1
Figure 1. Figure 1: Overview of MPDocBench-Parse. Compared with existing benchmarks, MPDocBench [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset statistics of MPDocBench-Parse. Pipeline-based methods [34, 35, 36, 37] decompose parsing into subtasks such as layout detection, OCR, element recognition, and reading order reconstruction. They are efficient and controllable, but their separately designed parsing modules often suffer from error accumulation. General VLMs [38, 39, 40, 41, 42] offer a simple end-to-end interface by directly generati… view at source ↗
Figure 3
Figure 3. Figure 3: MPDocBench-Parse’s construction pipeline. The left and middle panels illustrate the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MPDocBench-Parse, a benchmark for multi-page document parsing consisting of 433 manually annotated documents (3,246 pages) across 15 types in English and Chinese. It defines a comprehensive evaluation protocol covering text/table/formula recognition, truncated element merging, figure extraction, reading order, and heading hierarchy recovery. Experiments on existing models show strong performance on basic text extraction but clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery, positioning the benchmark as a step toward more realistic document-level evaluation.

Significance. If the benchmark construction and metrics are validated, this work could meaningfully advance document parsing research by moving beyond single-page and text-centric benchmarks to address practical multi-page challenges. The fine-grained protocol and identification of specific model weaknesses in semantic and structural recovery provide concrete directions for improvement. The new dataset and end-to-end evaluation framework represent a useful contribution to the field if representativeness and annotation reliability are demonstrated.

major comments (3)
  1. [Section 3] Section 3 (Dataset Construction): The manuscript provides no details on annotation guidelines, inter-annotator agreement scores, or resolution of disagreements for subjective elements such as reading order and heading hierarchy. This is load-bearing for the central claim, as the reported limitations in hierarchical structure recovery depend directly on the reliability of these ground-truth labels.
  2. [Section 3.1] Section 3.1 (Document Selection): No selection criteria, sampling strategy, or coverage analysis is reported for the 15 document types and layouts, including whether edge cases like cross-page references or complex nested tables are represented. Without this, the performance gaps may reflect sampling artifacts rather than general model shortcomings in semantic continuity and visual parsing.
  3. [Section 4] Section 4 (Evaluation Protocol): The fine-grained metrics for semantic continuity integration and visual content preservation lack explicit mathematical definitions, formulas, or statistical significance tests for the observed limitations. This undermines the strength of the experimental findings that models suffer clear limitations in these areas.
minor comments (2)
  1. [Section 2] The related work discussion would benefit from citing additional recent multi-page document understanding papers to better situate the new benchmark.
  2. [Table 1] Table 1 summarizing document types and page counts could include a column for language distribution to clarify the English/Chinese balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback, which highlights important areas for improving the clarity and rigor of our manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our benchmark without altering its core contributions.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Dataset Construction): The manuscript provides no details on annotation guidelines, inter-annotator agreement scores, or resolution of disagreements for subjective elements such as reading order and heading hierarchy. This is load-bearing for the central claim, as the reported limitations in hierarchical structure recovery depend directly on the reliability of these ground-truth labels.

    Authors: We agree that explicit documentation of the annotation process is essential to substantiate the reliability of the ground-truth labels, particularly for subjective tasks. In the revised manuscript, we will expand Section 3 with a dedicated subsection detailing the annotation guidelines provided to annotators, the multi-stage review process used to resolve disagreements, and quantitative inter-annotator agreement scores (e.g., Cohen’s kappa for pairwise comparisons and Fleiss’ kappa for multi-annotator tasks) computed specifically on reading order and heading hierarchy. These additions will directly support the validity of our findings on hierarchical structure recovery. revision: yes

  2. Referee: [Section 3.1] Section 3.1 (Document Selection): No selection criteria, sampling strategy, or coverage analysis is reported for the 15 document types and layouts, including whether edge cases like cross-page references or complex nested tables are represented. Without this, the performance gaps may reflect sampling artifacts rather than general model shortcomings in semantic continuity and visual parsing.

    Authors: We concur that transparent reporting of document selection is necessary to demonstrate the benchmark’s representativeness and to rule out sampling artifacts. We will revise Section 3.1 to include the explicit selection criteria, the stratified sampling strategy employed across the 15 document types and languages, and a coverage analysis that quantifies the presence of challenging edge cases such as cross-page references, multi-column layouts, and complex nested tables. This will provide stronger evidence that the identified model limitations reflect general challenges in multi-page parsing. revision: yes

  3. Referee: [Section 4] Section 4 (Evaluation Protocol): The fine-grained metrics for semantic continuity integration and visual content preservation lack explicit mathematical definitions, formulas, or statistical significance tests for the observed limitations. This undermines the strength of the experimental findings that models suffer clear limitations in these areas.

    Authors: We acknowledge that formal definitions and statistical validation are required for the fine-grained metrics. In the revised Section 4, we will provide explicit mathematical formulations for the semantic continuity integration and visual content preservation metrics, including the precise aggregation rules across pages. We will also add statistical significance testing (e.g., paired Wilcoxon signed-rank tests with reported p-values) for the performance differences highlighted in the experiments. These changes will make the evaluation protocol fully reproducible and strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

New benchmark with independent annotations; no derivation reduces to fitted inputs or self-citations

full rationale

The paper introduces MPDocBench-Parse as a new dataset of 433 manually annotated multi-page documents plus a custom evaluation protocol for semantic continuity, hierarchy, and visual content. No equations, fitted parameters, or predictions are defined; reported limitations are direct empirical observations on this fresh data rather than quantities derived from prior author work. The central claims rest on the new annotations and metrics, which constitute independent input rather than a self-referential loop. This is a standard benchmark-creation paper with self-contained content against external models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that manual annotation of 433 documents produces ground truth that faithfully captures semantic continuity and logical structure; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Manual annotation by the authors produces reliable ground truth for text merging, reading order, and heading hierarchy across 15 document types.
    The benchmark's value depends on the quality of these annotations, which is stated but not quantified in the abstract.

pith-pipeline@v0.9.0 · 5770 in / 1318 out tokens · 37307 ms · 2026-05-22T05:55:04.116865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 11 internal anchors

  1. [1]

    ParseBench: A Document Parsing Benchmark for AI Agents

    Boyang Zhang, Sebastián G Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, and Simon Suo. Parsebench: A document parsing benchmark for ai agents.arXiv preprint arXiv:2604.08538, 2026

  2. [2]

    Publaynet: largest dataset ever for document layout analysis

    Zhong Xu, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition, pages 1015–1022, 2019

  3. [3]

    Tablebank: Table benchmark for image-based table detection and recognition

    Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: Table benchmark for image-based table detection and recognition. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 1918–1925, 2020

  4. [4]

    Image-to-markup generation with coarse-to-fine attention

    Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-markup generation with coarse-to-fine attention. InInternational Conference on Machine Learning, pages 980–989. PMLR, 2017

  5. [5]

    Layoutreader: Pre-training of text and layout for reading order detection

    Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for reading order detection. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 4735–4744, 2021

  6. [6]

    Icdar 2015 competition on robust reading

    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. Icdar 2015 competition on robust reading. In2015 13th International Conference on Document Analysis and Recognition, pages 1...

  7. [7]

    Hrdoc: Dataset and baseline method toward hierarchical reconstruction of document structures

    Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, and Cong Liu. Hrdoc: Dataset and baseline method toward hierarchical reconstruction of document structures. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1870–1877, 2023

  8. [8]

    Doclaynet: A large human-annotated dataset for document-layout segmentation

    Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 3743–3751, 2022

  9. [9]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580, 2020

  10. [10]

    arXiv:1908.04729 (2019)

    Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition.arXiv preprint arXiv:1908.04729, 2019

  11. [11]

    Focus anywhere for fine- grained multi-page document understanding

    Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding. arXiv:2405.14295, 2024. 10

  12. [12]

    olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025

  13. [13]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025

  14. [14]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv:2409.01704, 2024

  15. [15]

    Nougat: Neural Optical Understanding for Academic Documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical under- standing for academic documents.arXiv:2308.13418, 2024

  16. [16]

    Readoc: A unified benchmark for realistic document structured extraction

    Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Shanshan Jiang, Bin Dong, and Le Sun. Readoc: A unified benchmark for realistic document structured extraction. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21889–21905, 2025

  17. [17]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025

  18. [18]

    Docptbench: Benchmarking end-to-end photographed document parsing and translation.arXiv preprint arXiv:2511.18434, 2025

    Yongkun Du, Pinxuan Chen, Xuye Ying, and Zhineng Chen. Docptbench: Benchmarking end-to-end photographed document parsing and translation.arXiv preprint arXiv:2511.18434, 2025

  19. [19]

    Mdpbench: A benchmark for multilingual document parsing in real-world scenarios

    Zhang Li, Zhibo Lin, Qiang Liu, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiajun Song, Jiarui Zhang, Xiang Bai, and Yuliang Liu. Mdpbench: A benchmark for multilingual document parsing in real-world scenarios. arXiv preprint arXiv:2603.28130, 2026

  20. [20]

    Logics-parsing technical report.arXiv preprint arXiv:2509.19760, 2025

    Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, and Minggang Wu. Logics-parsing technical report.arXiv preprint arXiv:2509.19760, 2025

  21. [21]

    Bookrag: A hierarchical structure-aware index-based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

    Shu Wang, Yingli Zhou, and Yixiang Fang. Bookrag: A hierarchical structure-aware index-based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

  22. [22]

    Kohakurag: A simple rag framework with hierarchical document indexing.arXiv preprint arXiv:2603.07612, 2026

    Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, and Buu-Khang Tu. Kohakurag: A simple rag framework with hierarchical document indexing.arXiv preprint arXiv:2603.07612, 2026

  23. [23]

    Deepread: Document structure-aware reasoning to enhance agentic search.arXiv preprint arXiv:2602.05014, 2026

    Zhanli Li, Huiwen Tian, Lvzhou Luo, Yixuan Cao, and Ping Luo. Deepread: Document structure-aware reasoning to enhance agentic search.arXiv preprint arXiv:2602.05014, 2026

  24. [24]

    Multidocfusion: Hierarchical and multimodal chunking pipeline for enhanced rag on long industrial documents

    Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, and Heui-Seok Lim. Multidocfusion: Hierarchical and multimodal chunking pipeline for enhanced rag on long industrial documents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20996–21015, 2025

  25. [25]

    Unimernet: A universal network for real-world mathematical expression recognition, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition, 2024

  26. [26]

    Dochienet: A large and diverse dataset for document hierarchy parsing

    Hangdi Xing, Changxu Cheng, Feiyu Gao, Zirui Shao, Zhi Yu, Jiajun Bu, Qi Zheng, and Cong Yao. Dochienet: A large and diverse dataset for document hierarchy parsing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1129–1142, 2024

  27. [27]

    Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2026

    Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, et al. Firered-ocr technical report.arXiv preprint arXiv:2603.01840, 2026

  28. [28]

    Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

  29. [29]

    Qianfan-ocr: A unified end-to-end model for document intelligence.arXiv preprint arXiv:2603.13398, 2026

    Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, et al. Qianfan-ocr: A unified end-to-end model for document intelligence.arXiv preprint arXiv:2603.13398, 2026

  30. [30]

    Chandra ocr 2.https://github.com/datalab-to/chandra, 2025

    Chandra OCR 2. Chandra ocr 2.https://github.com/datalab-to/chandra, 2025. 11

  31. [31]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025

  32. [32]

    Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

    Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026

  33. [33]

    arXiv preprint arXiv:2506.05218 , year=

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, et al. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025

  34. [34]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  35. [35]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

  36. [36]

    Docling: An efficient open-source toolkit for ai-driven document conversion

    Nikos Livathinos, Christoph Auer, Maxim Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion. InAAAI Conference on Artificial Intelligence, 2025

  37. [37]

    Layoutparser: A unified toolkit for deep learning based document image analysis

    Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. Layoutparser: A unified toolkit for deep learning based document image analysis. InInternational Conference on Document Analysis and Recognition, pages 131–146. Springer, 2021

  38. [38]

    Gemini 3.1.https://deepmind.google/models/gemini/pro/, 2026

    Google DeepMind. Gemini 3.1.https://deepmind.google/models/gemini/pro/, 2026

  39. [39]

    Chatgpt.https://chat.openai.com, 2025

    OpenAI. Chatgpt.https://chat.openai.com, 2025

  40. [40]

    Claude.https://www.anthropic.com/claude, 2025

    Anthropic. Claude.https://www.anthropic.com/claude, 2025

  41. [41]

    Tongyi.https://www.aliyun.com/product/tongyi, 2025

    TongYi. Tongyi.https://www.aliyun.com/product/tongyi, 2025

  42. [42]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  43. [43]

    Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

    Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

  44. [44]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

  45. [45]

    M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi- annotation category dataset for modern document layout analysis

    Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi- annotation category dataset for modern document layout analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 151...

  46. [46]

    Vision grid transformer for document layout analysis

    Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. Vision grid transformer for document layout analysis. InProceedings of the IEEE/CVF international conference on computer vision, pages 19462–19472, 2023

  47. [47]

    Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context

    Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, and Nancy Xin Ru Wang. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 697–706, 2021

  48. [48]

    Logics-Parsing-Omni Technical Report

    Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Baoyu Hou, et al. Logics-parsing-omni technical report.arXiv preprint arXiv:2603.09677, 2026

  49. [49]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

  50. [50]

    Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

    Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

  51. [51]

    Unidoc-bench: A unified benchmark for document-centric multimodal rag.arXiv preprint arXiv:2510.03663, 2025

    Xiangyu Peng, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, and Chien-Sheng Wu. Unidoc-bench: A unified benchmark for document-centric multimodal rag.arXiv preprint arXiv:2510.03663, 2025

  52. [52]

    Docbench: A benchmark for evaluating llm-based document reading systems

    Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, and Dong Yu. Docbench: A benchmark for evaluating llm-based document reading systems. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 359–373, 2025

  53. [53]

    PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi-task 0.9 b vlm for robust in-the-wild document parsing.arXiv preprint arXiv:2601.21957, 2026

  54. [54]

    Binary codes capable of correcting deletions, insertions, and reversals

    Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966

  55. [55]

    Gemini 3 pro

    Google DeepMind. Gemini 3 pro. https://blog.google/innovation-and-ai/technology/ developers-tools/gemini-3-pro-vision, 2025

  56. [56]

    Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025

    Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, and Chi Zhang. Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025

  57. [57]

    Cdm: A reliable metric for fair and accurate formula recognition evaluation.arXiv preprint arXiv:2409.03643, 5(6), 2024

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, and Con- ghui He. Cdm: A reliable metric for fair and accurate formula recognition evaluation.arXiv preprint arXiv:2409.03643, 5(6), 2024

  58. [58]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020

  59. [59]

    Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding.arXiv preprint arXiv:2601.20430, 2026

    Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, et al. Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding.arXiv preprint arXiv:2601.20430, 2026

  60. [60]

    Multimodal ocr: Parse anything from documents.arXiv preprint arXiv:2603.13032, 2026

    Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Jiyu Qiu, Qi Fu, et al. Multimodal ocr: Parse anything from documents.arXiv preprint arXiv:2603.13032, 2026

  61. [61]

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025

  62. [62]

    Ocrverse: Towards holistic ocr in end-to-end vision-language models

    Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models. arXiv preprint arXiv:2601.21639, 2026

  63. [63]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  64. [64]

    Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture

    Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, and Jiajun Zhang. Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture. arXiv preprint arXiv:2509.02359, 2025

  65. [65]

    Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods.arXiv preprint arXiv:2511.15722, 2025

  66. [66]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  67. [67]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  68. [68]

    LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

    Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed-precision large language model inference with turbomind.arXiv preprint arXiv:2508.15601, 2025. 13 Appendix Overview A More Details of MPDocBench 14 A.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 A...