pith. machine review for the scientific record. sign in

arxiv: 2603.24326 · v2 · submitted 2026-03-25 · 💻 cs.CV · cs.AI· cs.IR

Recognition: unknown

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.IR
keywords document parsingvision-language modelscoarse-to-fine processingvision token reductionefficient OCRpage-level parsingelement recognition
0
0 comments X

The pith

PaddleOCR-VL filters redundant image regions with a lightweight module so a 0.9B vision-language model can parse documents at state-of-the-art accuracy using far fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution document images cause quadratic growth in vision tokens for vision-language models, driving up compute costs. The authors argue that documents contain large redundant areas such as backgrounds that can be safely ignored. They introduce a lightweight Valid Region Focus Module that predicts which tokens carry semantic value and passes only those to a compact 0.9B model for final recognition. Experiments show the resulting system matches or exceeds larger models on both full-page parsing and element recognition while cutting token count and inference time.

Core claim

PaddleOCR-VL uses a coarse-to-fine pipeline in which the Valid Region Focus Module first identifies valid vision tokens via localization and contextual prediction, then routes only those tokens to a trained 0.9B vision-language model that performs detailed document recognition without processing the entire high-resolution image.

What carries the argument

The Valid Region Focus Module (VRFM), a lightweight network that selects semantically relevant vision tokens from document images before they reach the recognition model.

If this is right

  • Page-level parsing and element-level recognition both reach state-of-the-art scores.
  • The system uses substantially fewer vision tokens and parameters than competing vision-language models.
  • Inference speed improves while maintaining or exceeding accuracy of larger models.
  • Targeted filtering of visual input proves effective for fine-grained document tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering step could be applied to other high-resolution vision tasks to reduce token budgets without retraining the main model.
  • Small domain-specific models guided by coarse selection may close the gap with much larger general-purpose models in structured visual domains.
  • Further gains are possible if the Valid Region Focus Module is jointly trained end-to-end with the recognition model rather than used as a fixed preprocessor.

Load-bearing premise

Redundant background regions in documents can be identified reliably enough that the small model still recovers full accuracy from the remaining tokens.

What would settle it

Accuracy on a held-out document set drops sharply when the Valid Region Focus Module is forced to discard regions that contain critical text or layout elements.

Figures

Figures reproduced from arXiv: 2603.24326 by Changda Zhou, Cheng Cui, Dianhai Yu, Hongen Liu, Jiaxuan Liu, Jing Zhang, Jun Zhang, Manhui Lin, Suyin Liang, Tingquan Gao, Ting Sun, Xing Wei, Xueqing Wang, Yanjun Ma, Yi Liu, Yubo Zhang, Yue Zhang, Zelun Zhang.

Figure 1
Figure 1. Figure 1: PaddleOCR-VL achieves state-of-the-art performance with the fewest vision tokens and parameters on OmniDocBench v1.5. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architectural comparison of End-to-end VLM and our method. Among various types of document images, the valid area accounts [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of our proposed PaddleOCR-VL, which consists of two components. The first component is VRFM, which accu [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the Valid Region Focus Module (VRFM). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaddleOCR-VL, a coarse-to-fine document parsing architecture consisting of a lightweight Valid Region Focus Module (VRFM) that identifies semantically relevant image regions via localization and contextual prediction, followed by a compact 0.9B vision-language model (PaddleOCR-VL-0.9B) that performs detailed recognition on the filtered token set. It claims state-of-the-art results on page-level parsing and element-level recognition tasks, with substantial gains in inference speed and reductions in vision tokens and parameters relative to existing solutions and larger VLMs.

Significance. If the central performance claims hold under rigorous verification, the work offers a practical advance in efficient high-resolution document understanding by mitigating quadratic token scaling. The public release of code and models at https://github.com/PaddlePaddle/PaddleOCR further supports reproducibility and potential adoption in real-world OCR pipelines.

major comments (2)
  1. [Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.
  2. [Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.
minor comments (2)
  1. [Method] Notation for the VRFM output (filtered token set) and its integration with the 0.9B model could be clarified with an explicit diagram or pseudocode to improve reproducibility.
  2. [Discussion] The paper should include a limitations section discussing scenarios where VRFM filtering might degrade performance (e.g., highly cluttered documents).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.

    Authors: We agree that granular validation of VRFM is valuable. While the reported aggregate metrics on page- and element-level tasks already demonstrate strong end-to-end performance, we will add in the revision: quantitative recall measurements for critical sub-regions (small text, tables, signatures, low-contrast elements) and an analysis of failure cases on dense or low-contrast layouts. These additions will directly address the concern about potential irrecoverable accuracy drops. revision: yes

  2. Referee: [Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.

    Authors: We acknowledge that the abstract currently lacks specific numbers. In the revised manuscript we will update the abstract to include key quantitative results (accuracy gains, token reduction, parameter efficiency) together with explicit references to the baseline tables and ablations already present in the experimental section. This will make the magnitude of improvement directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical architecture derivation

full rationale

The paper presents PaddleOCR-VL as an empirical coarse-to-fine architecture: a lightweight VRFM identifies valid tokens via localization and context prediction, after which a separately trained 0.9B VLM performs recognition on the filtered input. No equations, fitted-parameter predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmarks and experiments rather than any internal reduction to the method's own inputs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new VRFM module and the 0.9B model; no free parameters are explicitly fitted in the abstract description, but the module itself is an invented component whose effectiveness is asserted empirically.

axioms (1)
  • domain assumption High-resolution document images contain substantial redundant visual regions such as background that can be safely suppressed.
    Stated directly in the abstract as the motivation for the coarse-to-fine design.
invented entities (2)
  • Valid Region Focus Module (VRFM) no independent evidence
    purpose: Lightweight module that identifies valid vision tokens via localization and contextual relationship prediction.
    New component introduced to guide the subsequent recognition stage.
  • PaddleOCR-VL-0.9B no independent evidence
    purpose: Compact 0.9B vision-language model that performs detailed recognition on VRFM-selected regions.
    New model size and training regime proposed for the fine stage.

pith-pipeline@v0.9.0 · 5604 in / 1402 out tokens · 47271 ms · 2026-05-15T00:22:15.072025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5, 7, 8, 6

  4. [4]

    Ernie 4.5 technical report, 2025

    Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025. 1, 4, 5, 2

  5. [5]

    Beagle: Au- tomated extraction and interpretation of visualizations from the web

    Leilani Battle, Peitong Duan, Zachery Miranda, Dana Muku- sheva, Remco Chang, and Michael Stonebraker. Beagle: Au- tomated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI conference on hu- man factors in computing systems, pages 1–8, 2018. 5, 3

  6. [6]

    Ocrflux.https : / / github

    chatdoc com. Ocrflux.https : / / github . com / chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25. 7

  7. [7]

    Onechart: Purify the chart structural extrac- tion via one auxiliary token

    Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xi- angyu Zhang. Onechart: Purify the chart structural extrac- tion via one auxiliary token. InProceedings of the 32nd ACM International Conference on Multimedia, pages 147– 155, 2024. 8

  8. [8]

    Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025

    Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, and Chi Zhang. Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025. 1

  9. [9]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 3, 5, 7, 8, 2

  10. [10]

    Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics)

    Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics). In2019 International Conference on Document Analysis and Recog- nition (ICDAR), pages 1594–1599. IEEE, 2019. 5, 3

  11. [11]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 4

  12. [12]

    Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025

    Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025. 3

  13. [13]

    Dolphin: Document Image Parsing via Heterogeneous An- chor Prompting.arXiv preprint arXiv:2505.14059, 2025

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025. 1, 3, 7, 8

  14. [14]

    Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024

    Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024. 4

  15. [15]

    Mathwriting: A dataset for handwritten mathematical ex- pression recognition

    Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical ex- pression recognition. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5459–5469, 2025. 5, 3

  16. [16]

    Gemini 2.5.https : / / blog

    Google DeepMind. Gemini 2.5.https : / / blog . google / technology / google - deepmind/gemini- model- thinking- updates- march-2025/, 2025. 1, 3, 7

  17. [17]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

  18. [18]

    Relation detr: Exploring explicit position relation prior for object detection

    Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, and Xuguang Lan. Relation detr: Exploring explicit position relation prior for object detection. InEuropean Con- ference on Computer Vision, pages 89–105. Springer, 2024. 5

  19. [19]

    Dvqa: Understanding data visualizations via ques- tion answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,

  20. [20]

    Chart-to-text: A large-scale benchmark for chart sum- marization.arXiv preprint arXiv:2203.06486, 2022

    Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart sum- marization.arXiv preprint arXiv:2203.06486, 2022. 5, 3

  21. [21]

    Binary coors capable or ‘correcting dele- tions, insertions, and reversals

    VI Lcvenshtcin. Binary coors capable or ‘correcting dele- tions, insertions, and reversals. InSoviet physics-doklady,

  22. [22]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1

  23. [23]

    MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm.arXiv preprint arXiv:2506.05218, 2025

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025. 1, 3, 7, 8, 5, 6

  24. [24]

    Casia online and offline chinese handwriting databases

    Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. In 2011 international conference on document analysis and recognition, pages 37–41. IEEE, 2011. 5

  25. [25]

    Deplot: One-shot visual language reasoning by plot-to-table trans- lation.arXiv preprint arXiv:2212.10505, 2022

    Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table trans- lation.arXiv preprint arXiv:2212.10505, 2022. 5, 8

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 4

  27. [27]

    POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025

    Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025. 1, 4, 7

  28. [28]

    pix2tex - latex ocr.https://github

    Lukas Blecher. pix2tex - latex ocr.https://github. com/lukas-blecher/LaTeX-OCR, 2022. Accessed: 2025-06-23. 3

  29. [29]

    Chartocr: Data extraction from charts images via a deep hy- brid framework

    Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. Chartocr: Data extraction from charts images via a deep hy- brid framework. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917– 1925, 2021. 5, 3

  30. [30]

    Optimized table tokeniza- tion for table structure recognition

    Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokeniza- tion for table structure recognition. InInternational Confer- ence on Document Analysis and Recognition, pages 37–50. Springer, 2023. 6, 1

  31. [31]

    Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection

    Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1533–1538. IEEE, 2019. 3

  32. [32]

    Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025

    Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025. 7

  33. [33]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5, 3

  34. [34]

    Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning.arXiv preprint arXiv:2305.14761, 2023

    Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Ena- mul Hoque, and Shafiq Joty. Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning.arXiv preprint arXiv:2305.14761, 2023. 5, 3

  35. [35]

    Plotqa: Reasoning over scientific plots

    Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. InProceedings of the ieee/cvf winter conference on appli- cations of computer vision, pages 1527–1536, 2020. 5, 3

  36. [36]

    Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot

    Mistral AI Team. Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot . cn,

  37. [37]

    Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014)

    Harold Mouchere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In2014 14th international conference on frontiers in handwriting recognition, pages 791–796. IEEE, 2014. 3

  38. [38]

    Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions

    Harold Mouch `ere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 607–612. IEEE, 2016. 3

  39. [39]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qin- tong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025. 1, 3, 7, 8, 6

  40. [40]

    Mineru2.0-2505-0.9b.https : / / huggingface

    opendatalab. Mineru2.0-2505-0.9b.https : / / huggingface . co / opendatalab / MinerU2 . 0 - 2505-0.9B, 2025. 1, 3, 4, 7, 8, 5

  41. [41]

    Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 24838–24848, 2025. 2, 6, 7

  42. [42]

    Erniekit.https://github

    PaddlePaddle Authors. Erniekit.https://github. com/PaddlePaddle/ERNIE, 2025. 6

  43. [43]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  44. [44]

    Marker.https : / / github

    Vik Paruchuri. Marker.https : / / github . com / datalab-to/marker, 2025. Accessed: 2025-09-25. 7

  45. [45]

    olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 7

  46. [46]

    dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025

    rednote-hilab. dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025. 1, 3, 7, 8, 5

  47. [47]

    Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

    Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025. 1, 5, 4

  48. [48]

    Vistext: A benchmark for semantically rich chart captioning

    Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023. 5, 3

  49. [49]

    Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

    Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chen- glong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025. 4

  50. [50]

    UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254, 2024. 5, 3

  51. [51]

    Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024. 1, 3

  52. [52]

    Image over text: Transforming formula recognition evalua- tion with character detection matching

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evalua- tion with character detection matching. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, 2025. 8

  53. [53]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7

  54. [54]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

  55. [55]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 1, 3

  56. [56]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  57. [57]

    Syntax-aware network for handwritten mathematical expression recognition

    Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4553–4562, 2022. 3

  58. [58]

    Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming

    Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, and Lianwen Jin. Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9923–9932, 2025. 4

  59. [59]

    Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning.arXiv preprint arXiv:2404.16635,

    Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning.arXiv preprint arXiv:2404.16635,

  60. [60]

    Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018

    Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018. 6

  61. [61]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 4, 5

  62. [62]

    Image-based table recognition: data, model, and evaluation, 2020

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation, 2020. 5, 7

  63. [63]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing Supp...