arxiv: 2603.24326 · v2 · submitted 2026-03-25 · 💻 cs.CV · cs.AI· cs.IR

Recognition: unknown

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Cheng Cui , Ting Sun , Suyin Liang , Tingquan Gao , Zelun Zhang , Jiaxuan Liu , Xueqing Wang , Changda Zhou

show 10 more authors

Hongen Liu Manhui Lin Yue Zhang Yubo Zhang Jing Zhang Jun Zhang Xing Wei Yi Liu Dianhai Yu Yanjun Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.IR

keywords document parsingvision-language modelscoarse-to-fine processingvision token reductionefficient OCRpage-level parsingelement recognition

0 comments

The pith

PaddleOCR-VL filters redundant image regions with a lightweight module so a 0.9B vision-language model can parse documents at state-of-the-art accuracy using far fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution document images cause quadratic growth in vision tokens for vision-language models, driving up compute costs. The authors argue that documents contain large redundant areas such as backgrounds that can be safely ignored. They introduce a lightweight Valid Region Focus Module that predicts which tokens carry semantic value and passes only those to a compact 0.9B model for final recognition. Experiments show the resulting system matches or exceeds larger models on both full-page parsing and element recognition while cutting token count and inference time.

Core claim

PaddleOCR-VL uses a coarse-to-fine pipeline in which the Valid Region Focus Module first identifies valid vision tokens via localization and contextual prediction, then routes only those tokens to a trained 0.9B vision-language model that performs detailed document recognition without processing the entire high-resolution image.

What carries the argument

The Valid Region Focus Module (VRFM), a lightweight network that selects semantically relevant vision tokens from document images before they reach the recognition model.

If this is right

Page-level parsing and element-level recognition both reach state-of-the-art scores.
The system uses substantially fewer vision tokens and parameters than competing vision-language models.
Inference speed improves while maintaining or exceeding accuracy of larger models.
Targeted filtering of visual input proves effective for fine-grained document tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering step could be applied to other high-resolution vision tasks to reduce token budgets without retraining the main model.
Small domain-specific models guided by coarse selection may close the gap with much larger general-purpose models in structured visual domains.
Further gains are possible if the Valid Region Focus Module is jointly trained end-to-end with the recognition model rather than used as a fixed preprocessor.

Load-bearing premise

Redundant background regions in documents can be identified reliably enough that the small model still recovers full accuracy from the remaining tokens.

What would settle it

Accuracy on a held-out document set drops sharply when the Valid Region Focus Module is forced to discard regions that contain critical text or layout elements.

Figures

Figures reproduced from arXiv: 2603.24326 by Changda Zhou, Cheng Cui, Dianhai Yu, Hongen Liu, Jiaxuan Liu, Jing Zhang, Jun Zhang, Manhui Lin, Suyin Liang, Tingquan Gao, Ting Sun, Xing Wei, Xueqing Wang, Yanjun Ma, Yi Liu, Yubo Zhang, Yue Zhang, Zelun Zhang.

**Figure 1.** Figure 1: PaddleOCR-VL achieves state-of-the-art performance with the fewest vision tokens and parameters on OmniDocBench v1.5. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Architectural comparison of End-to-end VLM and our method. Among various types of document images, the valid area accounts [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of our proposed PaddleOCR-VL, which consists of two components. The first component is VRFM, which accu [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the Valid Region Focus Module (VRFM). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaddleOCR-VL adds a lightweight VRFM to prune redundant tokens before a 0.9B model, which is a practical efficiency move but the SOTA claim rests on unshown numbers.

read the letter

The main thing here is PaddleOCR-VL's coarse-to-fine setup. A small Valid Region Focus Module first picks out the relevant visual regions in a document image using localization and context cues, then a 0.9B model processes only those tokens for the actual parsing. This cuts the vision token count and speeds things up compared to feeding the whole high-res image to a bigger VLM. The paper does a decent job laying out a practical solution to the token bloat problem that comes with high-resolution document inputs. They train the compact model to work with the filtered output, and making the code public lets others reproduce or build on it. For tasks like information extraction from PDFs or scans, this kind of efficiency boost could matter in real deployments. Where it gets thin is the lack of hard numbers in the abstract. It says SOTA performance and strong competitiveness, but without the actual metrics, comparisons, or ablation studies, it's hard to judge how much the pruning helps or hurts. The potential issue that the filter might miss fine details in complex layouts, like overlapping text or low-contrast elements, is worth watching. If the experiments don't include targeted checks on those cases, the efficiency claim stays conditional. This work is aimed at people who implement document parsing pipelines and need to balance accuracy with compute. Someone already using VLMs for similar tasks would get the most out of the architecture description and the reported inference speeds. I'd recommend putting it through peer review. The approach is grounded enough to deserve a closer look at the data.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaddleOCR-VL, a coarse-to-fine document parsing architecture consisting of a lightweight Valid Region Focus Module (VRFM) that identifies semantically relevant image regions via localization and contextual prediction, followed by a compact 0.9B vision-language model (PaddleOCR-VL-0.9B) that performs detailed recognition on the filtered token set. It claims state-of-the-art results on page-level parsing and element-level recognition tasks, with substantial gains in inference speed and reductions in vision tokens and parameters relative to existing solutions and larger VLMs.

Significance. If the central performance claims hold under rigorous verification, the work offers a practical advance in efficient high-resolution document understanding by mitigating quadratic token scaling. The public release of code and models at https://github.com/PaddlePaddle/PaddleOCR further supports reproducibility and potential adoption in real-world OCR pipelines.

major comments (2)

[Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.
[Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.

minor comments (2)

[Method] Notation for the VRFM output (filtered token set) and its integration with the 0.9B model could be clarified with an explicit diagram or pseudocode to improve reproducibility.
[Discussion] The paper should include a limitations section discussing scenarios where VRFM filtering might degrade performance (e.g., highly cluttered documents).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Valid Region Focus Module and Experiments] The headline SOTA and efficiency claims rest on the untested assumption that VRFM correctly retains all semantically critical tokens (small text, tables, signatures, low-contrast elements) while discarding only redundant background. The manuscript reports only aggregate page- and element-level metrics; it does not quantify recall on critical sub-regions or present failure cases on dense or low-contrast layouts where filtering could cause irrecoverable accuracy drops.

Authors: We agree that granular validation of VRFM is valuable. While the reported aggregate metrics on page- and element-level tasks already demonstrate strong end-to-end performance, we will add in the revision: quantitative recall measurements for critical sub-regions (small text, tables, signatures, low-contrast elements) and an analysis of failure cases on dense or low-contrast layouts. These additions will directly address the concern about potential irrecoverable accuracy drops. revision: yes
Referee: [Abstract and Results] The abstract asserts SOTA performance and strong competitiveness against top-tier VLMs without providing any quantitative metrics, baseline tables, or ablation results. This omission makes it impossible to assess the magnitude of improvement or rule out post-hoc tuning, and the full experimental section must supply these details for the central claim to be verifiable.

Authors: We acknowledge that the abstract currently lacks specific numbers. In the revised manuscript we will update the abstract to include key quantitative results (accuracy gains, token reduction, parameter efficiency) together with explicit references to the baseline tables and ablations already present in the experimental section. This will make the magnitude of improvement directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical architecture derivation

full rationale

The paper presents PaddleOCR-VL as an empirical coarse-to-fine architecture: a lightweight VRFM identifies valid tokens via localization and context prediction, after which a separately trained 0.9B VLM performs recognition on the filtered input. No equations, fitted-parameter predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmarks and experiments rather than any internal reduction to the method's own inputs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new VRFM module and the 0.9B model; no free parameters are explicitly fitted in the abstract description, but the module itself is an invented component whose effectiveness is asserted empirically.

axioms (1)

domain assumption High-resolution document images contain substantial redundant visual regions such as background that can be safely suppressed.
Stated directly in the abstract as the motivation for the coarse-to-fine design.

invented entities (2)

Valid Region Focus Module (VRFM) no independent evidence
purpose: Lightweight module that identifies valid vision tokens via localization and contextual relationship prediction.
New component introduced to guide the subsequent recognition stage.
PaddleOCR-VL-0.9B no independent evidence
purpose: Compact 0.9B vision-language model that performs detailed recognition on VRFM-selected regions.
New model size and training regime proposed for the fine stage.

pith-pipeline@v0.9.0 · 5604 in / 1402 out tokens · 47271 ms · 2026-05-15T00:22:15.072025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 28 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5, 7, 8, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Ernie 4.5 technical report, 2025

Baidu-ERNIE-Team. Ernie 4.5 technical report, 2025. 1, 4, 5, 2

2025
[5]

Beagle: Au- tomated extraction and interpretation of visualizations from the web

Leilani Battle, Peitong Duan, Zachery Miranda, Dana Muku- sheva, Remco Chang, and Michael Stonebraker. Beagle: Au- tomated extraction and interpretation of visualizations from the web. InProceedings of the 2018 CHI conference on hu- man factors in computing systems, pages 1–8, 2018. 5, 3

2018
[6]

Ocrflux.https : / / github

chatdoc com. Ocrflux.https : / / github . com / chatdoc-com/OCRFlux, 2025. Accessed:2025-09-25. 7

2025
[7]

Onechart: Purify the chart structural extrac- tion via one auxiliary token

Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, and Xi- angyu Zhang. Onechart: Purify the chart structural extrac- tion via one auxiliary token. InProceedings of the 32nd ACM International Conference on Multimedia, pages 147– 155, 2024. 8

2024
[8]

Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, and Chi Zhang. Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model.arXiv preprint arXiv:2508.13238, 2025. 1

work page arXiv 2025
[9]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 3, 5, 7, 8, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics)

Kenny Davila, Bhargava Urala Kota, Srirangaraj Setlur, Venu Govindaraju, Christopher Tensmeyer, Sumit Shekhar, and Ritwick Chaudhry. Icdar 2019 competition on harvesting raw tables from infographics (chart-infographics). In2019 International Conference on Document Analysis and Recog- nition (ICDAR), pages 1594–1599. IEEE, 2019. 5, 3

2019
[11]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 4

2023
[12]

Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025

Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025. 3

work page arXiv 2025
[13]

Dolphin: Document Image Parsing via Heterogeneous An- chor Prompting.arXiv preprint arXiv:2505.14059, 2025

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting.arXiv preprint arXiv:2505.14059, 2025. 1, 3, 7, 8

work page arXiv 2025
[14]

Scaling synthetic data creation with 1,000,000,000 personas.arXiv:2406.20094, 2024

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024. 4

work page arXiv 2024
[15]

Mathwriting: A dataset for handwritten mathematical ex- pression recognition

Philippe Gervais, Anastasiia Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical ex- pression recognition. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5459–5469, 2025. 5, 3

2025
[16]

Gemini 2.5.https : / / blog

Google DeepMind. Gemini 2.5.https : / / blog . google / technology / google - deepmind/gemini- model- thinking- updates- march-2025/, 2025. 1, 3, 7

2025
[17]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016. 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Relation detr: Exploring explicit position relation prior for object detection

Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen, and Xuguang Lan. Relation detr: Exploring explicit position relation prior for object detection. InEuropean Con- ference on Computer Vision, pages 89–105. Springer, 2024. 5

2024
[19]

Dvqa: Understanding data visualizations via ques- tion answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,
[20]

Chart-to-text: A large-scale benchmark for chart sum- marization.arXiv preprint arXiv:2203.06486, 2022

Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart sum- marization.arXiv preprint arXiv:2203.06486, 2022. 5, 3

work page arXiv 2022
[21]

Binary coors capable or ‘correcting dele- tions, insertions, and reversals

VI Lcvenshtcin. Binary coors capable or ‘correcting dele- tions, insertions, and reversals. InSoviet physics-doklady,
[22]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1

2020
[23]

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm.arXiv preprint arXiv:2506.05218, 2025

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025. 1, 3, 7, 8, 5, 6

work page arXiv 2025
[24]

Casia online and offline chinese handwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. In 2011 international conference on document analysis and recognition, pages 37–41. IEEE, 2011. 5

2011
[25]

Deplot: One-shot visual language reasoning by plot-to-table trans- lation.arXiv preprint arXiv:2212.10505, 2022

Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table trans- lation.arXiv preprint arXiv:2212.10505, 2022. 5, 8

work page arXiv 2022
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 4

2023
[27]

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion.arXiv preprint arXiv:2509.01215, 2025

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion.arXiv preprint arXiv:2509.01215, 2025. 1, 4, 7

work page arXiv 2025
[28]

pix2tex - latex ocr.https://github

Lukas Blecher. pix2tex - latex ocr.https://github. com/lukas-blecher/LaTeX-OCR, 2022. Accessed: 2025-06-23. 3

2022
[29]

Chartocr: Data extraction from charts images via a deep hy- brid framework

Junyu Luo, Zekun Li, Jinpeng Wang, and Chin-Yew Lin. Chartocr: Data extraction from charts images via a deep hy- brid framework. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917– 1925, 2021. 5, 3

1917
[30]

Optimized table tokeniza- tion for table structure recognition

Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, and Peter Staar. Optimized table tokeniza- tion for table structure recognition. InInternational Confer- ence on Document Analysis and Recognition, pages 37–50. Springer, 2023. 6, 1

2023
[31]

Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection

Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. Icdar 2019 crohme+ tfd: Competition on recognition of handwritten mathematical expressions and typeset formula detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1533–1538. IEEE, 2019. 3

2019
[32]

Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025

Souvik Mandal, Ashish Talewar, Paras Ahuja, and Prathamesh Juvatkar. Nanonets-ocr-s: A model for trans- forming documents into structured markdown with intelli- gent content recognition and semantic tagging, 2025. 7

2025
[33]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5, 3

work page arXiv 2022
[34]

Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning.arXiv preprint arXiv:2305.14761, 2023

Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Ena- mul Hoque, and Shafiq Joty. Unichart: A universal vision- language pretrained model for chart comprehension and rea- soning.arXiv preprint arXiv:2305.14761, 2023. 5, 3

work page arXiv 2023
[35]

Plotqa: Reasoning over scientific plots

Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. InProceedings of the ieee/cvf winter conference on appli- cations of computer vision, pages 1527–1536, 2020. 5, 3

2020
[36]

Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot

Mistral AI Team. Mistral-ocr.https://mistral.ai/ news / mistral - ocr ? utm _ source = ai - bot . cn,
[37]

Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014)

Harold Mouchere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr 2014 competition on recognition of on-line handwritten mathematical expressions (crohme 2014). In2014 14th international conference on frontiers in handwriting recognition, pages 791–796. IEEE, 2014. 3

2014
[38]

Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions

Harold Mouch `ere, Christian Viard-Gaudin, Richard Zanibbi, and Utpal Garain. Icfhr2016 crohme: Competition on recog- nition of online handwritten mathematical expressions. In 2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 607–612. IEEE, 2016. 3

2016
[39]

Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qin- tong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025. 1, 3, 7, 8, 6

work page arXiv 2025
[40]

Mineru2.0-2505-0.9b.https : / / huggingface

opendatalab. Mineru2.0-2505-0.9b.https : / / huggingface . co / opendatalab / MinerU2 . 0 - 2505-0.9B, 2025. 1, 3, 4, 7, 8, 5

2025
[41]

Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 24838–24848, 2025. 2, 6, 7

2025
[42]

Erniekit.https://github

PaddlePaddle Authors. Erniekit.https://github. com/PaddlePaddle/ERNIE, 2025. 6

2025
[43]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
[44]

Marker.https : / / github

Vik Paruchuri. Marker.https : / / github . com / datalab-to/marker, 2025. Accessed: 2025-09-25. 7

2025
[45]

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models.arXiv preprint arXiv:2502.18443, 2025

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025. 7

work page arXiv 2025
[46]

dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025

rednote-hilab. dots.ocr: Multilingual document layout pars- ing in a single vision-language model, 2025. 1, 3, 7, 8, 5

2025
[47]

Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025

Ting Sun, Cheng Cui, Yuning Du, and Yi Liu. Pp- doclayout: A unified document layout detection model to accelerate large-scale data construction.arXiv preprint arXiv:2503.17213, 2025. 1, 5, 4

work page arXiv 2025
[48]

Vistext: A benchmark for semantically rich chart captioning

Benny J Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023. 5, 3

work page arXiv 2023
[49]

Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chen- glong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025. 4

work page arXiv 2025
[50]

UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024

Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254, 2024. 5, 3

work page arXiv 2024
[51]

Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024. 1, 3

work page arXiv 2024
[52]

Image over text: Transforming formula recognition evalua- tion with character detection matching

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evalua- tion with character detection matching. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19681–19690, 2025. 8

2025
[53]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,

work page arXiv
[55]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek- ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Syntax-aware network for handwritten mathematical expression recognition

Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, and Xiang Bai. Syntax-aware network for handwritten mathematical expression recognition. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4553–4562, 2022. 3

2022
[58]

Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming

Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, and Lianwen Jin. Dockylin: A large multimodal model for visual document understanding with efficient visual slim- ming. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9923–9932, 2025. 4

2025
[59]

Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning.arXiv preprint arXiv:2404.16635,

Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang. Tinychart: Efficient chart understanding with visual token merging and program- of-thoughts learning.arXiv preprint arXiv:2404.16635,

work page arXiv
[60]

Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Ad- vances in neural information processing systems, 31, 2018. 6

2018
[61]

Detrs beat yolos on real-time object detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 4, 5

2024
[62]

Image-based table recognition: data, model, and evaluation, 2020

Xu Zhong, Elaheh ShafieiBavani, and Antonio Ji- meno Yepes. Image-based table recognition: data, model, and evaluation, 2020. 5, 7

2020
[63]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing Supp...

work page internal anchor Pith review Pith/arXiv arXiv 2025