pith. machine review for the scientific record. sign in

arxiv: 2604.04771 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.CL

Recognition: no theorem link

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords document parsingdata engineeringtraining dataOmniDocBenchhard samplesprogressive trainingannotation verificationvision-language models
0
0 comments X

The pith

MinerU2.5-Pro shows that data engineering alone can push a 1.2B document parser past all larger models to 95.69 on OmniDocBench v1.6.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that document parsing performance is limited more by gaps in training data than by model architecture or scale, since different models fail on the same hard samples. It demonstrates this by keeping the original 1.2B-parameter MinerU2.5 model fixed while building a Data Engine that expands the dataset to 65.5 million samples through diversity-aware selection, uses cross-model agreement to label reliably, and applies iterative refinement to difficult cases. A three-stage training process then uses these data tiers in sequence. A sympathetic reader would care because the result suggests future progress can come from cheaper, more accessible data work instead of ever-larger models.

Core claim

State-of-the-art document parsing models of many architectures and sizes show highly consistent failure patterns on the same hard samples, which indicates that the bottleneck is shared deficiencies in training data rather than architectural differences. MinerU2.5-Pro keeps its 1.2B-parameter architecture unchanged and advances performance purely through data engineering: Diversity-and-Difficulty-Aware Sampling scales the data from under 10M to 65.5M samples while reducing shift; Cross-Model Consistency Verification generates reliable annotations from model consensus; and the Judge-and-Refine pipeline corrects hard-sample labels via render-then-verify iteration. Combined with three-stage pre-

What carries the argument

The Data Engine, built around Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, and the Judge-and-Refine pipeline, together with a three-stage progressive training strategy of large-scale pre-training, hard-sample fine-tuning, and GRPO alignment.

If this is right

  • Data engineering and staged training can deliver larger gains than increasing model size in document parsing.
  • A revised OmniDocBench v1.6 with corrected element-matching biases and a dedicated Hard subset gives a more reliable evaluation of progress on difficult cases.
  • The same data engine and progressive training approach can be applied to other models without changing their architectures.
  • Consistent failure patterns across models imply that data improvements transfer across different architectures and scales.
  • Performance above 95 on the benchmark becomes achievable without models exceeding 1.2B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If data deficiencies explain most failures across models, then shared public datasets of verified hard samples could accelerate progress for the whole field.
  • The method suggests that computational budgets might shift from training ever-larger models toward curation and verification of training data.
  • Similar cross-model verification and refinement steps could improve training data quality in other vision-language tasks that exhibit overlapping error patterns.
  • Future benchmarks may need to prioritize hard-sample coverage and annotation accuracy to drive further data-centric advances.

Load-bearing premise

The Cross-Model Consistency Verification and Judge-and-Refine pipeline produce unbiased, high-accuracy annotations for hard samples without introducing systematic selection biases or errors that affect the reported benchmark gains.

What would settle it

Re-training the original baseline model with the new 65.5M-sample dataset and three-stage strategy produces no 2.71-point gain on OmniDocBench v1.6, or the set of consistently hard samples changes after the new training.

Figures

Figures reproduced from arXiv: 2604.04771 by Bangrui Xu, Bin Wang, Bowen Zhou, Chao Xu, Conghui He, Dahua Lin, Dongsheng Ma, Fan Wu, Hejun Dong, Huaping Zhong, Jiang Wu, Jiantao Qiu, Jiayong Shi, Jie Yang, Jing Yu, Junbo Niu, Jutao Xiao, Kai Chen, Lijun Wu, Linke Ouyang, Liqun Wei, Mengzhang Cai, Pengyu Liao, Qianqian Wu, Qintong Zhang, Shasha Wang, Tao Chu, Tianyao He, Weijia Li, Weijun Zeng, Wei Li, Wentao Zhang, Wenzheng Zhang, Xiaomeng Zhao, Xuanhe Zhou, Yuan Qu, Yuefeng Sun, Yu Qiao, Zhenjiang Jin, Zhenxiang Li, Zhiyuan Zhao, Zhongying Tu, Ziyang Miao.

Figure 1
Figure 1. Figure 1: Performance comparison on OmniDocBench v1.6, which comprises Base (standard samples), [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Data Engine pipeline. The system co-optimizes three dimensions—Coverage, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The DDAS pipeline operates at two granularity levels. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of element-matching bias in OmniDocBench v1.5. Semantically correct predictions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layout Detection examples. The model localizes content regions with bounding boxes, [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text Recognition examples across Chinese, English, and mixed-language text regions. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Formula Recognition examples including single-line display formulas and complex multi-line [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Table Recognition examples showing OTSL token output and the corresponding rendered [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Image-aware parsing examples. The model classifies image regions into fine-grained subtypes [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of truncated paragraph merging. In multi-column and complex layouts, Layout [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross-page table merging example. The model performs semantic understanding at the [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: In-table image detection and recognition. Embedded images within table cells are masked [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison on rotated table recognition. MinerU2.5-Pro correctly recovers the [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison on tables with long merged cells. MinerU2.5-Pro preserves the span [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison on complex matrix recognition. MinerU2.5-Pro accurately captures [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison on multi-line formula recognition. The row-by-row analysis of [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison on image-aware chart parsing (Part 1). MinerU2.5-Pro extracts [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative comparison on image-aware chart parsing (Part 2). Additional chart types [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗
read the original abstract

Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift; Cross-Model Consistency Verification leverages output consensus among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy--large-scale pre-training, hard sample fine-tuning, and GRPO alignment--sequentially exploits these data at different quality tiers. On the evaluation front, we rectify element-matching biases in OmniDocBench v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench v1.6 protocol. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods, including those based on models with over 200x more parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that document parsing performance is limited by training data deficiencies rather than architecture, as evidenced by consistent failure patterns across diverse models. It introduces MinerU2.5-Pro, which retains the 1.2B-parameter MinerU2.5 architecture but uses a co-designed Data Engine (Diversity-and-Difficulty-Aware Sampling to reach 65.5M samples, Cross-Model Consistency Verification for difficulty and annotations, and Judge-and-Refine for hard-sample correction) plus a three-stage training strategy (pre-training, hard-sample fine-tuning, GRPO alignment) to achieve 95.69 on the authors' rectified OmniDocBench v1.6, a 2.71-point gain over the identical-architecture baseline and surpassing models with >200x parameters.

Significance. If the gains are attributable solely to the data strategies, this is a notable demonstration that systematic data engineering can outperform architectural scaling in document parsing, where models share failure modes on hard samples. The observation of cross-model consistency on difficult examples and the scale of the 65.5M-sample dataset with progressive training tiers are concrete strengths that could shift research emphasis toward data curation pipelines.

major comments (3)
  1. [OmniDocBench v1.6 protocol] OmniDocBench v1.6 protocol (abstract and evaluation description): the rectification of element-matching biases from v1.5 and creation of the Hard subset must be shown to be independent of the Cross-Model Consistency Verification pipeline, because both the training annotations for hard samples and the benchmark changes rely on model consensus; any shared error patterns would directly inflate the reported 2.71-point gain on the Hard subset.
  2. [Data Engine] Data Engine description (abstract): no ablation results isolate the contribution of Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, or Judge-and-Refine to the 2.71-point improvement; without these controls it remains unclear whether the gains stem from the claimed data strategies or from unstated factors such as training schedule changes.
  3. [Three-stage training strategy] Three-stage training strategy (abstract): the paper provides no error analysis, human audit, or cross-validation of the Judge-and-Refine annotations against an independent source, leaving open the possibility that correlated model biases on hard samples are reinforced in both the 65.5M training set and the test protocol.
minor comments (2)
  1. [Data Engine] The abstract states expansion 'from under 10M to 65.5M samples' but does not specify the exact sources, diversity metrics, or difficulty thresholds used in sampling; these implementation details should be added for reproducibility.
  2. [Training strategy] GRPO alignment hyperparameters are listed as free parameters in the approach but no values or selection procedure are provided; this should be clarified in the training section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing that further clarifications and analyses will improve the manuscript. We will incorporate the suggested revisions in the next version.

read point-by-point responses
  1. Referee: [OmniDocBench v1.6 protocol] OmniDocBench v1.6 protocol (abstract and evaluation description): the rectification of element-matching biases from v1.5 and creation of the Hard subset must be shown to be independent of the Cross-Model Consistency Verification pipeline, because both the training annotations for hard samples and the benchmark changes rely on model consensus; any shared error patterns would directly inflate the reported 2.71-point gain on the Hard subset.

    Authors: We acknowledge the need to explicitly demonstrate independence to avoid any perception of circularity. The OmniDocBench v1.6 rectification was based on statistical analysis of element-matching discrepancies in v1.5 outputs combined with manual review of sampled cases, using a distinct collection of models and evaluation scripts separate from the Cross-Model Consistency Verification models employed for training data curation. The Hard subset threshold was derived from aggregate failure rates across a wide range of publicly available models not involved in our Data Engine. In the revised manuscript we will add a dedicated subsection in the evaluation protocol description that details the exact models, scripts, and manual audit procedures used for benchmark updates, along with a comparison showing that performance gains remain consistent when evaluated against the original v1.5 protocol on the same test samples. revision: yes

  2. Referee: [Data Engine] Data Engine description (abstract): no ablation results isolate the contribution of Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, or Judge-and-Refine to the 2.71-point improvement; without these controls it remains unclear whether the gains stem from the claimed data strategies or from unstated factors such as training schedule changes.

    Authors: We agree that component-wise ablations are necessary to isolate the contributions of each Data Engine module. Although the current manuscript emphasizes end-to-end results, we will add a new ablation study subsection. This will report three controlled experiments that reuse the identical three-stage training schedule and hyper-parameters: (i) baseline with only Diversity-and-Difficulty-Aware Sampling, (ii) addition of Cross-Model Consistency Verification, and (iii) full pipeline including Judge-and-Refine. Incremental accuracy deltas on OmniDocBench v1.6 will be presented to quantify the marginal benefit of each strategy while holding training schedule constant. revision: yes

  3. Referee: [Three-stage training strategy] Three-stage training strategy (abstract): the paper provides no error analysis, human audit, or cross-validation of the Judge-and-Refine annotations against an independent source, leaving open the possibility that correlated model biases on hard samples are reinforced in both the 65.5M training set and the test protocol.

    Authors: This concern about potential bias reinforcement is well-taken. We will expand the training strategy section with a human audit of 2,000 randomly sampled Judge-and-Refine outputs, reporting inter-annotator agreement (Cohen's kappa) and error-type breakdown. We will also add cross-validation results comparing the refined annotations against an independent held-out model ensemble and a small manually annotated reference set. The revised text will include quantitative error analysis demonstrating the reduction in annotation inconsistencies for hard samples and will discuss the use of heterogeneous model ensembles within Cross-Model Consistency Verification as a safeguard against correlated biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper attributes its 2.71-point gain on OmniDocBench v1.6 solely to data curation (Diversity-and-Difficulty-Aware Sampling, Cross-Model Consistency Verification, Judge-and-Refine) and a three-stage training schedule applied to an unchanged 1.2B architecture. These steps are described as independent engineering choices whose outputs are evaluated against an explicitly rectified external benchmark protocol; no equations, fitted parameters, or self-definitions reduce the reported scores to the inputs by construction. Benchmark rectification and hard-sample selection are presented as separate processes from the training pipeline, with no load-bearing self-citation chain or ansatz that forces the result. The comparison to the same-architecture baseline remains falsifiable on the shared protocol.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the reliability of model consensus for annotation and the bias-free nature of iterative refinement; specific implementation thresholds in sampling and verification are not detailed.

free parameters (2)
  • Diversity-and-difficulty sampling thresholds and weights
    Choices that expand the dataset from under 10M to 65.5M samples while mitigating distribution shift.
  • GRPO alignment hyperparameters
    Parameters controlling the final alignment stage after pre-training and fine-tuning.
axioms (2)
  • domain assumption Output consensus among heterogeneous models reliably indicates sample difficulty and produces accurate annotations
    Invoked in Cross-Model Consistency Verification to assess and label hard samples.
  • domain assumption Render-then-verify iterative correction improves annotation quality for hard samples without introducing new systematic errors
    Basis for the Judge-and-Refine pipeline applied to difficult cases.

pith-pipeline@v0.9.0 · 5764 in / 1470 out tokens · 40570 ms · 2026-05-10T18:53:37.675626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

    cs.CV 2026-05 conditional novelty 8.0

    PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

Reference graph

Works this paper leans on

57 extracted references · 37 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    arXiv preprint arXiv:2308.13418 , year=

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418, 2023

  3. [3]

    arXiv preprint arXiv:2501.15558 , year=

    Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, et al. Ocean-ocr: Towards general ocr application via a vision-language model.arXiv preprint arXiv:2501.15558, 2025

  4. [4]

    Logics-parsing technical report.arXiv preprint arXiv:2509.19760, 2025

    Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, and Minggang Wu. Logics-parsing technical report, 2025. URLhttps://arxiv.org/abs/2509.19760

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  6. [6]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model, 2025a

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

  7. [7]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report.arXiv preprint arXiv:2507.05595, 2025

  8. [8]

    PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi-task 0.9 b vlm for robust in-the-wild document parsing.arXiv preprint arXiv:2601.21957, 2026

  9. [9]

    Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

    Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

  10. [10]

    Unirec-0.1 b: Unified text and formula recognition with 0.1 b parameters.arXiv preprint arXiv:2512.21095, 2025

    Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Bai, Hao Feng, Wei Shi, Yuchen Su, Can Huang, and Yu-Gang Jiang. Unirec-0.1b: Unified text and formula recognition with 0.1b parameters.arXiv preprint arXiv:2512.21095, 2025

  11. [11]

    Glm-ocr technical report.arXiv preprint arXiv:2603.10910,

    Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026

  12. [12]

    Dolphin: Document image parsing via heterogeneous anchor prompting

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Document image parsing via heterogeneous anchor prompting. InProceedings of the 65th Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  13. [13]

    Selective sampling using the query by committee algorithm.Machine learning, 28(2):133–168, 1997

    Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm.Machine learning, 28(2):133–168, 1997

  14. [14]

    Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

  15. [15]

    Rag-anything: All-in-one rag framework

    Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and Chao Huang. Rag-anything: All-in-one rag framework. arXiv preprint arXiv:2510.12323, 2025

  16. [16]

    Kim, S., Thiessen, P

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut: Document understanding transformer without ocr.arXiv preprint arXiv:2111.15664, 7(15):2, 2021

  17. [17]

    Binary codes capable of correcting deletions, insertions, and reversals

    Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. InSoviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966

  18. [18]

    Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. dots. ocr: Multilingual document layout parsing in a single vision-language model.arXiv preprint arXiv:2512.02498, 2025. 19 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

  19. [19]

    25 Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218, 2025

  20. [20]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

  21. [21]

    Docling: An efficient open-source toolkit for AI-driven document conversion

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Ce- sar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Docling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

  22. [22]

    Ovis: Structural embedding alignment for multimodal large language model, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405. 20797

  23. [23]

    5 technical report

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, S...

  24. [24]

    Data-centric ai competition.DeepLearning AI

    Andrew Ng, Dillon Laird, and Lynn He. Data-centric ai competition.DeepLearning AI. Available online: https://https-deeplearning-ai. github. io/data-centric-comp/(accessed on 9 December 2021), 2021

  25. [25]

    Mineru2.5: Adecoupledvision-languagemodelforefficienthigh-resolution document parsing.arXiv preprint arXiv:2509.22186, 2025

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, ...

  26. [26]

    Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

    Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, et al. Native visual understanding: Resolving resolution dilemmas in vision- language models.arXiv preprint arXiv:2506.12776, 2025

  27. [27]

    Omnidocbench: Benchmarking diverse pdf document parsing with com- prehensive annotations

    Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking diverse pdf document parsing with com- prehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025

  28. [28]

    Marker.https://github.com/datalab-to/marker, 2025

    Vik Paruchuri. Marker.https://github.com/datalab-to/marker, 2025. Accessed:2025-09-25

  29. [29]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443,

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025

  30. [30]

    Query by committee

    H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. InProceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al

    Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, et al. Hunyuanocr technical report.arXiv preprint arXiv:2511.19575, 2025. 20 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

  33. [33]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  34. [34]

    UniMERNet: A Uni- versal Network for Real-World Mathematical Expression Recognition.arXiv preprint arXiv:2404.15254, 2024

    Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimernet: A universal network for real-world mathematical expression recognition.arXiv preprint arXiv:2404.15254, 2024

  35. [35]

    Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  36. [36]

    Image over text: Transforming formula recognition evaluation with character detection matching

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Botian Shi, Bo Zhang, and Conghui He. Image over text: Transforming formula recognition evaluation with character detection matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19681–19690, 2025

  37. [37]

    BookRAG: A hierarchical structure-aware index- based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

    Shu Wang, Yingli Zhou, and Yixiang Fang. Bookrag: A hierarchical structure-aware index-based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

  38. [38]

    Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  39. [39]

    URLhttps://arxiv.org/abs/2508.18265

  40. [40]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model.arXiv preprint arXiv:2409.01704, 2024

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704, 2024

  41. [41]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  42. [42]

    Firered-ocr technical report, 2026

    Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, and Changhao Qiao. Firered-ocr technical report, 2026. URL https://arxiv.org/abs/2603.01840

  43. [43]

    arXiv preprint arXiv:2406.11633 , year=

    Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, et al. Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  45. [45]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025

  46. [46]

    Youtu-parsing: Perception, structuring and recognition via high-parallelism decoding.arXiv preprint arXiv:2601.20430, 2026

    Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, 21 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Chengxu He, and Shuangyin Liu. Youtu-parsing: Perception, structuring and reco...

  47. [47]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  48. [48]

    Data-centric ai: Perspectives and challenges

    Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. Data-centric ai: Perspectives and challenges. InProceedings of the 2023 SIAM international conference on data mining (SDM), pages 945–948. SIAM, 2023

  49. [49]

    Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation

    Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17443–17453, 2025

  50. [50]

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, and Wentao Zhang. Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.arXiv preprint arXiv:2410.21169, 2024

  51. [51]

    2024 , eprint =

    Zhiyuan Zhao, Hengrui Kang, Bin Wang, and Conghui He. Doclayout-yolo: Enhancing document layout anal- ysis through diverse synthetic data and global-to-local adaptive perception.arXiv preprint arXiv:2410.12628, 2024

  52. [52]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020

  53. [53]

    OCRVerse: Towards holistic OCR in end-to-end vision-language models.arXiv preprint arXiv:2601.21639, 2026

    Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, and Zhixiong Zeng. Ocrverse: Towards holistic ocr in end-to-end vision-language models, 2026. URLhttps://arxiv.org/abs/2601.21639

  54. [54]

    春天里的强军号角

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023. 22 MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Appendix A Prompt Design and Task Examples Thi...

  55. [55]

    Each detected in-table image is replaced with a special placeholder token in the table crop, effectively masking the image region

    Detection.Layout Detection identifies image regions that fall spatially within a table bounding box. Each detected in-table image is replaced with a special placeholder token in the table crop, effectively masking the image region

  56. [56]

    Recognition.The masked table image is fed to Table Recognition, which generates the OTSL sequence with placeholder tokens marking the positions of masked images

  57. [57]

    Restoration.In the final output, placeholder tokens are resolved back to references to the original image regions, producing HTML table cells that contain <img> tags with unique identifiers linking to the extracted image content blocks. This approach allows the table structure and textual content to be recognized without interference from embedded images,...