pith. machine review for the scientific record. sign in

arxiv: 2409.18839 · v1 · submitted 2024-09-27 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MinerU: An Open-Source Solution for Precise Document Content Extraction

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords document content extractionPDF parsingOCRlayout detectionformula recognitionopen sourcecomputer vision
0
0 comments X

The pith

MinerU combines PDF-Extract-Kit models with custom rules to deliver high-precision document content extraction in open source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MinerU as an open-source system for extracting text, layout, formulas, and other content from PDFs and similar documents. It builds on existing models for OCR, layout detection, and formula recognition while adding preprocessing and postprocessing rules to handle variations in document styles. Experiments show that this combination produces more consistent and accurate results than prior open-source tools across different document types. If the approach holds, it makes reliable document parsing available without proprietary software. This matters for any workflow that turns scanned or digital documents into usable structured data.

Core claim

MinerU leverages the PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction.

What carries the argument

PDF-Extract-Kit models combined with finely-tuned preprocessing and postprocessing rules that correct and refine raw model outputs for final content accuracy.

If this is right

  • Document analysis pipelines can obtain more consistent text and layout data without switching tools per document type.
  • Downstream tasks such as information retrieval and data mining from PDFs become more reliable.
  • The open-source release allows direct inspection and modification of the extraction pipeline.
  • Users gain a single tool that maintains performance across reports, papers, forms, and mixed-content pages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration into larger AI document-understanding systems could reduce the need for separate post-processing stages.
  • The same rule-based refinement approach might extend to additional file formats beyond PDF.
  • Community contributions could test and improve generalization on languages or layouts absent from the original experiments.
  • Direct head-to-head comparisons with commercial extraction services on identical test sets would clarify practical trade-offs.

Load-bearing premise

The preprocessing and postprocessing rules will continue to work on documents that differ from the tested collection.

What would settle it

Measure extraction accuracy on a new, independently collected set of PDFs with varied layouts, languages, and content types; if accuracy falls substantially below the levels reported in the paper, the claim does not hold.

read the original abstract

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents MinerU, an open-source pipeline for precise document content extraction that combines PDF-Extract-Kit models with custom preprocessing and postprocessing rules. It claims that this combination delivers consistently high performance across diverse document types and improves extraction quality and consistency over existing open-source solutions. The project repository is made publicly available.

Significance. If the performance claims hold under rigorous evaluation, MinerU would provide a practical, reproducible engineering contribution to document analysis in computer vision by offering an accessible pipeline that addresses variability in document layouts and content. The open-source release enables direct community testing and extension.

major comments (1)
  1. [§4] §4 (Experiments): The section asserts that 'experimental results demonstrate that MinerU consistently achieves high performance' but supplies no quantitative metrics (e.g., precision/recall/F1 per task or per document type), no baseline comparisons, and no description of the test collection size or composition. This information is load-bearing for the central claim of superiority and consistency.
minor comments (2)
  1. [Abstract] The abstract and introduction could explicitly list the document categories (e.g., academic papers, forms, tables) on which the pipeline was evaluated.
  2. Add a reproducibility statement specifying the exact commit hash or release version of both MinerU and PDF-Extract-Kit used for the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We have addressed the concern regarding the lack of quantitative details in the Experiments section.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The section asserts that 'experimental results demonstrate that MinerU consistently achieves high performance' but supplies no quantitative metrics (e.g., precision/recall/F1 per task or per document type), no baseline comparisons, and no description of the test collection size or composition. This information is load-bearing for the central claim of superiority and consistency.

    Authors: We agree with the referee that the current version of §4 does not provide the requested quantitative metrics, baseline comparisons, or details on the test collection. In the revised manuscript we will expand the Experiments section to report precision, recall, and F1 scores broken down by task and document type, include comparisons against relevant open-source baselines, and describe the test collection (size, composition, and selection criteria). These additions will directly support the claims of consistent high performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering pipeline (PDF-Extract-Kit models plus custom preprocessing/postprocessing rules) whose performance is asserted through experiments on document extraction tasks. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs are present. The work is self-contained as a release of an open-source tool, with results directly testable externally rather than forced by internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper. No free parameters are introduced in the abstract, no new axioms beyond standard computer-vision assumptions, and no invented physical or mathematical entities.

pith-pipeline@v0.9.0 · 5482 in / 1016 out tokens · 78327 ms · 2026-05-16T03:55:23.006745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FollowTable: A Benchmark for Instruction-Following Table Retrieval

    cs.IR 2026-05 unverdicted novelty 8.0

    FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...

  2. PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

    cs.AI 2026-05 unverdicted novelty 7.0

    PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

  3. HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray

    cs.LG 2026-05 accept novelty 7.0

    HUGO-CS is a 4,383-experiment cold-spray dataset extracted from literature via a new hybrid LLM-manual framework that is 30 times larger than prior collections and released with code.

  4. REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction

    cs.MA 2026-04 unverdicted novelty 7.0

    RegReAct deploys self-correcting multi-agent pipelines across seven stages to extract hierarchical compliance criteria from regulatory texts, outperforming single-pass GPT-4o on EU Taxonomy documents.

  5. ParseBench: A Document Parsing Benchmark for AI Agents

    cs.CV 2026-04 accept novelty 7.0

    ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.

  6. EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.

  7. MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.

  8. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

    cs.IR 2026-03 unverdicted novelty 7.0

    Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.

  9. El Agente Quntur: A research collaborator agent for quantum chemistry

    physics.chem-ph 2026-02 unverdicted novelty 7.0

    El Agente Quntur is a new multi-agent system that uses reasoning over literature and software documentation to autonomously handle the full workflow of quantum chemistry experiments in ORCA.

  10. DocAtlas: Multilingual Document Understanding Across 80+ Languages

    cs.CL 2026-05 unverdicted novelty 6.0

    DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.

  11. ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

    cs.CL 2026-05 unverdicted novelty 6.0

    ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

  12. FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...

  13. BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

    cs.AI 2026-04 unverdicted novelty 6.0

    BioMiner introduces a multi-modal extraction system and BioVista benchmark that achieves F1 0.32 on bioactivity triplets and demonstrates utility in scaling datasets and improving QSAR models.

  14. Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

    cs.CL 2026-04 unverdicted novelty 6.0

    Doc-V* proposes a coarse-to-fine interactive visual reasoning agent for multi-page document VQA that aggregates evidence selectively via semantic retrieval and targeted fetching, outperforming baselines by up to 47.9%...

  15. QoS-QoE Translation with Large Language Model

    cs.MM 2026-04 unverdicted novelty 6.0

    A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.

  16. Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

    cs.CV 2026-03 conditional novelty 6.0

    PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.

  17. Thinking with Drafting: Optical Decompression via Logical Reconstruction

    cs.CL 2026-02 unverdicted novelty 6.0

    Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.

  18. DeepSeek-OCR: Contexts Optical Compression

    cs.CV 2025-10 unverdicted novelty 6.0

    DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

  19. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.

  20. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...

  21. Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval

    cs.CV 2026-01 unverdicted novelty 4.0

    Large-scale profiling of recent AI literature shows growth in safety, multimodal reasoning, and agent studies alongside stabilization in neural machine translation and graph methods.

  22. PaddleOCR 3.0 Technical Report

    cs.CV 2025-07 unverdicted novelty 4.0

    PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.

  23. Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

    cs.CV 2026-01 unverdicted novelty 3.0

    A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 22 Pith papers · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 , 2023

  3. [3]

    pix2tex - latex ocr

    Lukas Blecher. pix2tex - latex ocr. https://github.com/lukas-blecher/LaTeX-OCR,

  4. [4]

    Nougat: Neural Optical Understanding for Academic Documents

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

  5. [5]

    Language Models are Few-Shot Learners

    Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  6. [6]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024

  7. [7]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  8. [8]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

  9. [9]

    mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024

  10. [10]

    mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. arXiv preprint arXiv:2409.03420, 2024

  11. [11]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022

  12. [12]

    Mistral 7B

    AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b (2023). arXiv preprint arXiv:2310.06825, 2023

  13. [13]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision , pages 498–517. Springer, 2022

  14. [14]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  15. [15]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023

  16. [16]

    Exploring plain vision transformer backbones for object detection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In European conference on computer vision, pages 280–296. Springer, 2022. 12

  17. [17]

    Focus anywhere for fine-grained multi-page document understanding

    Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024

  18. [18]

    Multilingual denoising pre-training for neural machine translation

    Y Liu. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210, 2020

  19. [19]

    On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

  20. [20]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021

  21. [21]

    Kosmos-2.5: A multimodal literate model

    Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023

  22. [22]

    Mathpix. Mathpix. https://mathpix.com/. Accessed: 2024-8-15

  23. [23]

    OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023

  24. [24]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  25. [25]

    Vik Paruchuri. Texify. https://github.com/VikParuchuri/texify, 2023. Accessed: 2024-2-29

  26. [26]

    In-context retrieval-augmented language models

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics , 11:1316–1331, 2023

  27. [27]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  28. [28]

    An overview of the tesseract ocr engine

    Ray Smith. An overview of the tesseract ocr engine. In Ninth international conference on document analysis and recognition (ICDAR 2007) , volume 2, pages 629–633. IEEE, 2007

  29. [29]

    Internlm: A multilingual language model with progressively enhanced capabilities, 2023

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023

  30. [30]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  31. [31]

    Yolov10: Real-time end-to-end object detection

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection. arXiv preprint arXiv:2405.14458, 2024

  32. [32]

    Unimer- net: A universal network for real-world mathematical expression recognition

    Bin Wang, Zhuangcheng Gu, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimer- net: A universal network for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254, 2024

  33. [33]

    Cdm: A reliable metric for fair and accurate formula recognition evaluation

    Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, and Conghui He. Cdm: A reliable metric for fair and accurate formula recognition evaluation. arXiv preprint arXiv:2409.03643, 2024

  34. [34]

    Vary: Scaling up the vision vocabulary for large vision- language models

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision- language models. arXiv preprint arXiv:2312.06109, 2023

  35. [35]

    Small language model meets with reinforced vision vocabulary

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024. 13

  36. [36]

    arXiv preprint arXiv: 2409.01704 (2024)

    Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024

  37. [37]

    Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024

    Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, et al. Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024

  38. [38]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024

  39. [39]

    Docxchain: A powerful open-source toolchain for document parsing and beyond

    Cong Yao. Docxchain: A powerful open-source toolchain for document parsing and beyond. arXiv preprint arXiv:2310.12430, 2023

  40. [40]

    Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html

    Jiaquan Ye, Xianbiao Qi, Yelin He, Yihao Chen, Dengyi Gu, Peng Gao, and Rong Xiao. Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html. arXiv preprint arXiv:2105.01848, 2021

  41. [41]

    Internlm-Math: Open Math Large Language Models Toward Verifiable Reasoning.arXiv preprint arXiv:2402.06332, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024

  42. [42]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. In European conference on computer vision , pages 564–580. Springer, 2020. 14