arxiv: 2409.18839 · v1 · submitted 2024-09-27 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang , Chao Xu , Xiaomeng Zhao , Linke Ouyang , Fan Wu , Zhiyuan Zhao , Rui Xu , Kaiwen Liu

show 10 more authors

Yuan Qu Fukai Shang Bo Zhang Liqun Wei Zhihao Sui Wei Li Botian Shi Yu Qiao Dahua Lin Conghui He

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords document content extractionPDF parsingOCRlayout detectionformula recognitionopen sourcecomputer vision

0 comments

The pith

MinerU combines PDF-Extract-Kit models with custom rules to deliver high-precision document content extraction in open source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MinerU as an open-source system for extracting text, layout, formulas, and other content from PDFs and similar documents. It builds on existing models for OCR, layout detection, and formula recognition while adding preprocessing and postprocessing rules to handle variations in document styles. Experiments show that this combination produces more consistent and accurate results than prior open-source tools across different document types. If the approach holds, it makes reliable document parsing available without proprietary software. This matters for any workflow that turns scanned or digital documents into usable structured data.

Core claim

MinerU leverages the PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction.

What carries the argument

PDF-Extract-Kit models combined with finely-tuned preprocessing and postprocessing rules that correct and refine raw model outputs for final content accuracy.

If this is right

Document analysis pipelines can obtain more consistent text and layout data without switching tools per document type.
Downstream tasks such as information retrieval and data mining from PDFs become more reliable.
The open-source release allows direct inspection and modification of the extraction pipeline.
Users gain a single tool that maintains performance across reports, papers, forms, and mixed-content pages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into larger AI document-understanding systems could reduce the need for separate post-processing stages.
The same rule-based refinement approach might extend to additional file formats beyond PDF.
Community contributions could test and improve generalization on languages or layouts absent from the original experiments.
Direct head-to-head comparisons with commercial extraction services on identical test sets would clarify practical trade-offs.

Load-bearing premise

The preprocessing and postprocessing rules will continue to work on documents that differ from the tested collection.

What would settle it

Measure extraction accuracy on a new, independently collected set of PDFs with varied layouts, languages, and content types; if accuracy falls substantially below the levels reported in the paper, the claim does not hold.

read the original abstract

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinerU is a practical open-source engineering release that wraps PDF-Extract-Kit models with extra rules, but the abstract gives no numbers to show whether the rules actually improve results.

read the letter

MinerU takes the existing PDF-Extract-Kit components for layout, OCR, and formula recognition, adds some preprocessing and postprocessing rules, and releases the combined pipeline as open source. That is the main thing the paper does. The GitHub link and the focus on handling varied document types make it a usable tool for people who need reliable extraction in data pipelines or retrieval systems. Releasing working code is the real contribution here rather than any new model or derivation. The description of how the rules fit together is straightforward and shows they thought through the practical steps. The soft spot is the evaluation. The abstract claims high performance and better consistency but supplies no metrics, baselines, test-set details, or comparisons. Without those numbers it is difficult to tell how much the added rules move the needle or whether the results generalize. The stress-test note is right that users can test the repo themselves, but a paper still needs the authors to report their own protocol clearly. The work is honest about being an integration, with no circular claims or hidden parameters. This paper is for engineers and applied researchers who want a ready extractor rather than for people looking for new algorithms. A reader building vision-language data or document search systems could get value from the code. It shows clear thinking about making something usable and engages the prior work on PDF-Extract-Kit directly. I would send it to peer review so referees can check the implementation details and the actual results once the full numbers are in the manuscript.

Referee Report

1 major / 2 minor

Summary. The paper presents MinerU, an open-source pipeline for precise document content extraction that combines PDF-Extract-Kit models with custom preprocessing and postprocessing rules. It claims that this combination delivers consistently high performance across diverse document types and improves extraction quality and consistency over existing open-source solutions. The project repository is made publicly available.

Significance. If the performance claims hold under rigorous evaluation, MinerU would provide a practical, reproducible engineering contribution to document analysis in computer vision by offering an accessible pipeline that addresses variability in document layouts and content. The open-source release enables direct community testing and extension.

major comments (1)

[§4] §4 (Experiments): The section asserts that 'experimental results demonstrate that MinerU consistently achieves high performance' but supplies no quantitative metrics (e.g., precision/recall/F1 per task or per document type), no baseline comparisons, and no description of the test collection size or composition. This information is load-bearing for the central claim of superiority and consistency.

minor comments (2)

[Abstract] The abstract and introduction could explicitly list the document categories (e.g., academic papers, forms, tables) on which the pipeline was evaluated.
Add a reproducibility statement specifying the exact commit hash or release version of both MinerU and PDF-Extract-Kit used for the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We have addressed the concern regarding the lack of quantitative details in the Experiments section.

read point-by-point responses

Referee: [§4] §4 (Experiments): The section asserts that 'experimental results demonstrate that MinerU consistently achieves high performance' but supplies no quantitative metrics (e.g., precision/recall/F1 per task or per document type), no baseline comparisons, and no description of the test collection size or composition. This information is load-bearing for the central claim of superiority and consistency.

Authors: We agree with the referee that the current version of §4 does not provide the requested quantitative metrics, baseline comparisons, or details on the test collection. In the revised manuscript we will expand the Experiments section to report precision, recall, and F1 scores broken down by task and document type, include comparisons against relevant open-source baselines, and describe the test collection (size, composition, and selection criteria). These additions will directly support the claims of consistent high performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering pipeline (PDF-Extract-Kit models plus custom preprocessing/postprocessing rules) whose performance is asserted through experiments on document extraction tasks. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs are present. The work is self-contained as a release of an open-source tool, with results directly testable externally rather than forced by internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper. No free parameters are introduced in the abstract, no new axioms beyond standard computer-vision assumptions, and no invented physical or mathematical entities.

pith-pipeline@v0.9.0 · 5482 in / 1016 out tokens · 78327 ms · 2026-05-16T03:55:23.006745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results demonstrate that MinerU consistently achieves high performance across various document types

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FollowTable: A Benchmark for Instruction-Following Table Retrieval
cs.IR 2026-05 unverdicted novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray
cs.LG 2026-05 accept novelty 7.0

HUGO-CS is a 4,383-experiment cold-spray dataset extracted from literature via a new hybrid LLM-manual framework that is 30 times larger than prior collections and released with code.
REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction
cs.MA 2026-04 unverdicted novelty 7.0

RegReAct deploys self-correcting multi-agent pipelines across seven stages to extract hierarchical compliance criteria from regulatory texts, outperforming single-pass GPT-4o on EU Taxonomy documents.
ParseBench: A Document Parsing Benchmark for AI Agents
cs.CV 2026-04 accept novelty 7.0

ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
cs.CL 2026-04 unverdicted novelty 7.0

EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
cs.IR 2026-03 unverdicted novelty 7.0

Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.
El Agente Quntur: A research collaborator agent for quantum chemistry
physics.chem-ph 2026-02 unverdicted novelty 7.0

El Agente Quntur is a new multi-agent system that uses reasoning over literature and software documentation to autonomously handle the full workflow of quantum chemistry experiments in ORCA.
DocAtlas: Multilingual Document Understanding Across 80+ Languages
cs.CL 2026-05 unverdicted novelty 6.0

DocAtlas creates multilingual document datasets across 82 languages and shows DPO with rendered ground truth improves model accuracy by 1.7-1.9% without degrading base-language performance, unlike supervised fine-tuning.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

FT-RAG introduces a fine-grained graph-based retrieval framework for tables plus a new 9870-pair benchmark, reporting 23.5% and 59.2% gains in table- and cell-level hit rates and 62.2% higher exact-value recall over b...
BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
cs.AI 2026-04 unverdicted novelty 6.0

BioMiner introduces a multi-modal extraction system and BioVista benchmark that achieves F1 0.32 on bioactivity triplets and demonstrates utility in scaling datasets and improving QSAR models.
Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
cs.CL 2026-04 unverdicted novelty 6.0

Doc-V* proposes a coarse-to-fine interactive visual reasoning agent for multi-page document VQA that aggregates evidence selectively via semantic retrieval and targeted fetching, outperforming baselines by up to 47.9%...
QoS-QoE Translation with Large Language Model
cs.MM 2026-04 unverdicted novelty 6.0

A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Thinking with Drafting: Optical Decompression via Logical Reconstruction
cs.CL 2026-02 unverdicted novelty 6.0

Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval
cs.CV 2026-01 unverdicted novelty 4.0

Large-scale profiling of recent AI literature shows growth in safety, multimodal reasoning, and agent studies alongside stabilization in neural machine translation and graph methods.
PaddleOCR 3.0 Technical Report
cs.CV 2025-07 unverdicted novelty 4.0

PaddleOCR 3.0 releases compact open-source models for OCR, document structure parsing, and information extraction that rival billion-parameter VLMs.
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
cs.CV 2026-01 unverdicted novelty 3.0

A 7B-parameter domain-specific image captioning model for ICT, trained in three stages on synthesized and annotated data, outperforms 32B-parameter general models on BLEU and expert accuracy metrics.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 22 Pith papers · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

pix2tex - latex ocr

Lukas Blecher. pix2tex - latex ocr. https://github.com/lukas-blecher/LaTeX-OCR,

work page
[4]

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review arXiv 2023
[5]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[6]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024

work page arXiv 2024
[10]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. arXiv preprint arXiv:2409.03420, 2024

work page arXiv 2024
[11]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022

work page 2022
[12]

Mistral 7B

AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b (2023). arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision , pages 498–517. Springer, 2022

work page 2022
[14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[15]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Exploring plain vision transformer backbones for object detection

Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In European conference on computer vision, pages 280–296. Springer, 2022. 12

work page 2022
[17]

Focus anywhere for fine-grained multi-page document understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Focus anywhere for fine-grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024

work page arXiv 2024
[18]

Multilingual denoising pre-training for neural machine translation

Y Liu. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210, 2020

work page arXiv 2001
[19]

On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

work page arXiv 2023
[20]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021

work page 2021
[21]

Kosmos-2.5: A multimodal literate model

Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023

work page arXiv 2023
[22]

Mathpix. Mathpix. https://mathpix.com/. Accessed: 2024-8-15

work page 2024
[23]

OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023

work page 2023
[24]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[25]

Vik Paruchuri. Texify. https://github.com/VikParuchuri/texify, 2023. Accessed: 2024-2-29

work page 2023
[26]

In-context retrieval-augmented language models

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics , 11:1316–1331, 2023

work page 2023
[27]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

An overview of the tesseract ocr engine

Ray Smith. An overview of the tesseract ocr engine. In Ninth international conference on document analysis and recognition (ICDAR 2007) , volume 2, pages 629–633. IEEE, 2007

work page 2007
[29]

Internlm: A multilingual language model with progressively enhanced capabilities, 2023

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023

work page 2023
[30]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Yolov10: Real-time end-to-end object detection

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection. arXiv preprint arXiv:2405.14458, 2024

work page arXiv 2024
[32]

Unimer- net: A universal network for real-world mathematical expression recognition

Bin Wang, Zhuangcheng Gu, Chao Xu, Bo Zhang, Botian Shi, and Conghui He. Unimer- net: A universal network for real-world mathematical expression recognition. arXiv preprint arXiv:2404.15254, 2024

work page arXiv 2024
[33]

Cdm: A reliable metric for fair and accurate formula recognition evaluation

Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, and Conghui He. Cdm: A reliable metric for fair and accurate formula recognition evaluation. arXiv preprint arXiv:2409.03643, 2024

work page arXiv 2024
[34]

Vary: Scaling up the vision vocabulary for large vision- language models

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision- language models. arXiv preprint arXiv:2312.06109, 2023

work page arXiv 2023
[35]

Small language model meets with reinforced vision vocabulary

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024. 13

work page arXiv 2024
[36]

arXiv preprint arXiv: 2409.01704 (2024)

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024

work page arXiv 2024
[37]

Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024

Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, et al. Docgenome: An open large-scale scientific document benchmark for training and testing multi-modal large language models.arXiv preprint arXiv:2406.11633, 2024

work page arXiv 2024
[38]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Docxchain: A powerful open-source toolchain for document parsing and beyond

Cong Yao. Docxchain: A powerful open-source toolchain for document parsing and beyond. arXiv preprint arXiv:2310.12430, 2023

work page arXiv 2023
[40]

Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html

Jiaquan Ye, Xianbiao Qi, Yelin He, Yihao Chen, Dengyi Gu, Peng Gao, and Rong Xiao. Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html. arXiv preprint arXiv:2105.01848, 2021

work page arXiv 2021
[41]

Internlm-Math: Open Math Large Language Models Toward Verifiable Reasoning.arXiv preprint arXiv:2402.06332, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024

work page arXiv 2024
[42]

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. In European conference on computer vision , pages 564–580. Springer, 2020. 14

work page 2020