Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Janis Keuper; Pius Horn

arxiv: 2512.09874 · v2 · submitted 2025-12-10 · 💻 cs.CV · cs.AI· cs.IR

Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Pius Horn , Janis Keuper This is my paper

Pith reviewed 2026-05-16 23:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.IR

keywords PDF parsingmathematical formula extractionLLM evaluationdocument benchmarkingsemantic equivalencesynthetic datascientific literature processing

0 comments

The pith

A benchmarking framework with synthetic PDFs and LLM semantic judgment reveals large performance gaps among PDF parsers on mathematical formula extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a controlled test for how accurately document parsers extract mathematical formulas from PDFs. It generates synthetic PDFs from LaTeX sources that supply exact ground truth and then applies an LLM to decide whether each parsed formula conveys the same mathematical meaning as the original. A study with human raters confirms that this LLM judge tracks human semantic assessments at r=0.78, far above the r=0.34 obtained by simple character matching. When the method is run on twenty-plus parsers across one hundred documents containing more than two thousand formulas, clear differences in accuracy appear. The results give concrete guidance for choosing parsers when building scientific databases or training models on research papers.

Core claim

We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparit

What carries the argument

LLM-as-a-judge semantic equivalence scoring paired with a two-stage fuzzy matching pipeline that aligns parser outputs to LaTeX ground truth despite notation and format differences.

If this is right

Practitioners can use the reported rankings to pick parsers that preserve mathematical content more reliably for downstream scientific applications.
Semantic evaluation allows fair comparison even when parsers emit formulas in different notations or with minor rendering variations.
Controlled synthetic documents make it possible to isolate how layout complexity or formula density affects extraction accuracy.
Higher-quality formula extraction improves the training data available for large language models that process scientific literature.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-generation and LLM-judge approach could be extended to benchmark extraction of tables, figures, or citations under controlled conditions.
Running the benchmark on real scanned papers would reveal whether the observed parser gaps persist outside the synthetic setting.
Because the LLM judge correlates well with humans, it could scale evaluation for other document-parsing tasks where human labeling is expensive.

Load-bearing premise

Synthetically generated PDFs with controlled layouts and formulas adequately represent the parsing difficulties present in real-world academic PDFs.

What would settle it

Applying the same 20+ parsers and LLM judge to a set of real multi-column academic PDFs and checking whether the performance ordering and human correlation remain the same.

Figures

Figures reproduced from arXiv: 2512.09874 by Janis Keuper, Pius Horn.

**Figure 1.** Figure 1: Overview of the three main components of the benchmarking framework. The formula dataset component extracts and processes mathematical formulas from Wikipedia to create the wikipedia-latex-formulas-319k collection. The benchmark dataset component generates synthetic PDFs with precise ground truth by randomly combining sampled formulas from this dataset with text segments and inline formulas using randomly… view at source ↗

**Figure 2.** Figure 2: Correlation of Automated Metrics with Human Evaluations [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching (CDM) and r~0 for text similarity. Our robust two-stage matching pipeline combining LLM-based extraction with fuzzy validation reliably aligns parsed formulas with ground truth despite format inconsistencies across parsers. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities, providing actionable guidance for practitioners selecting parsers for downstream applications. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench and https://github.com/phorn1/formula-metric-study

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthetic PDFs plus an LLM judge give a practical benchmark for formula extraction, but the real-world transfer question is still open.

read the letter

The main takeaway is that the paper supplies a controlled benchmark for pulling mathematical formulas out of PDFs. They generate 100 synthetic documents from LaTeX with exact ground truth on more than 2000 formulas, then evaluate over 20 parsers using an LLM-as-a-judge for semantic equivalence rather than string matching. A human study on 250 pairs with 750 ratings shows the LLM metric reaches r=0.78 correlation with people, well above character-level matching at r=0.34. They also release code, data, and a two-stage matching pipeline that handles format differences across parsers. That combination of precise ground truth and validated semantic scoring is the concrete advance here, and the public resources make it straightforward to check or extend.

Referee Report

2 major / 2 minor

Summary. The paper introduces a benchmarking framework for PDF parsers focused on mathematical formula extraction. It generates 100 synthetic PDFs with controlled layouts and over 2,000 formulas using precise LaTeX ground truth, proposes an LLM-as-a-judge metric for semantic equivalence, validates this metric via a human study (250 formula pairs, 750 ratings yielding Pearson r=0.78 vs. r=0.34 for character-level matching), and evaluates 20+ parsers to reveal performance disparities. Code and data are released publicly.

Significance. If the synthetic benchmark generalizes, the work supplies practitioners with concrete guidance on parser selection for formula extraction tasks and introduces a semantically-aware evaluation method superior to string matching. Key strengths include the direct human validation study supporting the LLM judge and the public release of code, data, and the two-stage matching pipeline, which supports reproducibility.

major comments (2)

[§3 (Synthetic PDF Generation)] The synthetic PDF generation process (described at high level in the abstract and §3) provides no concrete details on incorporation of multi-column layouts, font variations, or rendering artifacts typical of real academic PDFs. This assumption is load-bearing for the central claim of actionable performance disparities in the evaluation of 20+ parsers, as the reported gaps may not transfer if these complexities are underrepresented.
[Evaluation section] Parser selection criteria are not specified (evaluation section), leaving unclear whether the 20+ tools form a representative sample or are biased toward particular architectures; this affects the reliability of the disparity findings.

minor comments (2)

Figure captions for parser output examples could include explicit annotations of matched vs. mismatched formulas to improve readability.
[Methods] The two-stage matching pipeline is described clearly but would benefit from a pseudocode listing or explicit parameter values for the fuzzy validation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3 (Synthetic PDF Generation)] The synthetic PDF generation process (described at high level in the abstract and §3) provides no concrete details on incorporation of multi-column layouts, font variations, or rendering artifacts typical of real academic PDFs. This assumption is load-bearing for the central claim of actionable performance disparities in the evaluation of 20+ parsers, as the reported gaps may not transfer if these complexities are underrepresented.

Authors: We agree that the current description of the synthetic PDF generation in §3 is high-level and would benefit from additional concrete implementation details to better justify transferability. In the revised manuscript, we will expand §3 to specify the exact mechanisms used: multi-column layouts generated via the LaTeX multicol package with controlled column counts and widths; font variations implemented through selection of standard academic typefaces (e.g., Computer Modern, Times, and sans-serif variants) with randomized sizes and styles; and rendering artifacts simulated by applying controlled PDF compression, noise injection, and anti-aliasing effects during compilation. These additions will directly address the concern and support the reliability of the observed performance disparities. revision: yes
Referee: [Evaluation section] Parser selection criteria are not specified (evaluation section), leaving unclear whether the 20+ tools form a representative sample or are biased toward particular architectures; this affects the reliability of the disparity findings.

Authors: We acknowledge that the evaluation section does not explicitly state the parser selection criteria, which is necessary to evaluate potential bias. The 20+ parsers were selected to represent a broad cross-section of contemporary tools, prioritizing those with high community adoption (measured by GitHub stars and citations), support for mathematical content extraction, and architectural diversity (including rule-based, OCR-dependent, and neural network-based parsers). In the revised manuscript, we will add a new subsection in the evaluation section that explicitly lists the selection criteria, the full list of evaluated parsers with their categories, and a brief rationale for inclusion to demonstrate representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark grounded in independent human validation and public releases

full rationale

The paper's central results derive from direct evaluation of 20+ parsers on 100 synthetic documents containing 2000+ formulas, using an LLM-as-a-judge metric that is separately validated against 750 independent human ratings (Pearson r=0.78 vs. r=0.34 for character matching). No equations, parameter fits, or derivations are present that reduce by construction to the paper's own inputs. The synthetic generation process and two-stage matching pipeline are described as controllable and robust but are not claimed to be derived from the evaluation outcomes themselves. Public code and data releases further allow external reproduction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing for the performance disparity claims. This is a standard empirical benchmarking study whose claims rest on observable outputs rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the assumption that synthetic PDFs capture representative parsing challenges and that LLM judgments align with human semantic understanding of formulas.

axioms (1)

domain assumption Synthetically generated PDFs with controlled layouts and formulas adequately represent the parsing difficulties present in real-world academic PDFs.
Invoked to justify generalization of benchmark results beyond the synthetic test set.

pith-pipeline@v0.9.0 · 5514 in / 1235 out tokens · 40984 ms · 2026-05-16T23:05:24.128148+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

arXiv preprint (2025)

Adhikari, N.S., Agarwal, S.: A comparative study of pdf pa rsing tools across diverse document categories. arXiv preprint (2025)

work page 2025
[2]

In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS)

Aguilar, F.D., Hirata, N.S.: ExpressMatch: A system for c reating ground-truthed datasets of online mathematical expressions. In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS) . pp. 155–159 (2012)

work page 2012
[3]

In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR)

Alvaro, F., Sánchez, J.A., Benedi, J.M.: Unbiased evalua tion of handwritten math- ematical expression recognition. In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 181–186 (2012)

work page 2012
[4]

In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR)

Awal, A.M.A.M., Mouchère, H., Viard-Gaudin, C.: Towards handwritten mathe- matical expressions recognition. In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1046 –1050 (2009)

work page 2009
[5]

In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Bast, H., Korzen, C.: A benchmark and evaluation for text e xtraction from pdf. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 1–10 (2017)

work page 2017
[6]

In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS)

Chao, H., Fan, J.: Layout and content extraction for PDF do cuments. In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS). pp. 213–224 (2004)

work page 2004
[7]

, Liu, Y., Yu, D., Ma, Y.: Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model

Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., Wan g, X., Zhou, C., Liu, H., Lin, M., Zhang, Y., Zhang, Y., Zheng, H., Zhang, J., Zhang, J. , Liu, Y., Yu, D., Ma, Y.: Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model. arXiv preprint (2025)

work page 2025
[8]

, Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J. , Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report. arXiv pre print (2025)

work page 2025
[9]

In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML)

Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to- markup generation with coarse-to-ﬁne attention. In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML). vol. 70, pp. 980–989 (2017)

work page 2017
[10]

arXiv preprint (2025)

Gemini Team, Google DeepMind: Gemini 2.5: Pushing the fr ontier with advanced reasoning, multimodality, long context, and next generati on agentic capabilities. arXiv preprint (2025)

work page 2025
[11]

https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01

Google DeepMind: Gemini 3: Introducing the latest gemin i ai model from google. https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01

work page 2025
[12]

In: Proceedings of the 26t h International ACM SIGACCESS Conference on Computers and Accessibility (ASSE TS) (2024)

Kumar, A., Wang, L.L.: Uncovering the new accessibility crisis in scholarly PDFs: Publishing model and platform changes contribute to declin ing scholarly document accessibility in the last decade. In: Proceedings of the 26t h International ACM SIGACCESS Conference on Computers and Accessibility (ASSE TS) (2024)

work page 2024
[13]

In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS)

Labahn, G., Lank, E., MacLean, S., Marzouk, M., Tausky, D .: MathBrush: A system for doing math on pen-based devices. In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS) . pp. 599–606 (2008) 14 P. Horn and J. Keuper

work page 2008
[14]

Soviet Physics Doklady 10(8), 707–710 (1966)

Levenshtein, V.I.: Binary codes capable of correcting d eletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

work page 1966
[15]

ar Xiv preprint (2024)

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., Liu, Y.: Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ar Xiv preprint (2024)

work page 2024
[16]

In: Pro- ceedings of the Thirteenth International Conference on Lea rning Representations (ICLR) (2025)

Li, S., Huang, J., Zhuang, J., Shi, Y., Cai, X., Xu, M., Wan g, X., Zhang, L., Ke, G., Cai, H.: Scilitllm: How to adapt llms for scientiﬁc literatu re understanding. In: Pro- ceedings of the Thirteenth International Conference on Lea rning Representations (ICLR) (2025)

work page 2025
[17]

arXiv preprint (2025)

Liu, Y., Yuan, X., Zhang, H., Gao, Z., Zhu, B., Peng, X., Li n, Z., Liu, Q., Jin, L., Bai, X.: Monkeyocr: Document parsing with a structure-r ecognition-relation triplet paradigm. arXiv preprint (2025)

work page 2025
[18]

https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01

LlamaIndex: LlamaParse: Genai-native document parsin g platform. https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01

work page 2024
[19]

In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL)

Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2O RC: The semantic scholar open research corpus. In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL). pp. 4969 –4983 (2020)

work page 2020
[20]

In: Proceedings of the 13th European Con- ference on Research and Advanced Technology for Digital Lib raries (ECDL)

Lopez, P.: Grobid: Combining automatic bibliographic d ata recognition and term extraction for scholarship publications. In: Proceedings of the 13th European Con- ference on Research and Advanced Technology for Digital Lib raries (ECDL). pp. 473–474 (2009)

work page 2009
[21]

https://mathpix.com (2025), accessed: 2025-11-28

Mathpix, Inc.: Mathpix: Document conversion for stem. https://mathpix.com (2025), accessed: 2025-11-28

work page 2025
[22]

https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28

Mistral AI: Mistral OCR 25.05: Next-generation documen t understanding model. https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28

work page 2025
[23]

Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28

Nano Net Technologies Inc.: Nanonets-OCR-s: Image-to- markdown ocr model. Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28

work page 2025
[24]

Hugging Fa ce Dataset, https://huggingface.co/datasets/getomni-ai/ocr-benchmark (2025), ac- cessed: 2025-11-17

OmniAI Technology, Inc.: Omni OCR Benchmark. Hugging Fa ce Dataset, https://huggingface.co/datasets/getomni-ai/ocr-benchmark (2025), ac- cessed: 2025-11-17

work page 2025
[25]

https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01

OpenAI: GPT-5: Openai’s next generation language model . https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01

work page 2025
[26]

In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., W ang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., X u, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf d ocument parsing with comprehensive annotations. In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Reco...

work page 2025
[27]

In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL)

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A met hod for automatic evaluation of machine translation. In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 3 11–318 (2002)

work page 2002
[28]

: Doclaynet: A large human-annotated dataset for document-layout analysis

Pﬁtzmann, B., Auer, C., Dolﬁ, M., Nassar, A.S., Staar, P. : Doclaynet: A large human-annotated dataset for document-layout analysis. In : Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Minin g (KDD). pp. 3743–3751 (2022)

work page 2022
[29]

arXiv preprint (2025)

Poznanski, J., Soldaini, L., Lo, K.: olmocr 2: Unit test r ewards for document ocr. arXiv preprint (2025)

work page 2025
[30]

GitHub repository, https://github.com/pymupdf/PyMuPDF4LLM (2025), ac- cessed: 2025-12-01 Benchmarking Document Parsers on Formula Extraction 15

PyMuPDF Contributors: PyMuPDF4LLM: Pdf extraction for large language mod- els. GitHub repository, https://github.com/pymupdf/PyMuPDF4LLM (2025), ac- cessed: 2025-12-01 Benchmarking Document Parsers on Formula Extraction 15

work page 2025
[31]

G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01

pypdf Contributors: pypdf: A pure-python pdf library. G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01

work page 2025
[32]

arXiv preprint (2 025)

Qwen Team: Qwen3-vl technical report. arXiv preprint (2 025)

work page
[33]

GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28

RedNote HiLab: dots.ocr: Multilingual document layout parsing. GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28

work page 2025
[34]

International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)

Sain, K., Dasgupta, A., Garain, U.: EMERS: A tree matchin g-based performance evaluation of mathematical expression recognition system s. International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)

work page 2011
[35]

: Adaparse: An adap- tive parallel pdf parsing and resource scaling engine

Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Kha n, A.M., Hossain, K., Babuji, Y., Chia, N., Vishwanath, V., Ramanathan, A., et al. : Adaparse: An adap- tive parallel pdf parsing and resource scaling engine. In: P roceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys) ( 2025)

work page 2025
[36]

In: Proceedings of th e 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkin son, D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., Hofmann, V., Jha, A., K umar, S., Lucy, L., Lyu, X., Lambert, N., Magnusson, I., Morrison, J., Muennigh oﬀ, N., Naik, A., Nam, C., Peters, M., Ravichander, A., Richardson, K., Shen, Z., S trubell, E., Subramani, N., Tafjord, O., Wals...

work page 2024
[37]

In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR)

Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Sh i, B., Zhang, B., He, C.: Image over text: Transforming formula recognition eval uation with character detection matching. In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (20 25)

work page
[38]

arXiv preprint (2024)

Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu , R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y ., Lin, D., He, C.: Mineru: An open-source solution for precise document conte nt extraction. arXiv preprint (2024)

work page 2024
[39]

Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)

Wang, Z., Liu, J.C.: Translating math formula images to L aTeX sequences us- ing deep neural networks with sequence-level training. Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)

work page 2021
[40]

arXiv preprint (2024)

Wei, H., Kong, L., Chen, J., Zhao, L., Sun, Z., Zhang, J., P eng, C., Shen, Y., Mao, X., Xu, Z., et al.: General ocr theory: Towards ocr-2.0 via a u niﬁed end-to-end model. arXiv preprint (2024)

work page 2024
[41]

arXiv preprint (2025)

Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint (2025)

work page 2025
[42]

, Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions

Xie, Y., Mouchère, H., Simistira Liwicki, F., Rakesh, S. , Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions. In: Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR). p p. 553–565 (2023)

work page 2023
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR)

Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai , X.: Syntax-aware network for handwritten mathematical expression recognit ion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR). pp. 4543–4552 (2022)

work page 2022
[44]

arXiv preprint (2025)

Zhang, Q., Wang, B., Huang, V.S.J., Zhang, J., Wang, Z., L iang, H., He, C., Zhang, W.: Document parsing unveiled: Techniques, challen ges, and prospects for structured information extraction. arXiv preprint (2025)

work page 2025
[45]

In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV)

Zhong, X., ShaﬁeiBavani, E., Jimeno Yepes, A.: Image-ba sed table recognition: Data, model, and evaluation. In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV). pp. 564–580 (2020) 16 P. Horn and J. Keuper

work page 2020
[46]

In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR)

Zhong, X., Tang, J., Yepes, A.J.: Publaynet: Largest dat aset ever for document lay- out analysis. In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR). pp. 1015–1022 (2019)

work page 2019

[1] [1]

arXiv preprint (2025)

Adhikari, N.S., Agarwal, S.: A comparative study of pdf pa rsing tools across diverse document categories. arXiv preprint (2025)

work page 2025

[2] [2]

In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS)

Aguilar, F.D., Hirata, N.S.: ExpressMatch: A system for c reating ground-truthed datasets of online mathematical expressions. In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS) . pp. 155–159 (2012)

work page 2012

[3] [3]

In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR)

Alvaro, F., Sánchez, J.A., Benedi, J.M.: Unbiased evalua tion of handwritten math- ematical expression recognition. In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 181–186 (2012)

work page 2012

[4] [4]

In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR)

Awal, A.M.A.M., Mouchère, H., Viard-Gaudin, C.: Towards handwritten mathe- matical expressions recognition. In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1046 –1050 (2009)

work page 2009

[5] [5]

In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Bast, H., Korzen, C.: A benchmark and evaluation for text e xtraction from pdf. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 1–10 (2017)

work page 2017

[6] [6]

In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS)

Chao, H., Fan, J.: Layout and content extraction for PDF do cuments. In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS). pp. 213–224 (2004)

work page 2004

[7] [7]

, Liu, Y., Yu, D., Ma, Y.: Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model

Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., Wan g, X., Zhou, C., Liu, H., Lin, M., Zhang, Y., Zhang, Y., Zheng, H., Zhang, J., Zhang, J. , Liu, Y., Yu, D., Ma, Y.: Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model. arXiv preprint (2025)

work page 2025

[8] [8]

, Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J. , Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report. arXiv pre print (2025)

work page 2025

[9] [9]

In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML)

Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to- markup generation with coarse-to-ﬁne attention. In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML). vol. 70, pp. 980–989 (2017)

work page 2017

[10] [10]

arXiv preprint (2025)

Gemini Team, Google DeepMind: Gemini 2.5: Pushing the fr ontier with advanced reasoning, multimodality, long context, and next generati on agentic capabilities. arXiv preprint (2025)

work page 2025

[11] [11]

https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01

Google DeepMind: Gemini 3: Introducing the latest gemin i ai model from google. https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01

work page 2025

[12] [12]

In: Proceedings of the 26t h International ACM SIGACCESS Conference on Computers and Accessibility (ASSE TS) (2024)

Kumar, A., Wang, L.L.: Uncovering the new accessibility crisis in scholarly PDFs: Publishing model and platform changes contribute to declin ing scholarly document accessibility in the last decade. In: Proceedings of the 26t h International ACM SIGACCESS Conference on Computers and Accessibility (ASSE TS) (2024)

work page 2024

[13] [13]

In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS)

Labahn, G., Lank, E., MacLean, S., Marzouk, M., Tausky, D .: MathBrush: A system for doing math on pen-based devices. In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS) . pp. 599–606 (2008) 14 P. Horn and J. Keuper

work page 2008

[14] [14]

Soviet Physics Doklady 10(8), 707–710 (1966)

Levenshtein, V.I.: Binary codes capable of correcting d eletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

work page 1966

[15] [15]

ar Xiv preprint (2024)

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., Liu, Y.: Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ar Xiv preprint (2024)

work page 2024

[16] [16]

In: Pro- ceedings of the Thirteenth International Conference on Lea rning Representations (ICLR) (2025)

Li, S., Huang, J., Zhuang, J., Shi, Y., Cai, X., Xu, M., Wan g, X., Zhang, L., Ke, G., Cai, H.: Scilitllm: How to adapt llms for scientiﬁc literatu re understanding. In: Pro- ceedings of the Thirteenth International Conference on Lea rning Representations (ICLR) (2025)

work page 2025

[17] [17]

arXiv preprint (2025)

Liu, Y., Yuan, X., Zhang, H., Gao, Z., Zhu, B., Peng, X., Li n, Z., Liu, Q., Jin, L., Bai, X.: Monkeyocr: Document parsing with a structure-r ecognition-relation triplet paradigm. arXiv preprint (2025)

work page 2025

[18] [18]

https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01

LlamaIndex: LlamaParse: Genai-native document parsin g platform. https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01

work page 2024

[19] [19]

In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL)

Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2O RC: The semantic scholar open research corpus. In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL). pp. 4969 –4983 (2020)

work page 2020

[20] [20]

In: Proceedings of the 13th European Con- ference on Research and Advanced Technology for Digital Lib raries (ECDL)

Lopez, P.: Grobid: Combining automatic bibliographic d ata recognition and term extraction for scholarship publications. In: Proceedings of the 13th European Con- ference on Research and Advanced Technology for Digital Lib raries (ECDL). pp. 473–474 (2009)

work page 2009

[21] [21]

https://mathpix.com (2025), accessed: 2025-11-28

Mathpix, Inc.: Mathpix: Document conversion for stem. https://mathpix.com (2025), accessed: 2025-11-28

work page 2025

[22] [22]

https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28

Mistral AI: Mistral OCR 25.05: Next-generation documen t understanding model. https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28

work page 2025

[23] [23]

Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28

Nano Net Technologies Inc.: Nanonets-OCR-s: Image-to- markdown ocr model. Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28

work page 2025

[24] [24]

Hugging Fa ce Dataset, https://huggingface.co/datasets/getomni-ai/ocr-benchmark (2025), ac- cessed: 2025-11-17

OmniAI Technology, Inc.: Omni OCR Benchmark. Hugging Fa ce Dataset, https://huggingface.co/datasets/getomni-ai/ocr-benchmark (2025), ac- cessed: 2025-11-17

work page 2025

[25] [25]

https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01

OpenAI: GPT-5: Openai’s next generation language model . https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01

work page 2025

[26] [26]

In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., W ang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., X u, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf d ocument parsing with comprehensive annotations. In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Reco...

work page 2025

[27] [27]

In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL)

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A met hod for automatic evaluation of machine translation. In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 3 11–318 (2002)

work page 2002

[28] [28]

: Doclaynet: A large human-annotated dataset for document-layout analysis

Pﬁtzmann, B., Auer, C., Dolﬁ, M., Nassar, A.S., Staar, P. : Doclaynet: A large human-annotated dataset for document-layout analysis. In : Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Minin g (KDD). pp. 3743–3751 (2022)

work page 2022

[29] [29]

arXiv preprint (2025)

Poznanski, J., Soldaini, L., Lo, K.: olmocr 2: Unit test r ewards for document ocr. arXiv preprint (2025)

work page 2025

[30] [30]

GitHub repository, https://github.com/pymupdf/PyMuPDF4LLM (2025), ac- cessed: 2025-12-01 Benchmarking Document Parsers on Formula Extraction 15

PyMuPDF Contributors: PyMuPDF4LLM: Pdf extraction for large language mod- els. GitHub repository, https://github.com/pymupdf/PyMuPDF4LLM (2025), ac- cessed: 2025-12-01 Benchmarking Document Parsers on Formula Extraction 15

work page 2025

[31] [31]

G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01

pypdf Contributors: pypdf: A pure-python pdf library. G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01

work page 2025

[32] [32]

arXiv preprint (2 025)

Qwen Team: Qwen3-vl technical report. arXiv preprint (2 025)

work page

[33] [33]

GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28

RedNote HiLab: dots.ocr: Multilingual document layout parsing. GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28

work page 2025

[34] [34]

International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)

Sain, K., Dasgupta, A., Garain, U.: EMERS: A tree matchin g-based performance evaluation of mathematical expression recognition system s. International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)

work page 2011

[35] [35]

: Adaparse: An adap- tive parallel pdf parsing and resource scaling engine

Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Kha n, A.M., Hossain, K., Babuji, Y., Chia, N., Vishwanath, V., Ramanathan, A., et al. : Adaparse: An adap- tive parallel pdf parsing and resource scaling engine. In: P roceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys) ( 2025)

work page 2025

[36] [36]

In: Proceedings of th e 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkin son, D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., Hofmann, V., Jha, A., K umar, S., Lucy, L., Lyu, X., Lambert, N., Magnusson, I., Morrison, J., Muennigh oﬀ, N., Naik, A., Nam, C., Peters, M., Ravichander, A., Richardson, K., Shen, Z., S trubell, E., Subramani, N., Tafjord, O., Wals...

work page 2024

[37] [37]

In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR)

Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Sh i, B., Zhang, B., He, C.: Image over text: Transforming formula recognition eval uation with character detection matching. In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (20 25)

work page

[38] [38]

arXiv preprint (2024)

Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu , R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y ., Lin, D., He, C.: Mineru: An open-source solution for precise document conte nt extraction. arXiv preprint (2024)

work page 2024

[39] [39]

Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)

Wang, Z., Liu, J.C.: Translating math formula images to L aTeX sequences us- ing deep neural networks with sequence-level training. Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)

work page 2021

[40] [40]

arXiv preprint (2024)

Wei, H., Kong, L., Chen, J., Zhao, L., Sun, Z., Zhang, J., P eng, C., Shen, Y., Mao, X., Xu, Z., et al.: General ocr theory: Towards ocr-2.0 via a u niﬁed end-to-end model. arXiv preprint (2024)

work page 2024

[41] [41]

arXiv preprint (2025)

Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint (2025)

work page 2025

[42] [42]

, Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions

Xie, Y., Mouchère, H., Simistira Liwicki, F., Rakesh, S. , Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions. In: Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR). p p. 553–565 (2023)

work page 2023

[43] [43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR)

Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai , X.: Syntax-aware network for handwritten mathematical expression recognit ion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR). pp. 4543–4552 (2022)

work page 2022

[44] [44]

arXiv preprint (2025)

Zhang, Q., Wang, B., Huang, V.S.J., Zhang, J., Wang, Z., L iang, H., He, C., Zhang, W.: Document parsing unveiled: Techniques, challen ges, and prospects for structured information extraction. arXiv preprint (2025)

work page 2025

[45] [45]

In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV)

Zhong, X., ShaﬁeiBavani, E., Jimeno Yepes, A.: Image-ba sed table recognition: Data, model, and evaluation. In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV). pp. 564–580 (2020) 16 P. Horn and J. Keuper

work page 2020

[46] [46]

In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR)

Zhong, X., Tang, J., Yepes, A.J.: Publaynet: Largest dat aset ever for document lay- out analysis. In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR). pp. 1015–1022 (2019)

work page 2019