Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
Pith reviewed 2026-05-16 23:05 UTC · model grok-4.3
The pith
A benchmarking framework with synthetic PDFs and LLM semantic judgment reveals large performance gaps among PDF parsers on mathematical formula extraction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparit
What carries the argument
LLM-as-a-judge semantic equivalence scoring paired with a two-stage fuzzy matching pipeline that aligns parser outputs to LaTeX ground truth despite notation and format differences.
If this is right
- Practitioners can use the reported rankings to pick parsers that preserve mathematical content more reliably for downstream scientific applications.
- Semantic evaluation allows fair comparison even when parsers emit formulas in different notations or with minor rendering variations.
- Controlled synthetic documents make it possible to isolate how layout complexity or formula density affects extraction accuracy.
- Higher-quality formula extraction improves the training data available for large language models that process scientific literature.
Where Pith is reading between the lines
- The same synthetic-generation and LLM-judge approach could be extended to benchmark extraction of tables, figures, or citations under controlled conditions.
- Running the benchmark on real scanned papers would reveal whether the observed parser gaps persist outside the synthetic setting.
- Because the LLM judge correlates well with humans, it could scale evaluation for other document-parsing tasks where human labeling is expensive.
Load-bearing premise
Synthetically generated PDFs with controlled layouts and formulas adequately represent the parsing difficulties present in real-world academic PDFs.
What would settle it
Applying the same 20+ parsers and LLM judge to a set of real multi-column academic PDFs and checking whether the performance ordering and human correlation remain the same.
Figures
read the original abstract
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching (CDM) and r~0 for text similarity. Our robust two-stage matching pipeline combining LLM-based extraction with fuzzy validation reliably aligns parsed formulas with ground truth despite format inconsistencies across parsers. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities, providing actionable guidance for practitioners selecting parsers for downstream applications. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench and https://github.com/phorn1/formula-metric-study
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a benchmarking framework for PDF parsers focused on mathematical formula extraction. It generates 100 synthetic PDFs with controlled layouts and over 2,000 formulas using precise LaTeX ground truth, proposes an LLM-as-a-judge metric for semantic equivalence, validates this metric via a human study (250 formula pairs, 750 ratings yielding Pearson r=0.78 vs. r=0.34 for character-level matching), and evaluates 20+ parsers to reveal performance disparities. Code and data are released publicly.
Significance. If the synthetic benchmark generalizes, the work supplies practitioners with concrete guidance on parser selection for formula extraction tasks and introduces a semantically-aware evaluation method superior to string matching. Key strengths include the direct human validation study supporting the LLM judge and the public release of code, data, and the two-stage matching pipeline, which supports reproducibility.
major comments (2)
- [§3 (Synthetic PDF Generation)] The synthetic PDF generation process (described at high level in the abstract and §3) provides no concrete details on incorporation of multi-column layouts, font variations, or rendering artifacts typical of real academic PDFs. This assumption is load-bearing for the central claim of actionable performance disparities in the evaluation of 20+ parsers, as the reported gaps may not transfer if these complexities are underrepresented.
- [Evaluation section] Parser selection criteria are not specified (evaluation section), leaving unclear whether the 20+ tools form a representative sample or are biased toward particular architectures; this affects the reliability of the disparity findings.
minor comments (2)
- Figure captions for parser output examples could include explicit annotations of matched vs. mismatched formulas to improve readability.
- [Methods] The two-stage matching pipeline is described clearly but would benefit from a pseudocode listing or explicit parameter values for the fuzzy validation step.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§3 (Synthetic PDF Generation)] The synthetic PDF generation process (described at high level in the abstract and §3) provides no concrete details on incorporation of multi-column layouts, font variations, or rendering artifacts typical of real academic PDFs. This assumption is load-bearing for the central claim of actionable performance disparities in the evaluation of 20+ parsers, as the reported gaps may not transfer if these complexities are underrepresented.
Authors: We agree that the current description of the synthetic PDF generation in §3 is high-level and would benefit from additional concrete implementation details to better justify transferability. In the revised manuscript, we will expand §3 to specify the exact mechanisms used: multi-column layouts generated via the LaTeX multicol package with controlled column counts and widths; font variations implemented through selection of standard academic typefaces (e.g., Computer Modern, Times, and sans-serif variants) with randomized sizes and styles; and rendering artifacts simulated by applying controlled PDF compression, noise injection, and anti-aliasing effects during compilation. These additions will directly address the concern and support the reliability of the observed performance disparities. revision: yes
-
Referee: [Evaluation section] Parser selection criteria are not specified (evaluation section), leaving unclear whether the 20+ tools form a representative sample or are biased toward particular architectures; this affects the reliability of the disparity findings.
Authors: We acknowledge that the evaluation section does not explicitly state the parser selection criteria, which is necessary to evaluate potential bias. The 20+ parsers were selected to represent a broad cross-section of contemporary tools, prioritizing those with high community adoption (measured by GitHub stars and citations), support for mathematical content extraction, and architectural diversity (including rule-based, OCR-dependent, and neural network-based parsers). In the revised manuscript, we will add a new subsection in the evaluation section that explicitly lists the selection criteria, the full list of evaluated parsers with their categories, and a brief rationale for inclusion to demonstrate representativeness. revision: yes
Circularity Check
No circularity: empirical benchmark grounded in independent human validation and public releases
full rationale
The paper's central results derive from direct evaluation of 20+ parsers on 100 synthetic documents containing 2000+ formulas, using an LLM-as-a-judge metric that is separately validated against 750 independent human ratings (Pearson r=0.78 vs. r=0.34 for character matching). No equations, parameter fits, or derivations are present that reduce by construction to the paper's own inputs. The synthetic generation process and two-stage matching pipeline are described as controllable and robust but are not claimed to be derived from the evaluation outcomes themselves. Public code and data releases further allow external reproduction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing for the performance disparity claims. This is a standard empirical benchmarking study whose claims rest on observable outputs rather than self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetically generated PDFs with controlled layouts and formulas adequately represent the parsing difficulties present in real-world academic PDFs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Adhikari, N.S., Agarwal, S.: A comparative study of pdf pa rsing tools across diverse document categories. arXiv preprint (2025)
work page 2025
-
[2]
In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS)
Aguilar, F.D., Hirata, N.S.: ExpressMatch: A system for c reating ground-truthed datasets of online mathematical expressions. In: Proceedi ngs of the 10th IAPR International Workshop on Document Analysis Systems (DAS) . pp. 155–159 (2012)
work page 2012
-
[3]
In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR)
Alvaro, F., Sánchez, J.A., Benedi, J.M.: Unbiased evalua tion of handwritten math- ematical expression recognition. In: Proceedings of the In ternational Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 181–186 (2012)
work page 2012
-
[4]
In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR)
Awal, A.M.A.M., Mouchère, H., Viard-Gaudin, C.: Towards handwritten mathe- matical expressions recognition. In: Proceedings of the 10 th International Confer- ence on Document Analysis and Recognition (ICDAR). pp. 1046 –1050 (2009)
work page 2009
-
[5]
In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL)
Bast, H., Korzen, C.: A benchmark and evaluation for text e xtraction from pdf. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 1–10 (2017)
work page 2017
-
[6]
In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS)
Chao, H., Fan, J.: Layout and content extraction for PDF do cuments. In: Proceed- ings of the 6th International Workshop on Document Analysis Systems (DAS). pp. 213–224 (2004)
work page 2004
-
[7]
Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., Wan g, X., Zhou, C., Liu, H., Lin, M., Zhang, Y., Zhang, Y., Zheng, H., Zhang, J., Zhang, J. , Liu, Y., Yu, D., Ma, Y.: Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model. arXiv preprint (2025)
work page 2025
-
[8]
, Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report
Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J. , Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report. arXiv pre print (2025)
work page 2025
-
[9]
In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML)
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to- markup generation with coarse-to-fine attention. In: Proceedings of the 34th Inter national Conference on Machine Learning (ICML). vol. 70, pp. 980–989 (2017)
work page 2017
-
[10]
Gemini Team, Google DeepMind: Gemini 2.5: Pushing the fr ontier with advanced reasoning, multimodality, long context, and next generati on agentic capabilities. arXiv preprint (2025)
work page 2025
-
[11]
https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01
Google DeepMind: Gemini 3: Introducing the latest gemin i ai model from google. https://blog.google/products/gemini/gemini-3/ (2025), accessed: 2025-12-01
work page 2025
-
[12]
Kumar, A., Wang, L.L.: Uncovering the new accessibility crisis in scholarly PDFs: Publishing model and platform changes contribute to declin ing scholarly document accessibility in the last decade. In: Proceedings of the 26t h International ACM SIGACCESS Conference on Computers and Accessibility (ASSE TS) (2024)
work page 2024
-
[13]
In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS)
Labahn, G., Lank, E., MacLean, S., Marzouk, M., Tausky, D .: MathBrush: A system for doing math on pen-based devices. In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS) . pp. 599–606 (2008) 14 P. Horn and J. Keuper
work page 2008
-
[14]
Soviet Physics Doklady 10(8), 707–710 (1966)
Levenshtein, V.I.: Binary codes capable of correcting d eletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
work page 1966
-
[15]
Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., Liu, Y.: Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ar Xiv preprint (2024)
work page 2024
-
[16]
Li, S., Huang, J., Zhuang, J., Shi, Y., Cai, X., Xu, M., Wan g, X., Zhang, L., Ke, G., Cai, H.: Scilitllm: How to adapt llms for scientific literatu re understanding. In: Pro- ceedings of the Thirteenth International Conference on Lea rning Representations (ICLR) (2025)
work page 2025
-
[17]
Liu, Y., Yuan, X., Zhang, H., Gao, Z., Zhu, B., Peng, X., Li n, Z., Liu, Q., Jin, L., Bai, X.: Monkeyocr: Document parsing with a structure-r ecognition-relation triplet paradigm. arXiv preprint (2025)
work page 2025
-
[18]
https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01
LlamaIndex: LlamaParse: Genai-native document parsin g platform. https://www.llamaindex.ai/llamaparse (2024), accessed: 2025-12-01
work page 2024
-
[19]
In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL)
Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2O RC: The semantic scholar open research corpus. In: Proceedings of the 58th An nual Meeting of the Association for Computational Linguistics (ACL). pp. 4969 –4983 (2020)
work page 2020
-
[20]
Lopez, P.: Grobid: Combining automatic bibliographic d ata recognition and term extraction for scholarship publications. In: Proceedings of the 13th European Con- ference on Research and Advanced Technology for Digital Lib raries (ECDL). pp. 473–474 (2009)
work page 2009
-
[21]
https://mathpix.com (2025), accessed: 2025-11-28
Mathpix, Inc.: Mathpix: Document conversion for stem. https://mathpix.com (2025), accessed: 2025-11-28
work page 2025
-
[22]
https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28
Mistral AI: Mistral OCR 25.05: Next-generation documen t understanding model. https://mistral.ai/news/mistral-ocr (2025), accessed: 2025-11-28
work page 2025
-
[23]
Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28
Nano Net Technologies Inc.: Nanonets-OCR-s: Image-to- markdown ocr model. Hugging Face Model, https://huggingface.co/nanonets/Nanonets-OCR-s (2025), accessed: 2025-11-28
work page 2025
-
[24]
OmniAI Technology, Inc.: Omni OCR Benchmark. Hugging Fa ce Dataset, https://huggingface.co/datasets/getomni-ai/ocr-benchmark (2025), ac- cessed: 2025-11-17
work page 2025
-
[25]
https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01
OpenAI: GPT-5: Openai’s next generation language model . https://openai.com/index/introducing-gpt-5/ (2025), accessed: 2025-12- 01
work page 2025
-
[26]
In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., W ang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., X u, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf d ocument parsing with comprehensive annotations. In: Proceedings of the IEE E/CVF Conference on Computer Vision and Pattern Reco...
work page 2025
-
[27]
In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A met hod for automatic evaluation of machine translation. In: Proceedings of the 4 0th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 3 11–318 (2002)
work page 2002
-
[28]
: Doclaynet: A large human-annotated dataset for document-layout analysis
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P. : Doclaynet: A large human-annotated dataset for document-layout analysis. In : Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Minin g (KDD). pp. 3743–3751 (2022)
work page 2022
-
[29]
Poznanski, J., Soldaini, L., Lo, K.: olmocr 2: Unit test r ewards for document ocr. arXiv preprint (2025)
work page 2025
-
[30]
PyMuPDF Contributors: PyMuPDF4LLM: Pdf extraction for large language mod- els. GitHub repository, https://github.com/pymupdf/PyMuPDF4LLM (2025), ac- cessed: 2025-12-01 Benchmarking Document Parsers on Formula Extraction 15
work page 2025
-
[31]
G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01
pypdf Contributors: pypdf: A pure-python pdf library. G itHub repository, https://github.com/py-pdf/pypdf (2025), accessed: 2025-12-01
work page 2025
- [32]
-
[33]
GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28
RedNote HiLab: dots.ocr: Multilingual document layout parsing. GitHub reposi- tory, https://github.com/rednote-hilab/dots.ocr (2025), accessed: 2025-11-28
work page 2025
-
[34]
International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)
Sain, K., Dasgupta, A., Garain, U.: EMERS: A tree matchin g-based performance evaluation of mathematical expression recognition system s. International Journal on Document Analysis and Recognition (IJDAR) 14(1), 75–85 (2011)
work page 2011
-
[35]
: Adaparse: An adap- tive parallel pdf parsing and resource scaling engine
Siebenschuh, C., Hippe, K., Gokdemir, O., Brace, A., Kha n, A.M., Hossain, K., Babuji, Y., Chia, N., Vishwanath, V., Ramanathan, A., et al. : Adaparse: An adap- tive parallel pdf parsing and resource scaling engine. In: P roceedings of the 8th Annual Conference on Machine Learning and Systems (MLSys) ( 2025)
work page 2025
-
[36]
In: Proceedings of th e 62nd Annual Meeting of the Association for Computational Linguistics (ACL)
Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkin son, D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., Hofmann, V., Jha, A., K umar, S., Lucy, L., Lyu, X., Lambert, N., Magnusson, I., Morrison, J., Muennigh off, N., Naik, A., Nam, C., Peters, M., Ravichander, A., Richardson, K., Shen, Z., S trubell, E., Subramani, N., Tafjord, O., Wals...
work page 2024
-
[37]
In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR)
Wang, B., Wu, F., Ouyang, L., Gu, Z., Zhang, R., Xia, R., Sh i, B., Zhang, B., He, C.: Image over text: Transforming formula recognition eval uation with character detection matching. In: Proceedings of the IEEE/CVF Confer ence on Computer Vision and Pattern Recognition (CVPR). pp. 19681–19690 (20 25)
-
[38]
Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu , R., Liu, K., Qu, Y., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W., Shi, B., Qiao, Y ., Lin, D., He, C.: Mineru: An open-source solution for precise document conte nt extraction. arXiv preprint (2024)
work page 2024
-
[39]
Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)
Wang, Z., Liu, J.C.: Translating math formula images to L aTeX sequences us- ing deep neural networks with sequence-level training. Int ernational Journal on Document Analysis and Recognition (IJDAR) 24(1), 63–75 (2021)
work page 2021
-
[40]
Wei, H., Kong, L., Chen, J., Zhao, L., Sun, Z., Zhang, J., P eng, C., Shen, Y., Mao, X., Xu, Z., et al.: General ocr theory: Towards ocr-2.0 via a u nified end-to-end model. arXiv preprint (2024)
work page 2024
-
[41]
Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint (2025)
work page 2025
-
[42]
Xie, Y., Mouchère, H., Simistira Liwicki, F., Rakesh, S. , Saini, R., Nakagawa, M., Nguyen, C.T., Truong, T.N.: ICDAR 2023 CROHME: Competition on recognition of handwritten mathematical expressions. In: Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR). p p. 553–565 (2023)
work page 2023
-
[43]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR)
Yuan, Y., Liu, X., Dikubab, W., Liu, H., Ji, Z., Wu, Z., Bai , X.: Syntax-aware network for handwritten mathematical expression recognit ion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Reco gnition (CVPR). pp. 4543–4552 (2022)
work page 2022
-
[44]
Zhang, Q., Wang, B., Huang, V.S.J., Zhang, J., Wang, Z., L iang, H., He, C., Zhang, W.: Document parsing unveiled: Techniques, challen ges, and prospects for structured information extraction. arXiv preprint (2025)
work page 2025
-
[45]
In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV)
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-ba sed table recognition: Data, model, and evaluation. In: Proceedings of the Europea n Conference on Com- puter Vision (ECCV). pp. 564–580 (2020) 16 P. Horn and J. Keuper
work page 2020
-
[46]
In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR)
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: Largest dat aset ever for document lay- out analysis. In: Proceedings of the International Confere nce on Document Analysis and Recognition (ICDAR). pp. 1015–1022 (2019)
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.