pith. sign in

arxiv: 2605.26712 · v1 · pith:Q6RN5JHRnew · submitted 2026-05-26 · 💻 cs.CV

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

Pith reviewed 2026-06-29 18:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords automatic text recognitionmultilingual benchmarkvision large language modelsdocument diversityATR evaluationreal-world documentsevolving benchmarkmultilingual ATR
0
0 comments X

The pith

METATR is a multilingual evolving benchmark for evaluating automatic text recognition on diverse real-world documents in 29 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents METATR as a new benchmark to evaluate ATR systems including vision LLMs on a wide variety of documents. Existing evaluations often use modern English printed texts, limiting their usefulness for practical applications. METATR draws from public collections to cover 29 languages, different scripts and layouts. It standardizes prompting, normalization, and provides a dynamic framework for reproducible and extensible evaluation. Results from testing multiple models show proprietary systems are most consistent but performance varies significantly across scripts and layouts.

Core claim

METATR (v1.0) introduces a dataset from various public collections covering 29 languages with multiple scripts and layouts, along with a standardized prompting and normalization methodology and a dynamic evaluation framework intended to produce reproducible results while remaining extensible over time, allowing for meaningful model comparison and selection in real-world conditions.

What carries the argument

The METATR benchmark dataset combined with its standardized prompting, normalization methodology, and dynamic evaluation framework for multilingual ATR assessment.

If this is right

  • Practitioners can select ATR models based on performance for specific languages, scripts, or layouts.
  • Progress in the field can be tracked as new models and document types are added to the evolving benchmark.
  • Variability in model performance across different document types is quantified for better understanding of limitations.
  • Both open-source and closed-source models can be compared under the same standardized conditions.
  • Computational efficiency is reported alongside accuracy to inform deployment decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work might integrate METATR with other benchmarks to create even broader evaluations.
  • The emphasis on real-world conditions could encourage development of more robust vLLMs for handwritten and varied layout documents.
  • Language-level performance reporting might help prioritize improvements in underrepresented scripts.
  • Extending the benchmark to include more evolving elements like user-submitted documents could enhance its relevance.

Load-bearing premise

Documents selected from various public collections adequately represent the complexity and diversity of real-world documents across 29 languages, scripts, and layouts.

What would settle it

If adding new documents to the benchmark changes the relative rankings of models in ways that do not match real-world application outcomes, the representativeness of the selection would be questioned.

Figures

Figures reproduced from arXiv: 2605.26712 by Christopher Kermorvant, M\'elodie Boillet, Sol\`ene Tarride.

Figure 1
Figure 1. Figure 1: Distribution of languages and example images in the benchmark dataset. even the best models often have problems with historical documents. The authors found that more than a third of predictions from a small open-weight model had major hallucinations, and over 40% had reading-order errors on multi-column pages. LMMs vs. specialized systems: a fragmented picture. Direct compar￾isons between LMMs and special… view at source ↗
read the original abstract

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces METATR (v1.0), a multilingual evolving benchmark for Automatic Text Recognition (ATR) that selects documents from public collections to cover 29 languages with multiple scripts and layouts, aiming to maximize diversity and real-world representativeness. It defines a standardized prompting and normalization methodology plus a dynamic evaluation framework, evaluates a range of open- and closed-source vLLM and other ATR systems, and reports results showing proprietary models achieve the most consistent performance while substantial variability remains across scripts and layouts.

Significance. If the document selection and coverage can be shown to adequately capture real-world multilingual ATR complexity, METATR would supply a practitioner-oriented, extensible framework for model comparison and selection that addresses the English-centric bias of prior benchmarks.

major comments (2)
  1. [Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.
  2. [Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.
minor comments (1)
  1. [Abstract] The abstract states the benchmark is 'evolving' and 'dynamic' but does not specify the versioning or update mechanism that would enable reproducible tracking of progress over time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and note planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.

    Authors: We agree that the manuscript does not supply explicit sampling rules, stratification criteria, or quantitative diversity metrics, nor does it compare the collection against a reference corpus. Document selection was performed by drawing from multiple public collections to achieve coverage of 29 languages, varied scripts, and layouts while favoring real-world documents; however, this process was not formalized with the metrics mentioned. We will revise the abstract and add a dedicated section describing the curation approach, provide per-language document counts, and explicitly discuss the limitations of the current selection procedure with respect to verifiable diversity maximization. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.

    Authors: The manuscript reports aggregate and language-level results together with some robustness observations for handwritten documents, but we concur that the absence of per-language document counts, detailed error analysis, and additional robustness statistics limits the ability to interpret the reported variability. We will expand the evaluation section to include a table of document counts per language, basic error breakdowns for representative scripts and layouts, and further statistical summaries of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper contains no derivations or fitted predictions.

full rationale

METATR introduces a dataset and evaluation protocol with no equations, parameter fits, or predictions. The text asserts diversity maximization via selection from public collections but supplies no quantitative derivation, sampling formula, or self-referential reduction that could qualify as circular under the enumerated patterns. No self-citation chains, ansatzes, or uniqueness theorems appear in the provided sections. The central claim rests on qualitative dataset construction rather than any input-to-output equivalence by construction, making a score of 0 the appropriate finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contributes a benchmark dataset and evaluation protocol rather than any theoretical derivation; no free parameters, mathematical axioms, or invented entities are invoked.

pith-pipeline@v0.9.1-grok · 5810 in / 1064 out tokens · 41693 ms · 2026-06-29T18:46:21.534213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    Anthropic: Claude Opus 4.5 (2025), https://www.anthropic.com/news/ claude-opus-4-5 [Accessed: 2025-12]

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

    Beyer, Y., Solberg, P.E.: Norhand v3 / dataset for handwritten text recognition in norwegian (Dec 2023). https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

  4. [4]

    In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing

    Boillet, M., Bonhomme, M.L., Stutzmann, D., Kermorvant, C.: Horae: an anno- tated dataset of books of hours. In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing. p. 7–12. HIP ’19 (2019). https://doi.org/10.1145/3352631.3352633

  5. [5]

    In: International Conference on Pattern Recognition (2022)

    Cascianelli, S., Pippi, V., Martin, M., Cornia, M., Baraldi, L., Christopher, K., Cucchiara, R.: The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition. In: International Conference on Pattern Recognition (2022)

  6. [6]

    https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

    Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Bree, S., Merveille, F.: POPP Datasets: Datasets for handwriting recognition from French population census (Mar 2022). https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

  7. [7]

    Journal of Documentation81(7), 334–354 (2025)

    Crosilla, G., Klic, L., Colavizza, G.: Benchmarking large language models for handwritten text recognition. Journal of Documentation81(7), 334–354 (2025). https://doi.org/https://doi.org/10.1108/JD-03-2025-0082

  8. [8]

    scrib- blelens

    Dolfing, H.J., Bellegarda, J., Chorowski, J., Marxer, R., Laurent, A.: The “scrib- blelens” dutch historical handwriting corpus. In: 2020 17th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 67–72 (2020). https://doi.org/10.1109/ICFHR2020.2020.00023

  9. [9]

    GoogleDeepMind:Gemini3Pro(2025),https://deepmind.google/models/gemini/ [Accessed: 2026-01]

  10. [10]

    In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition

    Grosicki, E., Carre, M., Brodin, J.M., Geoffrois, E.: Results of the rimes evaluation campaign for handwritten mail processing. In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition. p. 941–945. ICDAR ’09 (2009). https://doi.org/10.1109/ICDAR.2009.224

  11. [11]

    https://doi.org/doi.org/10.23636/1135

    Keinan-Schoonbaert, A.: Automatic transcription of historical handwritten arabic texts (2019). https://doi.org/doi.org/10.23636/1135

  12. [12]

    In: Document Analysis and Recognition - ICDAR 2021

    Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Document Analysis and Recognition - ICDAR 2021. pp. 492–506 (2021)

  13. [13]

    In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

    Liu, C.L., Yin, F., Wang, D.H., Wang, Q.: Casia online and offline chinese hand- writing databases. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. pp. 37 – 41 (10 2011). https://doi.org/10.1109/ ICDAR.2011.17

  14. [14]

    Science China Information Sciences67(12) (Dec 2024)

    Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences67(12) (Dec 2024). https://doi.org/10.1007/ s11432-024-4235-6 A Multilingual Benchmark for ATR 17

  15. [15]

    International Journal on Document Analysis and Recognition5, 39–46 (11 2002)

    Marti, U.V., Bunke, H.: The iam-database: An english sentence database for of- fline handwriting recognition. International Journal on Document Analysis and Recognition5, 39–46 (11 2002). https://doi.org/10.1007/s100320200071

  16. [16]

    com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

    Microsoft: Azure Document Intelligence layout (2024), https://learn.microsoft. com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

  17. [17]

    com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

    Microsoft: Azure Optical Character Recognition (2024), https://learn.microsoft. com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

  18. [18]

    Mistral AI: Mistral Large 3 (2025), https://mistral.ai/news/mistral-3 [Accessed: 2026-01]

  19. [19]

    Mistral AI: Mistral Medium 3 (2025), https://mistral.ai/news/mistral-medium-3 [Accessed: 2026-01]

  20. [20]

    Mistral AI: Mistral OCR (2025), https://mistral.ai/news/mistral-ocr [Accessed: 2026-01]

  21. [21]

    Mistral AI: Mistral Small 3.1 (2025), https://mistral.ai/news/mistral-small-3-1 [Accessed: 2026-01]

  22. [22]

    OpenAI: GPT-5.1 (2025), https://openai.com/index/gpt-5-1/ [Accessed: 2025-12]

  23. [23]

    arXiv preprint arXiv:2510.19817 (2025)

    Poznanski, J., Soldaini, L., Lo, K.: olmOCR 2: Unit Test Rewards for Document OCR. arXiv preprint arXiv:2510.19817 (2025)

  24. [24]

    Reducto AI: RolmOCR: A Faster, Lighter Open Source OCR Model (2025)

  25. [25]

    https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

    Romanov, M., Seydi, M.: Openiti: a machine-readable corpus of islamicate texts (May 2019). https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

  26. [26]

    Pattern Recognition46(6), 1658–1669 (2013)

    Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A.H., Frinken, V., Vidal, E., Lladós, J.: The esposalles database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition46(6), 1658–1669 (2013). https://doi.org/https://doi.org/10.1016/j.patcog.2012.11.024

  27. [27]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Semnani, S., Zhang, H., He, X., Tekgurler, M., Lam, M.: CHURRO: Making his- tory readable with an open-weight large vision-language model for high-accuracy, low-cost historical text recognition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 34777–34824 (Nov 2025). https://doi.org/10.18653/v1/2025.emnlp-main.1763

  28. [28]

    In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR)

    Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr2016 competition on hand- written text recognition on the read dataset. In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 630–635 (2016). https://doi.org/10.1109/ICFHR.2016.0120

  29. [29]

    arXiv preprint arXiv:2601.14251 (2026)

    Taghadouini, S., Cavaillès, A., Aubertin, B.: LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR. arXiv preprint arXiv:2601.14251 (2026)

  30. [30]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234 (2025)

  31. [31]

    arXiv preprint arXiv:2601.20552 (2026)

    Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR 2: Visual Causal Flow. arXiv preprint arXiv:2601.20552 (2026)

  32. [32]

    In: Yin, X.C., Karatzas, D., Lopresti, D

    Wolf, F., Tüselmann, O., Matei, A., Hennies, L., Rass, C., Fink, G.A.: CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Lan- guage Models. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR 2025. pp. 23–39 (2026)

  33. [33]

    arXiv preprint arXiv:2412.02210 (2024)

    Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., Jin, L., Lin, J.: CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy. arXiv preprint arXiv:2412.02210 (2024)