METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

Christopher Kermorvant; M\'elodie Boillet; Sol\`ene Tarride

arxiv: 2605.26712 · v1 · pith:Q6RN5JHRnew · submitted 2026-05-26 · 💻 cs.CV

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

M\'elodie Boillet , Sol\`ene Tarride , Christopher Kermorvant This is my paper

Pith reviewed 2026-06-29 18:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords automatic text recognitionmultilingual benchmarkvision large language modelsdocument diversityATR evaluationreal-world documentsevolving benchmarkmultilingual ATR

0 comments

The pith

METATR is a multilingual evolving benchmark for evaluating automatic text recognition on diverse real-world documents in 29 languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents METATR as a new benchmark to evaluate ATR systems including vision LLMs on a wide variety of documents. Existing evaluations often use modern English printed texts, limiting their usefulness for practical applications. METATR draws from public collections to cover 29 languages, different scripts and layouts. It standardizes prompting, normalization, and provides a dynamic framework for reproducible and extensible evaluation. Results from testing multiple models show proprietary systems are most consistent but performance varies significantly across scripts and layouts.

Core claim

METATR (v1.0) introduces a dataset from various public collections covering 29 languages with multiple scripts and layouts, along with a standardized prompting and normalization methodology and a dynamic evaluation framework intended to produce reproducible results while remaining extensible over time, allowing for meaningful model comparison and selection in real-world conditions.

What carries the argument

The METATR benchmark dataset combined with its standardized prompting, normalization methodology, and dynamic evaluation framework for multilingual ATR assessment.

If this is right

Practitioners can select ATR models based on performance for specific languages, scripts, or layouts.
Progress in the field can be tracked as new models and document types are added to the evolving benchmark.
Variability in model performance across different document types is quantified for better understanding of limitations.
Both open-source and closed-source models can be compared under the same standardized conditions.
Computational efficiency is reported alongside accuracy to inform deployment decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work might integrate METATR with other benchmarks to create even broader evaluations.
The emphasis on real-world conditions could encourage development of more robust vLLMs for handwritten and varied layout documents.
Language-level performance reporting might help prioritize improvements in underrepresented scripts.
Extending the benchmark to include more evolving elements like user-submitted documents could enhance its relevance.

Load-bearing premise

Documents selected from various public collections adequately represent the complexity and diversity of real-world documents across 29 languages, scripts, and layouts.

What would settle it

If adding new documents to the benchmark changes the relative rankings of models in ways that do not match real-world application outcomes, the representativeness of the selection would be questioned.

Figures

Figures reproduced from arXiv: 2605.26712 by Christopher Kermorvant, M\'elodie Boillet, Sol\`ene Tarride.

**Figure 1.** Figure 1: Distribution of languages and example images in the benchmark dataset. even the best models often have problems with historical documents. The authors found that more than a third of predictions from a small open-weight model had major hallucinations, and over 40% had reading-order errors on multi-column pages. LMMs vs. specialized systems: a fragmented picture. Direct comparisons between LMMs and special… view at source ↗

read the original abstract

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

METATR adds a 29-language ATR benchmark with standardized eval but its diversity claims lack any quantitative backing or metrics.

read the letter

The core takeaway is that this paper ships a new multilingual benchmark for automatic text recognition covering 29 languages plus an evolving evaluation setup. That addresses a real gap where most ATR datasets stay English-centric and modern-printed.

It does a few things cleanly. The authors pull documents from public collections, define consistent prompting and normalization rules, and run a range of open and closed models with breakdowns by language, script, layout, handwriting robustness, and compute cost. Reporting results across those axes gives practitioners something concrete to use when choosing a model for non-English or degraded material.

The soft spot sits in the dataset construction. The abstract and stress-test both note that selection was done "to maximize diversity," yet no sampling rules, stratification criteria, diversity metrics (script entropy, layout coverage, degradation stats), or comparison to a reference corpus appear. Without those, the representativeness claim stays untested. The rest of the work does not depend on that claim being ironclad, but it does limit how far the "real-world conditions" framing can be taken.

This is for ATR researchers and practitioners who need to evaluate models beyond English printed text. Anyone building or selecting systems for historical or multilingual documents will find the evaluation framework and results useful even if they treat the diversity assertion with caution.

It deserves peer review. The benchmark itself is new and the evaluation protocol is reproducible enough to be worth referee time, though reviewers will likely press on the missing quantitative validation for sampling.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces METATR (v1.0), a multilingual evolving benchmark for Automatic Text Recognition (ATR) that selects documents from public collections to cover 29 languages with multiple scripts and layouts, aiming to maximize diversity and real-world representativeness. It defines a standardized prompting and normalization methodology plus a dynamic evaluation framework, evaluates a range of open- and closed-source vLLM and other ATR systems, and reports results showing proprietary models achieve the most consistent performance while substantial variability remains across scripts and layouts.

Significance. If the document selection and coverage can be shown to adequately capture real-world multilingual ATR complexity, METATR would supply a practitioner-oriented, extensible framework for model comparison and selection that addresses the English-centric bias of prior benchmarks.

major comments (2)

[Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.
[Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.

minor comments (1)

[Abstract] The abstract states the benchmark is 'evolving' and 'dynamic' but does not specify the versioning or update mechanism that would enable reproducible tracking of progress over time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and note planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the benchmark 'maximizes diversity' and 'represent[s] the complexity of real-world documents' across 29 languages/scripts/layouts rests on selection from 'various public collections' but supplies no explicit sampling rules, stratification criteria, diversity metrics (e.g., script entropy, layout-type coverage, degradation distribution), or comparison against any reference corpus of real-world documents; this unverified premise directly undermines the multidimensional framework claim.

Authors: We agree that the manuscript does not supply explicit sampling rules, stratification criteria, or quantitative diversity metrics, nor does it compare the collection against a reference corpus. Document selection was performed by drawing from multiple public collections to achieve coverage of 29 languages, varied scripts, and layouts while favoring real-world documents; however, this process was not formalized with the metrics mentioned. We will revise the abstract and add a dedicated section describing the curation approach, provide per-language document counts, and explicitly discuss the limitations of the current selection procedure with respect to verifiable diversity maximization. revision: yes
Referee: [Evaluation] Evaluation section (implied by results reporting): the reported performance variability across scripts and layouts is presented without accompanying error analysis, per-language document counts, or robustness statistics that would allow readers to assess whether observed differences reflect genuine benchmark properties rather than sampling artifacts.

Authors: The manuscript reports aggregate and language-level results together with some robustness observations for handwritten documents, but we concur that the absence of per-language document counts, detailed error analysis, and additional robustness statistics limits the ability to interpret the reported variability. We will expand the evaluation section to include a table of document counts per language, basic error breakdowns for representative scripts and layouts, and further statistical summaries of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper contains no derivations or fitted predictions.

full rationale

METATR introduces a dataset and evaluation protocol with no equations, parameter fits, or predictions. The text asserts diversity maximization via selection from public collections but supplies no quantitative derivation, sampling formula, or self-referential reduction that could qualify as circular under the enumerated patterns. No self-citation chains, ansatzes, or uniqueness theorems appear in the provided sections. The central claim rests on qualitative dataset construction rather than any input-to-output equivalence by construction, making a score of 0 the appropriate finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contributes a benchmark dataset and evaluation protocol rather than any theoretical derivation; no free parameters, mathematical axioms, or invented entities are invoked.

pith-pipeline@v0.9.1-grok · 5810 in / 1064 out tokens · 41693 ms · 2026-06-29T18:46:21.534213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 18 canonical work pages · 2 internal anchors

[1]

Anthropic: Claude Opus 4.5 (2025), https://www.anthropic.com/news/ claude-opus-4-5 [Accessed: 2025-12]

2025
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

Beyer, Y., Solberg, P.E.: Norhand v3 / dataset for handwritten text recognition in norwegian (Dec 2023). https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

work page doi:10.5281/zenodo.10255840 2023
[4]

In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing

Boillet, M., Bonhomme, M.L., Stutzmann, D., Kermorvant, C.: Horae: an anno- tated dataset of books of hours. In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing. p. 7–12. HIP ’19 (2019). https://doi.org/10.1145/3352631.3352633

work page doi:10.1145/3352631.3352633 2019
[5]

In: International Conference on Pattern Recognition (2022)

Cascianelli, S., Pippi, V., Martin, M., Cornia, M., Baraldi, L., Christopher, K., Cucchiara, R.: The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition. In: International Conference on Pattern Recognition (2022)

2022
[6]

https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Bree, S., Merveille, F.: POPP Datasets: Datasets for handwriting recognition from French population census (Mar 2022). https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

work page doi:10.5281/zenodo.6581158 2022
[7]

Journal of Documentation81(7), 334–354 (2025)

Crosilla, G., Klic, L., Colavizza, G.: Benchmarking large language models for handwritten text recognition. Journal of Documentation81(7), 334–354 (2025). https://doi.org/https://doi.org/10.1108/JD-03-2025-0082

work page doi:10.1108/jd-03-2025-0082 2025
[8]

scrib- blelens

Dolfing, H.J., Bellegarda, J., Chorowski, J., Marxer, R., Laurent, A.: The “scrib- blelens” dutch historical handwriting corpus. In: 2020 17th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 67–72 (2020). https://doi.org/10.1109/ICFHR2020.2020.00023

work page doi:10.1109/icfhr2020.2020.00023 2020
[9]

GoogleDeepMind:Gemini3Pro(2025),https://deepmind.google/models/gemini/ [Accessed: 2026-01]

2025
[10]

In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition

Grosicki, E., Carre, M., Brodin, J.M., Geoffrois, E.: Results of the rimes evaluation campaign for handwritten mail processing. In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition. p. 941–945. ICDAR ’09 (2009). https://doi.org/10.1109/ICDAR.2009.224

work page doi:10.1109/icdar.2009.224 2009
[11]

https://doi.org/doi.org/10.23636/1135

Keinan-Schoonbaert, A.: Automatic transcription of historical handwritten arabic texts (2019). https://doi.org/doi.org/10.23636/1135

work page doi:10.23636/1135 2019
[12]

In: Document Analysis and Recognition - ICDAR 2021

Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Document Analysis and Recognition - ICDAR 2021. pp. 492–506 (2021)

2021
[13]

In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

Liu, C.L., Yin, F., Wang, D.H., Wang, Q.: Casia online and offline chinese hand- writing databases. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. pp. 37 – 41 (10 2011). https://doi.org/10.1109/ ICDAR.2011.17

2011
[14]

Science China Information Sciences67(12) (Dec 2024)

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences67(12) (Dec 2024). https://doi.org/10.1007/ s11432-024-4235-6 A Multilingual Benchmark for ATR 17

2024
[15]

International Journal on Document Analysis and Recognition5, 39–46 (11 2002)

Marti, U.V., Bunke, H.: The iam-database: An english sentence database for of- fline handwriting recognition. International Journal on Document Analysis and Recognition5, 39–46 (11 2002). https://doi.org/10.1007/s100320200071

work page doi:10.1007/s100320200071 2002
[16]

com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

Microsoft: Azure Document Intelligence layout (2024), https://learn.microsoft. com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

2024
[17]

com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

Microsoft: Azure Optical Character Recognition (2024), https://learn.microsoft. com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

2024
[18]

Mistral AI: Mistral Large 3 (2025), https://mistral.ai/news/mistral-3 [Accessed: 2026-01]

2025
[19]

Mistral AI: Mistral Medium 3 (2025), https://mistral.ai/news/mistral-medium-3 [Accessed: 2026-01]

2025
[20]

Mistral AI: Mistral OCR (2025), https://mistral.ai/news/mistral-ocr [Accessed: 2026-01]

2025
[21]

Mistral AI: Mistral Small 3.1 (2025), https://mistral.ai/news/mistral-small-3-1 [Accessed: 2026-01]

2025
[22]

OpenAI: GPT-5.1 (2025), https://openai.com/index/gpt-5-1/ [Accessed: 2025-12]

2025
[23]

arXiv preprint arXiv:2510.19817 (2025)

Poznanski, J., Soldaini, L., Lo, K.: olmOCR 2: Unit Test Rewards for Document OCR. arXiv preprint arXiv:2510.19817 (2025)

work page arXiv 2025
[24]

Reducto AI: RolmOCR: A Faster, Lighter Open Source OCR Model (2025)

2025
[25]

https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

Romanov, M., Seydi, M.: Openiti: a machine-readable corpus of islamicate texts (May 2019). https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

work page doi:10.5281/zenodo.3082464 2019
[26]

Pattern Recognition46(6), 1658–1669 (2013)

Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A.H., Frinken, V., Vidal, E., Lladós, J.: The esposalles database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition46(6), 1658–1669 (2013). https://doi.org/https://doi.org/10.1016/j.patcog.2012.11.024

work page doi:10.1016/j.patcog.2012.11.024 2013
[27]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Semnani, S., Zhang, H., He, X., Tekgurler, M., Lam, M.: CHURRO: Making his- tory readable with an open-weight large vision-language model for high-accuracy, low-cost historical text recognition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 34777–34824 (Nov 2025). https://doi.org/10.18653/v1/2025.emnlp-main.1763

work page doi:10.18653/v1/2025.emnlp-main.1763 2025
[28]

In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR)

Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr2016 competition on hand- written text recognition on the read dataset. In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 630–635 (2016). https://doi.org/10.1109/ICFHR.2016.0120

work page doi:10.1109/icfhr.2016.0120 2016
[29]

arXiv preprint arXiv:2601.14251 (2026)

Taghadouini, S., Cavaillès, A., Aubertin, B.: LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR. arXiv preprint arXiv:2601.14251 (2026)

work page arXiv 2026
[30]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

arXiv preprint arXiv:2601.20552 (2026)

Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR 2: Visual Causal Flow. arXiv preprint arXiv:2601.20552 (2026)

work page arXiv 2026
[32]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wolf, F., Tüselmann, O., Matei, A., Hennies, L., Rass, C., Fink, G.A.: CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Lan- guage Models. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR 2025. pp. 23–39 (2026)

2025
[33]

arXiv preprint arXiv:2412.02210 (2024)

Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., Jin, L., Lin, J.: CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy. arXiv preprint arXiv:2412.02210 (2024)

work page arXiv 2024

[1] [1]

Anthropic: Claude Opus 4.5 (2025), https://www.anthropic.com/news/ claude-opus-4-5 [Accessed: 2025-12]

2025

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

Beyer, Y., Solberg, P.E.: Norhand v3 / dataset for handwritten text recognition in norwegian (Dec 2023). https://doi.org/10.5281/zenodo.10255840, https://doi.org/ 10.5281/zenodo.10255840

work page doi:10.5281/zenodo.10255840 2023

[4] [4]

In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing

Boillet, M., Bonhomme, M.L., Stutzmann, D., Kermorvant, C.: Horae: an anno- tated dataset of books of hours. In: Proceedings of the 5th International Work- shop on Historical Document Imaging and Processing. p. 7–12. HIP ’19 (2019). https://doi.org/10.1145/3352631.3352633

work page doi:10.1145/3352631.3352633 2019

[5] [5]

In: International Conference on Pattern Recognition (2022)

Cascianelli, S., Pippi, V., Martin, M., Cornia, M., Baraldi, L., Christopher, K., Cucchiara, R.: The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition. In: International Conference on Pattern Recognition (2022)

2022

[6] [6]

https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Bree, S., Merveille, F.: POPP Datasets: Datasets for handwriting recognition from French population census (Mar 2022). https://doi.org/10.5281/zenodo.6581158, https: //doi.org/10.5281/zenodo.6581158

work page doi:10.5281/zenodo.6581158 2022

[7] [7]

Journal of Documentation81(7), 334–354 (2025)

Crosilla, G., Klic, L., Colavizza, G.: Benchmarking large language models for handwritten text recognition. Journal of Documentation81(7), 334–354 (2025). https://doi.org/https://doi.org/10.1108/JD-03-2025-0082

work page doi:10.1108/jd-03-2025-0082 2025

[8] [8]

scrib- blelens

Dolfing, H.J., Bellegarda, J., Chorowski, J., Marxer, R., Laurent, A.: The “scrib- blelens” dutch historical handwriting corpus. In: 2020 17th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 67–72 (2020). https://doi.org/10.1109/ICFHR2020.2020.00023

work page doi:10.1109/icfhr2020.2020.00023 2020

[9] [9]

GoogleDeepMind:Gemini3Pro(2025),https://deepmind.google/models/gemini/ [Accessed: 2026-01]

2025

[10] [10]

In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition

Grosicki, E., Carre, M., Brodin, J.M., Geoffrois, E.: Results of the rimes evaluation campaign for handwritten mail processing. In: Proceedings of the 2009 10th Inter- national Conference on Document Analysis and Recognition. p. 941–945. ICDAR ’09 (2009). https://doi.org/10.1109/ICDAR.2009.224

work page doi:10.1109/icdar.2009.224 2009

[11] [11]

https://doi.org/doi.org/10.23636/1135

Keinan-Schoonbaert, A.: Automatic transcription of historical handwritten arabic texts (2019). https://doi.org/doi.org/10.23636/1135

work page doi:10.23636/1135 2019

[12] [12]

In: Document Analysis and Recognition - ICDAR 2021

Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Document Analysis and Recognition - ICDAR 2021. pp. 492–506 (2021)

2021

[13] [13]

In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

Liu, C.L., Yin, F., Wang, D.H., Wang, Q.: Casia online and offline chinese hand- writing databases. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. pp. 37 – 41 (10 2011). https://doi.org/10.1109/ ICDAR.2011.17

2011

[14] [14]

Science China Information Sciences67(12) (Dec 2024)

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences67(12) (Dec 2024). https://doi.org/10.1007/ s11432-024-4235-6 A Multilingual Benchmark for ATR 17

2024

[15] [15]

International Journal on Document Analysis and Recognition5, 39–46 (11 2002)

Marti, U.V., Bunke, H.: The iam-database: An english sentence database for of- fline handwriting recognition. International Journal on Document Analysis and Recognition5, 39–46 (11 2002). https://doi.org/10.1007/s100320200071

work page doi:10.1007/s100320200071 2002

[16] [16]

com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

Microsoft: Azure Document Intelligence layout (2024), https://learn.microsoft. com/en-us/azure/ai-services/document-intelligence/prebuilt/layout [Accessed: 2025-12]

2024

[17] [17]

com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

Microsoft: Azure Optical Character Recognition (2024), https://learn.microsoft. com/en-us/azure/ai-services/computer-vision/overview-ocr [Accessed: 2025-12]

2024

[18] [18]

Mistral AI: Mistral Large 3 (2025), https://mistral.ai/news/mistral-3 [Accessed: 2026-01]

2025

[19] [19]

Mistral AI: Mistral Medium 3 (2025), https://mistral.ai/news/mistral-medium-3 [Accessed: 2026-01]

2025

[20] [20]

Mistral AI: Mistral OCR (2025), https://mistral.ai/news/mistral-ocr [Accessed: 2026-01]

2025

[21] [21]

Mistral AI: Mistral Small 3.1 (2025), https://mistral.ai/news/mistral-small-3-1 [Accessed: 2026-01]

2025

[22] [22]

OpenAI: GPT-5.1 (2025), https://openai.com/index/gpt-5-1/ [Accessed: 2025-12]

2025

[23] [23]

arXiv preprint arXiv:2510.19817 (2025)

Poznanski, J., Soldaini, L., Lo, K.: olmOCR 2: Unit Test Rewards for Document OCR. arXiv preprint arXiv:2510.19817 (2025)

work page arXiv 2025

[24] [24]

Reducto AI: RolmOCR: A Faster, Lighter Open Source OCR Model (2025)

2025

[25] [25]

https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

Romanov, M., Seydi, M.: Openiti: a machine-readable corpus of islamicate texts (May 2019). https://doi.org/10.5281/zenodo.3082464, https://doi.org/10.5281/ zenodo.3082464

work page doi:10.5281/zenodo.3082464 2019

[26] [26]

Pattern Recognition46(6), 1658–1669 (2013)

Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A.H., Frinken, V., Vidal, E., Lladós, J.: The esposalles database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition46(6), 1658–1669 (2013). https://doi.org/https://doi.org/10.1016/j.patcog.2012.11.024

work page doi:10.1016/j.patcog.2012.11.024 2013

[27] [27]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Semnani, S., Zhang, H., He, X., Tekgurler, M., Lam, M.: CHURRO: Making his- tory readable with an open-weight large vision-language model for high-accuracy, low-cost historical text recognition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 34777–34824 (Nov 2025). https://doi.org/10.18653/v1/2025.emnlp-main.1763

work page doi:10.18653/v1/2025.emnlp-main.1763 2025

[28] [28]

In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR)

Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr2016 competition on hand- written text recognition on the read dataset. In: 2016 15th International Con- ference on Frontiers in Handwriting Recognition (ICFHR). pp. 630–635 (2016). https://doi.org/10.1109/ICFHR.2016.0120

work page doi:10.1109/icfhr.2016.0120 2016

[29] [29]

arXiv preprint arXiv:2601.14251 (2026)

Taghadouini, S., Cavaillès, A., Aubertin, B.: LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR. arXiv preprint arXiv:2601.14251 (2026)

work page arXiv 2026

[30] [30]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts Optical Compression. arXiv preprint arXiv:2510.18234 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

arXiv preprint arXiv:2601.20552 (2026)

Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR 2: Visual Causal Flow. arXiv preprint arXiv:2601.20552 (2026)

work page arXiv 2026

[32] [32]

In: Yin, X.C., Karatzas, D., Lopresti, D

Wolf, F., Tüselmann, O., Matei, A., Hennies, L., Rass, C., Fink, G.A.: CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Lan- guage Models. In: Yin, X.C., Karatzas, D., Lopresti, D. (eds.) Document Analysis and Recognition – ICDAR 2025. pp. 23–39 (2026)

2025

[33] [33]

arXiv preprint arXiv:2412.02210 (2024)

Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., Jin, L., Lin, J.: CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy. arXiv preprint arXiv:2412.02210 (2024)

work page arXiv 2024