Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

Aditya Pratap Singh

arxiv: 2606.29213 · v1 · pith:HJYWEOO3new · submitted 2026-06-28 · 💻 cs.CL · cs.CV

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

Aditya Pratap Singh This is my paper

Pith reviewed 2026-06-30 07:53 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords DevanagariOCRVision-Language ModelsBenchmarkIndic scriptsSynthetic vs realPost-correctionHindi

0 comments

The pith

Real Devanagari scans cause nine of ten OCR systems to collapse, with scores spanning a 76-point range that synthetic tests hide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates ten OCR systems, from classical engines to frontier multimodal models, on Devanagari Hindi text. All ten perform similarly on clean synthetic renders, clustering between chrF++ 91 and 98. On real printed scans the same systems diverge sharply, with most dropping far below their synthetic scores and the group covering a 76-point spread. This matters because it shows that synthetic benchmarks do not reveal how these systems will behave on actual documents. The work also introduces an error taxonomy and tests a byte-level post-corrector that improves one engine on its own errors but does not transfer.

Core claim

On real printed Devanagari scans nine of the ten systems collapse relative to synthetic conditions, EasyOCR falling from chrF++ 93.6 to 58.3 and the field spreading across a 76-point range, while synthetic renders keep all systems clustered within 91-98. Specialized OCR-VLMs show rare catastrophic repetition failures that distort mean scores, English OCR strength does not predict Indic results, and a ByT5 post-corrector raises performance on its training distribution but fails to generalize across engines.

What carries the argument

A benchmark of ten OCR systems across four synthetic degradation conditions and 300 real printed scans, measured by chrF++ and an error taxonomy that separates surface errors from structural ones such as conjuncts and matras.

If this is right

Specialized OCR-VLMs remain fragile under degradation even when their median scores look strong.
English OCR performance is a poor predictor of results on Devanagari.
Byte-level post-correction yields gains only on the engine whose errors it was trained on.
Synthetic text fails to separate model capabilities on Devanagari and therefore cannot substitute for real scans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations of OCR for non-Latin scripts should rely primarily on real scanned documents rather than synthetic renders.
Open models such as Qwen3-VL-8B can match or approach some closed frontier systems on this task while running on modest hardware.
Future benchmarks would benefit from including handwritten or low-quality historical Devanagari material to test additional failure modes.

Load-bearing premise

The four synthetic degradation conditions and the 300 real printed scans are representative of the challenges faced by OCR systems on Devanagari documents in practice.

What would settle it

A larger test set of real Devanagari documents drawn from varied sources and print conditions in which all ten systems maintain chrF++ scores within a narrow band near their synthetic levels would falsify the claim that synthetic renders overstate quality.

Figures

Figures reproduced from arXiv: 2606.29213 by Aditya Pratap Singh.

**Figure 2.** Figure 2: Under blur, DeepSeek-OCR has the best median CER (1 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Error composition by category (clean). EasyOCR errors are dominated by numerals and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: chrF++ on synthetic clean renders versus real printed scans, ten systems sorted by real [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A ByT5 post-corrector trained on EasyOCR’s own error distribution improves EasyOCR [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts is largely uncharacterised. We benchmark ten systems on Devanagari (Hindi): classical EasyOCR; open VLMs (Qwen2.5-VL-3B, Qwen3-VL-8B, olmOCR-7B); specialised OCR-VLMs (DeepSeek-OCR, Unlimited-OCR); and frontier closed models (Gemini 2.5 Flash, Claude Opus 4.7, GPT-5.5, Mistral OCR), across four synthetic degradation conditions and 300 real printed scans. We report four findings. First, on clean rendered text all ten cluster within chrF++ 91 to 98, so synthetic text does not separate them. Second, under degradation the specialised OCR-VLMs are the most fragile: DeepSeek-OCR suffers rare but catastrophic repetition failures (outputs up to 71 the reference length) that wreck its corpus mean even though its median is the best of any system, which is why we report median and catastrophic-rate instead of the mean. Third, on real scans nine of the ten systems collapse (EasyOCR falls from chrF++ 93.6 to 58.3) and the field spreads across a 76-point range, so synthetic renders badly overstate Devanagari quality. Fourth, strong English OCR does not predict Indic OCR: GPT-5.5 drops to chrF++ 58.5 (tying classical EasyOCR) and olmOCR-7B, the model behind olmOCR-Bench, falls to 40.5, while the open Qwen3-VL-8B (75.2, runnable on a single 24 GB GPU) beats GPT-5.5 and approaches Mistral; Gemini and Claude lead at 86.3 and 82.2. An error taxonomy separates surface errors (numerals, punctuation) from structural ones (conjuncts, matras, nukta), and a byte-level (ByT5) post-corrector improves a cheap engine on its own error distribution (chrF++ +1.2 to +1.5) but does not transfer across engines. We release the benchmark, code, and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthetic Devanagari benchmarks look good but real printed scans expose big drops and a wide spread across models, though the 300-scan set needs clearer justification.

read the letter

The main takeaway is that clean synthetic Devanagari text keeps all ten systems clustered at high chrF++ scores while the 300 real scans cause nine of them to collapse, with EasyOCR falling to 58.3 and a 76-point range across the field. This suggests synthetic renders overstate practical quality for this script.

The paper tests classical EasyOCR, several open VLMs, specialized OCR-VLMs, and closed frontier models on four degradation conditions plus the real scans. It reports that specialized models suffer rare repetition failures that ruin means, so median and catastrophic-rate metrics are used instead. An error taxonomy splits surface issues like numerals from structural ones like conjuncts and matras. A ByT5 post-corrector lifts performance on its own error distribution but shows no transfer to other engines. The release of the benchmark, code, and models is a clear plus.

What stands out is the direct real-versus-synthetic comparison for Devanagari and the observation that strong English OCR does not predict results here. Qwen3-VL-8B beats some closed models while olmOCR-7B underperforms. The switch to median metrics is a straightforward response to the observed failure mode.

The soft spot is the real-scan collection. The abstract supplies no selection criteria, sources, document types, or quality statistics, so the claim that synthetic data badly overstates quality depends on whether these 300 scans are representative. If the full paper lacks those details or diversity measures, the central inference rests on unverified ground. The degradation implementations could also use more explicit description.

This work is for researchers evaluating or improving OCR on Indic scripts. A reader who needs concrete numbers on model fragility and released resources will find it useful. It deserves a serious referee because the empirical results are new for this script and the open release supports follow-up work.

I would send it to peer review with a request for more on scan selection and degradation methods.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks ten OCR systems (classical EasyOCR, open VLMs like Qwen2.5-VL/Qwen3-VL/olmOCR, specialized OCR-VLMs like DeepSeek-OCR/Unlimited-OCR, and frontier models like Gemini/Claude/GPT/Mistral) on Devanagari using four synthetic degradation conditions and 300 real printed scans. Key findings: all systems cluster at high chrF++ (91-98) on clean synthetic text; specialized VLMs show fragility with rare catastrophic repetition failures (prompting median and catastrophic-rate metrics); nine of ten systems collapse on real scans (e.g., EasyOCR from 93.6 to 58.3, 76-point spread); English OCR performance does not predict Indic results; an error taxonomy distinguishes surface vs. structural errors; and a ByT5 post-corrector yields small gains on matched error distributions but does not transfer. The benchmark, code, and models are released.

Significance. If the real scans are representative, the results establish that synthetic benchmarks substantially overstate OCR quality on Devanagari and that current systems (including frontier models) have major gaps on Indic scripts, with the resource release aiding reproducibility. The switch to median/catastrophic-rate metrics is a sound response to observed failure modes.

major comments (2)

[Dataset / real scans subsection] The section describing the 300 real printed scans provides no selection criteria, sources, document types, resolution/noise statistics, or diversity measures. This is load-bearing for the central claim (abstract and finding 3) that synthetic renders 'badly overstate Devanagari quality,' as the large drop (e.g., EasyOCR 93.6 o 58.3) and 76-point spread could reflect a non-representative subset rather than a general property of real Devanagari material.
[§3 / synthetic degradation conditions] The implementation of the four synthetic degradation conditions lacks specific parameters or algorithms (e.g., exact noise models, blur kernels, or distortion levels). This undermines the comparison to real scans and the inference that synthetic conditions fail to capture practical challenges.

minor comments (2)

[Abstract] Abstract: 'outputs up to 71 the reference length' is missing 'times'.
[Post-correction study] Clarify whether the post-correction experiments include cross-validation or hold-out sets to support the claim of non-transfer across engines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive comments on the need for greater transparency. We address each major comment below.

read point-by-point responses

Referee: [Dataset / real scans subsection] The section describing the 300 real printed scans provides no selection criteria, sources, document types, resolution/noise statistics, or diversity measures. This is load-bearing for the central claim (abstract and finding 3) that synthetic renders 'badly overstate Devanagari quality,' as the large drop (e.g., EasyOCR 93.6 to 58.3) and 76-point spread could reflect a non-representative subset rather than a general property of real Devanagari material.

Authors: We agree that additional details on the real scans are required to support the central claims. In the revised manuscript we will expand the relevant subsection to specify the sources (public-domain printed Hindi books and periodicals from digitized archives), document types, selection criteria (random sampling across available high-resolution scans), and any available statistics on resolution, noise, and diversity (e.g., number of unique documents and fonts). We will also note any limitations in representativeness. revision: yes
Referee: [§3 / synthetic degradation conditions] The implementation of the four synthetic degradation conditions lacks specific parameters or algorithms (e.g., exact noise models, blur kernels, or distortion levels). This undermines the comparison to real scans and the inference that synthetic conditions fail to capture practical challenges.

Authors: We acknowledge the need for precise implementation details. We will revise §3 to include the exact algorithms and parameters for each degradation condition (noise models, blur kernels with specific sizes and sigmas, distortion levels, etc.) together with references to the released code repository to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential reductions

full rationale

The paper is a benchmark study reporting measured performance of ten OCR systems on synthetic degradations and 300 real scans. It states four findings based on direct chrF++ comparisons (e.g., clean synthetic 91-98, real scans collapse to 58.3 for EasyOCR with 76-point spread). No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described structure. The representativeness concern raised by the skeptic is an external-validity issue, not a circular reduction of any claimed derivation to its own inputs. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking paper with no mathematical derivations; central claims rest on the unstated assumption that chrF++ is the right metric for Devanagari and that the chosen test sets capture real-world distribution.

pith-pipeline@v0.9.1-grok · 5988 in / 1115 out tokens · 37468 ms · 2026-06-30T07:53:33.716925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages · 4 internal anchors

[1]

H. Wei, Y. Sun, Y. Li. DeepSeek-OCR: Contexts Optical Compression. arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Unlimited OCR Works

Y. Yin et al. Unlimited OCR Works. arXiv:2606.23050, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Qwen2.5-VL Technical Report

S. Bai et al. Qwen2.5-VL Technical Report. arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

H. Wei et al. General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Ouyang et al

L. Ouyang et al. OmniDocBench. CVPR, 2025. 8 Figure 5: A ByT5 post-corrector trained on EasyOCR’s own error distribution improves EasyOCR chrF++ in every condition. 9

2025

[1] [1]

H. Wei, Y. Sun, Y. Li. DeepSeek-OCR: Contexts Optical Compression. arXiv:2510.18234, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Unlimited OCR Works

Y. Yin et al. Unlimited OCR Works. arXiv:2606.23050, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Qwen2.5-VL Technical Report

S. Bai et al. Qwen2.5-VL Technical Report. arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

H. Wei et al. General OCR Theory: Towards OCR-2.0 via a Unified End-to-End Model. arXiv:2409.01704, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Ouyang et al

L. Ouyang et al. OmniDocBench. CVPR, 2025. 8 Figure 5: A ByT5 post-corrector trained on EasyOCR’s own error distribution improves EasyOCR chrF++ in every condition. 9

2025