Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Antonia Karamolegkou; Beno\^it Sagot; Nicolas Angleraud; Thibault Cl\'erice

arxiv: 2605.27750 · v1 · pith:R6IC4VPSnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.CV· cs.DL

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Antonia Karamolegkou , Nicolas Angleraud , Beno\^it Sagot , Thibault Cl\'erice This is my paper

Pith reviewed 2026-06-29 17:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.DL

keywords vision-language modelsOCRAncient Greekvisual groundinglanguage priorsoptical character recognitionhistorical documents

0 comments

The pith

VLMs for ancient Greek OCR often output fluent text that ignores the actual image and follows language priors instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language models applied to optical character recognition on low-resource Ancient Greek critical editions frequently generate plausible Greek text unsupported by the visual input. This occurs because the models lean on language model priors rather than the image pixels. Traditional OCR engines, by contrast, produce local noise when they err. The authors test this by applying character-level image perturbations and measuring how much each model's token predictions shift when the image is removed versus when it is present. The result is that fluent errors do not guarantee visual grounding and that the degree of image dependence varies by model type.

Core claim

Under controlled character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful to the altered image; token-level analysis further reveals that an OCR-specialist VLM produces fluent lexical substitutions with little dependence on the image, whereas general-purpose VLMs stay conditioned on visual input even when their output is wrong.

What carries the argument

Controlled image perturbations together with token-level grounding measures that compare conditional decoding distributions against image-free decoding distributions.

If this is right

OCR-specialist VLMs can emit fluent but image-ungrounded lexical substitutions.
General-purpose VLMs continue to condition on the visual input even on incorrect outputs.
Decode-time interventions do not reliably increase visual grounding.
Post-generation language-model correction can improve final text quality without fixing the underlying grounding problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation of historical-document OCR should include explicit checks for visual grounding rather than fluency alone.
Hybrid pipelines that run traditional OCR first and then apply VLM correction only where needed may reduce ungrounded fluent errors.
The same perturbation-plus-distribution-shift test could be applied to other low-resource scripts to measure prior reliance.

Load-bearing premise

The controlled image perturbations and the conditional-versus-image-free decoding measures accurately reflect how much the model actually uses the image during ordinary decoding.

What would settle it

A direct observation that, under the same character-level perturbations, VLM output distributions remain as close to the perturbed ground truth as traditional OCR output distributions would falsify the claim of language-prior dominance.

Figures

Figures reproduced from arXiv: 2605.27750 by Antonia Karamolegkou, Beno\^it Sagot, Nicolas Angleraud, Thibault Cl\'erice.

**Figure 2.** Figure 2: Per-seed CER for the five VLMs (dots = the five sampling seeds, bar = sampling mean, diamond = greedy). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Composition of word-level errors per model on the real Greek test set, as the share of each model’s errors [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Rendered examples from the CLLG synthetic test set used in RQ2. Ancient Greek prose is typeset with [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Perturbation conditions illustrated on one passage. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of token-level image gain by to [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs produce fluent but ungrounded OCR on Ancient Greek editions more than traditional engines, with model-specific differences in image reliance, though the perturbation method risks confounding noise sensitivity with grounding.

read the letter

The core observation is that VLMs for OCR on low-resource Ancient Greek critical editions often output fluent text that diverges from perturbed images while traditional OCR stays closer to the visual changes. They add token-level comparisons between conditional and image-free decoding to separate models: an OCR-specialist one shows little image dependence on errors, while general VLMs stay more tied to the image even when wrong. Decode interventions don't fix grounding reliably, but post-OCR LM correction helps some systems.

What stands out is the extension to historical documents plus the split between model types and the use of perturbations for grounding checks. That moves beyond aggregate accuracy and gives a concrete way to look at when output is visually supported.

The soft spot is the perturbation design. Character-level changes could simply disrupt the visual encoder differently than they affect traditional OCR, so the divergence gap might reflect general brittleness to local noise rather than absence of grounding in normal decoding. The abstract mentions no controls like random non-semantic perturbations of matched strength or matched-encoder baselines, which leaves room for that alternative. If the full results include those or error bars on the token measures, it would tighten the claim.

This is useful for people evaluating VLMs on historical OCR or building grounding-aware tests. It is worth sending to peer review because the empirical comparison and model split are concrete enough to discuss, even if the interpretation of the measures needs more checks.

Referee Report

2 major / 2 minor

Summary. The manuscript examines visual grounding failures in vision-language models (VLMs) applied to OCR on low-resource Ancient Greek critical editions. It claims that VLMs often produce fluent but visually unsupported text by relying on language priors, in contrast to traditional OCR engines that generate more local noise. The authors introduce controlled character-level image perturbations and token-level grounding metrics based on conditional versus image-free decoding distributions. Under these perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains more faithful; however, the degree of prior reliance is model-specific (fluent errors with little image conditioning in OCR-specialist models versus continued visual conditioning in general-purpose VLMs even when incorrect). Decode-time interventions do not reliably restore grounding, while post-OCR LM correction helps only by repairing output after generation. The work extends prior evidence to historical documents and a broader model set, arguing that fluent output is not necessarily visually grounded.

Significance. If the central empirical findings hold, the paper makes a useful contribution by extending observations of language-prior reliance in VLMs to a challenging low-resource historical domain and by distinguishing behavior across model types. The perturbation-plus-distribution-comparison approach attempts to move beyond aggregate accuracy toward interpretability-driven evaluation, which is a positive direction. The focus on Ancient Greek editions is well-chosen given the domain's sensitivity to fluent substitutions. No machine-checked proofs or parameter-free derivations are present, but the reproducible experimental setup on public editions would be a strength if code and data are released.

major comments (2)

[Perturbation Analysis] Perturbation Analysis section: the central claim that character-level perturbations plus conditional-vs-image-free token distributions isolate visual grounding (rather than general sensitivity to local noise) is load-bearing for the model-specific prior-reliance conclusions. The manuscript does not appear to include controls such as random non-semantic perturbations of matched magnitude or comparisons against models sharing the same visual encoder; without these, the divergence gap could reflect differences in receptive-field size or training rather than absence of grounding per se.
[Intervention Experiments] Results on decode-time interventions: the statement that such interventions 'fail to reliably restore grounding' is central to the practical takeaway, yet the quantitative effect sizes and failure modes (e.g., which tokens are affected) are not reported with sufficient detail to assess whether the interventions were applied at the appropriate decoding stage or with appropriate strength.

minor comments (2)

[Abstract] Abstract: the claim that 'traditional OCR remains comparatively faithful' would benefit from a brief quantitative anchor (e.g., character error rate delta) even in the abstract.
[Methods] Notation: the token-level grounding measure is described as comparing 'conditional versus image-free decoding distributions'; a short equation or pseudocode in the methods would clarify whether this is a KL divergence, probability ratio, or other statistic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and note planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Perturbation Analysis] Perturbation Analysis section: the central claim that character-level perturbations plus conditional-vs-image-free token distributions isolate visual grounding (rather than general sensitivity to local noise) is load-bearing for the model-specific prior-reliance conclusions. The manuscript does not appear to include controls such as random non-semantic perturbations of matched magnitude or comparisons against models sharing the same visual encoder; without these, the divergence gap could reflect differences in receptive-field size or training rather than absence of grounding per se.

Authors: We appreciate the referee's identification of this potential confound. Our character-level perturbations target visually similar substitutions typical of Greek OCR errors, and the contrast with traditional OCR (which remains locally faithful) provides evidence that the observed divergence is not solely general noise sensitivity. To further address the concern, we will add a control experiment using random non-semantic perturbations of matched magnitude. Direct comparisons with models sharing an identical visual encoder are constrained by the set of available open-weight VLMs; we will explicitly discuss this architectural limitation and its implications in the revised text. revision: partial
Referee: [Intervention Experiments] Results on decode-time interventions: the statement that such interventions 'fail to reliably restore grounding' is central to the practical takeaway, yet the quantitative effect sizes and failure modes (e.g., which tokens are affected) are not reported with sufficient detail to assess whether the interventions were applied at the appropriate decoding stage or with appropriate strength.

Authors: We agree that additional quantitative detail is needed to evaluate the interventions. In the revised manuscript we will report effect sizes on the grounding metrics, token-level breakdowns of affected positions, and explicit information on the decoding stages and intervention strengths applied. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper is an empirical study that compares VLMs against traditional OCR baselines on perturbed Ancient Greek edition images. It defines token-level grounding measures operationally as differences between conditional and image-free decoding distributions, then reports observed divergences under character-level perturbations. No equations, fitted parameters, or predictions reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central observations rely on external baselines and direct experimental outputs rather than any self-referential chain, so the derivation chain contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the perturbation and grounding measures introduced in the paper, which are not detailed in the abstract. No free parameters or invented entities are mentioned.

axioms (1)

domain assumption Controlled image perturbations and token-level measures based on conditional versus image-free decoding distributions accurately reflect visual grounding during normal OCR decoding.
This assumption underpins the analysis of prior reliance and is invoked to interpret model behavior on perturbed inputs.

pith-pipeline@v0.9.1-grok · 5770 in / 1243 out tokens · 32994 ms · 2026-06-29T17:49:06.747756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Multi-modal hallucination control by visual information grounding. In2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312. Yulin Fei, Yuhui Gao, Xingyuan Xian, Xiaojin Zhang, Tao Wu, and Wei Chen. 2025. Do current video LLMs have strong OCR abilities? a preliminary study. InProceedings of the 31st International C...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13872–13882

Mitigating object hallucinations in large vision- language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13872–13882. Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, and Shiwen Ni. 2026. Visual merit or lin...

work page arXiv 2026
[3]

LightOnOCR: A 1b end-to-end multilingual vision-language model for state-of-the-art OCR,

Association for Computing Machinery. R. Smith. 2007. An overview of the tesseract ocr engine. InNinth International Conference on Document Anal- ysis and Recognition (ICDAR 2007), volume 2, pages 629–633. Said Taghadouini, Adrien Cavaillès, and Baptiste Aubertin. 2026. Lightonocr: A 1b end-to-end mul- tilingual vision-language model for state-of-the-art o...

work page internal anchor Pith review arXiv 2007

[1] [1]

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Multi-modal hallucination control by visual information grounding. In2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312. Yulin Fei, Yuhui Gao, Xingyuan Xian, Xiaojin Zhang, Tao Wu, and Wei Chen. 2025. Do current video LLMs have strong OCR abilities? a preliminary study. InProceedings of the 31st International C...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13872–13882

Mitigating object hallucinations in large vision- language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13872–13882. Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, and Shiwen Ni. 2026. Visual merit or lin...

work page arXiv 2026

[3] [3]

LightOnOCR: A 1b end-to-end multilingual vision-language model for state-of-the-art OCR,

Association for Computing Machinery. R. Smith. 2007. An overview of the tesseract ocr engine. InNinth International Conference on Document Anal- ysis and Recognition (ICDAR 2007), volume 2, pages 629–633. Said Taghadouini, Adrien Cavaillès, and Baptiste Aubertin. 2026. Lightonocr: A 1b end-to-end mul- tilingual vision-language model for state-of-the-art o...

work page internal anchor Pith review arXiv 2007