Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Bhushan Pawar; Madhu Reddiboina; Prabhjot Singh; Rajvee Sheth

arxiv: 2606.17188 · v2 · pith:O2XP2QQRnew · submitted 2026-06-15 · 💻 cs.CV · cs.CL

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Prabhjot Singh , Bhushan Pawar , Madhu Reddiboina , Rajvee Sheth This is my paper

Pith reviewed 2026-06-27 04:07 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelsmultilingual evaluationscript consistencyPunjabi scriptsorthographic gapVLM benchmarkmulti-script languagesvisual reasoning

0 comments

The pith

Current multilingual VLMs are not truly multi-script, showing systematic performance gaps up to 16% on identical visual tasks across different scripts of one language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the PuMVR benchmark of 1,000 strictly parallel image-text instances in Punjabi across its three scripts to test orthographic handling in VLMs. Models frequently solve tasks in one script while failing the same tasks in another, and adding visual input raises overall accuracy but leaves the script-based differences unchanged. Cross-script transfer of in-context examples proves brittle, pointing to script-locked internal representations. Statistical tests confirm the gaps, leading to the proposal of the Script Consistency Rate as a required evaluation metric.

Core claim

Evaluating ten state-of-the-art VLMs on the PuMVR benchmark reveals a substantial and systematic Script Gap, where models solve visual tasks in one script while failing identical tasks in another. Visual input boosts absolute performance uniformly yet does not close the orthographic gap. Cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, the findings demonstrate that current multilingual VLMs are not truly multi-script, with the Script Consistency Rate falling as low as 24.8 percent.

What carries the argument

The PuMVR benchmark of 1,000 strictly parallel image-text instances across Punjabi's Gurmukhi, Shahmukhi, and Roman scripts, used to compute the Script Consistency Rate (SCR) as the measure of script-agnostic performance.

If this is right

Accuracy deltas reach 16% between scripts on identical visual reasoning tasks.
Visual input improves performance uniformly across scripts but does not reduce the orthographic gaps.
In-context learning examples do not transfer reliably when the script changes.
The Script Consistency Rate reaches lows of 24.8% on the benchmark, indicating inconsistent script handling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar script gaps are likely to appear when testing other multi-script languages that share the same linguistic content across orthographies.
Tokenization differences and uneven script representation in training data are probable root causes of the observed script locking.
Requiring the Script Consistency Rate in standard VLM benchmarks would push development toward representations that treat scripts as interchangeable for the same language.

Load-bearing premise

The 1,000 instances are strictly parallel across scripts with identical task difficulty and content, so any performance differences can be attributed to script rather than rendering, tokenization, or data distribution factors.

What would settle it

Re-evaluating the ten VLMs on the PuMVR benchmark and obtaining no statistically significant accuracy differences between any script pairs according to McNemar tests would falsify the existence of a systematic script gap.

Figures

Figures reproduced from arXiv: 2606.17188 by Bhushan Pawar, Madhu Reddiboina, Prabhjot Singh, Rajvee Sheth.

**Figure 1.** Figure 1: Sample instance from PuMVR showing cross-script equivalent options. PuMVR comprises 1,000 parallel instances. Each instance consists of one image paired with a question and four multiple-choice options provided in Gurmukhi, Shahmukhi, and Roman scripts, along with the corresponding correct answer (semantically equivalent across scripts); [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Script-dependent accuracy and Script Consistency Rate (SCR) across 10 VLMs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 2.** Figure 2: Transfer Efficiency (TE) heatmap (baseline=100%). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs have large script-specific gaps on parallel Punjabi tasks and introduces a new benchmark plus metric, but the strict parallelism claim is the part that needs verification.

read the letter

The main thing to know is that this work documents accuracy drops of up to 16% on identical visual reasoning items when the text switches between Gurmukhi, Shahmukhi, and Roman scripts, even after adding images, and it supplies a new benchmark and consistency metric to measure that.

The new elements are the PuMVR set of 1,000 parallel instances and the Script Consistency Rate. The paper evaluates ten current VLMs, reports the deltas, runs McNemar tests on the paired outcomes, and shows that vision lifts scores across the board without shrinking the script differences. Releasing the data and code is also useful. These pieces give a concrete way to test whether a model is actually script-agnostic rather than just language-agnostic.

The soft spot is the load-bearing assumption that the 1,000 instances are strictly equivalent in content and difficulty once the script changes. Any difference in phrasing, tokenization cost, or rendering could contribute to the observed gaps. The abstract states the instances are parallel, but the strength of the central claim rests on how carefully that equivalence was checked; without those details the attribution to script alone is harder to pin down. The scope is also narrow to one language family, so the size of the problem in other multi-script settings remains open.

This is for researchers who build or evaluate VLMs for languages that use more than one script. Anyone working on fairness metrics or low-resource multimodal data would get direct value from the benchmark and the proposed rate.

It is worth sending to peer review so the data construction and statistical controls can be examined in full.

Referee Report

3 major / 2 minor

Summary. The paper introduces PuMVR, a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three scripts (Gurmukhi, Shahmukhi, Roman). Evaluating 10 VLMs, it reports accuracy deltas up to 16% between scripts on identical visual tasks, shows that visual input boosts performance uniformly but does not close the orthographic gap, finds cross-script in-context transfer brittle, and proposes the Script Consistency Rate (SCR) metric (as low as 24.8%) supported by McNemar tests to argue that current 'multilingual' VLMs are not truly multi-script.

Significance. If the parallelism of instances holds, the work identifies script consistency as an overlooked dimension in VLM evaluation for multi-script languages spoken by billions, with implications for equitable access. The public release of data and code is a clear strength for reproducibility and follow-up work.

major comments (3)

[§3 (Dataset Construction)] §3 (Dataset Construction): The central claim that performance deltas are attributable to script alone rests on the 1,000 instances being strictly parallel with identical semantics, task difficulty, and no confounds from translation or tokenization; the manuscript provides no explicit verification protocol (e.g., back-translation, expert equivalence ratings, or controls for Romanization variability), which is load-bearing for the attribution of gaps to orthography rather than other factors.
[§4 (Evaluation and Results)] §4 (Evaluation and Results): While McNemar tests are invoked to support systematic gaps across script pairs, the manuscript should report exact test statistics, p-values (with correction for multiple comparisons), and effect sizes per model and pair to substantiate the 'substantial' claim rather than relying on accuracy deltas alone.
[§5 (Proposed Metric)] §5 (Proposed Metric): The Script Consistency Rate (SCR) is presented as a mandatory new metric, but its precise definition, aggregation across three scripts, and normalization must be formalized (ideally with an equation) to allow independent verification and ensure it is not reducible to simple accuracy variance.

minor comments (2)

Figure captions should explicitly label the three scripts and clarify whether bars represent per-script accuracy or consistency rates.
The related work section should more explicitly contrast PuMVR against prior multilingual VLM benchmarks that include non-Latin scripts to better position the novelty of the script-gap focus.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the rigor of our claims. We address each major comment point by point below.

read point-by-point responses

Referee: [§3 (Dataset Construction)] The central claim that performance deltas are attributable to script alone rests on the 1,000 instances being strictly parallel with identical semantics, task difficulty, and no confounds from translation or tokenization; the manuscript provides no explicit verification protocol (e.g., back-translation, expert equivalence ratings, or controls for Romanization variability), which is load-bearing for the attribution of gaps to orthography rather than other factors.

Authors: We agree that an explicit verification protocol should have been included to fully support the parallelism claim. In the revised manuscript, we have added a new subsection in §3 that details the protocol used: independent back-translation of all instances, equivalence ratings by three native Punjabi linguists (with reported inter-rater agreement), and explicit controls for Romanization variants. These steps confirm identical semantics and task difficulty across scripts, allowing attribution of gaps to orthography. revision: yes
Referee: [§4 (Evaluation and Results)] While McNemar tests are invoked to support systematic gaps across script pairs, the manuscript should report exact test statistics, p-values (with correction for multiple comparisons), and effect sizes per model and pair to substantiate the 'substantial' claim rather than relying on accuracy deltas alone.

Authors: We concur that full statistical reporting is needed. The revised §4 now includes a supplementary table reporting exact McNemar χ² statistics, Bonferroni-corrected p-values across all model-script-pair comparisons, and effect sizes (Cohen's h) for each of the 10 models. These additional details substantiate that the gaps are systematic and statistically significant beyond the reported accuracy deltas. revision: yes
Referee: [§5 (Proposed Metric)] The Script Consistency Rate (SCR) is presented as a mandatory new metric, but its precise definition, aggregation across three scripts, and normalization must be formalized (ideally with an equation) to allow independent verification and ensure it is not reducible to simple accuracy variance.

Authors: We have addressed this by adding a formal definition and equation for SCR in the revised §5. SCR is defined as the proportion of instances where a model produces correct answers across all three scripts, aggregated as the mean over N instances and normalized to the unit interval. This per-instance consistency measure is mathematically distinct from aggregate accuracy variance, as shown in the added equation and accompanying explanation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent data collection

full rationale

The paper constructs a new benchmark (PuMVR) of 1,000 parallel image-text instances across three scripts and evaluates existing VLMs on it, reporting accuracy deltas and proposing SCR as a metric. No equations, fitted parameters, or derivations are present that reduce results to inputs by construction. Claims rest on direct model evaluations and statistical tests (McNemar) rather than self-referential definitions or self-citation chains. The parallelism assumption is an empirical claim about data construction, not a definitional loop. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the creation of a new parallel benchmark and the assumption that performance differences are attributable to script rather than other factors.

axioms (1)

domain assumption The 1,000 instances are strictly parallel across the three scripts with identical task difficulty and content.
This is stated directly in the abstract as the basis for comparing performance across scripts.

invented entities (1)

Script Consistency Rate (SCR) no independent evidence
purpose: To quantify how consistently a model performs across different scripts for the same content.
New metric proposed in the abstract as a mandatory addition to VLM evaluation.

pith-pipeline@v0.9.1-grok · 5762 in / 1385 out tokens · 57661 ms · 2026-06-27T04:07:26.493156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Learning Transferable Visual Models From Natural Language Supervision

CORI: CJKV benchmark with Roman- ization integration - a step towards cross-lingual transfer beyond textual scripts . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages 4008–4020, Torino, Italia. ELRA and ICCL. OpenAI. 2024. GPT-4o System Card. https:// opena...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Preprint, arXiv:2510.11178

Blend-vis: Benchmarking multimodal cul- tural understanding in vision language models . Preprint, arXiv:2510.11178. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Gar- rett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker,...

work page arXiv 2024
[3]

cuda" with max_memory = {0:

All languages matter: Evaluating lmms on culturally diverse 100 languages . Preprint, arXiv:2411.16508. Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang. 2021. Broaden the vision: Geo-diverse visual commonsense reason- ing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 2115–2129, Onli...

work page arXiv 2021
[4]

Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

Script Comprehension V eriﬁcation: A model might correctly guess a letter (25% probability) without truly process- ing the script. Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy
[5]

hallucinated

F ailure Mode Analysis: By requiring text output, we could identify cases where models produced "hallucinated" charac- ters or mixed scripts, data that would be lost if restricted to single-letter outputs. D.3 Experiment 1: Baseline Script Gap The primary benchmark used the following template with English instructions and script- speciﬁc constraints. Exp ...
[10]

Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options: {formatted_options} CRITICAL RULES:

Output ONLY the option text, nothing else Your answer (copy exact text from options): Answer: D.4 Experiment 2: Native Instruction Prompting This experiment used similar prompts as ex- periment 1, without the image input to test model performance and establish a baseline. Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options:...
[11]

Answer MUST be in the SAME script as the question
[12]

Copy EXACTLY one option from above - character by character
[13]

NO explanations, NO extra words, NO English translations
[14]

NO letters like A), B), C) or numbers
[15]

Exp 3: System Prompt You are a precise answering assistant

Output ONLY the option text, nothing else Your answer (copy exact text from options): D.5 Experiment 3: System Prompting (F ew-Shot) For experiments involving system-level instruc- tions, the following persona-based prompt was utilized. Exp 3: System Prompt You are a precise answering assistant. You will be given a visual question and options. You must ou...
[16]

NO option letters (like A, B)
[17]

Output strictly the option text. E Error Classiﬁcation Details Table 7 reports the per-script breakdown of error types across all 11,250 model responses (10 models × 375 instances × 3 scripts), con- ﬁrming that the generative evaluation design measures comprehension rather than output formatting diﬃculty. Script Comp. F ailures Empty F ormatting Gurmukhi ...

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

CORI: CJKV benchmark with Roman- ization integration - a step towards cross-lingual transfer beyond textual scripts . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages 4008–4020, Torino, Italia. ELRA and ICCL. OpenAI. 2024. GPT-4o System Card. https:// opena...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Preprint, arXiv:2510.11178

Blend-vis: Benchmarking multimodal cul- tural understanding in vision language models . Preprint, arXiv:2510.11178. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Gar- rett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker,...

work page arXiv 2024

[3] [3]

cuda" with max_memory = {0:

All languages matter: Evaluating lmms on culturally diverse 100 languages . Preprint, arXiv:2411.16508. Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang. 2021. Broaden the vision: Geo-diverse visual commonsense reason- ing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 2115–2129, Onli...

work page arXiv 2021

[4] [4]

Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

Script Comprehension V eriﬁcation: A model might correctly guess a letter (25% probability) without truly process- ing the script. Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

[5] [5]

hallucinated

F ailure Mode Analysis: By requiring text output, we could identify cases where models produced "hallucinated" charac- ters or mixed scripts, data that would be lost if restricted to single-letter outputs. D.3 Experiment 1: Baseline Script Gap The primary benchmark used the following template with English instructions and script- speciﬁc constraints. Exp ...

[6] [10]

Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options: {formatted_options} CRITICAL RULES:

Output ONLY the option text, nothing else Your answer (copy exact text from options): Answer: D.4 Experiment 2: Native Instruction Prompting This experiment used similar prompts as ex- periment 1, without the image input to test model performance and establish a baseline. Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options:...

[7] [11]

Answer MUST be in the SAME script as the question

[8] [12]

Copy EXACTLY one option from above - character by character

[9] [13]

NO explanations, NO extra words, NO English translations

[10] [14]

NO letters like A), B), C) or numbers

[11] [15]

Exp 3: System Prompt You are a precise answering assistant

Output ONLY the option text, nothing else Your answer (copy exact text from options): D.5 Experiment 3: System Prompting (F ew-Shot) For experiments involving system-level instruc- tions, the following persona-based prompt was utilized. Exp 3: System Prompt You are a precise answering assistant. You will be given a visual question and options. You must ou...

[12] [16]

NO option letters (like A, B)

[13] [17]

Output strictly the option text. E Error Classiﬁcation Details Table 7 reports the per-script breakdown of error types across all 11,250 model responses (10 models × 375 instances × 3 scripts), con- ﬁrming that the generative evaluation design measures comprehension rather than output formatting diﬃculty. Script Comp. F ailures Empty F ormatting Gurmukhi ...