Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

Bhushan Pawar; Madhu Reddiboina; Prabhjot Singh

arxiv: 2606.20770 · v1 · pith:COGJ6NDAnew · submitted 2026-06-18 · 💻 cs.CL · cs.AI· cs.LG

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

Prabhjot Singh , Bhushan Pawar , Madhu Reddiboina This is my paper

Pith reviewed 2026-06-26 17:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords orthographic biasmultilingual VLMsPunjabi scriptsvisual reasoningscript consistency ratePuMVR benchmarkmultimodal evaluationcross-script transfer

0 comments

The pith

Vision-language models exhibit accuracy gaps up to 16 percent on identical visual tasks depending on which script represents the same Punjabi language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates PuMVR, a set of 375 image-based reasoning tasks written in Punjabi's three scripts, to test whether VLMs treat the same language as equivalent across writing systems. Evaluation of ten models shows frequent cases where a model solves a puzzle in one script but fails the identical puzzle in another. Adding images raises overall scores yet leaves the relative differences between scripts unchanged. Reasoning steps and chain-of-thought paths also shift with script choice alone. The work introduces Script Consistency Rate as a measure to expose this orthographic fracture in multilingual evaluation.

Core claim

PuMVR reveals a substantial Script Gap in state-of-the-art VLMs, where accuracy deltas reach 16 percent across Punjabi's three scripts on culturally grounded visual reasoning tasks, and Script Consistency Rates drop to 24.8 percent. Visual input boosts absolute scores yet fails to close the relative gap between scripts, with reasoning patterns showing limited cross-script transferability and Chain-of-Thought pathways diverging by script alone.

What carries the argument

The PuMVR benchmark of 375 tasks across Gurmukhi, Shahmukhi, and Roman scripts, which quantifies orthographic bias via Script Consistency Rate.

If this is right

Models solve the same visual puzzle in one script while failing it in another on the same content.
Chain-of-Thought reasoning paths change when only the script of the question is altered.
Visual input improves performance numbers but leaves relative script differences intact.
Current multilingual benchmarks miss script variation inside one language.
Script Consistency Rate should become a standard check for script-agnostic model assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same script-dependent gaps could appear in other languages that use multiple scripts, such as Serbian or Kurdish.
Training pipelines might need explicit script-variation data to reduce the observed fracture.
Extending the benchmark to additional multi-script languages would test how general the bias is.
Evaluation suites could add script-swapping tests as a routine check for fairness.

Load-bearing premise

The 375 tasks keep identical difficulty, cultural grounding, and visual content across the three scripts, so differences can be blamed only on orthography.

What would settle it

Re-run the evaluations after confirming or adjusting the tasks so that content, difficulty, and visuals are verified equivalent across scripts; persistent gaps would support the bias claim while vanishing gaps would undermine it.

Figures

Figures reproduced from arXiv: 2606.20770 by Bhushan Pawar, Madhu Reddiboina, Prabhjot Singh.

**Figure 1.** Figure 1: Sample instance from PuMVR showing cross-script equivalent options. 4 Experimental Setup Our experimental framework systematically interrogates the Script-Reality Gap through five complementary investigations, each designed to expose a distinct facet of orthographic bias in Vision-Language Models. We evaluate 10 state-of-the-art VLMs spanning proprietary frontier systems and openweights alternatives ( … view at source ↗

**Figure 2.** Figure 2: Script-dependent accuracy and Script Consistency Rate (SCR) across 10 VLMs. Metric: Script Disparity Index (SDI) measures maximum performance spread: SDII = max Q∈S Acc(I, Q) − min Q∈S Acc(I, Q) (7) High SDI indicates script-brittle reasoning, models cannot abstract away from orthographic surface forms. 4.5 Experiment 5: Script-Induced Reasoning Divergence Objective: Investigate whether script variatio… view at source ↗

**Figure 3.** Figure 3: Transfer Efficiency (TE) heatmap (baseline=100%). Values above 100 indicate cross-script examples outperform monoscript; below 100 indicates reasoning loss. in high-resource scripts but require visual evidence when orthographic priors weaken. Llama-3.2-11B-Vision’s substantially low Shahmukhi VG (7.5%) indicates scriptspecific visual grounding failure: even with images, it cannot integrate Perso-Arabic … view at source ↗

**Figure 4.** Figure 4: Script Disparity Index (SDI) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Current Vision-Language Models (VLMs) are celebrated for their multilingual capabilities, yet they operate under a flawed assumption: that one language corresponds to a single writing system. This overlooks billions of users of multi-script languages like Punjabi, Serbian, Hindi-Urdu, Kurdish, among many others, for whom a model's capability may be fractured by orthographic bias. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), the first benchmark designed to quantify script-dependent bias through 375 culturally grounded image-reasoning tasks across Punjabi's three active scripts (Gurmukhi, Shahmukhi, Roman). Evaluating 10 state-of-the-art VLMs, we expose a substantial Script Gap: models frequently solve visual puzzles in one script while failing identical tasks in another, with accuracy deltas reaching 16% and Script Consistency Rates (SCR) as low as 24.8%. Crucially, visual input boosts absolute performance but does not close this gap, the relative bias persists. Our analysis suggests reasoning patterns show limited cross-script transferability, and Chain-of-Thought pathways diverge based on script alone. We propose SCR as a core metric for script-agnostic evaluation, challenging current multilingual assessment paradigms and providing a framework for equitable AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PuMVR gives a concrete benchmark for script bias in one language but the claim that gaps are purely orthographic rests on unshown task equivalence.

read the letter

The paper's main contribution is PuMVR, a set of 375 image-reasoning tasks in Punjabi across Gurmukhi, Shahmukhi, and Roman scripts, plus the SCR metric to track consistency. It evaluates 10 VLMs and reports accuracy deltas up to 16% and SCR as low as 24.8%, with the observation that adding images raises overall scores but leaves the relative script gap intact.

This is new in the sense that prior multilingual VLM work has not isolated script variation inside a single language this way. The focus on Punjabi is reasonable given its real multi-script user base, and the suggestion that reasoning patterns do not transfer cleanly across scripts is worth testing.

The soft spot is exactly the one the stress-test flags. The headline numbers only isolate orthographic bias if the tasks stay identical in difficulty, lexical choice, and cultural fit once rewritten in each script. The abstract gives no sign of human equivalence ratings, back-translation checks, or difficulty calibration, and the provided text does not describe any such controls. Without them the observed gaps could partly reflect adaptation artifacts rather than script alone.

No error bars, statistical tests, or prompt-format controls are mentioned either, which keeps the quantitative claims preliminary. The work is still worth a referee's time because the topic matters and the benchmark itself could be refined. Readers working on multilingual evaluation or low-resource script handling would get the most out of it once the equivalence question is addressed.

I would send it to review rather than desk-reject, but the methods section needs to show the controls before the central claim can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces the PuMVR benchmark consisting of 375 culturally grounded image-reasoning tasks rendered in Punjabi's three scripts (Gurmukhi, Shahmukhi, Roman) to measure orthographic bias in 10 state-of-the-art VLMs. It reports accuracy gaps reaching 16% across scripts, Script Consistency Rates (SCR) as low as 24.8%, and finds that adding visual input improves absolute performance but does not eliminate relative script-dependent differences; the work proposes SCR as a new evaluation metric.

Significance. If the tasks prove equivalent in difficulty and content, the results would establish a concrete, measurable limitation in current multilingual VLMs for multi-script languages spoken by hundreds of millions of users. The creation of a dedicated Punjabi multimodal benchmark and the SCR metric constitute a useful empirical contribution to fairness and robustness evaluation in vision-language models.

major comments (2)

[Abstract / §3] Abstract and §3 (Benchmark Construction): The central claim that observed accuracy deltas (up to 16%) and low SCR values are attributable solely to orthographic bias requires that the 375 tasks maintain identical semantic content, difficulty, and cultural grounding across the three scripts. The manuscript supplies no details on equivalence verification (human ratings, back-translation checks, lexical-frequency calibration, or difficulty pre-testing), which is load-bearing for isolating script as the causal factor.
[§4 / §5] §4 (Experimental Setup) and §5 (Results): No information is provided on statistical testing, error bars, or controls for prompt-formatting differences across scripts. Without these, it is unclear whether the reported gaps exceed what would be expected from sampling variability or minor surface-form variations.

minor comments (2)

[Results tables] Table 1 or equivalent: clarify how the 375 tasks were partitioned across the three scripts and whether each image is paired with exactly one question per script.
[§2] Notation: define SCR explicitly on first use and state whether it is computed per model or aggregated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which highlight important aspects of benchmark validity and statistical rigor. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Benchmark Construction): The central claim that observed accuracy deltas (up to 16%) and low SCR values are attributable solely to orthographic bias requires that the 375 tasks maintain identical semantic content, difficulty, and cultural grounding across the three scripts. The manuscript supplies no details on equivalence verification (human ratings, back-translation checks, lexical-frequency calibration, or difficulty pre-testing), which is load-bearing for isolating script as the causal factor.

Authors: We agree that explicit equivalence verification is necessary to causally attribute differences to orthography. The submitted manuscript omitted these procedural details in favor of focusing on benchmark release and empirical results. In the revision we will expand §3 with: native-speaker semantic-equivalence ratings (5-point scale, inter-annotator agreement reported), back-translation checks to English, lexical-frequency calibration against available Punjabi corpora, and pilot difficulty pre-testing. These additions will directly support the isolation of script as the variable. revision: yes
Referee: [§4 / §5] §4 (Experimental Setup) and §5 (Results): No information is provided on statistical testing, error bars, or controls for prompt-formatting differences across scripts. Without these, it is unclear whether the reported gaps exceed what would be expected from sampling variability or minor surface-form variations.

Authors: We accept that the current version lacks statistical controls and error quantification. The revision will add: (i) bootstrap-derived 95% confidence intervals on all accuracy figures, (ii) McNemar’s test for paired script-wise comparisons on identical tasks, and (iii) the exact prompt templates (identical structure, only script changed) in the appendix to demonstrate formatting control. These changes will clarify that observed gaps exceed sampling variability. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper introduces the PuMVR benchmark of 375 image-reasoning tasks and reports direct accuracy measurements, deltas, and SCR values from evaluating 10 VLMs across three scripts. No equations, fitted parameters, predictions, or uniqueness theorems appear; claims rest on observed model outputs rather than any chain that reduces to inputs by construction. The central assumption of task equivalence across scripts is an experimental design choice, not a circular derivation. This is a standard empirical measurement study whose results are self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified premise that tasks are content-identical across scripts and that model inputs differ only in orthography; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 375 tasks are equivalent in difficulty and meaning across the three scripts.
Required to attribute performance differences to script rather than task content.

pith-pipeline@v0.9.1-grok · 5767 in / 1256 out tokens · 40080 ms · 2026-06-26T17:30:55.580544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Learning Transferable Visual Models From Natural Language Supervision

CORI: CJKV benchmark with Roman- ization integration - a step towards cross-lingual transfer beyond textual scripts . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages 4008–4020, Torino, Italia. ELRA and ICCL. OpenAI. 2024. GPT-4o System Card. https:// opena...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Preprint, arXiv:2407.06581

Vision language models are blind: Failing to translate detailed visual features into words . Preprint, arXiv:2407.06581. Phillip Rust, Jonas Pfeiﬀer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2021. How good is your tokenizer? on the monolingual perfor- mance of multilingual language models . In Pro- ceedings of the 59th Annual Meeting of the As- so...

work page arXiv 2021
[3]

Preprint, arXiv:2510.11178

Blend-vis: Benchmarking multimodal cul- tural understanding in vision language models . Preprint, arXiv:2510.11178. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Gar- rett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker,...

work page arXiv 2024
[4]

cuda" with max_memory = {0:

All languages matter: Evaluating lmms on culturally diverse 100 languages . Preprint, arXiv:2411.16508. An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. 2025. Vision language models are biased . Preprint, arXiv:2505.23941. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. 2024. ...

work page arXiv 2025
[5]

Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

Script Comprehension V eriﬁcation: A model might correctly guess a letter (25% probability) without truly process- ing the script. Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy
[6]

hallucinated

F ailure Mode Analysis: By requiring text output, we could identify cases where models produced "hallucinated" charac- ters or mixed scripts, data that would be lost if restricted to single-letter outputs. D.3 Experiment 1: Baseline Script Gap The primary benchmark used the following template with English instructions and script- speciﬁc constraints. Exp ...
[11]

Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options: {formatted_options} CRITICAL RULES:

Output ONLY the option text, nothing else Your answer (copy exact text from options): Answer: D.4 Experiment 2: Native Instruction Prompting This experiment used similar prompts as ex- periment 1, without the image input to test model performance and establish a baseline. Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options:...
[12]

Answer MUST be in the SAME script as the question
[13]

Copy EXACTLY one option from above - character by character
[14]

NO explanations, NO extra words, NO English translations
[15]

NO letters like A), B), C) or numbers
[16]

Exp 3: System Prompt You are a precise answering assistant

Output ONLY the option text, nothing else Your answer (copy exact text from options): D.5 Experiment 3: System Prompting (F ew-Shot) For experiments involving system-level instruc- tions, the following persona-based prompt was utilized. Exp 3: System Prompt You are a precise answering assistant. You will be given a visual question and options. You must ou...
[17]

NO option letters (like A, B)
[18]

look and write

Output strictly the option text. D.6 Experiment 4: OCR-A ware T ranscription This prompt forced the model to ﬁrst "look and write" before deciding, to decouple visual recognition from reasoning. Exp 4: OCR + Decision Prompt Question {Script Instruction}: {question} Options: {formatted_options} TASK INSTRUCTIONS:
[19]

LOOK at the image text carefully
[20]

TRANSCRIPTION: Write down exactly what text you see in the image (if any)
[21]

REQUIRED OUTPUT FORMAT: OCR: [Write the text from the image here]

DECISION: Choose the correct option from the list above. REQUIRED OUTPUT FORMAT: OCR: [Write the text from the image here]
[22]

You are taking a multiple choice test
[23]

Select the Correct Option from the list above
[24]

Your output must be IDENTICAL to one of the options
[25]

DO NOT write ’Answer:’

DO NOT provide reasoning. DO NOT write ’Answer:’
[26]

ANSWER: [Write ONLY the selected option text here] D.7 Experiment 5: Chain-of-Thought (CoT) Models were prompted to follow a 5-step rea- soning process

DO NOT use A), B), C), D). ANSWER: [Write ONLY the selected option text here] D.7 Experiment 5: Chain-of-Thought (CoT) Models were prompted to follow a 5-step rea- soning process. The instructions were pro- vided in the same script as the question. Exp 5: Gurmukhi CoT T emplate ਤੁਹਾਨੂ ੰ ਇਹਨਾਂ ਕਦਮਾਂ ਦੀ ਸਪੱਸ਼ਟ ਤੌਰ ’ਤੇ ਪਾਲਣਾ ਕਰਕੇ ਇਸ ਮਲਟੀਮੋਡਲ ਸਮੱਿਸਆ ਨੂ ੰ ਹੱਲ ਕਰਨਾ ਹੈ :
[27]

ਿਵਜ਼ੂਅਲ ਗਰਾਊਂਿਡੰਗ : ਿਚੱਤਰ ਦੇ ਆਧਾਰ ’ਤੇ, 3 ਸਭ ਤੋਂ ਪਰ੍ਮੁੱਖ ਵਸਤੂਆਂ, ਿਦÁਸ਼ਾਂ ਜਾਂ ਿਕਿਰਆਵਾਂ ਦੀ ਸੂਚੀ ਬਣਾਓ।
[28]

ਭਾਸ਼ਾਈ ਪਾਰਿਸੰਗ : ਇਹ ਯਕੀਨੀ ਬਣਾਉਣ ਲਈ ਿਕ ਤੁਸੀਂ ਇਸਨੂ ੰ ਸਮਝਦੇ ਹੋ, ਆਪਣੇ ਸ਼ਬਦਾਂ ਿਵੱਚ ਪਰ੍ਸ਼ਨ ਨੂ ੰ ਦੁਬਾਰਾ ਿਲਖੋ।
[29]

ਸੰਬੰਧ ਐਬਸਟਰੈਕਸ਼ਨ: ਪਰ੍ਸ਼ਨ ਦੁਆਰਾ ਪਰਖਣ ਵਾਲੇ ਮੁੱਖ ਸੰਬੰਧ ਜਾਂ ਸੰਕਲਪ ਦੀ ਪਛਾਣ ਕਰੋ।
[30]

ਕਰਾਸ-ਮਾਡਲ ਏਕੀਕਰਣ: ਸਮਝਾਓ ਿਕ ਕਦਮ 1 ਦੇ ਿਵਜ਼ੂਅਲ ਤੱਤ ਤੁਹਾਨੂ ੰ ਕਦਮ 2 ਦੇ ਪਰ੍ਸ਼ਨ ਦਾ ਉੱਤਰ ਦੇਣ ਿਵੱਚ ਿਕਵੇਂ ਮਦਦ ਕਰਦੇ ਹਨ।
[31]

ਕਟੌਤੀ: ਕਦਮ 1-4 ਦੇ ਆਧਾਰ ’ਤੇ, ਉੱਤਰ ਕੱਢੋ। ਅੰਿਤਮ ਉੱਤਰ : ਅੰਿਤਮ ਿਵਕਲਪ ਟੈਕਸਟ ਦੱਸੋ ( ਿਦੱਤੇ ਗਏ ਿਵਕਲਪਾਂ ਤੋਂ ਿਬਲਕੁਲ ਕਾਪੀ ਕਰੋ )। ਹੁਣ, ਹੇਠਾਂ ਿਦੱਤੀ ਸਮੱਿਸਆ ਨੂ ੰ ਕਦਮ-ਦਰ-ਕਦਮ ਹੱਲ ਕਰੋ। Exp 5: Shahmukhi CoT T emplate ﻃﻮرﺳﭙﺸﭧدیﻗﺪﻣﺎںاﯾﮩﻨﺎںﺗﻮﮨﺎﻧﻮں ’ﭘﺎﻟﻨﺎﺗﮯ ﮨﮯﮐﺮﻧﺎﺣﻞﻧﻮںﺳﻤﺴﯿﺎﻣﺎڈلﻣﻠﭩﯽاِسﮐﺮﮐﮯ : 1.ﮔﺮاؤﻧﮉﻧﮓوِژوﺋﻞ :آدﮬﺎردےﭼِﺘﺮ ’ﺗﮯ 3ﮐﺮﯾﺎواںﺟﺎںدرﺷﺎںوﺳﺘﻮاں،ﭘﺮﻣﮑﮫﺗﻮںﺳﺒﮫ ﺑﻨﺎؤ۔ﺳﻮﭼﯽدی 2.ﭘﺎرﺳﻨﮓ...
[32]

vizooal graaoonding: chittar de aadhaar ’te, 3 sabh ton pramukkh vastuaan, drishaan jaan kiriavaan dee soochee banao
[33]

bhashai paarsing: eh yakeenee banaun laee ki tuseen isnu samajhde ho, aapne shabdaan vich prashan nu dubaara likho
[34]

sambandh aibstraikshan: prashan duara parkhan vaale mukkh sambandh jaan sankalp dee pachhaan karo
[35]

kraas-modal ekeekaran: samjhao ki kadam 1 de vizooal tatt tuhannu kadam 2 de prashan da uttar den vich kiven madad karde han
[36]

antim uttar: antim vikalp taikst dasso (ditte gae vikalpaan ton bilkul kaapee karo)

katauti: kadam 1-4 de aadhaar ’te, uttar kaddho. antim uttar: antim vikalp taikst dasso (ditte gae vikalpaan ton bilkul kaapee karo). hun, hethaan dittee samassya nu kadam-dar-kadam hall karo. E Statistical Signiﬁcance Testing Table 9 reports McNemar’s test results for all pairwise script comparisons across all 10 mod- els. We apply McNemar’s test treatin...

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

CORI: CJKV benchmark with Roman- ization integration - a step towards cross-lingual transfer beyond textual scripts . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages 4008–4020, Torino, Italia. ELRA and ICCL. OpenAI. 2024. GPT-4o System Card. https:// opena...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Preprint, arXiv:2407.06581

Vision language models are blind: Failing to translate detailed visual features into words . Preprint, arXiv:2407.06581. Phillip Rust, Jonas Pfeiﬀer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2021. How good is your tokenizer? on the monolingual perfor- mance of multilingual language models . In Pro- ceedings of the 59th Annual Meeting of the As- so...

work page arXiv 2021

[3] [3]

Preprint, arXiv:2510.11178

Blend-vis: Benchmarking multimodal cul- tural understanding in vision language models . Preprint, arXiv:2510.11178. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Gar- rett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker,...

work page arXiv 2024

[4] [4]

cuda" with max_memory = {0:

All languages matter: Evaluating lmms on culturally diverse 100 languages . Preprint, arXiv:2411.16508. An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. 2025. Vision language models are biased . Preprint, arXiv:2505.23941. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. 2024. ...

work page arXiv 2025

[5] [5]

Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

Script Comprehension V eriﬁcation: A model might correctly guess a letter (25% probability) without truly process- ing the script. Forcing the model to re- produce the script-speciﬁc text ensures it can parse and generate the target orthog- raphy

[6] [6]

hallucinated

F ailure Mode Analysis: By requiring text output, we could identify cases where models produced "hallucinated" charac- ters or mixed scripts, data that would be lost if restricted to single-letter outputs. D.3 Experiment 1: Baseline Script Gap The primary benchmark used the following template with English instructions and script- speciﬁc constraints. Exp ...

[7] [11]

Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options: {formatted_options} CRITICAL RULES:

Output ONLY the option text, nothing else Your answer (copy exact text from options): Answer: D.4 Experiment 2: Native Instruction Prompting This experiment used similar prompts as ex- periment 1, without the image input to test model performance and establish a baseline. Exp 2: Native Script Instructions Question {Script Instruction}: {question} Options:...

[8] [12]

Answer MUST be in the SAME script as the question

[9] [13]

Copy EXACTLY one option from above - character by character

[10] [14]

NO explanations, NO extra words, NO English translations

[11] [15]

NO letters like A), B), C) or numbers

[12] [16]

Exp 3: System Prompt You are a precise answering assistant

Output ONLY the option text, nothing else Your answer (copy exact text from options): D.5 Experiment 3: System Prompting (F ew-Shot) For experiments involving system-level instruc- tions, the following persona-based prompt was utilized. Exp 3: System Prompt You are a precise answering assistant. You will be given a visual question and options. You must ou...

[13] [17]

NO option letters (like A, B)

[14] [18]

look and write

Output strictly the option text. D.6 Experiment 4: OCR-A ware T ranscription This prompt forced the model to ﬁrst "look and write" before deciding, to decouple visual recognition from reasoning. Exp 4: OCR + Decision Prompt Question {Script Instruction}: {question} Options: {formatted_options} TASK INSTRUCTIONS:

[15] [19]

LOOK at the image text carefully

[16] [20]

TRANSCRIPTION: Write down exactly what text you see in the image (if any)

[17] [21]

REQUIRED OUTPUT FORMAT: OCR: [Write the text from the image here]

DECISION: Choose the correct option from the list above. REQUIRED OUTPUT FORMAT: OCR: [Write the text from the image here]

[18] [22]

You are taking a multiple choice test

[19] [23]

Select the Correct Option from the list above

[20] [24]

Your output must be IDENTICAL to one of the options

[21] [25]

DO NOT write ’Answer:’

DO NOT provide reasoning. DO NOT write ’Answer:’

[22] [26]

ANSWER: [Write ONLY the selected option text here] D.7 Experiment 5: Chain-of-Thought (CoT) Models were prompted to follow a 5-step rea- soning process

DO NOT use A), B), C), D). ANSWER: [Write ONLY the selected option text here] D.7 Experiment 5: Chain-of-Thought (CoT) Models were prompted to follow a 5-step rea- soning process. The instructions were pro- vided in the same script as the question. Exp 5: Gurmukhi CoT T emplate ਤੁਹਾਨੂ ੰ ਇਹਨਾਂ ਕਦਮਾਂ ਦੀ ਸਪੱਸ਼ਟ ਤੌਰ ’ਤੇ ਪਾਲਣਾ ਕਰਕੇ ਇਸ ਮਲਟੀਮੋਡਲ ਸਮੱਿਸਆ ਨੂ ੰ ਹੱਲ ਕਰਨਾ ਹੈ :

[23] [27]

ਿਵਜ਼ੂਅਲ ਗਰਾਊਂਿਡੰਗ : ਿਚੱਤਰ ਦੇ ਆਧਾਰ ’ਤੇ, 3 ਸਭ ਤੋਂ ਪਰ੍ਮੁੱਖ ਵਸਤੂਆਂ, ਿਦÁਸ਼ਾਂ ਜਾਂ ਿਕਿਰਆਵਾਂ ਦੀ ਸੂਚੀ ਬਣਾਓ।

[24] [28]

ਭਾਸ਼ਾਈ ਪਾਰਿਸੰਗ : ਇਹ ਯਕੀਨੀ ਬਣਾਉਣ ਲਈ ਿਕ ਤੁਸੀਂ ਇਸਨੂ ੰ ਸਮਝਦੇ ਹੋ, ਆਪਣੇ ਸ਼ਬਦਾਂ ਿਵੱਚ ਪਰ੍ਸ਼ਨ ਨੂ ੰ ਦੁਬਾਰਾ ਿਲਖੋ।

[25] [29]

ਸੰਬੰਧ ਐਬਸਟਰੈਕਸ਼ਨ: ਪਰ੍ਸ਼ਨ ਦੁਆਰਾ ਪਰਖਣ ਵਾਲੇ ਮੁੱਖ ਸੰਬੰਧ ਜਾਂ ਸੰਕਲਪ ਦੀ ਪਛਾਣ ਕਰੋ।

[26] [30]

ਕਰਾਸ-ਮਾਡਲ ਏਕੀਕਰਣ: ਸਮਝਾਓ ਿਕ ਕਦਮ 1 ਦੇ ਿਵਜ਼ੂਅਲ ਤੱਤ ਤੁਹਾਨੂ ੰ ਕਦਮ 2 ਦੇ ਪਰ੍ਸ਼ਨ ਦਾ ਉੱਤਰ ਦੇਣ ਿਵੱਚ ਿਕਵੇਂ ਮਦਦ ਕਰਦੇ ਹਨ।

[27] [31]

ਕਟੌਤੀ: ਕਦਮ 1-4 ਦੇ ਆਧਾਰ ’ਤੇ, ਉੱਤਰ ਕੱਢੋ। ਅੰਿਤਮ ਉੱਤਰ : ਅੰਿਤਮ ਿਵਕਲਪ ਟੈਕਸਟ ਦੱਸੋ ( ਿਦੱਤੇ ਗਏ ਿਵਕਲਪਾਂ ਤੋਂ ਿਬਲਕੁਲ ਕਾਪੀ ਕਰੋ )। ਹੁਣ, ਹੇਠਾਂ ਿਦੱਤੀ ਸਮੱਿਸਆ ਨੂ ੰ ਕਦਮ-ਦਰ-ਕਦਮ ਹੱਲ ਕਰੋ। Exp 5: Shahmukhi CoT T emplate ﻃﻮرﺳﭙﺸﭧدیﻗﺪﻣﺎںاﯾﮩﻨﺎںﺗﻮﮨﺎﻧﻮں ’ﭘﺎﻟﻨﺎﺗﮯ ﮨﮯﮐﺮﻧﺎﺣﻞﻧﻮںﺳﻤﺴﯿﺎﻣﺎڈلﻣﻠﭩﯽاِسﮐﺮﮐﮯ : 1.ﮔﺮاؤﻧﮉﻧﮓوِژوﺋﻞ :آدﮬﺎردےﭼِﺘﺮ ’ﺗﮯ 3ﮐﺮﯾﺎواںﺟﺎںدرﺷﺎںوﺳﺘﻮاں،ﭘﺮﻣﮑﮫﺗﻮںﺳﺒﮫ ﺑﻨﺎؤ۔ﺳﻮﭼﯽدی 2.ﭘﺎرﺳﻨﮓ...

[28] [32]

vizooal graaoonding: chittar de aadhaar ’te, 3 sabh ton pramukkh vastuaan, drishaan jaan kiriavaan dee soochee banao

[29] [33]

bhashai paarsing: eh yakeenee banaun laee ki tuseen isnu samajhde ho, aapne shabdaan vich prashan nu dubaara likho

[30] [34]

sambandh aibstraikshan: prashan duara parkhan vaale mukkh sambandh jaan sankalp dee pachhaan karo

[31] [35]

kraas-modal ekeekaran: samjhao ki kadam 1 de vizooal tatt tuhannu kadam 2 de prashan da uttar den vich kiven madad karde han

[32] [36]

antim uttar: antim vikalp taikst dasso (ditte gae vikalpaan ton bilkul kaapee karo)

katauti: kadam 1-4 de aadhaar ’te, uttar kaddho. antim uttar: antim vikalp taikst dasso (ditte gae vikalpaan ton bilkul kaapee karo). hun, hethaan dittee samassya nu kadam-dar-kadam hall karo. E Statistical Signiﬁcance Testing Table 9 reports McNemar’s test results for all pairwise script comparisons across all 10 mod- els. We apply McNemar’s test treatin...