Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark

Hanif Rahman

arxiv: 2604.08786 · v2 · submitted 2026-04-09 · 💻 cs.SD · eess.AS

Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark

Hanif Rahman This is my paper

Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords script collapseautomatic speech recognitionmultilingual ASRreference-free evaluationScript Fidelity Ratewriting systemsWhisperprompting

0 comments

The pith

Multilingual automatic speech recognition models often output fluent text in the wrong writing system, a systematic failure that standard word error rate metrics overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Script Fidelity Rate as the share of output characters that fall inside the Unicode block for the target language's script, a quantity computable from the hypothesis alone. It applies the metric to one hundred model-language pairs drawn from ten languages and ten models on the FLEURS test sets, documenting that twenty-one pairs fall below ten percent fidelity. Twenty of those failures occur with Whisper variants and one with Gemma-4 on Urdu; script-aware prompting raises average fidelity from 71 percent to 98 percent and improves downstream translation scores for the affected languages. The work matters because current ASR benchmarks treat any fluent string as acceptable, allowing models to produce unusable transcriptions for languages that rely on non-Latin scripts.

Core claim

Script collapse is the production of fluent ASR output whose characters lie outside the Unicode block of the intended script. The authors introduce Script Fidelity Rate, the fraction of hypothesis characters belonging to the target script block, and measure it across one hundred pairs. They report collapse in twenty-one pairs, identify four recurring patterns (Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali and Malayalam, and unique-script Latin collapse for Georgian), and show that targeted prompting raises mean fidelity from 71.2 percent to 97.7 percent while recovering 5.9 chrF on subsequent translation for languages whose baseline fidelity

What carries the argument

Script Fidelity Rate (SFR), the fraction of characters in the ASR hypothesis that belong to the Unicode script block of the target language.

If this is right

Twenty of the twenty-one collapsed pairs involve Whisper models of various sizes.
Script-aware prompting lifts mean SFR from 71.2 percent to 97.7 percent across ten languages.
Prompting restores Urdu from 6.5 percent to 97.0 percent SFR and improves downstream NLLB translation by 5.9 chrF for the six languages whose baseline SFR is below 90 percent.
Four distinct collapse patterns appear: Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali and Malayalam, and Latin substitution for Georgian.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reference-free script checks could be added to existing ASR evaluation pipelines to catch failures that WER misses.
The same collapse patterns may appear in other multilingual generation tasks that produce text for non-Latin languages.
Developers may need to test low-resource languages with script-specific prompts during model release.

Load-bearing premise

Membership in a Unicode script block is a sufficient and unambiguous proxy for whether the generated text is actually in the intended writing system for that language.

What would settle it

Manual inspection of a random sample of outputs with SFR below 10 percent to count how many are genuinely written in an entirely different script versus mixed or misclassified blocks.

Figures

Figures reproduced from arXiv: 2604.08786 by Hanif Rahman.

**Figure 2.** Figure 2: WER (%) vs SFR (%) for all 100 model–language pairs on FLEURS test sets, [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Georgian SFR (%) and WER (%) by model. SFR bars are coloured by collapse [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target script block, computable without reference transcriptions, and report a systematic measurement of script collapse across ten languages spanning six writing systems and ten models (seven Whisper sizes, MMS-1B, SeamlessM4T-v2, and Gemma 4 E2B) on FLEURS test sets. Across 100 evaluated model-language pairs, 21 (21%; 95% Wilson CI: 14-30%) exhibit script collapse (SFR less than 10%): 20 involve Whisper and one involves Gemma 4 E2B on Urdu under a generic transcription prompt. In a ten-language Gemma 4 probe, script-aware prompting raises mean SFR from 71.2% to 97.7%, fixes Urdu collapse (6.5% to 97.0%), and recovers 5.9 chrF on downstream NLLB translation for the six languages whose baseline SFR is below 90%. We identify four collapse patterns: Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali/Malayalam, and unique-script Latin collapse for Georgian.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean reference-free metric for script collapse in ASR and shows it's common in Whisper on FLEURS, with prompting as a partial fix.

read the letter

The key takeaway is that models can output fluent speech-to-text in the wrong script, and standard WER misses it. They define Script Fidelity Rate as the fraction of hypothesis characters falling in the target Unicode block, then measure it across ten languages, six scripts, and ten models on public FLEURS sets. Twenty-one of the 100 pairs drop below 10% SFR, almost all Whisper variants plus one Gemma-Urdu case. Script-aware prompts lift mean SFR from 71% to 98% and recover 5.9 chrF on downstream translation for the affected languages. They also list four observable collapse patterns from manual inspection.

Referee Report

2 major / 2 minor

Summary. The paper introduces Script Fidelity Rate (SFR), a reference-free metric defined as the fraction of hypothesis characters falling within a language's target Unicode script block, to detect script collapse in multilingual ASR where models produce fluent output in the wrong writing system. It benchmarks SFR across 100 model-language pairs (10 languages spanning 6 scripts, 10 models including Whisper variants, MMS-1B, SeamlessM4T-v2, and Gemma 4 E2B) on FLEURS test sets, finding 21 pairs (21%; 95% Wilson CI 14-30%) with SFR <10% (20 Whisper, 1 Gemma on Urdu), identifies four collapse patterns via inspection, and shows that script-aware prompting raises mean SFR from 71.2% to 97.7% for Gemma while recovering 5.9 chrF in downstream NLLB translation for low-SFR languages.

Significance. If the SFR metric and threshold are reliable, the work identifies a previously under-measured failure mode in current ASR systems that standard WER misses, supplies a simple reproducible benchmark on public data with confidence intervals, and demonstrates a low-cost mitigation via prompting that also benefits downstream tasks. The empirical scale (100 pairs) and concrete rates provide a useful baseline for future model evaluation.

major comments (2)

[SFR definition and results] Definition of SFR: the metric treats membership in a single assigned Unicode script block as a sufficient proxy for correct orthography, but provides no semantic, contextual, or language-specific validation. This assumption is load-bearing for the headline 21/100 collapse count, yet the paper does not report human-judgment alignment or explicit handling for overlapping blocks (Urdu/Arabic, Somali/Arabic), digits/punctuation, or mixed-script hypotheses; the four patterns noted from inspection are post-hoc and do not retroactively validate the quantitative threshold.
[Results and discussion] Results section (100-pair benchmark): the reported collapse rate and Wilson intervals rest on the unvalidated SFR <10% cutoff; without a cross-check against human ratings of writing-system correctness on even a modest subset of the 100 pairs, it is unclear whether the 21% figure over- or under-counts true collapse, especially for the single non-Whisper case (Gemma on Urdu).

minor comments (2)

[Abstract] Abstract and methods: no implementation details are given for edge cases such as mixed-script output, normalization artifacts, or language-specific script ambiguities (e.g., Georgian unique script).
[Results] The paper could usefully add a small table or appendix showing SFR sensitivity to the 10% threshold (e.g., rates at 5%, 15%) to demonstrate robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting important considerations regarding the SFR metric and its empirical validation. We address each major comment below and outline targeted revisions to the manuscript.

read point-by-point responses

Referee: Definition of SFR: the metric treats membership in a single assigned Unicode script block as a sufficient proxy for correct orthography, but provides no semantic, contextual, or language-specific validation. This assumption is load-bearing for the headline 21/100 collapse count, yet the paper does not report human-judgment alignment or explicit handling for overlapping blocks (Urdu/Arabic, Somali/Arabic), digits/punctuation, or mixed-script hypotheses; the four patterns noted from inspection are post-hoc and do not retroactively validate the quantitative threshold.

Authors: SFR is intentionally defined as a lightweight, reference-free proxy that flags writing-system mismatches via Unicode block membership, which directly captures the script collapse phenomenon without requiring references or semantic analysis. We acknowledge that the manuscript does not include human-judgment alignment, which is a limitation of the current version. For script overlaps, Urdu is standardly written in the Arabic script block, while Somali uses Latin in the FLEURS data; we will add explicit text clarifying these assignments. We will revise the methods section to state that digits, punctuation, and other non-letter characters are excluded from the character count when computing SFR, and that mixed-script hypotheses are scored strictly by the proportion of characters falling inside the target block. The four patterns are presented as post-inspection characterizations to illustrate observed failure modes rather than as quantitative validation. We will add a limitations subsection discussing the proxy assumptions and their scope. revision: partial
Referee: Results section (100-pair benchmark): the reported collapse rate and Wilson intervals rest on the unvalidated SFR <10% cutoff; without a cross-check against human ratings of writing-system correctness on even a modest subset of the 100 pairs, it is unclear whether the 21% figure over- or under-counts true collapse, especially for the single non-Whisper case (Gemma on Urdu).

Authors: The <10% threshold is a heuristic chosen to identify near-total script failure (i.e., >90% of characters outside the target block). The Wilson intervals correctly quantify sampling uncertainty around the observed 21/100 rate. We agree that direct human validation would increase confidence in the headline figure. In the revised manuscript we will report a human evaluation on a random subset of 20 model-language pairs (including the Gemma-Urdu case), in which annotators judge whether each hypothesis is written in the expected script; we will then compare these judgments to the SFR <10% classification to assess agreement and potential over- or under-counting. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical definition and measurement of a reference-free metric

full rationale

The paper defines Script Fidelity Rate (SFR) explicitly as the fraction of hypothesis characters whose Unicode script block matches the pre-assigned target block for each language. It then computes this quantity on the 100 model-language pairs from FLEURS test sets and reports the observed fraction below 10%. No parameters are fitted to data, no predictions are derived from subsets, no self-citations support the central measurement, and no equations reduce the reported rates to the inputs by construction. The result is a straightforward counting procedure on external data; concerns about the Unicode proxy's validity are separate from circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that Unicode script-block membership is a valid proxy for correct script usage and that FLEURS plus the chosen models are representative of real multilingual ASR behavior.

axioms (1)

domain assumption Unicode script blocks can be used to reliably determine if text is in the target script for a given language
Directly defines how SFR is computed as the fraction of characters inside the target block.

pith-pipeline@v0.9.0 · 5538 in / 1313 out tokens · 59386 ms · 2026-05-10T16:40:11.345729+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Record ASR hypotheses and the intended target language

work page
[2]

Compute utterance-level SFR using the target language’s Unicode block specifica- tion

work page
[3]

Alert when corpus-level mean SFR drops below a deployment threshold, for example < 0.8

work page
[4]

Inspect low-SFR examples with WER, CER, LID, or human review before making product decisions. 13

work page

[1] [1]

Record ASR hypotheses and the intended target language

work page

[2] [2]

Compute utterance-level SFR using the target language’s Unicode block specifica- tion

work page

[3] [3]

Alert when corpus-level mean SFR drops below a deployment threshold, for example < 0.8

work page

[4] [4]

Inspect low-SFR examples with WER, CER, LID, or human review before making product decisions. 13

work page