Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark
Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3
The pith
Multilingual automatic speech recognition models often output fluent text in the wrong writing system, a systematic failure that standard word error rate metrics overlook.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Script collapse is the production of fluent ASR output whose characters lie outside the Unicode block of the intended script. The authors introduce Script Fidelity Rate, the fraction of hypothesis characters belonging to the target script block, and measure it across one hundred pairs. They report collapse in twenty-one pairs, identify four recurring patterns (Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali and Malayalam, and unique-script Latin collapse for Georgian), and show that targeted prompting raises mean fidelity from 71.2 percent to 97.7 percent while recovering 5.9 chrF on subsequent translation for languages whose baseline fidelity
What carries the argument
Script Fidelity Rate (SFR), the fraction of characters in the ASR hypothesis that belong to the Unicode script block of the target language.
If this is right
- Twenty of the twenty-one collapsed pairs involve Whisper models of various sizes.
- Script-aware prompting lifts mean SFR from 71.2 percent to 97.7 percent across ten languages.
- Prompting restores Urdu from 6.5 percent to 97.0 percent SFR and improves downstream NLLB translation by 5.9 chrF for the six languages whose baseline SFR is below 90 percent.
- Four distinct collapse patterns appear: Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali and Malayalam, and Latin substitution for Georgian.
Where Pith is reading between the lines
- Reference-free script checks could be added to existing ASR evaluation pipelines to catch failures that WER misses.
- The same collapse patterns may appear in other multilingual generation tasks that produce text for non-Latin languages.
- Developers may need to test low-resource languages with script-specific prompts during model release.
Load-bearing premise
Membership in a Unicode script block is a sufficient and unambiguous proxy for whether the generated text is actually in the intended writing system for that language.
What would settle it
Manual inspection of a random sample of outputs with SFR below 10 percent to count how many are genuinely written in an entirely different script versus mixed or misclassified blocks.
Figures
read the original abstract
Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target script block, computable without reference transcriptions, and report a systematic measurement of script collapse across ten languages spanning six writing systems and ten models (seven Whisper sizes, MMS-1B, SeamlessM4T-v2, and Gemma 4 E2B) on FLEURS test sets. Across 100 evaluated model-language pairs, 21 (21%; 95% Wilson CI: 14-30%) exhibit script collapse (SFR less than 10%): 20 involve Whisper and one involves Gemma 4 E2B on Urdu under a generic transcription prompt. In a ten-language Gemma 4 probe, script-aware prompting raises mean SFR from 71.2% to 97.7%, fixes Urdu collapse (6.5% to 97.0%), and recovers 5.9 chrF on downstream NLLB translation for the six languages whose baseline SFR is below 90%. We identify four collapse patterns: Latin phonetic substitution, Arabic substitution for Somali, Devanagari substitution for Bengali/Malayalam, and unique-script Latin collapse for Georgian.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Script Fidelity Rate (SFR), a reference-free metric defined as the fraction of hypothesis characters falling within a language's target Unicode script block, to detect script collapse in multilingual ASR where models produce fluent output in the wrong writing system. It benchmarks SFR across 100 model-language pairs (10 languages spanning 6 scripts, 10 models including Whisper variants, MMS-1B, SeamlessM4T-v2, and Gemma 4 E2B) on FLEURS test sets, finding 21 pairs (21%; 95% Wilson CI 14-30%) with SFR <10% (20 Whisper, 1 Gemma on Urdu), identifies four collapse patterns via inspection, and shows that script-aware prompting raises mean SFR from 71.2% to 97.7% for Gemma while recovering 5.9 chrF in downstream NLLB translation for low-SFR languages.
Significance. If the SFR metric and threshold are reliable, the work identifies a previously under-measured failure mode in current ASR systems that standard WER misses, supplies a simple reproducible benchmark on public data with confidence intervals, and demonstrates a low-cost mitigation via prompting that also benefits downstream tasks. The empirical scale (100 pairs) and concrete rates provide a useful baseline for future model evaluation.
major comments (2)
- [SFR definition and results] Definition of SFR: the metric treats membership in a single assigned Unicode script block as a sufficient proxy for correct orthography, but provides no semantic, contextual, or language-specific validation. This assumption is load-bearing for the headline 21/100 collapse count, yet the paper does not report human-judgment alignment or explicit handling for overlapping blocks (Urdu/Arabic, Somali/Arabic), digits/punctuation, or mixed-script hypotheses; the four patterns noted from inspection are post-hoc and do not retroactively validate the quantitative threshold.
- [Results and discussion] Results section (100-pair benchmark): the reported collapse rate and Wilson intervals rest on the unvalidated SFR <10% cutoff; without a cross-check against human ratings of writing-system correctness on even a modest subset of the 100 pairs, it is unclear whether the 21% figure over- or under-counts true collapse, especially for the single non-Whisper case (Gemma on Urdu).
minor comments (2)
- [Abstract] Abstract and methods: no implementation details are given for edge cases such as mixed-script output, normalization artifacts, or language-specific script ambiguities (e.g., Georgian unique script).
- [Results] The paper could usefully add a small table or appendix showing SFR sensitivity to the 10% threshold (e.g., rates at 5%, 15%) to demonstrate robustness.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting important considerations regarding the SFR metric and its empirical validation. We address each major comment below and outline targeted revisions to the manuscript.
read point-by-point responses
-
Referee: Definition of SFR: the metric treats membership in a single assigned Unicode script block as a sufficient proxy for correct orthography, but provides no semantic, contextual, or language-specific validation. This assumption is load-bearing for the headline 21/100 collapse count, yet the paper does not report human-judgment alignment or explicit handling for overlapping blocks (Urdu/Arabic, Somali/Arabic), digits/punctuation, or mixed-script hypotheses; the four patterns noted from inspection are post-hoc and do not retroactively validate the quantitative threshold.
Authors: SFR is intentionally defined as a lightweight, reference-free proxy that flags writing-system mismatches via Unicode block membership, which directly captures the script collapse phenomenon without requiring references or semantic analysis. We acknowledge that the manuscript does not include human-judgment alignment, which is a limitation of the current version. For script overlaps, Urdu is standardly written in the Arabic script block, while Somali uses Latin in the FLEURS data; we will add explicit text clarifying these assignments. We will revise the methods section to state that digits, punctuation, and other non-letter characters are excluded from the character count when computing SFR, and that mixed-script hypotheses are scored strictly by the proportion of characters falling inside the target block. The four patterns are presented as post-inspection characterizations to illustrate observed failure modes rather than as quantitative validation. We will add a limitations subsection discussing the proxy assumptions and their scope. revision: partial
-
Referee: Results section (100-pair benchmark): the reported collapse rate and Wilson intervals rest on the unvalidated SFR <10% cutoff; without a cross-check against human ratings of writing-system correctness on even a modest subset of the 100 pairs, it is unclear whether the 21% figure over- or under-counts true collapse, especially for the single non-Whisper case (Gemma on Urdu).
Authors: The <10% threshold is a heuristic chosen to identify near-total script failure (i.e., >90% of characters outside the target block). The Wilson intervals correctly quantify sampling uncertainty around the observed 21/100 rate. We agree that direct human validation would increase confidence in the headline figure. In the revised manuscript we will report a human evaluation on a random subset of 20 model-language pairs (including the Gemma-Urdu case), in which annotators judge whether each hypothesis is written in the expected script; we will then compare these judgments to the SFR <10% classification to assess agreement and potential over- or under-counting. revision: yes
Circularity Check
No circularity: direct empirical definition and measurement of a reference-free metric
full rationale
The paper defines Script Fidelity Rate (SFR) explicitly as the fraction of hypothesis characters whose Unicode script block matches the pre-assigned target block for each language. It then computes this quantity on the 100 model-language pairs from FLEURS test sets and reports the observed fraction below 10%. No parameters are fitted to data, no predictions are derived from subsets, no self-citations support the central measurement, and no equations reduce the reported rates to the inputs by construction. The result is a straightforward counting procedure on external data; concerns about the Unicode proxy's validity are separate from circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Unicode script blocks can be used to reliably determine if text is in the target script for a given language
Reference graph
Works this paper leans on
-
[1]
Record ASR hypotheses and the intended target language
-
[2]
Compute utterance-level SFR using the target language’s Unicode block specifica- tion
-
[3]
Alert when corpus-level mean SFR drops below a deployment threshold, for example < 0.8
-
[4]
Inspect low-SFR examples with WER, CER, LID, or human review before making product decisions. 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.