PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Hanif Rahman

arxiv: 2605.26978 · v1 · pith:MLWH5JRPnew · submitted 2026-05-26 · 💻 cs.CL · cs.SD

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Hanif Rahman This is my paper

Pith reviewed 2026-06-29 18:41 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords Pashto TTSlow-resource languagestext-to-speech evaluationscript fidelityautomated screeningASR WERlanguage identificationbenchmark

0 comments

The pith

The INSV framework separates intelligibility, script fidelity, and language identification for automated Pashto TTS screening beyond single WER.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces INSV, a reporting framework that breaks TTS evaluation into intelligibility, naturalness, script fidelity, and verification to address failures of single ASR round-trip WER on low-resource non-Latin-script languages. It focuses on the automated INSV-A subset using synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. A system may produce no audio, speak a neighbouring language, or preserve target script only in transcripts, and these cases need separation. PashtoTTS-Bench applies this to evaluate several systems on FLEURS and Common Voice prompts, reporting distinct scores per metric and system.

Core claim

We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts.

What carries the argument

INSV-A, the automated screening subset of the INSV framework that combines synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification via independent models.

If this is right

Different systems exhibit distinct failure modes that separate metrics can detect, such as low WER but zero Pashto LID labels.
Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control.
WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech.
The benchmark release supplies provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding new systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same INSV-A checks could be rerun on later model versions to track whether improvements in one metric affect others.
Extending the benchmark to additional non-Latin low-resource languages would test whether the framework generalizes beyond Pashto.
The separation of metrics makes it possible to prioritize development effort on the weakest dimension for each system.

Load-bearing premise

The independent omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, and SpeechBrain VoxLingua107 models provide reliable signals for Pashto-specific intelligibility, script fidelity, and language identification without systematic bias on synthetic audio.

What would settle it

A set of human native listener ratings on the same audio that reverses the automated WER and LID rankings or shows high mismatch with the Script Fidelity Rate.

Figures

Figures reproduced from arXiv: 2605.26978 by Hanif Rahman.

**Figure 1.** Figure 1: ASR round-trip WER by system and source text set. Error bars show 95% bootstrap confidence intervals. The dashed reference line marks the natural-speech pashto-asr-v3 WER baseline (34.6% on FLEURS computed this study; 32.5% on CV24 from Rahman 14). The omniASR CV24 gap between Latifa and GulNawaz (47.7 % vs. 39.5 %) is wider than the pashto-asr-v3 gap (34.4 % vs. 32.7 %), which suggests the independent rec… view at source ↗

**Figure 2.** Figure 2: Failure map from transcript SFR and language verification. The vertical reference line shows a 90 % Pashto-ID rate, but model disagreement is reported as unresolved rather than overridden by one model. cases are realistic in practice and SFR = 1.0 represents an empirical outcome of this run rather than a design flaw in the metric. Whisper Large V3 returns 0.0 % Pashto labels on author-manually checked Pas… view at source ↗

**Figure 3.** Figure 3: Utterance-level WER for Pashto-specific grapheme groups on FLEURS. Bars report WER for utterances containing at least one character in the class. This is a screening view, not phone-level error annotation. Brain are doing something meaningful. The actionable conclusion is that multimodel LID is required because no single available automatic model covers Pashto reliably. Transcript SFR also requires cautio… view at source ↗

**Figure 4.** Figure 4: System-by-failure matrix for the automated run. Unmeasured cells are kept separate from confirmed pass/fail cells [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

read the original abstract

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PashtoTTS-Bench supplies a dated automated screening template and concrete scores for Pashto TTS but the numbers rest on unvalidated ASR and LID models for that language.

read the letter

PashtoTTS-Bench introduces INSV-A, an automated reporting setup that runs synthesis completion, ASR WER/CER, transcript script fidelity, and audio language ID on Pashto TTS outputs. It applies this to Edge GulNawaz, Edge Latifa, OmniVoice variants, and an Urdu control using FLEURS and filtered Common Voice prompts, then releases the per-sentence scores, logs, and scripts.

The release itself is the useful part. Practitioners who need a quick, reproducible way to screen new Pashto systems now have a dated baseline and the code to extend it. The framework separates obvious failure modes without requiring native listeners in this version, which is a reasonable practical step for low-resource work.

The soft spot is exactly the one flagged in the stress test. The paper relies on omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, and SpeechBrain VoxLingua107 without any Pashto-specific calibration or error analysis on synthetic audio. Whisper returning 0 % Pashto labels on the evaluated audio already shows the LID component is not behaving as intended. No ablation on ASR choice or checks for systematic bias on non-Latin synthetic speech appears in the text. That leaves the reported WER ordering dependent on an unchecked assumption.

The paper is for engineers and researchers who build or compare TTS for Pashto or similar languages and want an off-the-shelf automated filter. It is narrow but the benchmark and data release are new. It deserves peer review because the contribution is the concrete template and shared results rather than any novel model or derivation.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the INSV-A automated screening subset of the INSV framework for TTS evaluation in low-resource non-Latin-script languages. It instantiates PashtoTTS-Bench on five systems (Edge GulNawaz, Edge Latifa, OmniVoice clone/auto, Urdu control) using 400 prompts, reporting synthesis completion, WER/CER from omniASR_CTC_300M_v2 (e.g., 24.1%/27.4% for OmniVoice auto on FLEURS/CV24), transcript Script Fidelity Rate, and LID outcomes from Whisper Large V3 (0% Pashto), MMS-LID-4017, and SpeechBrain VoxLingua107, while releasing per-sentence scores, metadata, failure logs, and addition scripts.

Significance. If the chosen ASR and LID models yield low-bias signals on Pashto synthetic audio, INSV-A supplies a practical, reproducible automated benchmark that separates synthesis failure modes without requiring native listeners; this is valuable for low-resource TTS research. The explicit release of scripts, per-sentence data, and provider metadata strengthens reproducibility and extensibility.

major comments (3)

[Abstract] Abstract (results reporting paragraph): The WER ordering and LID separation are presented as direct evidence supporting INSV-A utility, yet no Pashto-specific validation, calibration on real Pashto speech, or error analysis of omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, or SpeechBrain VoxLingua107 on synthetic audio is reported. This assumption is load-bearing for interpreting all numerical claims, as the models were likely trained with minimal Pashto data and Whisper already returns 0% Pashto labels on the evaluated audio.
[Abstract] Abstract (results reporting paragraph): No error bars, confidence intervals, or statistical tests accompany the WER/CER figures (e.g., 24.1%, 32.8%, 35.6%, 45.4%), so the reported differences between systems cannot be assessed for reliability or robustness to prompt sampling variation.
[Abstract] Abstract: The prompt filtering procedure for the 200 Common Voice 24 items and any ablation of ASR model choice are not described, leaving the robustness of the INSV-A scores and the cross-system comparisons dependent on unstated selection criteria.

minor comments (2)

[Abstract] Abstract: The exact definition and computation of 'transcript Script Fidelity Rate' is not illustrated with an example or formula, which would clarify how it contributes to case separation in INSV-A.
The manuscript would benefit from a summary table listing each INSV-A component, the model(s) used, and the intended failure mode it detects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report recommending major revision. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (results reporting paragraph): The WER ordering and LID separation are presented as direct evidence supporting INSV-A utility, yet no Pashto-specific validation, calibration on real Pashto speech, or error analysis of omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, or SpeechBrain VoxLingua107 on synthetic audio is reported. This assumption is load-bearing for interpreting all numerical claims, as the models were likely trained with minimal Pashto data and Whisper already returns 0% Pashto labels on the evaluated audio.

Authors: We agree this is a substantive limitation: the manuscript provides no dedicated Pashto validation set or error analysis for the proxy ASR/LID models on synthetic audio. INSV-A is explicitly positioned as an automated screening framework that uses off-the-shelf models to separate failure modes (no audio, wrong language, script mismatch) without native listeners; the 0 % Whisper Pashto label is reported precisely to surface one such limitation rather than to claim model reliability. The per-sentence scores and failure logs are released to enable exactly the calibration the referee requests. We will add a limitations subsection that states the low-resource training data issue for all four models and notes that native validation remains future work. This is a partial revision. revision: partial
Referee: [Abstract] Abstract (results reporting paragraph): No error bars, confidence intervals, or statistical tests accompany the WER/CER figures (e.g., 24.1%, 32.8%, 35.6%, 45.4%), so the reported differences between systems cannot be assessed for reliability or robustness to prompt sampling variation.

Authors: We accept the point. The current release reports point estimates only. In the revised manuscript we will add bootstrap confidence intervals (1,000 resamples) for all WER/CER values and a brief statement on sampling robustness. revision: yes
Referee: [Abstract] Abstract: The prompt filtering procedure for the 200 Common Voice 24 items and any ablation of ASR model choice are not described, leaving the robustness of the INSV-A scores and the cross-system comparisons dependent on unstated selection criteria.

Authors: The filtering criteria (utterance length, automatic language verification, and removal of low-quality or code-switched items) appear in the methods section of the full manuscript, but the abstract does not repeat them. We will move a concise description of the Common Voice filtering procedure into the abstract and add a short paragraph on ASR model selection rationale. No ablation of ASR choice was performed; we will state this explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external models

full rationale

The paper defines INSV-A as a reporting framework and instantiates it via straightforward application of independent external models (omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, SpeechBrain VoxLingua107) to compute WER/CER, script fidelity rate, and LID labels on TTS outputs. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing premises. All reported numbers are direct outputs from the cited off-the-shelf models; the framework itself is a taxonomy of measurement categories rather than a derived result. This matches the default case of self-contained empirical benchmarking against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the assumption that the chosen external ASR and LID models are sufficiently accurate for Pashto synthetic audio; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Selected ASR and LID models (omniASR_CTC_300M_v2, Whisper Large V3, MMS-LID-4017, SpeechBrain VoxLingua107) produce reliable Pashto-specific signals on synthetic audio.
Invoked when interpreting WER, script fidelity, and LID results as meaningful for the target language.

pith-pipeline@v0.9.1-grok · 5874 in / 1204 out tokens · 25986 ms · 2026-06-29T18:41:38.735770+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Automatic speech recognition for under-resourced languages: A survey

Besacier, L., Barnard, E., Karpov, A., Schultz, T., 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Communi- cation 56, 85–100. doi: 10.1016/j.specom.2013.07.008 . foundational low-resource ASR survey; most directly analogous prior survey to this pa- per

work page doi:10.1016/j.specom.2013.07.008 2014
[2]

FLEURS: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Tech- nology Workshop (SLT), pp

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., Bapna, A., 2023. FLEURS: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Tech- nology Workshop (SLT), pp. 798–805. URL: https://doi.org/10.1109/ SLT54892.2023.10023141, doi:10.1109/SLT54892.2023.10023141

work page doi:10.1109/slt54892.2023.10023141 2023
[3]

The zero resource speech challenge 2021: Spoken language modelling, in: Interspeech 2021, pp

Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Dupoux, E., 2021. The zero resource speech challenge 2021: Spoken language modelling, in: Interspeech 2021, pp. 1574–

2021
[4]

URL: https://doi.org/10.21437/Interspeech.2021-1755 , doi:10.21437/Interspeech.2021-1755

work page doi:10.21437/interspeech.2021-1755 2021
[5]

The zero resource speech challenge 2017, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp

Dunbar, E., Cao, X.N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X., Dupoux, E., 2017. The zero resource speech challenge 2017, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330. URL: https://doi.org/10.1109/ASRU.2017.82 68953, doi:10.1109/ASRU.2017.8268953. 32

work page doi:10.1109/asru.2017.82 2017
[6]

VoiceMOS challenge 2022, in: Interspeech 2022, pp

Huang, W.C., Cooper, E., Tsao, Y., Wang, H.M., Toda, T., Yamagishi, J., 2022. VoiceMOS challenge 2022, in: Interspeech 2022, pp. 4536–4540. URL: http s://doi.org/10.21437/Interspeech.2022-970, doi:10.21437/Inter speech.2022-970

work page doi:10.21437/interspeech.2022-970 2022
[7]

Recommendation P.800: Methods for subjective determination of transmission quality

ITU-T, 1996. Recommendation P.800: Methods for subjective determination of transmission quality. URL: https://www.itu.int/rec/T-REC-P.800 -199608-I. international Telecommunication Union, in force

1996
[8]

Recommendation P.808: Subjective evaluation of speech qual- ity with a crowdsourcing approach

ITU-T, 2021. Recommendation P.808: Subjective evaluation of speech qual- ity with a crowdsourcing approach. URL: https://www.itu.int/itu-t /recommendations/rec.aspx?rec=P.808 . international Telecommuni- cation Union, in force

2021
[9]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,

Keren, G., Kozhevnikov, A., Meng, Y., Ropers, C., Setzler, M., Wang, S., Adebara, I., Auli, M., Balioglu, C., Chan, K., Cheng, C., Chuang, J., Droof, C., Duppenthaler, M., Duquenne, P.A., Erben, A., Gao, C., Gonzalez, G.M., Lyu, K., Miglani, S., Pratap, V., Sadagopan, K.R., Saleem, S., Turkatenko, A., Ventayol-Boada, A., Yong, Z.X., Chung, Y.A., Maillard,...

work page arXiv 2025
[10]

Multilingual spoken words corpus

Mazumder, M., Chitlangia, S., Banbury, C., Kang, Y., Ciro, J., Achorn, K., Galvez, D., Sabini, M., Mattson, P., Kanter, D., Diamos, G., Warden, P., Meyer, J., Janapa Reddi, V., 2021. Multilingual spoken words corpus. Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. URL: https://datasets-benchmarks-proceedings .n...

2021
[11]

Orthography and phonemes in Pashto (Afghan)

Penzl, H., 1954. Orthography and phonemes in Pashto (Afghan). Journal of the American Oriental Society 74, 74–81. URL: https://doi.org/10.2 307/596206, doi:10.2307/596206

work page doi:10.2307/596206 1954
[12]

The blizzard challenge 2014, in: The Blizzard Challenge 2014, pp

Prahallad, K., Vadapalli, A., Kesiraju, S., Murthy, H.A., Lata, S., Nagarajan, T., Prasanna, M., Patil, H., Sao, A.K., King, S., Black, A.W., Tokuda, K., 2014. The blizzard challenge 2014, in: The Blizzard Challenge 2014, pp. 1–14. URL: https://doi.org/10.21437/Blizzard.2014-1 , doi:10.21437/Blizz ard.2014-1. 33

work page doi:10.21437/blizzard.2014-1 2014
[13]

Scaling speech technology to 1,000+ languages

Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.N., Conneau, A., Auli, M., 2024. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research 25, 1–52. URL: https: //jmlr.org/papers/v25/23-1318.html . includes the MMS-LID-...

2024
[14]

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.,
[15]

Robust speech recognition via large-scale weak supervision, in: Pro- ceedings of the 40th International Conference on Machine Learning, PMLR. pp. 28492–28518. URL: https://proceedings.mlr.press/v202/radfo rd23a.html
[16]

Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

Rahman, H., 2026a. Benchmarking multilingual speech models on Pashto: Zero-shot ASR, script failure, and cross-domain evaluation. arXiv preprint arXiv:2604.04598. URL: https://arxiv.org/abs/2604.04598 . 10 mod- els zero-shot + 5 fine-tuned cross-domain; script failure audit; phoneme- class WER; companion benchmark to this survey

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark

Rahman, H., 2026b. Script collapse in multilingual ASR: A reference- free metric and 100-pair benchmark. arXiv preprint arXiv:2604.08786. URL: https://arxiv.org/abs/2604.08786 . script Fidelity Rate metric; 6 languages x 9 models on FLEURS; package at github.com/the- mechanism/script-fidelity-metric

work page internal anchor Pith review Pith/arXiv arXiv
[18]

ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR

Rahman, H.u., 2024. ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR. https://huggingface.co/ihanif/pashto-asr-v3 . Fine- tuned from facebook/w2v-bert-2.0 (600M parameters; CTC decoder); trained on FLEURS Pashto training split; published WER 13.96%; model weights publicly available on HuggingFace Hub

2024
[19]

SpeechBrain: A general- purpose speech toolkit

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lu- gosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.C., Yeh, S.L., Fu, S.W., Liao, C.F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., De Mori, R., Bengio, Y., 2021. SpeechBrain: A general- purpose speech toolkit. URL: https://arxiv.org/abs/2106.04624 , a...

work page arXiv 2021
[20]

Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., Saruwatari, H.,
[21]

4521–4525

UTMOS: Utokyo-sarulab system for voicemos challenge 2022, in: In- 34 terspeech 2022, pp. 4521–4525. URL: https://doi.org/10.21437/Int erspeech.2022-439, doi:10.21437/Interspeech.2022-439

work page doi:10.21437/int 2022
[22]

Mean opinion score (mos) re- visited: Methods and applications, limitations and alternatives

Streijl, R.C., Winkler, S., Hands, D.S., 2016. Mean opinion score (mos) re- visited: Methods and applications, limitations and alternatives. Multimedia Systems 22, 213–227. URL: https://doi.org/10.1007/s00530-014-0 446-1, doi:10.1007/s00530-014-0446-1

work page doi:10.1007/s00530-014-0 2016
[23]

Confidence intervals for ASR-based TTS evaluation, in: Interspeech 2021, pp

Taylor, J., Richmond, K., 2021. Confidence intervals for ASR-based TTS evaluation, in: Interspeech 2021, pp. 2791–2795. URL: https://doi.org/ 10.21437/Interspeech.2021-2203 , doi:10.21437/Interspeech.202 1-2203

work page doi:10.21437/interspeech.2021-2203 2021
[24]

A Reference Grammar of Pashto

Tegey, H., Robson, B., 1996. A Reference Grammar of Pashto. Center for Applied Linguistics, Washington, D.C. Foundational Pashto grammar; phoneme inventory; morphology

1996
[25]

VoxLingua107: a dataset for spoken language recognition, in: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp

Valk, J., Alumäe, T., 2021. VoxLingua107: a dataset for spoken language recognition, in: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 652–658. doi: 10.1109/SLT48900.2021.9383459 . 107-language YouTube-scraped corpus; XLS-R uses this as one of five pre- training data sources. Supplementary material §A: MOS rater instructions Engli...

work page doi:10.1109/slt48900.2021.9383459 2021
[26]

You may replay once if unsure

Listen to the recording once before rating. You may replay once if unsure
[27]

Rate the recording using the 1–5 scale
[28]

Answer whether the clip is Pashto speech: yes, no, or unsure
[29]

Do not go back and change previous ratings

Work through the recordings in order. Do not go back and change previous ratings
[30]

If the recording is silent or in a language other than Pashto, rate it 1. Pashto version ॴధ ୑୆ൌ৭ ߓߵﺍߓߵّ۬ ܔިﺉ.݁۬ ﺩ࠯ࡇﻩ ﺩ ل؇݁أٷ؇ﺩ ﻭﺭܔިﺉ.݁۬ﺩﺭ ৆ྵ ޗٴ٭ ޗٴ٭ ଫଃ༚ ଫଃ༚ ﺩﯼؗ ेू ๲ཇ ﻭﺭ ቕሹ ࢻࣖܳިﺉ.݁۬ﺩﺭ ﻭٙ؇ 36 Rater demographics The published summary must report rater count, L1 status, dialect back- ground, age range (when consented), gender breakdown (when consented), col-...
[31]

Add a provider implementation with four methods: synthesize, supports_pashto, get_metadata, get_provider_name
[32]

Run synthesis and automated metrics on both prompt sets
[33]

Run multi-model LID (MMS-LID, SpeechBrain; Whisper as diagnostic only); mark disagreement as unresolved until native adjudication is collected
[34]

For MOS evaluation, use scripts/export\_mos\_survey.py and the rater instructions in Supplementary §A

Record per-sentence CSVs, provider metadata, run logs, and redistribution terms for generated audio. For MOS evaluation, use scripts/export\_mos\_survey.py and the rater instructions in Supplementary §A. Report rater counts, dialect metadata, and Krippendorff’s α alongside MOS means. 37 Audio integrity Commercial audio is not redistributed when provider t...

2026

[1] [1]

Automatic speech recognition for under-resourced languages: A survey

Besacier, L., Barnard, E., Karpov, A., Schultz, T., 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Communi- cation 56, 85–100. doi: 10.1016/j.specom.2013.07.008 . foundational low-resource ASR survey; most directly analogous prior survey to this pa- per

work page doi:10.1016/j.specom.2013.07.008 2014

[2] [2]

FLEURS: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Tech- nology Workshop (SLT), pp

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., Bapna, A., 2023. FLEURS: Few-shot learning evaluation of universal representations of speech, in: 2022 IEEE Spoken Language Tech- nology Workshop (SLT), pp. 798–805. URL: https://doi.org/10.1109/ SLT54892.2023.10023141, doi:10.1109/SLT54892.2023.10023141

work page doi:10.1109/slt54892.2023.10023141 2023

[3] [3]

The zero resource speech challenge 2021: Spoken language modelling, in: Interspeech 2021, pp

Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Dupoux, E., 2021. The zero resource speech challenge 2021: Spoken language modelling, in: Interspeech 2021, pp. 1574–

2021

[4] [4]

URL: https://doi.org/10.21437/Interspeech.2021-1755 , doi:10.21437/Interspeech.2021-1755

work page doi:10.21437/interspeech.2021-1755 2021

[5] [5]

The zero resource speech challenge 2017, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp

Dunbar, E., Cao, X.N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X., Dupoux, E., 2017. The zero resource speech challenge 2017, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330. URL: https://doi.org/10.1109/ASRU.2017.82 68953, doi:10.1109/ASRU.2017.8268953. 32

work page doi:10.1109/asru.2017.82 2017

[6] [6]

VoiceMOS challenge 2022, in: Interspeech 2022, pp

Huang, W.C., Cooper, E., Tsao, Y., Wang, H.M., Toda, T., Yamagishi, J., 2022. VoiceMOS challenge 2022, in: Interspeech 2022, pp. 4536–4540. URL: http s://doi.org/10.21437/Interspeech.2022-970, doi:10.21437/Inter speech.2022-970

work page doi:10.21437/interspeech.2022-970 2022

[7] [7]

Recommendation P.800: Methods for subjective determination of transmission quality

ITU-T, 1996. Recommendation P.800: Methods for subjective determination of transmission quality. URL: https://www.itu.int/rec/T-REC-P.800 -199608-I. international Telecommunication Union, in force

1996

[8] [8]

Recommendation P.808: Subjective evaluation of speech qual- ity with a crowdsourcing approach

ITU-T, 2021. Recommendation P.808: Subjective evaluation of speech qual- ity with a crowdsourcing approach. URL: https://www.itu.int/itu-t /recommendations/rec.aspx?rec=P.808 . international Telecommuni- cation Union, in force

2021

[9] [9]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages,

Keren, G., Kozhevnikov, A., Meng, Y., Ropers, C., Setzler, M., Wang, S., Adebara, I., Auli, M., Balioglu, C., Chan, K., Cheng, C., Chuang, J., Droof, C., Duppenthaler, M., Duquenne, P.A., Erben, A., Gao, C., Gonzalez, G.M., Lyu, K., Miglani, S., Pratap, V., Sadagopan, K.R., Saleem, S., Turkatenko, A., Ventayol-Boada, A., Yong, Z.X., Chung, Y.A., Maillard,...

work page arXiv 2025

[10] [10]

Multilingual spoken words corpus

Mazumder, M., Chitlangia, S., Banbury, C., Kang, Y., Ciro, J., Achorn, K., Galvez, D., Sabini, M., Mattson, P., Kanter, D., Diamos, G., Warden, P., Meyer, J., Janapa Reddi, V., 2021. Multilingual spoken words corpus. Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. URL: https://datasets-benchmarks-proceedings .n...

2021

[11] [11]

Orthography and phonemes in Pashto (Afghan)

Penzl, H., 1954. Orthography and phonemes in Pashto (Afghan). Journal of the American Oriental Society 74, 74–81. URL: https://doi.org/10.2 307/596206, doi:10.2307/596206

work page doi:10.2307/596206 1954

[12] [12]

The blizzard challenge 2014, in: The Blizzard Challenge 2014, pp

Prahallad, K., Vadapalli, A., Kesiraju, S., Murthy, H.A., Lata, S., Nagarajan, T., Prasanna, M., Patil, H., Sao, A.K., King, S., Black, A.W., Tokuda, K., 2014. The blizzard challenge 2014, in: The Blizzard Challenge 2014, pp. 1–14. URL: https://doi.org/10.21437/Blizzard.2014-1 , doi:10.21437/Blizz ard.2014-1. 33

work page doi:10.21437/blizzard.2014-1 2014

[13] [13]

Scaling speech technology to 1,000+ languages

Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.N., Conneau, A., Auli, M., 2024. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research 25, 1–52. URL: https: //jmlr.org/papers/v25/23-1318.html . includes the MMS-LID-...

2024

[14] [14]

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.,

[15] [15]

Robust speech recognition via large-scale weak supervision, in: Pro- ceedings of the 40th International Conference on Machine Learning, PMLR. pp. 28492–28518. URL: https://proceedings.mlr.press/v202/radfo rd23a.html

[16] [16]

Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

Rahman, H., 2026a. Benchmarking multilingual speech models on Pashto: Zero-shot ASR, script failure, and cross-domain evaluation. arXiv preprint arXiv:2604.04598. URL: https://arxiv.org/abs/2604.04598 . 10 mod- els zero-shot + 5 fine-tuned cross-domain; script failure audit; phoneme- class WER; companion benchmark to this survey

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Script collapse in multilingual ASR: A reference-free metric and 100-pair benchmark

Rahman, H., 2026b. Script collapse in multilingual ASR: A reference- free metric and 100-pair benchmark. arXiv preprint arXiv:2604.08786. URL: https://arxiv.org/abs/2604.08786 . script Fidelity Rate metric; 6 languages x 9 models on FLEURS; package at github.com/the- mechanism/script-fidelity-metric

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR

Rahman, H.u., 2024. ihanif/pashto-asr-v3: W2V-BERT 2.0 fine-tuned for Pashto ASR. https://huggingface.co/ihanif/pashto-asr-v3 . Fine- tuned from facebook/w2v-bert-2.0 (600M parameters; CTC decoder); trained on FLEURS Pashto training split; published WER 13.96%; model weights publicly available on HuggingFace Hub

2024

[19] [19]

SpeechBrain: A general- purpose speech toolkit

Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lu- gosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.C., Yeh, S.L., Fu, S.W., Liao, C.F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., De Mori, R., Bengio, Y., 2021. SpeechBrain: A general- purpose speech toolkit. URL: https://arxiv.org/abs/2106.04624 , a...

work page arXiv 2021

[20] [20]

Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., Saruwatari, H.,

[21] [21]

4521–4525

UTMOS: Utokyo-sarulab system for voicemos challenge 2022, in: In- 34 terspeech 2022, pp. 4521–4525. URL: https://doi.org/10.21437/Int erspeech.2022-439, doi:10.21437/Interspeech.2022-439

work page doi:10.21437/int 2022

[22] [22]

Mean opinion score (mos) re- visited: Methods and applications, limitations and alternatives

Streijl, R.C., Winkler, S., Hands, D.S., 2016. Mean opinion score (mos) re- visited: Methods and applications, limitations and alternatives. Multimedia Systems 22, 213–227. URL: https://doi.org/10.1007/s00530-014-0 446-1, doi:10.1007/s00530-014-0446-1

work page doi:10.1007/s00530-014-0 2016

[23] [23]

Confidence intervals for ASR-based TTS evaluation, in: Interspeech 2021, pp

Taylor, J., Richmond, K., 2021. Confidence intervals for ASR-based TTS evaluation, in: Interspeech 2021, pp. 2791–2795. URL: https://doi.org/ 10.21437/Interspeech.2021-2203 , doi:10.21437/Interspeech.202 1-2203

work page doi:10.21437/interspeech.2021-2203 2021

[24] [24]

A Reference Grammar of Pashto

Tegey, H., Robson, B., 1996. A Reference Grammar of Pashto. Center for Applied Linguistics, Washington, D.C. Foundational Pashto grammar; phoneme inventory; morphology

1996

[25] [25]

VoxLingua107: a dataset for spoken language recognition, in: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp

Valk, J., Alumäe, T., 2021. VoxLingua107: a dataset for spoken language recognition, in: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 652–658. doi: 10.1109/SLT48900.2021.9383459 . 107-language YouTube-scraped corpus; XLS-R uses this as one of five pre- training data sources. Supplementary material §A: MOS rater instructions Engli...

work page doi:10.1109/slt48900.2021.9383459 2021

[26] [26]

You may replay once if unsure

Listen to the recording once before rating. You may replay once if unsure

[27] [27]

Rate the recording using the 1–5 scale

[28] [28]

Answer whether the clip is Pashto speech: yes, no, or unsure

[29] [29]

Do not go back and change previous ratings

Work through the recordings in order. Do not go back and change previous ratings

[30] [30]

If the recording is silent or in a language other than Pashto, rate it 1. Pashto version ॴధ ୑୆ൌ৭ ߓߵﺍߓߵّ۬ ܔިﺉ.݁۬ ﺩ࠯ࡇﻩ ﺩ ل؇݁أٷ؇ﺩ ﻭﺭܔިﺉ.݁۬ﺩﺭ ৆ྵ ޗٴ٭ ޗٴ٭ ଫଃ༚ ଫଃ༚ ﺩﯼؗ ेू ๲ཇ ﻭﺭ ቕሹ ࢻࣖܳިﺉ.݁۬ﺩﺭ ﻭٙ؇ 36 Rater demographics The published summary must report rater count, L1 status, dialect back- ground, age range (when consented), gender breakdown (when consented), col-...

[31] [31]

Add a provider implementation with four methods: synthesize, supports_pashto, get_metadata, get_provider_name

[32] [32]

Run synthesis and automated metrics on both prompt sets

[33] [33]

Run multi-model LID (MMS-LID, SpeechBrain; Whisper as diagnostic only); mark disagreement as unresolved until native adjudication is collected

[34] [34]

For MOS evaluation, use scripts/export\_mos\_survey.py and the rater instructions in Supplementary §A

Record per-sentence CSVs, provider metadata, run logs, and redistribution terms for generated audio. For MOS evaluation, use scripts/export\_mos\_survey.py and the rater instructions in Supplementary §A. Report rater counts, dialect metadata, and Krippendorff’s α alongside MOS means. 37 Audio integrity Commercial audio is not redistributed when provider t...

2026