Recognition: unknown
BlasBench: An Open Benchmark for Irish Speech Recognition
Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3
The pith
BlasBench supplies an Irish-aware normalizer and scoring harness that makes ASR comparisons for the language reliable and exposes a large generalization gap between fine-tuned and multilingual models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BlasBench demonstrates that an Irish-aware normalizer preserving fadas, lenition, and eclipsis is required for valid ASR evaluation; with it in place, Whisper variants exceed 100 percent WER through insertion hallucination, Microsoft Azure reaches 22.2 percent on Common Voice and 57.5 percent on FLEURS, and the best open model reaches 30.65 percent and 39.09 percent respectively, while fine-tuned systems degrade far more than multilingual ones when moving between the two corpora.
What carries the argument
BlasBench, an open evaluation harness built around a standalone Irish-aware normaliser that preserves fadas, lenition, and eclipsis and supplies reproducible scoring with released per-utterance predictions.
If this is right
- Single-dataset leaderboards for Irish ASR become unreliable and must be replaced by multi-corpus evaluation.
- Fine-tuning on one Irish resource produces brittle models that fail on new domains or recording conditions.
- Massively multilingual pre-training confers measurable robustness advantages for Irish that single-language fine-tuning does not.
- Hallucination-driven insertions in Whisper models render their WER scores uninterpretable without an Irish-aware normalizer.
Where Pith is reading between the lines
- Similar orthography-preserving normalizers may be needed for other languages with diacritics or mutation rules before their ASR benchmarks can be trusted.
- The generalization gap observed here suggests that low-resource language evaluation should routinely include at least two independent test sets drawn from different sources.
- Releasing both the normalizer code and the raw predictions allows future researchers to test new models or normalizer variants without re-collecting data.
Load-bearing premise
The custom normalizer fully captures all relevant Irish orthographic rules and the two chosen corpora are representative enough to support claims about generalization.
What would settle it
A manual audit of the normalizer output on held-out Irish text that reveals systematic errors in handling eclipsis or lenition, or a third Irish speech corpus on which the reported 33-43 point degradation for fine-tuned models disappears.
Figures
read the original abstract
Existing multilingual benchmarks include Irish among dozens of languages but apply no Irish-aware text normalisation, leaving reliable and reproducible ASR comparison impossible. We introduce BlasBench, an open evaluation harness that provides a standalone Irish-aware normaliser preserving fadas, lenition, and eclipsis; a reproducible scoring harness and per-utterance predictions released for all evaluated runs. We pilot this by benchmarking 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER through insertion-driven hallucination. Microsoft Azure reaches 22.2% WER on Common Voice and 57.5% on FLEURS; the best open model, Omnilingual ASR 7B, reaches 30.65% and 39.09% respectively. Models fine-tuned on Common Voice degrade 33-43 points moving to FLEURS, while massively multilingual models degrade only 7-10 - a generalisation gap that single-dataset evaluation misses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BlasBench, an open evaluation harness for Irish ASR that supplies a standalone Irish-aware text normalizer preserving fadas, lenition, and eclipsis, together with a reproducible scoring pipeline and released per-utterance predictions. It benchmarks 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE, reporting concrete WERs (Azure 22.2 % / 57.5 %, best open model Omnilingual ASR 7B at 30.65 % / 39.09 %) and a generalization gap in which fine-tuned models degrade 33–43 points while massively multilingual models degrade only 7–10 points; all Whisper variants exceed 100 % WER via insertion-driven hallucination.
Significance. If the normalizer is shown to be correct and complete, BlasBench would supply a much-needed reproducible resource for Irish ASR that current multilingual benchmarks lack. The public release of the harness, normalizer code, and per-utterance predictions is a clear strength that directly supports the reproducibility claim. The reported generalization gap supplies a concrete, falsifiable observation that single-dataset evaluation misses and that future low-resource ASR work can test.
major comments (2)
- [§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.
- [§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.
minor comments (2)
- [Abstract] The abstract states “four architecture families” but does not enumerate them; the methods section should list the families explicitly for immediate readability.
- [Results table] Table 1 (or equivalent results table) reports point WER estimates without confidence intervals or statistical tests; adding these would strengthen the generalization-gap claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the specific revisions we will incorporate to strengthen the presentation of the normalizer and the error analysis.
read point-by-point responses
-
Referee: [§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.
Authors: We agree that the current description of the normalizer in §3 is insufficient for full reproducibility and independent verification. In the revised manuscript we will expand this section to provide an exhaustive enumerated list of all normalization rules, multiple concrete examples addressing dialectal mutations, compound-word handling, and punctuation interactions, and quantitative validation results including agreement metrics with native-speaker gold normalizations and inter-annotator agreement scores. These additions will allow readers to confirm that the reported WER differences arise from model behavior rather than normalization artifacts. revision: yes
-
Referee: [§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.
Authors: We acknowledge that the manuscript currently states the >100 % WER observation for Whisper variants without supporting error analysis. In the revised version we will add a dedicated error-analysis subsection (or appendix) that reports per-model insertion, substitution, and deletion rates, supplies representative utterance-level examples of the insertion-driven hallucinations, and discusses whether the pattern appears systematic across both Common Voice and FLEURS. This will strengthen the architectural-family comparison and enable readers to assess the generality of the failure mode. revision: yes
Circularity Check
No significant circularity; empirical benchmark with independent components
full rationale
The paper presents an open evaluation harness, a custom normalizer, and empirical WER results on public datasets (Common Voice ga-IE, FLEURS ga-IE). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The normalizer is introduced as a standalone contribution rather than derived from prior results by the same authors. All reported numbers are direct measurements, not reductions to inputs by construction. This matches the default expectation for non-circular empirical benchmark papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Word error rate remains a valid primary metric once text is properly normalized for Irish orthography.
Reference graph
Works this paper leans on
-
[1]
WER We Stand: Benchmarking U rdu ASR Models
Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, and Awais Athar. WER We Stand: Benchmarking U rdu ASR Models. In Proc. COLING, pages 5952--5961, 2025. arXiv:2409.11252
-
[2]
Claude Opus 4.6 (Models overview)
Anthropic. Claude Opus 4.6 (Models overview). https://docs.anthropic.com/en/docs/about-claude/models, 2026
2026
-
[3]
Bootstrap estimates for confidence intervals in ASR performance evaluation
Maximilian Bisani and Hermann Ney. Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proc. ICASSP, 2004
2004
-
[4]
Common V oice: A massively- multilingual speech corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. In Proc. LREC, 2020. arXiv:1912.06670
-
[5]
How I Built ASR for Endangered Languages with a Spoken Dictionary
Christopher Bartley and Anton Ragni. How I Built ASR for Endangered Languages with a Spoken Dictionary. arXiv:2510.04832, 2025
-
[6]
\' O Meachair, and Jennifer Foster
James Barry, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn, Abigail Walsh, M\' i che\' a l J. \' O Meachair, and Jennifer Foster. ga BERT -- an I rish Language Model. In Proc. LREC, pages 4774--4788, 2022. arXiv:2107.12930
-
[7]
Fleurs: Few-shot learning evaluation of universal representations of speech,
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. FLEURS : Few-shot Learning Evaluation of Universal Representations of Speech. In Proc. SLT, 2022. arXiv:2205.12446
-
[8]
Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. XTREME-S : Evaluating Cross-lingual Speech Representations. In Proc. Interspeech, 2022...
-
[9]
S. Faste. Wav2Vec 2.0 for I rish ASR : A Multilingual Approach to Under-Resourced Languages. MSc thesis, University of Groningen, Campus Fryslân, 2022. https://campus-fryslan.studenttheses.ub.rug.nl/234/
2022
-
[10]
Larry Gillick and Stephen J. Cox. Some Statistical Issues in the Comparison of Speech Recognition Algorithms. In Proc. ICASSP, pages 532--535, 1989
1989
-
[11]
Development and Evaluation of Speech Recognition for the W elsh Language
Dewi Jones. Development and Evaluation of Speech Recognition for the W elsh Language. In Proc. CLTW, 2022
2022
-
[12]
autoresearch: AI agents running research on single- GPU nanochat training automatically
Andrej Karpathy. autoresearch: AI agents running research on single- GPU nanochat training automatically. https://github.com/karpathy/autoresearch, 2026
2026
-
[13]
Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Salee...
-
[14]
Ond r ej Klejch, William Lamb, and Peter Bell. A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on S cottish G aelic. In Proc. Interspeech, 2025. arXiv:2506.04915
-
[15]
Automatic Speech Recognition for I rish: the ABAIR-\' E IST System
Liam Lonergan, Mengjie Qian, Harald Berthelsen, Andy Murphy, Christoph Wendler, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Automatic Speech Recognition for I rish: the ABAIR-\' E IST System. In Proc. CLTW, pages 47--51, 2022
2022
-
[16]
Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish
Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish. In Proc. Interspeech, pages 4865--4869, 2022. doi:10.21437/Interspeech.2022-838
-
[17]
Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Dialect-inclusive Recognition in a Low-resource Language: Are Balanced Corpora the Answer? In Proc. Interspeech, pages 5082--5086, 2023. arXiv:2307.07295
-
[18]
Towards Spoken Dialect Identification of I rish
Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Spoken Dialect Identification of I rish. In Proc. SIGUL, pages 63--67, 2023. arXiv:2307.07436
-
[19]
Low-resource speech recognition and dialect identification of I rish in a multi-task framework
Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Low-resource speech recognition and dialect identification of I rish in a multi-task framework. In Proc. Odyssey, pages 67--73, 2024. arXiv:2405.01293
-
[20]
Fotheidil: an Automatic Transcription System for the I rish Language
Liam Lonergan, Ibon Saratxaga, John Sloan, Oscar Maharg Bravo, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Fotheidil: an Automatic Transcription System for the I rish Language. In Proc. CLTW, pages 35--45, 2025. arXiv:2501.00509
-
[21]
Kavya Manohar, Leena G. Pillai, and Elizabeth Sherly. What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations. In Proc. EMNLP, 2024. arXiv:2409.02449
-
[22]
FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks
Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, and Michiel Bacchiani. FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks. In Proc. Interspeech, 2024. arXiv:2408.06227
-
[23]
Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
Yasmin Moslem. Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation. In Proc. IWSLT, 2024. arXiv:2406.17363
-
[24]
Findings of the IWSLT 2024 Evaluation Campaign
Ibrahim Said Ahmad et al. Findings of the IWSLT 2024 Evaluation Campaign. In Proc. IWSLT, 2024. arXiv:2411.05088
-
[25]
Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe. OWSM-CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. In Proc. ACL, 2024. arXiv:2402.12654
-
[26]
Scaling speech technology to 1,000+ languages,
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling Speech Technology to 1,000+ Languages. arXiv:2305.13516, 2023
-
[27]
Krishna C. Puvvada, Piotr \. Z elasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. Less is More: Accurate Speech Recognition & Translation without Web-Scale Data. In Proc. Interspeech, 2024. arXiv:2406.19674
-
[28]
Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, and Mark J. F. Gales. Learn and Don't Forget: Adding a New Language to ASR Foundation Models. In Proc. Interspeech, 2024. arXiv:2407.06800
-
[29]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. In Proc. ICML, pages 28492--28518, 2023. arXiv:2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, and Shinji Watanabe. ML-SUPERB : Multilingual Speech Universal PERformance Benchmark. In Proc. Interspeech, 2023. arXiv:2305.10615
-
[31]
Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr \. Z elasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi. Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation. arXiv:2510.06961, 2025
-
[32]
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Fran c o...
-
[33]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[34]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.