arxiv: 2604.10736 · v2 · submitted 2026-04-12 · 💻 cs.CL · cs.SD

Recognition: unknown

BlasBench: An Open Benchmark for Irish Speech Recognition

Jyoutir Raj , John Conway

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords Irish ASRspeech recognition benchmarktext normalizationlow-resource languagesgeneralization gapWhisper hallucinationmultilingual models

0 comments

The pith

BlasBench supplies an Irish-aware normalizer and scoring harness that makes ASR comparisons for the language reliable and exposes a large generalization gap between fine-tuned and multilingual models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multilingual ASR benchmarks treat Irish like any other language and apply generic text normalization that ignores fadas, lenition, and eclipsis. This paper releases BlasBench, a standalone open harness containing an Irish-specific normalizer plus reproducible scoring code and per-utterance outputs. When twelve systems are evaluated on both Common Voice ga-IE and FLEURS ga-IE, fine-tuned models lose 33 to 43 WER points on the second dataset while massively multilingual models lose only 7 to 10 points. The result shows that single-dataset leaderboards hide robustness failures that matter for low-resource languages.

Core claim

BlasBench demonstrates that an Irish-aware normalizer preserving fadas, lenition, and eclipsis is required for valid ASR evaluation; with it in place, Whisper variants exceed 100 percent WER through insertion hallucination, Microsoft Azure reaches 22.2 percent on Common Voice and 57.5 percent on FLEURS, and the best open model reaches 30.65 percent and 39.09 percent respectively, while fine-tuned systems degrade far more than multilingual ones when moving between the two corpora.

What carries the argument

BlasBench, an open evaluation harness built around a standalone Irish-aware normaliser that preserves fadas, lenition, and eclipsis and supplies reproducible scoring with released per-utterance predictions.

If this is right

Single-dataset leaderboards for Irish ASR become unreliable and must be replaced by multi-corpus evaluation.
Fine-tuning on one Irish resource produces brittle models that fail on new domains or recording conditions.
Massively multilingual pre-training confers measurable robustness advantages for Irish that single-language fine-tuning does not.
Hallucination-driven insertions in Whisper models render their WER scores uninterpretable without an Irish-aware normalizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar orthography-preserving normalizers may be needed for other languages with diacritics or mutation rules before their ASR benchmarks can be trusted.
The generalization gap observed here suggests that low-resource language evaluation should routinely include at least two independent test sets drawn from different sources.
Releasing both the normalizer code and the raw predictions allows future researchers to test new models or normalizer variants without re-collecting data.

Load-bearing premise

The custom normalizer fully captures all relevant Irish orthographic rules and the two chosen corpora are representative enough to support claims about generalization.

What would settle it

A manual audit of the normalizer output on held-out Irish text that reveals systematic errors in handling eclipsis or lenition, or a third Irish speech corpus on which the reported 33-43 point degradation for fine-tuned models disappears.

Figures

Figures reproduced from arXiv: 2604.10736 by John Conway, Jyoutir Raj.

read the original abstract

Existing multilingual benchmarks include Irish among dozens of languages but apply no Irish-aware text normalisation, leaving reliable and reproducible ASR comparison impossible. We introduce BlasBench, an open evaluation harness that provides a standalone Irish-aware normaliser preserving fadas, lenition, and eclipsis; a reproducible scoring harness and per-utterance predictions released for all evaluated runs. We pilot this by benchmarking 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER through insertion-driven hallucination. Microsoft Azure reaches 22.2% WER on Common Voice and 57.5% on FLEURS; the best open model, Omnilingual ASR 7B, reaches 30.65% and 39.09% respectively. Models fine-tuned on Common Voice degrade 33-43 points moving to FLEURS, while massively multilingual models degrade only 7-10 - a generalisation gap that single-dataset evaluation misses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlasBench gives a usable open harness and Irish normalizer for ASR but the WER numbers depend on rules that lack visible validation or edge-case coverage.

read the letter

The paper's main contribution is a standalone Irish-aware text normalizer plus a released evaluation harness and per-utterance predictions. Standard multilingual benchmarks ignore fadas, lenition, and eclipsis, so this fills a practical gap for anyone trying to compare Irish ASR systems reproducibly. They run 12 models across four families on Common Voice ga-IE and FLEURS ga-IE, report concrete WERs, and highlight that fine-tuned models drop 33-43 points out of domain while large multilingual ones drop only 7-10. Releasing the predictions is the right move for a benchmark paper and lets others check the numbers directly. That generalization observation is the most useful takeaway for low-resource work. The normalizer is the load-bearing piece, yet the abstract gives no rule list, no dialectal examples, no compound-word handling, and no quantitative check against gold normalized text from native speakers. Without that, the reported gaps between Azure at 22.2% versus 57.5% or the Whisper failures could partly reflect normalization artifacts rather than model behavior. The two datasets are a start, but their representativeness for broader Irish speech is not argued. This is aimed at speech researchers working on Celtic languages or other low-resource settings who need an evaluation starting point. A reader building or testing ASR systems would get immediate value from the released code and predictions. The work shows clear thinking about why existing benchmarks fall short for Irish and engages the literature on multilingual models. It deserves peer review because the tool is concrete and the data is public, even though reviewers will need to press on normalizer validation and dataset scope before the numbers can be treated as definitive.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BlasBench, an open evaluation harness for Irish ASR that supplies a standalone Irish-aware text normalizer preserving fadas, lenition, and eclipsis, together with a reproducible scoring pipeline and released per-utterance predictions. It benchmarks 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE, reporting concrete WERs (Azure 22.2 % / 57.5 %, best open model Omnilingual ASR 7B at 30.65 % / 39.09 %) and a generalization gap in which fine-tuned models degrade 33–43 points while massively multilingual models degrade only 7–10 points; all Whisper variants exceed 100 % WER via insertion-driven hallucination.

Significance. If the normalizer is shown to be correct and complete, BlasBench would supply a much-needed reproducible resource for Irish ASR that current multilingual benchmarks lack. The public release of the harness, normalizer code, and per-utterance predictions is a clear strength that directly supports the reproducibility claim. The reported generalization gap supplies a concrete, falsifiable observation that single-dataset evaluation misses and that future low-resource ASR work can test.

major comments (2)

[§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.
[§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.

minor comments (2)

[Abstract] The abstract states “four architecture families” but does not enumerate them; the methods section should list the families explicitly for immediate readability.
[Results table] Table 1 (or equivalent results table) reports point WER estimates without confidence intervals or statistical tests; adding these would strengthen the generalization-gap claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the specific revisions we will incorporate to strengthen the presentation of the normalizer and the error analysis.

read point-by-point responses

Referee: [§3] §3 (Normalizer): the manuscript presents the custom normalizer as the load-bearing component that makes all WER numbers reliable and reproducible, yet supplies neither an exhaustive rule list, concrete edge-case examples (dialectal mutations, compound-word handling, punctuation interactions), nor any quantitative validation (e.g., agreement with native-speaker gold normalizations or inter-annotator metrics). Without this, the central claim that observed WER differences reflect model behavior rather than normalization artifacts cannot be evaluated.

Authors: We agree that the current description of the normalizer in §3 is insufficient for full reproducibility and independent verification. In the revised manuscript we will expand this section to provide an exhaustive enumerated list of all normalization rules, multiple concrete examples addressing dialectal mutations, compound-word handling, and punctuation interactions, and quantitative validation results including agreement metrics with native-speaker gold normalizations and inter-annotator agreement scores. These additions will allow readers to confirm that the reported WER differences arise from model behavior rather than normalization artifacts. revision: yes
Referee: [§4–5] §4–5 (Results): the assertion that Whisper variants exceed 100 % WER “through insertion-driven hallucination” is stated without error analysis, utterance-level examples, or breakdown of insertion versus substitution rates. This detail is required to substantiate the architectural-family comparison and to allow readers to judge whether the failure mode is systematic or dataset-specific.

Authors: We acknowledge that the manuscript currently states the >100 % WER observation for Whisper variants without supporting error analysis. In the revised version we will add a dedicated error-analysis subsection (or appendix) that reports per-model insertion, substitution, and deletion rates, supplies representative utterance-level examples of the insertion-driven hallucinations, and discusses whether the pattern appears systematic across both Common Voice and FLEURS. This will strengthen the architectural-family comparison and enable readers to assess the generality of the failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent components

full rationale

The paper presents an open evaluation harness, a custom normalizer, and empirical WER results on public datasets (Common Voice ga-IE, FLEURS ga-IE). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The normalizer is introduced as a standalone contribution rather than derived from prior results by the same authors. All reported numbers are direct measurements, not reductions to inputs by construction. This matches the default expectation for non-circular empirical benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard WER metric and the assumption that the two public Irish datasets are suitable test beds; no new free parameters, axioms beyond standard ASR evaluation practice, or invented entities are introduced.

axioms (1)

domain assumption Word error rate remains a valid primary metric once text is properly normalized for Irish orthography.
Invoked when reporting all WER figures and degradation gaps.

pith-pipeline@v0.9.0 · 5464 in / 1253 out tokens · 39196 ms · 2026-05-10T15:50:33.685650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 25 canonical work pages · 1 internal anchor

[1]

WER We Stand: Benchmarking U rdu ASR Models

Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, and Awais Athar. WER We Stand: Benchmarking U rdu ASR Models. In Proc. COLING, pages 5952--5961, 2025. arXiv:2409.11252

work page arXiv 2025
[2]

Claude Opus 4.6 (Models overview)

Anthropic. Claude Opus 4.6 (Models overview). https://docs.anthropic.com/en/docs/about-claude/models, 2026

2026
[3]

Bootstrap estimates for confidence intervals in ASR performance evaluation

Maximilian Bisani and Hermann Ney. Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proc. ICASSP, 2004

2004
[4]

Common V oice: A massively- multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. In Proc. LREC, 2020. arXiv:1912.06670

work page arXiv 2020
[5]

How I Built ASR for Endangered Languages with a Spoken Dictionary

Christopher Bartley and Anton Ragni. How I Built ASR for Endangered Languages with a Spoken Dictionary. arXiv:2510.04832, 2025

work page arXiv 2025
[6]

\' O Meachair, and Jennifer Foster

James Barry, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn, Abigail Walsh, M\' i che\' a l J. \' O Meachair, and Jennifer Foster. ga BERT -- an I rish Language Model. In Proc. LREC, pages 4774--4788, 2022. arXiv:2107.12930

work page arXiv 2022
[7]

Fleurs: Few-shot learning evaluation of universal representations of speech,

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. FLEURS : Few-shot Learning Evaluation of Universal Representations of Speech. In Proc. SLT, 2022. arXiv:2205.12446

work page arXiv 2022
[8]

Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson

Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. XTREME-S : Evaluating Cross-lingual Speech Representations. In Proc. Interspeech, 2022...

work page arXiv 2022
[9]

S. Faste. Wav2Vec 2.0 for I rish ASR : A Multilingual Approach to Under-Resourced Languages. MSc thesis, University of Groningen, Campus Fryslân, 2022. https://campus-fryslan.studenttheses.ub.rug.nl/234/

2022
[10]

Larry Gillick and Stephen J. Cox. Some Statistical Issues in the Comparison of Speech Recognition Algorithms. In Proc. ICASSP, pages 532--535, 1989

1989
[11]

Development and Evaluation of Speech Recognition for the W elsh Language

Dewi Jones. Development and Evaluation of Speech Recognition for the W elsh Language. In Proc. CLTW, 2022

2022
[12]

autoresearch: AI agents running research on single- GPU nanochat training automatically

Andrej Karpathy. autoresearch: AI agents running research on single- GPU nanochat training automatically. https://github.com/karpathy/autoresearch, 2026

2026
[13]

Omnilingual asr: Open- source multilingual speech recognition for 1600+ languages.arXiv preprint arXiv:2511.09690, 2025

Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Salee...

work page arXiv 2025
[14]

A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on S cottish G aelic

Ond r ej Klejch, William Lamb, and Peter Bell. A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on S cottish G aelic. In Proc. Interspeech, 2025. arXiv:2506.04915

work page arXiv 2025
[15]

Automatic Speech Recognition for I rish: the ABAIR-\' E IST System

Liam Lonergan, Mengjie Qian, Harald Berthelsen, Andy Murphy, Christoph Wendler, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Automatic Speech Recognition for I rish: the ABAIR-\' E IST System. In Proc. CLTW, pages 47--51, 2022

2022
[16]

Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish

Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Cross-dialect lexicon optimisation for an endangered language ASR system: the case of I rish. In Proc. Interspeech, pages 4865--4869, 2022. doi:10.21437/Interspeech.2022-838

work page doi:10.21437/interspeech.2022-838 2022
[17]

Towards Dialect-inclusive Recognition in a Low-resource Language: Are Balanced Corpora the Answer? In Proc

Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Dialect-inclusive Recognition in a Low-resource Language: Are Balanced Corpora the Answer? In Proc. Interspeech, pages 5082--5086, 2023. arXiv:2307.07295

work page arXiv 2023
[18]

Towards Spoken Dialect Identification of I rish

Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Towards Spoken Dialect Identification of I rish. In Proc. SIGUL, pages 63--67, 2023. arXiv:2307.07436

work page arXiv 2023
[19]

Low-resource speech recognition and dialect identification of I rish in a multi-task framework

Liam Lonergan, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Low-resource speech recognition and dialect identification of I rish in a multi-task framework. In Proc. Odyssey, pages 67--73, 2024. arXiv:2405.01293

work page arXiv 2024
[20]

Fotheidil: an Automatic Transcription System for the I rish Language

Liam Lonergan, Ibon Saratxaga, John Sloan, Oscar Maharg Bravo, Mengjie Qian, Neasa N\' i Chiar\' a in, Christer Gobl, and Ailbhe N\' i Chasaide. Fotheidil: an Automatic Transcription System for the I rish Language. In Proc. CLTW, pages 35--45, 2025. arXiv:2501.00509

work page arXiv 2025
[21]

Pillai, and Elizabeth Sherly

Kavya Manohar, Leena G. Pillai, and Elizabeth Sherly. What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations. In Proc. EMNLP, 2024. arXiv:2409.02449

work page arXiv 2024
[22]

FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks

Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, and Michiel Bacchiani. FLEURS-R : A Restored Multilingual Speech Corpus for Generation Tasks. In Proc. Interspeech, 2024. arXiv:2408.06227

work page arXiv 2024
[23]

Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

Yasmin Moslem. Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation. In Proc. IWSLT, 2024. arXiv:2406.17363

work page arXiv 2024
[24]

Findings of the IWSLT 2024 Evaluation Campaign

Ibrahim Said Ahmad et al. Findings of the IWSLT 2024 Evaluation Campaign. In Proc. IWSLT, 2024. arXiv:2411.05088

work page arXiv 2024
[25]

OWSM-CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, and Shinji Watanabe. OWSM-CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. In Proc. ACL, 2024. arXiv:2402.12654

work page arXiv 2024
[26]

Scaling speech technology to 1,000+ languages,

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Scaling Speech Technology to 1,000+ Languages. arXiv:2305.13516, 2023

work page arXiv 2023
[27]

Puvvada, Piotr \

Krishna C. Puvvada, Piotr \. Z elasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. Less is More: Accurate Speech Recognition & Translation without Web-Scale Data. In Proc. Interspeech, 2024. arXiv:2406.19674

work page arXiv 2024
[28]

Knill, and Mark J

Mengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, and Mark J. F. Gales. Learn and Don't Forget: Adding a New Language to ASR Foundation Models. In Proc. Interspeech, 2024. arXiv:2407.06800

work page arXiv 2024
[29]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. In Proc. ICML, pages 28492--28518, 2023. arXiv:2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Shi et al.,ML-SUPERB: Multilingual Speech Universal PERfor- mance Benchmark, arXiv:2305.10615 [cs], Feb

Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, and Shinji Watanabe. ML-SUPERB : Multilingual Speech Universal PERformance Benchmark. In Proc. Interspeech, 2023. arXiv:2305.10615

work page arXiv 2023
[31]

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation,

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr \. Z elasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi. Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation. arXiv:2510.06961, 2025

work page arXiv 2025
[32]

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Fran c o...

work page arXiv 2023
[33]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[34]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...