pith. sign in

arxiv: 2606.02375 · v1 · pith:OOOAQ2DEnew · submitted 2026-06-01 · 💻 cs.CL · cs.CY· cs.HC

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

Pith reviewed 2026-06-28 14:40 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.HC
keywords automatic speech recognitionAfrican languagesfine-tuningedge modelsword error rateWAXAL corpuslow-resource ASRspontaneous speech
0
0 comments X

The pith

Fine-tuned compact ASR models achieve 38% WER on 19 African languages, beating 65% for much larger zero-shot baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether small domain-specialized automatic speech recognition models can outperform large multilingual foundation models on conversational speech from the WAXAL corpus across 19 African languages. Fine-tuned edge models reach a macro-averaged word error rate of 38.0 percent compared with 64.9 percent for the strongest zero-shot baseline, using models three to forty times smaller. This outcome indicates that adaptation to the target domain and languages outweighs model scale for spontaneous African speech. The work also examines cross-domain performance, provides a native-speaker error taxonomy, notes limitations of WER for syllabary scripts, and releases models, scripts, and data.

Core claim

Fine-tuned edge models achieve a macro-averaged WER of 38.0% compared to 64.9% for the best zero-shot baseline, a 26.9 percentage-point reduction using models 3-40× smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution.

What carries the argument

Fine-tuning of compact edge ASR models on the WAXAL corpus for spontaneous conversational speech in 19 African languages.

If this is right

  • Fine-tuned models recover usable performance on out-of-distribution speech.
  • Zero-shot models regain an advantage when the test domain matches their pretraining distribution.
  • CTC and autoregressive architectures behave differently across language families according to the native-speaker error taxonomy.
  • WER alone misrepresents performance for syllabary-script languages, where CER/WER ratios show higher character-level accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The released model weights and scripts enable direct replication and incremental improvement by other researchers working on the same languages.
  • The observed advantage of domain specialization may extend to other low-resource conversational speech settings beyond the 19 languages tested.
  • Edge deployment of the smaller fine-tuned models becomes more practical in regions with limited compute infrastructure.

Load-bearing premise

The WAXAL corpus and its test splits accurately represent spontaneous conversational speech across the 19 languages, and the zero-shot baselines represent the strongest possible performance without any fine-tuning on WAXAL data.

What would settle it

A new zero-shot model evaluated on the WAXAL test splits without exposure to any WAXAL training data achieving a macro-averaged WER below 38%.

Figures

Figures reproduced from arXiv: 2606.02375 by Akebert Arefaine, Athanase Bahizire, Bolarinwa Gbotemi, Candace Hunzwi, Cleophas Kadima, Emmanuel Aaron, Emmilly Namuganga, Hondi Prisca Birindwa, Idris Muhammed, Innocent Elendu Anyaele, John Uzodinma, Jonathan Enoch Simenya, Kausar Moshood, Martin Koome, Matewos Tegete Endaylalu, Mikel K. Ngueajio, Nicholaus Ladislaus, Oluwademilade Williams, Onitsiky Ranaivoson, Oreoluwa Babatunde, Pericles Adjovi, Peter Ifeoluwa Adeyemo, Prasenjit Mitra, Ramsey Njema, Sunday Ajayi, Toluwani Aremu, Ukachi Agnes Eze-Mbey, Victor Tolulope Olufemi, Wanchi Lucia Yen, Wongel Dawit Daniel, Yacoba Oduro-Yeboah.

Figure 1
Figure 1. Figure 1: Benchmarking pipeline. WAXAL is used with its original train/test split. After corpus cleaning and test-set filtering, fine-tuned edge models are trained and evaluated alongside zero-shot baselines. Results are assessed quanti￾tatively and qualitatively through a native-speaker linguistic audit across 19 languages. 2.2 African Speech Resources The development of African ASR has historically been constraine… view at source ↗
Figure 2
Figure 2. Figure 2: Word Error Rate (%) across 19 WAXAL languages for five of the six evaluated models: zero-shot baselines Omnilingual-1B and MMS-1B (light bars) vs. fine-tuned edge models MMS-300M, Whisper Small, and Whisper Tiny (dark bars). Whisper Large-v3 is omitted as it natively supports only 4 of the 19 languages; its results are reported in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: CTC vs. Autoregressive: Fine-Tuned CER Compar￾ison. MMS-300M (CTC) achieves lower CER than Whisper Small (AR) in 17 of 19 languages, consistent with CTC’s ten￾dency toward phonetic precision, though Whisper’s AR prior provides advantages for morphologically complex languages. Whisper Small (AR) on Character Error Rate, con￾sistent with CTC’s tendency toward acoustic pre￾cision (K et al., 2025) (see [PITH_… view at source ↗
read the original abstract

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript evaluates whether compact, domain-specialized ASR models can outperform large multilingual foundation models on spontaneous conversational speech across 19 African languages in the WAXAL corpus. It reports that fine-tuned edge models achieve a macro-averaged WER of 38.0% versus 64.9% for the best zero-shot baseline (a 26.9-point reduction) while using models 3-40× smaller, concluding that domain specialization dominates scale. Additional results cover cross-domain OOD performance, a native-speaker error taxonomy distinguishing CTC and autoregressive behavior, and the observation that WER misrepresents accuracy for syllabary-script languages (via CER/WER ratios). The work releases all model weights, fine-tuning/evaluation scripts, and a cleaned WAXAL subset.

Significance. If the empirical comparisons hold, the results provide concrete evidence that targeted fine-tuning on domain-specific data can substantially outperform scale-driven zero-shot approaches for low-resource African languages. The explicit release of weights, scripts, and data subset, together with the distributed native-speaker audit, directly supports independent verification and future work, strengthening the contribution beyond typical empirical ASR papers.

minor comments (3)
  1. [Abstract] Abstract: the headline WER figures would be more immediately interpretable if the abstract briefly noted the model families or parameter counts of the fine-tuned edge models alongside the 3-40× size claim.
  2. The cross-domain evaluation section would benefit from an explicit statement of how the OOD test domains were selected relative to the pretraining distributions of the zero-shot baselines.
  3. Table or figure presenting the CER/WER ratios for syllabary-script languages should include the number of languages and utterances involved to allow readers to gauge the scope of the misrepresentation claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept. The review accurately captures the core claims, empirical results, and release artifacts.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports direct empirical measurements of WER from fine-tuning compact ASR models on the WAXAL corpus and evaluating on held-out splits, compared against zero-shot baselines. No equations, derivations, or predictions are present that reduce reported gains to quantities defined by fitted parameters or self-citations within the paper. The central claims rest on experimental results and released artifacts enabling external verification, with no load-bearing steps that collapse by construction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or ad-hoc axioms are described beyond standard ASR evaluation assumptions.

axioms (1)
  • domain assumption WER is an appropriate primary metric for comparing ASR systems on the WAXAL corpus
    The paper uses macro-averaged WER as the headline result while separately noting its limitations for syllabary-script languages.

pith-pipeline@v0.9.1-grok · 5929 in / 1416 out tokens · 32368 ms · 2026-06-28T14:40:39.691587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 8 canonical work pages

  1. [1]

    Proceedings of the 40th International Conference on Machine Learning , year=

    Robust Speech Recognition via Large-Scale Weak Supervision , author=. Proceedings of the 40th International Conference on Machine Learning , year=

  2. [2]

    Journal of Machine Learning Research , year=

    Scaling Speech Technology to 1,000+ Languages , author=. Journal of Machine Learning Research , year=

  3. [3]

    2025 , eprint=

    Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages , author=. 2025 , eprint=

  4. [4]

    FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech , year=

    Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur , booktitle=. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech , year=

  5. [5]

    Proceedings of Interspeech , year=

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , author=. Proceedings of Interspeech , year=

  6. [6]

    Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

    Common Voice: A Massively-Multilingual Speech Corpus , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

  7. [7]

    Transactions of the Association for Computational Linguistics , year=

    AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR , author=. Transactions of the Association for Computational Linguistics , year=

  8. [8]

    2022 , url =

    Vaessen, Nik , title =. 2022 , url =

  9. [9]

    2025 , eprint=

    Hallucination Benchmark for Speech Foundation Models , author=. 2025 , eprint=

  10. [10]

    doi:10.21437/Interspeech.2004-668 , issn =

    Andrew Cameron Morris and Viktoria Maier and Phil Green , year =. doi:10.21437/Interspeech.2004-668 , issn =

  11. [11]

    Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio , year=

    Barański, Mateusz and Jasiński, Jan and Bartolewska, Julitta and Kacprzak, Stanisław and Witkowski, Marcin and Kowalczyk, Konrad , booktitle=. Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio , year=

  12. [12]

    Findings of the Association for Computational Linguistics: ACL 2025 , year =

    Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =. doi:10.18653/v1/2025.findings-acl.1190 , pages =

  13. [13]

    Advocating Character Error Rate for Multilingual ASR Evaluation

    K, Thennal D and James, Jesin and Gopinath, Deepa Padmini and K, Muhammed Ashraf. Advocating Character Error Rate for Multilingual ASR Evaluation. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.277

  14. [14]

    Full Fine-Tuning vs

    Imam, Sukairaj Hafiz and Bello, Muhammad Yahuza and Umar, Hadiza Ali and Belay, Tadesse Destaw and Abdulmumin, Idris and Yimam, Seid Muhie and Muhammad, Shamsuddeen Hassan. Full Fine-Tuning vs. Parameter-Efficient Adaptation for Low-Resource A frican ASR : A Controlled Study with Whisper-Small. Proceedings of the 7th Workshop on A frican Natural Language ...

  15. [15]

    EURASIP Journal on Audio, Speech, and Music Processing , author =

    Exploration of. EURASIP Journal on Audio, Speech, and Music Processing , author =. 2024 , pages =. doi:10.1186/s13636-024-00349-3 , abstract =

  16. [16]

    ACM Trans

    Pillai, Leena G and Manohar, Kavya and Raju, Basil and Sherly, Elizabeth , title =. ACM Trans. Asian Low-Resour. Lang. Inf. Process. , month = may, keywords =. 2026 , publisher =. doi:10.1145/3813800 , abstract =

  17. [17]

    2026 , eprint=

    Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages , author=. 2026 , eprint=

  18. [18]

    Experiment Tracking with Weights and Biases , year =

  19. [19]

    Biometrika , volume=

    Tests for rank correlation coefficients, I , author=. Biometrika , volume=. 1957 , publisher=

  20. [20]

    Proceedings of the 40th International Conference on Machine Learning , year =

    Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. Proceedings of the 40th International Conference on Machine Learning , year =

  21. [21]

    A literature review , author=

    Hey ASR system! Why aren’t you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review , author=. International conference on human-computer interaction , pages=. 2022 , organization=

  22. [22]

    Journal of Machine Learning Research , year =

    Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Baevski, Alexei and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael , title =. Journal of Machine Learning Research , year =

  23. [23]

    Neural Text Normalization for L uxembourgish Using Real-Life Variation Data

    Lutgen, Anne-Marie and Plum, Alistair and Purschke, Christoph and Plank, Barbara. Neural Text Normalization for L uxembourgish Using Real-Life Variation Data. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025

  24. [24]

    arXiv preprint arXiv:2602.02734 , year =

    Diack, Abdoulaye and Nelson, Perry and Agbesi, Kwaku and Nakalembe, Angela and MohamedKhair, MohamedElfatih and Dube, Vusumuzi and Siyavora, Tavonga and Venugopalan, Subhashini and Hickey, Jason and Okonkwo, Uche and Bapna, Abhishek and Wiafe, Isaac and Helegah, Raynard Dodzi and Atsakpo, Elikem Doe and Nutrokpor, Charles and Winful, Fiifi Baffoe Payin an...

  25. [25]

    Transactions of the Association for Computational Linguistics , volume=

    Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

  26. [26]

    2025 , booktitle =

    Chris Emezue and. 2025 , booktitle =. doi:10.21437/Interspeech.2025-1104 , issn =