Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation
Pith reviewed 2026-05-19 22:01 UTC · model grok-4.3
The pith
Knowledge distillation from monolingual models followed by self-improvement on pseudo labels improves multilingual ASR for Nigerian languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present Sometin Beta Pass Notin (SBPN), a multilingual ASR model trained via student-teacher knowledge distillation conditioned on robust language-specific N-gram language models, followed by iterative self-improvement using pseudo-labelled data. This yields an average relative Word Error Rate reduction of 29% over monolingual baselines and outperforms state-of-the-art multilingual models on benchmarks like Common Voice and Fleurs. The framework covers Yoruba, Hausa, Igbo, Nigerian Pidgin, and Nigerian English, released in Base (120M) and Large (600M) parameter versions as open foundation models.
What carries the argument
Two-stage distillation process consisting of student-teacher knowledge distillation from monolingual models conditioned on N-gram language models, followed by iterative self-improvement using pseudo-labelled data.
If this is right
- SBPN models achieve better accuracy than previous monolingual and multilingual systems on Nigerian language benchmarks.
- The approach bridges the gap for low-resource languages with challenges like tonal diacritics and code-switching.
- Open release of SBPN-Base and SBPN-Large enables further research into phonetic and cultural aspects of the region.
- Performance gains hold across major benchmarks including Common Voice and Fleurs.
Where Pith is reading between the lines
- This method might generalize to other low-resource language families facing similar data scarcity and orthographic issues.
- Future work could test if combining this with larger unlabeled audio corpora further reduces the need for manual transcription.
- Voice applications in healthcare or education in Nigeria could benefit from these improved recognition rates.
Load-bearing premise
The iterative self-improvement works only if the pseudo-labels from the first model are accurate enough to avoid compounding errors in later rounds.
What would settle it
A test showing that running the second stage on the pseudo-labeled data increases word error rate on a held-out evaluation set compared to stopping after the first stage would falsify the benefit of the iterative refinement.
Figures
read the original abstract
Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yor\`ub\'a, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SBPN, a multilingual ASR framework for five Nigerian languages (Yorùbá, Hausa, Igbo, Nigerian Pidgin, Nigerian English) that employs a two-stage knowledge distillation process. The first stage distills from monolingual teacher models conditioned on language-specific N-gram LMs; the second stage performs iterative self-improvement on pseudo-labeled data. The central empirical claims are an average 29% relative WER reduction versus monolingual baselines and outperformance of existing SOTA multilingual models on Common Voice and Fleurs, with open release of 120 M and 600 M parameter models.
Significance. If the reported gains are robustly validated, the work would supply useful open foundation models for ASR in low-resource languages that exhibit tonal diacritics, code-switching, and orthographic inconsistency. The public release of both model sizes would enable downstream research on Nigerian-language speech technology.
major comments (2)
- [Experimental Evaluation] Experimental section: the headline 29 % relative WER reduction and the claim of outperforming SOTA multilingual models on Common Voice and Fleurs are presented without dataset sizes, exact baseline implementations, statistical significance tests, or error bars. These omissions make it impossible to determine whether the central performance claims are supported by the data or sensitive to experimental choices.
- [Method Description] Method section (iterative self-improvement stage): the performance delta is stated to depend on refinement using pseudo-labels generated by the first-stage model. No description of pseudo-label filtering, confidence thresholding, or an ablation that isolates the contribution of the iterative stage is provided. In the presence of tonal and code-switched data this assumption is load-bearing for the reported gains.
minor comments (2)
- [Abstract] The abstract introduces the acronym SBPN without explaining its etymology or intended meaning.
- [References] Ensure that all benchmark citations (Common Voice, Fleurs) are accompanied by complete references in the bibliography.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have reviewed the comments carefully and agree that additional experimental details and methodological clarifications will strengthen the manuscript. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section: the headline 29 % relative WER reduction and the claim of outperforming SOTA multilingual models on Common Voice and Fleurs are presented without dataset sizes, exact baseline implementations, statistical significance tests, or error bars. These omissions make it impossible to determine whether the central performance claims are supported by the data or sensitive to experimental choices.
Authors: We agree that the experimental section requires more detail to substantiate the central claims. In the revised manuscript, we will add the training and evaluation dataset sizes for each of the five languages, precise specifications of the monolingual teacher models (including architectures, pre-training data, and N-gram LM integration), and the exact configurations of the SOTA multilingual baselines from Common Voice and Fleurs. We will also report error bars from multiple random seeds and include statistical significance testing (e.g., paired bootstrap or t-tests) for the reported 29% relative WER reduction and outperformance results. revision: yes
-
Referee: [Method Description] Method section (iterative self-improvement stage): the performance delta is stated to depend on refinement using pseudo-labels generated by the first-stage model. No description of pseudo-label filtering, confidence thresholding, or an ablation that isolates the contribution of the iterative stage is provided. In the presence of tonal and code-switched data this assumption is load-bearing for the reported gains.
Authors: We acknowledge that the iterative self-improvement stage needs a more explicit description, especially given the tonal and code-switching characteristics of the languages. In the revised version, we will expand the method section to detail the pseudo-label generation process, including confidence thresholding (based on token-level probabilities and alignment scores) and any filtering steps to mitigate errors from tonal diacritics or code-switched segments. We will also add an ablation study comparing first-stage distillation performance against the full two-stage model to isolate the iterative stage's contribution. revision: yes
Circularity Check
No circularity: empirical performance gains on external benchmarks
full rationale
The paper presents an empirical two-stage knowledge distillation pipeline for multilingual ASR, with results consisting of measured WER reductions (29% relative) against monolingual baselines and SOTA models on independent benchmarks (Common Voice, Fleurs). No equations, first-principles derivations, or predictions are claimed that reduce by construction to fitted parameters or self-referential inputs inside the paper. The iterative pseudo-labeling stage is a standard semi-supervised technique whose validity is assessed via external test sets rather than being tautological; success depends on data and training outcomes, not definitional equivalence. No self-citations, uniqueness theorems, or ansatzes serve as load-bearing steps that collapse the central claim into its own assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- model size (120M / 600M)
axioms (1)
- domain assumption Monolingual ASR teachers provide useful supervisory signal for multilingual student
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SBPN improves transcription accuracy on several Nigerian languages relative to existing monolingual baselines
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.21437/Interspeech.2023-466. Josh Meyer, David Adelani, Edresson Casanova, Alp ¨Oktem, Daniel Whitenack, Julian Weber, Salomon KABONGO KABENAMUALU, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chinenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete AGBOLO, Victor Akinode, Bernard Opoku, Olanrewaju Samuel, Jesujoba Alabi, and Sha...
-
[2]
Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury
European Language Resources Association. Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury. The Zeno’s paradox of ‘low-resource’ languages. In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 17753–177...
-
[3]
20 Sometin Beta Pass Notin (SBPN) Appendix A
doi: 10.1109/TASLP.2023.3306709. 20 Sometin Beta Pass Notin (SBPN) Appendix A. List of Pidgin English variants. Phrases and words on the left were replaced with phrases and words on the right during training
-
[4]
beforebefore→bifor bifor
-
[5]
chewing gum→ chingum
-
[6]
everybody’s→evri- bodi’s
-
[7]
gavernments→gov- ernments
-
[8]
gobernment→govern- ment
-
[9]
gouverment→govern- ment
-
[10]
gouvernment→gov- ernment
-
[11]
govaenment→govern- ment
-
[12]
govenment→govern- ment
-
[13]
goverment→govern- ment
-
[14]
goverments→govern- ments
-
[15]
governmen→govern- ment
-
[16]
governmet→govern- ment
-
[17]
governmint→govern- ment
-
[18]
pickin’s→pikins 22 Sometin Beta Pass Notin (SBPN)
-
[19]
tomorrow’s→to- moro’s
-
[20]
waiting dey→wetin dey
-
[21]
wan welcome to→wan welcome
-
[22]
we de for→wey dey for
-
[23]
yu→you Appendix B. List of Pidgin words with their homophones considered for replacements using the N-gram Pidgin language model
-
[24]
becoming — become hin
-
[25]
convex — con vex — com vex
-
[26]
fellow — folo — follow
-
[27]
hear — here — hia — ear
-
[28]
tory — tori — touring — thory
-
[29]
way — wey — we — whey
-
[30]
wear — wia — were — where
-
[31]
yo — you 24 Sometin Beta Pass Notin (SBPN) Appendix C. Table showing the hyper-parameters selected for each model variant of SBPN Hyperparameter SBPN-Base SBPN-Large Base learning rate 1e−4 3e−4 Self improvement learning rate 1e−5 1e−5 Number of layers 17 24 No. of pred. RNN layers 1 2 Encoder feature dimension 512 1024 No. of attention heads 8 8 Weight d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.