Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
Pith reviewed 2026-05-25 02:01 UTC · model grok-4.3
The pith
Semi-supervised training on 17.55 hours of untranscribed Somali speech cuts word error rate by 7.74 percent relative to a supervised baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using factorised time-delay neural networks and three successive semi-supervised passes, the addition of automatically transcribed 17.55 hours of Somali speech, filtered by decoder , yields acoustic models that achieve a 7.74 percent relative word-error-rate reduction and language models whose perplexity drops 6.55 percent compared with a baseline trained on the 1.57-hour seed corpus alone.
What carries the argument
Semi-supervised training loop that decodes unlabelled audio, thresholds the output by decoder , and retrains both the TDNN-F acoustic model and the language model on the filtered transcripts.
If this is right
- The same three-pass recipe can be applied whenever a modest seed of transcribed speech exists for any low-resource language.
- Language-model augmentation from the filtered transcripts is responsible for part of the overall gain.
- The method directly supports downstream keyword-spotting systems needed for real-time humanitarian monitoring.
- Further passes beyond three would be expected to produce diminishing but still positive returns until the pool of untranscribed data is exhausted.
Where Pith is reading between the lines
- Languages with similar phonetic inventories to Somali may see comparable relative gains from the identical pipeline.
- If decoder is a poor proxy for transcription accuracy, an external quality estimator could replace or augment the threshold.
- The approach could be combined with self-training on even larger unlabelled corpora without any additional manual annotation.
Load-bearing premise
The automatic transcripts that survive the decoder-confidence filter are accurate enough on average that adding them improves rather than harms the acoustic and language models.
What would settle it
A controlled experiment in which the same 17.55 hours are added after random or low-confidence filtering and the resulting word error rate is higher than the 1.57-hour baseline.
Figures
read the original abstract
We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying semi-supervised training to 17.55 hours of untranscribed speech. We make use of factorised time-delay neural networks (TDNN-F) for acoustic modelling, since these have recently been shown to be effective in resource-scarce situations. Three semi-supervised training passes were performed, where the decoded output from each pass was used for acoustic model training in the subsequent pass. The automatic transcriptions from the best performing pass were used for language model augmentation. To ensure the quality of automatic transcriptions, decoder confidence is used as a threshold. The acoustic and language models obtained from the semi-supervised approach show significant improvement in terms of WER and perplexity compared to the baseline. Incorporating the automatically generated transcriptions yields a 6.55\% improvement in language model perplexity. The use of 17.55 hour of Somali acoustic data in semi-supervised training shows an improvement of 7.74\% relative over the baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that starting from a 1.57-hour seed of transcribed Somali speech, three iterative semi-supervised passes using TDNN-F acoustic models on 17.55 hours of untranscribed data (filtered by decoder confidence) plus augmentation of the language model with the best-pass transcripts yields a 7.74% relative WER reduction on held-out test data and a 6.55% reduction in LM perplexity relative to a baseline trained only on the seed.
Significance. If the reported gains prove robust, the result would be useful for extremely low-resource ASR, particularly in humanitarian keyword-spotting applications. The work gives explicit credit to the effectiveness of TDNN-F models in data-scarce regimes and demonstrates a practical three-pass iterative procedure with a simple confidence filter.
major comments (1)
- [Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.
minor comments (2)
- [Abstract and §4] Abstract and §4: the statements that the improvements are “significant” are not accompanied by any statistical significance test or confidence interval on the WER difference.
- [§3.2] §3.2: the exact data partitions (how the 17.55 h were selected from the larger untranscribed pool, train/dev/test splits) are described only at a high level; a table listing hours per subset would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the constructive comment regarding our experimental controls. We address the major comment below.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.
Authors: We agree that the manuscript does not report WER or phone error rate on the retained pseudo-labels, nor does it include an ablation comparing the confidence-filtered data against an equal volume of unfiltered or randomly sampled transcripts. This is a valid observation, and such controls would strengthen the claim that gains arise from label quality rather than data volume alone. The work emphasizes the practical iterative procedure with confidence thresholding in an extremely low-resource setting, where the three passes yield progressive improvements. In the revised manuscript we will add a discussion paragraph in the Experiments section explicitly acknowledging this limitation and noting that the observed gains are consistent with the design of the confidence filter. revision: yes
Circularity Check
No circularity; purely empirical results on held-out data
full rationale
The paper reports measured WER and perplexity improvements from adding 17.55 h of confidence-filtered automatic transcripts to a 1.57 h seed set for TDNN-F training and LM augmentation. All claims are direct experimental outcomes on held-out test data; no equations, fitted parameters renamed as predictions, self-citations, or derivations are present that reduce to the inputs by construction. The method is iterative semi-supervised training, but the reported 7.74% relative gain is an external measurement, not a self-referential quantity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction In countries with a well established internet infrastructure, so- cial media has become an accepted platform for sharing opin- ions and concerns [1–3]. Surveys conducted by the United Nations (UN) in places lacking sufficient internet infrastruc- ture indicate that this function is fulfilled by radio phone-in shows [4–6]. Therefore, to support ...
-
[2]
Radio browsing system Figure 1 [8] shows the components of the radio browsing sys- tem. The preprocessed audio stream is passed to the ASR sys- tem which generates lattices which are subsequently searched for predefined keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is curr...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Acoustic and text data 3.1. Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1. This small dataset of speech captured from broadcast Somali radio phone-in programmes, contains only 1.57 hours of transcribed speech that is available for training and 10 minutes for testing. Table 1...
-
[4]
Semi-supervised training It has been shown that semi-supervised training can improve ASR performance in an under-resourced scenario [14, 15]. As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option. To test this, we used a recently-acquired corpus c...
-
[5]
Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times. The resulting vocabulary consisted of 41.7k word types. The language model used in [9] was used as the baseline (LMbase). This ...
-
[6]
Acoustic modelling The Kaldi speech recognition toolkit was used for all ASR experiments [20]. All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU. In our previous work, we found multi- lingual training to improve ASR performance substantially [9]. Table 4: Perplexities of the evaluated language m...
-
[7]
Results and discussion The ASR performance is reported in Table 5 in terms of the word error rate (WER) for the various training approaches. In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2). Even though TDNN-F uses only half the number of parameters as CNN- TDNN-BLSTM, it is able to offer better ...
-
[8]
Conclusion We have presented our initial efforts to increase the pool of So- mali acoustic and language model data in a semi-supervised manner in an effort to improve automatic speech recognition for Somali. A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available. A further 17.55 hours o...
-
[9]
We also gratefully acknowl- edge the support of Telkom South Africa
Acknowledgements We thank the NVIDIA corporation for the donation of GPU equipment used for this research. We also gratefully acknowl- edge the support of Telkom South Africa
-
[10]
A human-machine collaborative system for identifying rumors on Twitter,
S. V osoughi and D. Roy, “A human-machine collaborative system for identifying rumors on Twitter,” in Proc. ICDMW, 2015
work page 2015
-
[11]
So- cial media analysis for e-health and medical purposes,
K. Wegrzyn-Wolska, L. Bougueroua, and G. Dziczkowski, “So- cial media analysis for e-health and medical purposes,” in Proc. CASoN, 2011
work page 2011
-
[12]
Machine classifica- tion and analysis of suicide related communication on Twitter,
P. Burnap, G. Colombo, and J. Scourfield, “Machine classifica- tion and analysis of suicide related communication on Twitter,” in Proc. ACM-HT, 2015
work page 2015
-
[13]
Analyzing attitudes towards contraception and teenage pregnancy using social data,
G. P. P. Series, “Analyzing attitudes towards contraception and teenage pregnancy using social data,”Global Pulse Project Series, no. 8, 2014
work page 2014
-
[14]
Mining citizen feedback data for enhanced local gov- ernment decision-making,
——, “Mining citizen feedback data for enhanced local gov- ernment decision-making,” Global Pulse Project Series , no. 16, 2015
work page 2015
-
[15]
Understanding immunisation awareness and sentiment through social and mainstream media,
——, “Understanding immunisation awareness and sentiment through social and mainstream media,” Global Pulse Project Se- ries, no. 19, 2015
work page 2015
-
[16]
Radio-browsing for developmental monitoring in Uganda,
R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017
work page 2017
-
[17]
Very low resource radio browsing for agile develop- mental and humanitarian monitoring,
A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile develop- mental and humanitarian monitoring,” inProc. Interspeech, 2017
work page 2017
-
[18]
Au- tomatic speech recognition for humanitarian applications in So- mali,
R. Menon, A. Biswas, A. Saeb, J. Quinn, and T. Niesler, “Au- tomatic speech recognition for humanitarian applications in So- mali,” in Proc. SLTU, 2018
work page 2018
-
[19]
A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz, and T. Niesler, “Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,” in Proc. Interspeech, 2018
work page 2018
-
[20]
Multilingual training of deep neural networks,
A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep neural networks,” inProc. ICASSP, 2013, pp. 7319–7323
work page 2013
-
[21]
Automatic tran- scription of Somali language,
N. Addillahi, N.Pascal, and B. Jean-Francois, “Automatic tran- scription of Somali language,” in Proc. Interspeech, 2006
work page 2006
-
[22]
Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,
D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. Interspeech, 2018, pp. 3743–3747
work page 2018
-
[23]
Semi-supervised acoustic model training for speech with code-switching,
E. Yılmaz, M. McLaren, H. van den Heuvel, and D. A. van Leeuwen, “Semi-supervised acoustic model training for speech with code-switching,” Speech Communication, vol. 105, pp. 12– 22, 2018
work page 2018
-
[24]
Semi-supervised learn- ing for speech recognition in the context of accent adaptation,
U. Nallasamy, F. Metze, and T. Schultz, “Semi-supervised learn- ing for speech recognition in the context of accent adaptation,” in Symposium on Machine Learning in Speech and Language Pro- cessing, 2012, pp. 13–17
work page 2012
-
[25]
Deep neural network features and semi-supervised training for low re- source speech recognition,
S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low re- source speech recognition,” in in Proc. ICASSP, 2013, pp. 6704– 6708
work page 2013
-
[26]
H. Kamper, F. de Wet, T. Hain, and T. Niesler, “Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,” Computer Speech and Language, vol. 28, no. 6, pp. 1255– 1268, 2014
work page 2014
-
[27]
D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages.” in Proc. LREC, vol. 29, 2012, pp. 31–43
work page 2012
-
[28]
SRILM-an extensible language modeling toolkit,
A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc. ICSLP, 2002
work page 2002
-
[29]
The Kaldi speech recognition toolkit,
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011
work page 2011
-
[30]
Audio augmen- tation for speech recognition,
T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Proc. Interspeech, 2015
work page 2015
-
[31]
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in Proc. Interspeech, 2016, pp. 2751–2755
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.