pith. sign in

arxiv: 1907.03064 · v1 · pith:G4XZH3ZSnew · submitted 2019-07-06 · 💻 cs.CL · cs.LG· eess.AS

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Pith reviewed 2026-05-25 02:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LGeess.AS
keywords Somalispeech recognitionsemi-supervised learninglow-resource ASRTDNN-Flanguage model augmentationkeyword spotting
0
0 comments X

The pith

Semi-supervised training on 17.55 hours of untranscribed Somali speech cuts word error rate by 7.74 percent relative to a supervised baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that starting from only 1.57 hours of manually transcribed Somali speech, three rounds of semi-supervised training on an additional 17.55 hours of unlabelled audio produce better acoustic and language models. Decoder scores are used to filter the automatic transcripts before they are added to the training sets. The resulting models lower word error rate on held-out test data and reduce language-model perplexity by 6.55 percent. These gains matter because Somali remains extremely data-scarce and the work supports keyword-spotting tools for humanitarian relief operations.

Core claim

Using factorised time-delay neural networks and three successive semi-supervised passes, the addition of automatically transcribed 17.55 hours of Somali speech, filtered by decoder , yields acoustic models that achieve a 7.74 percent relative word-error-rate reduction and language models whose perplexity drops 6.55 percent compared with a baseline trained on the 1.57-hour seed corpus alone.

What carries the argument

Semi-supervised training loop that decodes unlabelled audio, thresholds the output by decoder , and retrains both the TDNN-F acoustic model and the language model on the filtered transcripts.

If this is right

  • The same three-pass recipe can be applied whenever a modest seed of transcribed speech exists for any low-resource language.
  • Language-model augmentation from the filtered transcripts is responsible for part of the overall gain.
  • The method directly supports downstream keyword-spotting systems needed for real-time humanitarian monitoring.
  • Further passes beyond three would be expected to produce diminishing but still positive returns until the pool of untranscribed data is exhausted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Languages with similar phonetic inventories to Somali may see comparable relative gains from the identical pipeline.
  • If decoder is a poor proxy for transcription accuracy, an external quality estimator could replace or augment the threshold.
  • The approach could be combined with self-training on even larger unlabelled corpora without any additional manual annotation.

Load-bearing premise

The automatic transcripts that survive the decoder-confidence filter are accurate enough on average that adding them improves rather than harms the acoustic and language models.

What would settle it

A controlled experiment in which the same 17.55 hours are added after random or low-confidence filtering and the resulting word error rate is higher than the 1.57-hour baseline.

Figures

Figures reproduced from arXiv: 1907.03064 by Astik Biswas, Ewald van der Westhuizen, Raghav Menon, Thomas Niesler.

Figure 1
Figure 1. Figure 1: [8] shows the components of the radio browsing sys￾tem. The preprocessed audio stream is passed to the ASR sys￾tem which generates lattices which are subsequently searched for predefined keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is currently successfully deployed by the UN in Uganda.2 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 2
Figure 2. Figure 2: Semi-supervised training framework for Somali ASR. represents untranscribed speech is being fed to transcriber 5. Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Semi-supervised acoustic and language modelling for Somali ASR3. not show any significant improvement over the baseline. How￾ever, LM3, which was optimised on the validation set, showed an improvement of 1.86% relative to the baseline. The results in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying semi-supervised training to 17.55 hours of untranscribed speech. We make use of factorised time-delay neural networks (TDNN-F) for acoustic modelling, since these have recently been shown to be effective in resource-scarce situations. Three semi-supervised training passes were performed, where the decoded output from each pass was used for acoustic model training in the subsequent pass. The automatic transcriptions from the best performing pass were used for language model augmentation. To ensure the quality of automatic transcriptions, decoder confidence is used as a threshold. The acoustic and language models obtained from the semi-supervised approach show significant improvement in terms of WER and perplexity compared to the baseline. Incorporating the automatically generated transcriptions yields a 6.55\% improvement in language model perplexity. The use of 17.55 hour of Somali acoustic data in semi-supervised training shows an improvement of 7.74\% relative over the baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that starting from a 1.57-hour seed of transcribed Somali speech, three iterative semi-supervised passes using TDNN-F acoustic models on 17.55 hours of untranscribed data (filtered by decoder confidence) plus augmentation of the language model with the best-pass transcripts yields a 7.74% relative WER reduction on held-out test data and a 6.55% reduction in LM perplexity relative to a baseline trained only on the seed.

Significance. If the reported gains prove robust, the result would be useful for extremely low-resource ASR, particularly in humanitarian keyword-spotting applications. The work gives explicit credit to the effectiveness of TDNN-F models in data-scarce regimes and demonstrates a practical three-pass iterative procedure with a simple confidence filter.

major comments (1)
  1. [Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.
minor comments (2)
  1. [Abstract and §4] Abstract and §4: the statements that the improvements are “significant” are not accompanied by any statistical significance test or confidence interval on the WER difference.
  2. [§3.2] §3.2: the exact data partitions (how the 17.55 h were selected from the larger untranscribed pool, train/dev/test splits) are described only at a high level; a table listing hours per subset would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the constructive comment regarding our experimental controls. We address the major comment below.

read point-by-point responses
  1. Referee: [Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.

    Authors: We agree that the manuscript does not report WER or phone error rate on the retained pseudo-labels, nor does it include an ablation comparing the confidence-filtered data against an equal volume of unfiltered or randomly sampled transcripts. This is a valid observation, and such controls would strengthen the claim that gains arise from label quality rather than data volume alone. The work emphasizes the practical iterative procedure with confidence thresholding in an extremely low-resource setting, where the three passes yield progressive improvements. In the revised manuscript we will add a discussion paragraph in the Experiments section explicitly acknowledging this limitation and noting that the observed gains are consistent with the design of the confidence filter. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical results on held-out data

full rationale

The paper reports measured WER and perplexity improvements from adding 17.55 h of confidence-filtered automatic transcripts to a 1.57 h seed set for TDNN-F training and LM augmentation. All claims are direct experimental outcomes on held-out test data; no equations, fitted parameters renamed as predictions, self-citations, or derivations are present that reduce to the inputs by construction. The method is iterative semi-supervised training, but the reported 7.74% relative gain is an external measurement, not a self-referential quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no model equations, free parameters, or invented entities are specified. Standard ASR assumptions such as the validity of WER as a metric and the usefulness of TDNN-F are implicit but not detailed.

pith-pipeline@v0.9.0 · 5780 in / 1012 out tokens · 20731 ms · 2026-05-25T02:01:42.861371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Surveys conducted by the United Nations (UN) in places lacking sufficient internet infrastruc- ture indicate that this function is fulfilled by radio phone-in shows [4–6]

    Introduction In countries with a well established internet infrastructure, so- cial media has become an accepted platform for sharing opin- ions and concerns [1–3]. Surveys conducted by the United Nations (UN) in places lacking sufficient internet infrastruc- ture indicate that this function is fulfilled by radio phone-in shows [4–6]. Therefore, to support ...

  2. [2]

    Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

    Radio browsing system Figure 1 [8] shows the components of the radio browsing sys- tem. The preprocessed audio stream is passed to the ASR sys- tem which generates lattices which are subsequently searched for predefined keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is curr...

  3. [3]

    Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1

    Acoustic and text data 3.1. Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1. This small dataset of speech captured from broadcast Somali radio phone-in programmes, contains only 1.57 hours of transcribed speech that is available for training and 10 minutes for testing. Table 1...

  4. [4]

    As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option

    Semi-supervised training It has been shown that semi-supervised training can improve ASR performance in an under-resourced scenario [14, 15]. As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option. To test this, we used a recently-acquired corpus c...

  5. [5]

    The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times

    Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times. The resulting vocabulary consisted of 41.7k word types. The language model used in [9] was used as the baseline (LMbase). This ...

  6. [6]

    All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU

    Acoustic modelling The Kaldi speech recognition toolkit was used for all ASR experiments [20]. All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU. In our previous work, we found multi- lingual training to improve ASR performance substantially [9]. Table 4: Perplexities of the evaluated language m...

  7. [7]

    In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2)

    Results and discussion The ASR performance is reported in Table 5 in terms of the word error rate (WER) for the various training approaches. In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2). Even though TDNN-F uses only half the number of parameters as CNN- TDNN-BLSTM, it is able to offer better ...

  8. [8]

    A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available

    Conclusion We have presented our initial efforts to increase the pool of So- mali acoustic and language model data in a semi-supervised manner in an effort to improve automatic speech recognition for Somali. A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available. A further 17.55 hours o...

  9. [9]

    We also gratefully acknowl- edge the support of Telkom South Africa

    Acknowledgements We thank the NVIDIA corporation for the donation of GPU equipment used for this research. We also gratefully acknowl- edge the support of Telkom South Africa

  10. [10]

    A human-machine collaborative system for identifying rumors on Twitter,

    S. V osoughi and D. Roy, “A human-machine collaborative system for identifying rumors on Twitter,” in Proc. ICDMW, 2015

  11. [11]

    So- cial media analysis for e-health and medical purposes,

    K. Wegrzyn-Wolska, L. Bougueroua, and G. Dziczkowski, “So- cial media analysis for e-health and medical purposes,” in Proc. CASoN, 2011

  12. [12]

    Machine classifica- tion and analysis of suicide related communication on Twitter,

    P. Burnap, G. Colombo, and J. Scourfield, “Machine classifica- tion and analysis of suicide related communication on Twitter,” in Proc. ACM-HT, 2015

  13. [13]

    Analyzing attitudes towards contraception and teenage pregnancy using social data,

    G. P. P. Series, “Analyzing attitudes towards contraception and teenage pregnancy using social data,”Global Pulse Project Series, no. 8, 2014

  14. [14]

    Mining citizen feedback data for enhanced local gov- ernment decision-making,

    ——, “Mining citizen feedback data for enhanced local gov- ernment decision-making,” Global Pulse Project Series , no. 16, 2015

  15. [15]

    Understanding immunisation awareness and sentiment through social and mainstream media,

    ——, “Understanding immunisation awareness and sentiment through social and mainstream media,” Global Pulse Project Se- ries, no. 19, 2015

  16. [16]

    Radio-browsing for developmental monitoring in Uganda,

    R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017

  17. [17]

    Very low resource radio browsing for agile develop- mental and humanitarian monitoring,

    A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile develop- mental and humanitarian monitoring,” inProc. Interspeech, 2017

  18. [18]

    Au- tomatic speech recognition for humanitarian applications in So- mali,

    R. Menon, A. Biswas, A. Saeb, J. Quinn, and T. Niesler, “Au- tomatic speech recognition for humanitarian applications in So- mali,” in Proc. SLTU, 2018

  19. [19]

    Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,

    A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz, and T. Niesler, “Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,” in Proc. Interspeech, 2018

  20. [20]

    Multilingual training of deep neural networks,

    A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep neural networks,” inProc. ICASSP, 2013, pp. 7319–7323

  21. [21]

    Automatic tran- scription of Somali language,

    N. Addillahi, N.Pascal, and B. Jean-Francois, “Automatic tran- scription of Somali language,” in Proc. Interspeech, 2006

  22. [22]

    Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,

    D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. Interspeech, 2018, pp. 3743–3747

  23. [23]

    Semi-supervised acoustic model training for speech with code-switching,

    E. Yılmaz, M. McLaren, H. van den Heuvel, and D. A. van Leeuwen, “Semi-supervised acoustic model training for speech with code-switching,” Speech Communication, vol. 105, pp. 12– 22, 2018

  24. [24]

    Semi-supervised learn- ing for speech recognition in the context of accent adaptation,

    U. Nallasamy, F. Metze, and T. Schultz, “Semi-supervised learn- ing for speech recognition in the context of accent adaptation,” in Symposium on Machine Learning in Speech and Language Pro- cessing, 2012, pp. 13–17

  25. [25]

    Deep neural network features and semi-supervised training for low re- source speech recognition,

    S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low re- source speech recognition,” in in Proc. ICASSP, 2013, pp. 6704– 6708

  26. [26]

    Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,

    H. Kamper, F. de Wet, T. Hain, and T. Niesler, “Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,” Computer Speech and Language, vol. 28, no. 6, pp. 1255– 1268, 2014

  27. [27]

    Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages

    D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages.” in Proc. LREC, vol. 29, 2012, pp. 31–43

  28. [28]

    SRILM-an extensible language modeling toolkit,

    A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc. ICSLP, 2002

  29. [29]

    The Kaldi speech recognition toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011

  30. [30]

    Audio augmen- tation for speech recognition,

    T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Proc. Interspeech, 2015

  31. [31]

    Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

    D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in Proc. Interspeech, 2016, pp. 2751–2755