Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Astik Biswas; Ewald van der Westhuizen; Raghav Menon; Thomas Niesler

arxiv: 1907.03064 · v1 · pith:G4XZH3ZSnew · submitted 2019-07-06 · 💻 cs.CL · cs.LG· eess.AS

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Astik Biswas , Raghav Menon , Ewald van der Westhuizen , Thomas Niesler This is my paper

Pith reviewed 2026-05-25 02:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LGeess.AS

keywords Somalispeech recognitionsemi-supervised learninglow-resource ASRTDNN-Flanguage model augmentationkeyword spotting

0 comments

The pith

Semi-supervised training on 17.55 hours of untranscribed Somali speech cuts word error rate by 7.74 percent relative to a supervised baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that starting from only 1.57 hours of manually transcribed Somali speech, three rounds of semi-supervised training on an additional 17.55 hours of unlabelled audio produce better acoustic and language models. Decoder scores are used to filter the automatic transcripts before they are added to the training sets. The resulting models lower word error rate on held-out test data and reduce language-model perplexity by 6.55 percent. These gains matter because Somali remains extremely data-scarce and the work supports keyword-spotting tools for humanitarian relief operations.

Core claim

Using factorised time-delay neural networks and three successive semi-supervised passes, the addition of automatically transcribed 17.55 hours of Somali speech, filtered by decoder , yields acoustic models that achieve a 7.74 percent relative word-error-rate reduction and language models whose perplexity drops 6.55 percent compared with a baseline trained on the 1.57-hour seed corpus alone.

What carries the argument

Semi-supervised training loop that decodes unlabelled audio, thresholds the output by decoder , and retrains both the TDNN-F acoustic model and the language model on the filtered transcripts.

If this is right

The same three-pass recipe can be applied whenever a modest seed of transcribed speech exists for any low-resource language.
Language-model augmentation from the filtered transcripts is responsible for part of the overall gain.
The method directly supports downstream keyword-spotting systems needed for real-time humanitarian monitoring.
Further passes beyond three would be expected to produce diminishing but still positive returns until the pool of untranscribed data is exhausted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Languages with similar phonetic inventories to Somali may see comparable relative gains from the identical pipeline.
If decoder is a poor proxy for transcription accuracy, an external quality estimator could replace or augment the threshold.
The approach could be combined with self-training on even larger unlabelled corpora without any additional manual annotation.

Load-bearing premise

The automatic transcripts that survive the decoder-confidence filter are accurate enough on average that adding them improves rather than harms the acoustic and language models.

What would settle it

A controlled experiment in which the same 17.55 hours are added after random or low-confidence filtering and the resulting word error rate is higher than the 1.57-hour baseline.

Figures

Figures reproduced from arXiv: 1907.03064 by Astik Biswas, Ewald van der Westhuizen, Raghav Menon, Thomas Niesler.

**Figure 1.** Figure 1: [8] shows the components of the radio browsing system. The preprocessed audio stream is passed to the ASR system which generates lattices which are subsequently searched for predefined keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is currently successfully deployed by the UN in Uganda.2 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 2.** Figure 2: Semi-supervised training framework for Somali ASR. represents untranscribed speech is being fed to transcriber 5. Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Semi-supervised acoustic and language modelling for Somali ASR3. not show any significant improvement over the baseline. However, LM3, which was optimised on the validation set, showed an improvement of 1.86% relative to the baseline. The results in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying semi-supervised training to 17.55 hours of untranscribed speech. We make use of factorised time-delay neural networks (TDNN-F) for acoustic modelling, since these have recently been shown to be effective in resource-scarce situations. Three semi-supervised training passes were performed, where the decoded output from each pass was used for acoustic model training in the subsequent pass. The automatic transcriptions from the best performing pass were used for language model augmentation. To ensure the quality of automatic transcriptions, decoder confidence is used as a threshold. The acoustic and language models obtained from the semi-supervised approach show significant improvement in terms of WER and perplexity compared to the baseline. Incorporating the automatically generated transcriptions yields a 6.55\% improvement in language model perplexity. The use of 17.55 hour of Somali acoustic data in semi-supervised training shows an improvement of 7.74\% relative over the baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modest 7.74% relative WER gain from iterative semi-supervised TDNN-F training on Somali, but no ablations or pseudo-label checks leave open whether gains come from label quality or just added volume.

read the letter

The core result is that three passes of semi-supervised training on 17.55 hours of unlabelled Somali speech, filtered by decoder , improves WER by 7.74% relative to a 1.57-hour seed baseline and also lowers LM perplexity by 6.55%. The work uses TDNN-F models, which are already known to help in low-resource cases, and applies them in an iterative loop where the best transcripts augment the language model. This is a straightforward extension to Somali for a UN humanitarian keyword-spotting use case, and the abstract reports the numbers cleanly enough to show the setup is practical rather than theoretical.

Referee Report

1 major / 2 minor

Summary. The paper claims that starting from a 1.57-hour seed of transcribed Somali speech, three iterative semi-supervised passes using TDNN-F acoustic models on 17.55 hours of untranscribed data (filtered by decoder confidence) plus augmentation of the language model with the best-pass transcripts yields a 7.74% relative WER reduction on held-out test data and a 6.55% reduction in LM perplexity relative to a baseline trained only on the seed.

Significance. If the reported gains prove robust, the result would be useful for extremely low-resource ASR, particularly in humanitarian keyword-spotting applications. The work gives explicit credit to the effectiveness of TDNN-F models in data-scarce regimes and demonstrates a practical three-pass iterative procedure with a simple confidence filter.

major comments (1)

[Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.

minor comments (2)

[Abstract and §4] Abstract and §4: the statements that the improvements are “significant” are not accompanied by any statistical significance test or confidence interval on the WER difference.
[§3.2] §3.2: the exact data partitions (how the 17.55 h were selected from the larger untranscribed pool, train/dev/test splits) are described only at a high level; a table listing hours per subset would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the constructive comment regarding our experimental controls. We address the major comment below.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results section: the central 7.74% relative WER claim rests on adding 17.55 h of confidence-filtered automatic transcripts, yet the manuscript reports neither WER nor phone error rate on the retained pseudo-labels themselves nor an ablation that adds an equal volume of unfiltered or randomly sampled data. Without these controls it remains possible that the observed gain is an artifact of increased training volume rather than label quality.

Authors: We agree that the manuscript does not report WER or phone error rate on the retained pseudo-labels, nor does it include an ablation comparing the confidence-filtered data against an equal volume of unfiltered or randomly sampled transcripts. This is a valid observation, and such controls would strengthen the claim that gains arise from label quality rather than data volume alone. The work emphasizes the practical iterative procedure with confidence thresholding in an extremely low-resource setting, where the three passes yield progressive improvements. In the revised manuscript we will add a discussion paragraph in the Experiments section explicitly acknowledging this limitation and noting that the observed gains are consistent with the design of the confidence filter. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical results on held-out data

full rationale

The paper reports measured WER and perplexity improvements from adding 17.55 h of confidence-filtered automatic transcripts to a 1.57 h seed set for TDNN-F training and LM augmentation. All claims are direct experimental outcomes on held-out test data; no equations, fitted parameters renamed as predictions, self-citations, or derivations are present that reduce to the inputs by construction. The method is iterative semi-supervised training, but the reported 7.74% relative gain is an external measurement, not a self-referential quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no model equations, free parameters, or invented entities are specified. Standard ASR assumptions such as the validity of WER as a metric and the usefulness of TDNN-F are implicit but not detailed.

pith-pipeline@v0.9.0 · 5780 in / 1012 out tokens · 20731 ms · 2026-05-25T02:01:42.861371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Surveys conducted by the United Nations (UN) in places lacking sufﬁcient internet infrastruc- ture indicate that this function is fulﬁlled by radio phone-in shows [4–6]

Introduction In countries with a well established internet infrastructure, so- cial media has become an accepted platform for sharing opin- ions and concerns [1–3]. Surveys conducted by the United Nations (UN) in places lacking sufﬁcient internet infrastruc- ture indicate that this function is fulﬁlled by radio phone-in shows [4–6]. Therefore, to support ...

work page
[2]

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Radio browsing system Figure 1 [8] shows the components of the radio browsing sys- tem. The preprocessed audio stream is passed to the ASR sys- tem which generates lattices which are subsequently searched for predeﬁned keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is curr...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1

Acoustic and text data 3.1. Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1. This small dataset of speech captured from broadcast Somali radio phone-in programmes, contains only 1.57 hours of transcribed speech that is available for training and 10 minutes for testing. Table 1...

work page
[4]

As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option

Semi-supervised training It has been shown that semi-supervised training can improve ASR performance in an under-resourced scenario [14, 15]. As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option. To test this, we used a recently-acquired corpus c...

work page
[5]

The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times

Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times. The resulting vocabulary consisted of 41.7k word types. The language model used in [9] was used as the baseline (LMbase). This ...

work page
[6]

All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU

Acoustic modelling The Kaldi speech recognition toolkit was used for all ASR experiments [20]. All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU. In our previous work, we found multi- lingual training to improve ASR performance substantially [9]. Table 4: Perplexities of the evaluated language m...

work page
[7]

In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2)

Results and discussion The ASR performance is reported in Table 5 in terms of the word error rate (WER) for the various training approaches. In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2). Even though TDNN-F uses only half the number of parameters as CNN- TDNN-BLSTM, it is able to offer better ...

work page
[8]

A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available

Conclusion We have presented our initial efforts to increase the pool of So- mali acoustic and language model data in a semi-supervised manner in an effort to improve automatic speech recognition for Somali. A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available. A further 17.55 hours o...

work page
[9]

We also gratefully acknowl- edge the support of Telkom South Africa

Acknowledgements We thank the NVIDIA corporation for the donation of GPU equipment used for this research. We also gratefully acknowl- edge the support of Telkom South Africa

work page
[10]

A human-machine collaborative system for identifying rumors on Twitter,

S. V osoughi and D. Roy, “A human-machine collaborative system for identifying rumors on Twitter,” in Proc. ICDMW, 2015

work page 2015
[11]

So- cial media analysis for e-health and medical purposes,

K. Wegrzyn-Wolska, L. Bougueroua, and G. Dziczkowski, “So- cial media analysis for e-health and medical purposes,” in Proc. CASoN, 2011

work page 2011
[12]

Machine classiﬁca- tion and analysis of suicide related communication on Twitter,

P. Burnap, G. Colombo, and J. Scourﬁeld, “Machine classiﬁca- tion and analysis of suicide related communication on Twitter,” in Proc. ACM-HT, 2015

work page 2015
[13]

Analyzing attitudes towards contraception and teenage pregnancy using social data,

G. P. P. Series, “Analyzing attitudes towards contraception and teenage pregnancy using social data,”Global Pulse Project Series, no. 8, 2014

work page 2014
[14]

Mining citizen feedback data for enhanced local gov- ernment decision-making,

——, “Mining citizen feedback data for enhanced local gov- ernment decision-making,” Global Pulse Project Series , no. 16, 2015

work page 2015
[15]

Understanding immunisation awareness and sentiment through social and mainstream media,

——, “Understanding immunisation awareness and sentiment through social and mainstream media,” Global Pulse Project Se- ries, no. 19, 2015

work page 2015
[16]

Radio-browsing for developmental monitoring in Uganda,

R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017

work page 2017
[17]

Very low resource radio browsing for agile develop- mental and humanitarian monitoring,

A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile develop- mental and humanitarian monitoring,” inProc. Interspeech, 2017

work page 2017
[18]

Au- tomatic speech recognition for humanitarian applications in So- mali,

R. Menon, A. Biswas, A. Saeb, J. Quinn, and T. Niesler, “Au- tomatic speech recognition for humanitarian applications in So- mali,” in Proc. SLTU, 2018

work page 2018
[19]

Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,

A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz, and T. Niesler, “Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,” in Proc. Interspeech, 2018

work page 2018
[20]

Multilingual training of deep neural networks,

A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep neural networks,” inProc. ICASSP, 2013, pp. 7319–7323

work page 2013
[21]

Automatic tran- scription of Somali language,

N. Addillahi, N.Pascal, and B. Jean-Francois, “Automatic tran- scription of Somali language,” in Proc. Interspeech, 2006

work page 2006
[22]

Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. Interspeech, 2018, pp. 3743–3747

work page 2018
[23]

Semi-supervised acoustic model training for speech with code-switching,

E. Yılmaz, M. McLaren, H. van den Heuvel, and D. A. van Leeuwen, “Semi-supervised acoustic model training for speech with code-switching,” Speech Communication, vol. 105, pp. 12– 22, 2018

work page 2018
[24]

Semi-supervised learn- ing for speech recognition in the context of accent adaptation,

U. Nallasamy, F. Metze, and T. Schultz, “Semi-supervised learn- ing for speech recognition in the context of accent adaptation,” in Symposium on Machine Learning in Speech and Language Pro- cessing, 2012, pp. 13–17

work page 2012
[25]

Deep neural network features and semi-supervised training for low re- source speech recognition,

S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low re- source speech recognition,” in in Proc. ICASSP, 2013, pp. 6704– 6708

work page 2013
[26]

Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,

H. Kamper, F. de Wet, T. Hain, and T. Niesler, “Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,” Computer Speech and Language, vol. 28, no. 6, pp. 1255– 1268, 2014

work page 2014
[27]

Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages

D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages.” in Proc. LREC, vol. 29, 2012, pp. 31–43

work page 2012
[28]

SRILM-an extensible language modeling toolkit,

A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc. ICSLP, 2002

work page 2002
[29]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011

work page 2011
[30]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Proc. Interspeech, 2015

work page 2015
[31]

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in Proc. Interspeech, 2016, pp. 2751–2755

work page 2016

[1] [1]

Surveys conducted by the United Nations (UN) in places lacking sufﬁcient internet infrastruc- ture indicate that this function is fulﬁlled by radio phone-in shows [4–6]

Introduction In countries with a well established internet infrastructure, so- cial media has become an accepted platform for sharing opin- ions and concerns [1–3]. Surveys conducted by the United Nations (UN) in places lacking sufﬁcient internet infrastruc- ture indicate that this function is fulﬁlled by radio phone-in shows [4–6]. Therefore, to support ...

work page

[2] [2]

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Radio browsing system Figure 1 [8] shows the components of the radio browsing sys- tem. The preprocessed audio stream is passed to the ASR sys- tem which generates lattices which are subsequently searched for predeﬁned keywords. Human analysts further process the data which aid in humanitarian decision making and situational awareness. This system is curr...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1

Acoustic and text data 3.1. Manually transcribed acoustic data The Somali acoustic training and test data used in our exper- iments is described in Table 1. This small dataset of speech captured from broadcast Somali radio phone-in programmes, contains only 1.57 hours of transcribed speech that is available for training and 10 minutes for testing. Table 1...

work page

[4] [4]

As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option

Semi-supervised training It has been shown that semi-supervised training can improve ASR performance in an under-resourced scenario [14, 15]. As we only have less than two hours of transcribed Somali acoustic data, increasing the pool of in-domain data by semi-supervised training was an attractive option. To test this, we used a recently-acquired corpus c...

work page

[5] [5]

The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times

Language modelling All language models were built using the SRILM toolkit [19]. The vocabulary of the ASR system was drawn from the pool of T1, T2 and T3 texts in Table 3 by retaining all word types occurring at least four times. The resulting vocabulary consisted of 41.7k word types. The language model used in [9] was used as the baseline (LMbase). This ...

work page

[6] [6]

All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU

Acoustic modelling The Kaldi speech recognition toolkit was used for all ASR experiments [20]. All experiments were performed using a PC with an 8-core Intel i7 CPU, 32GB of RAM and a 12GB NVIDIA Tesla GPU. In our previous work, we found multi- lingual training to improve ASR performance substantially [9]. Table 4: Perplexities of the evaluated language m...

work page

[7] [7]

In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2)

Results and discussion The ASR performance is reported in Table 5 in terms of the word error rate (WER) for the various training approaches. In comparison with our previous ASR system [9], the improve- ment afforded by TDNN-F is clear (rows 1 and 2). Even though TDNN-F uses only half the number of parameters as CNN- TDNN-BLSTM, it is able to offer better ...

work page

[8] [8]

A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available

Conclusion We have presented our initial efforts to increase the pool of So- mali acoustic and language model data in a semi-supervised manner in an effort to improve automatic speech recognition for Somali. A training corpus of only 1.57 hours of in-domain segmented and transcribed Somali radio broadcast speech data was available. A further 17.55 hours o...

work page

[9] [9]

We also gratefully acknowl- edge the support of Telkom South Africa

Acknowledgements We thank the NVIDIA corporation for the donation of GPU equipment used for this research. We also gratefully acknowl- edge the support of Telkom South Africa

work page

[10] [10]

A human-machine collaborative system for identifying rumors on Twitter,

S. V osoughi and D. Roy, “A human-machine collaborative system for identifying rumors on Twitter,” in Proc. ICDMW, 2015

work page 2015

[11] [11]

So- cial media analysis for e-health and medical purposes,

K. Wegrzyn-Wolska, L. Bougueroua, and G. Dziczkowski, “So- cial media analysis for e-health and medical purposes,” in Proc. CASoN, 2011

work page 2011

[12] [12]

Machine classiﬁca- tion and analysis of suicide related communication on Twitter,

P. Burnap, G. Colombo, and J. Scourﬁeld, “Machine classiﬁca- tion and analysis of suicide related communication on Twitter,” in Proc. ACM-HT, 2015

work page 2015

[13] [13]

Analyzing attitudes towards contraception and teenage pregnancy using social data,

G. P. P. Series, “Analyzing attitudes towards contraception and teenage pregnancy using social data,”Global Pulse Project Series, no. 8, 2014

work page 2014

[14] [14]

Mining citizen feedback data for enhanced local gov- ernment decision-making,

——, “Mining citizen feedback data for enhanced local gov- ernment decision-making,” Global Pulse Project Series , no. 16, 2015

work page 2015

[15] [15]

Understanding immunisation awareness and sentiment through social and mainstream media,

——, “Understanding immunisation awareness and sentiment through social and mainstream media,” Global Pulse Project Se- ries, no. 19, 2015

work page 2015

[16] [16]

Radio-browsing for developmental monitoring in Uganda,

R. Menon, A. Saeb, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Radio-browsing for developmental monitoring in Uganda,” in Proc. ICASSP, 2017

work page 2017

[17] [17]

Very low resource radio browsing for agile develop- mental and humanitarian monitoring,

A. Saeb, R. Menon, H. Cameron, W. Kibira, J. Quinn, and T. Niesler, “Very low resource radio browsing for agile develop- mental and humanitarian monitoring,” inProc. Interspeech, 2017

work page 2017

[18] [18]

Au- tomatic speech recognition for humanitarian applications in So- mali,

R. Menon, A. Biswas, A. Saeb, J. Quinn, and T. Niesler, “Au- tomatic speech recognition for humanitarian applications in So- mali,” in Proc. SLTU, 2018

work page 2018

[19] [19]

Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,

A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz, and T. Niesler, “Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,” in Proc. Interspeech, 2018

work page 2018

[20] [20]

Multilingual training of deep neural networks,

A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep neural networks,” inProc. ICASSP, 2013, pp. 7319–7323

work page 2013

[21] [21]

Automatic tran- scription of Somali language,

N. Addillahi, N.Pascal, and B. Jean-Francois, “Automatic tran- scription of Somali language,” in Proc. Interspeech, 2006

work page 2006

[22] [22]

Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,

D. Povey, G. Cheng, Y . Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza- tion for deep neural networks,” in Proc. Interspeech, 2018, pp. 3743–3747

work page 2018

[23] [23]

Semi-supervised acoustic model training for speech with code-switching,

E. Yılmaz, M. McLaren, H. van den Heuvel, and D. A. van Leeuwen, “Semi-supervised acoustic model training for speech with code-switching,” Speech Communication, vol. 105, pp. 12– 22, 2018

work page 2018

[24] [24]

Semi-supervised learn- ing for speech recognition in the context of accent adaptation,

U. Nallasamy, F. Metze, and T. Schultz, “Semi-supervised learn- ing for speech recognition in the context of accent adaptation,” in Symposium on Machine Learning in Speech and Language Pro- cessing, 2012, pp. 13–17

work page 2012

[25] [25]

Deep neural network features and semi-supervised training for low re- source speech recognition,

S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low re- source speech recognition,” in in Proc. ICASSP, 2013, pp. 6704– 6708

work page 2013

[26] [26]

Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,

H. Kamper, F. de Wet, T. Hain, and T. Niesler, “Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition sys- tem,” Computer Speech and Language, vol. 28, no. 6, pp. 1255– 1268, 2014

work page 2014

[27] [27]

Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages

D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large mono- lingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages.” in Proc. LREC, vol. 29, 2012, pp. 31–43

work page 2012

[28] [28]

SRILM-an extensible language modeling toolkit,

A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proc. ICSLP, 2002

work page 2002

[29] [29]

The Kaldi speech recognition toolkit,

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011

work page 2011

[30] [30]

Audio augmen- tation for speech recognition,

T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- tation for speech recognition,” in Proc. Interspeech, 2015

work page 2015

[31] [31]

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

D. Povey, V . Peddinti, D. Galvez, P. Ghahremani, V . Manohar, X. Na, Y . Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.” in Proc. Interspeech, 2016, pp. 2751–2755

work page 2016