From Speech to Text Corpora: Evaluating ASR-Based Data Acquisition for Low-Resource Fongbe and Hausa
Pith reviewed 2026-06-26 11:29 UTC · model grok-4.3
The pith
Fine-tuning ASR on 12 hours of Fongbe speech cuts benchmark errors by 78 percent while keeping tone marks, and Hausa video processing yields 6,770 segments rated near usable by humans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASR pipelines can extend text resources for low-resource West African languages. Fine-tuning MMS-300M on 12.3 hours of Fongbe data achieves 9.48 percent WER on the ALFFA benchmark, an improvement that preserves tonal diacritics. Processing 45.49 hours of Hausa video with an existing Whisper-Small model yields 6,770 transcribed segments whose human quality scores average 57.4 out of 100 for Hausa and 36.5 out of 100 for Fongbe, showing Hausa output approaches acceptable quality for corpus construction while Fongbe needs further refinement.
What carries the argument
ASR data-acquisition pipeline that fine-tunes existing models on limited curated speech, transcribes selected video, and measures output quality with both word error rate and human scoring on random samples.
If this is right
- Text corpora for these languages can grow from existing speech sources without starting from full manual transcription.
- Targeted fine-tuning allows models to retain language-specific features such as tonal markings during transcription.
- Video sources supply scalable raw data, but domain balance and compute limits shape which content gets processed.
- Human scoring remains essential to judge whether automatic transcripts meet production standards for corpus building.
Where Pith is reading between the lines
- The same pipeline could be tested on other tonal or diacritic-rich African languages to check how widely the quality patterns hold.
- The produced segments could be fed back as additional training data to improve the ASR models in later rounds.
- Expanding the video catalog beyond the 424 selected items might change the balance between usable and lower-quality output.
Load-bearing premise
The 50 sampled segments per language accurately reflect the quality and domain coverage of the full 6,770-segment corpus produced from the chosen videos.
What would settle it
A complete human review of all 6,770 segments that finds average quality scores below 40 out of 100, or new test data where the fine-tuned Fongbe model drops tonal diacritics at rates close to the original baseline.
Figures
read the original abstract
Low-resource African languages lack text corpora needed for language model training. We investigate whether ASR pipelines can extend text resources for two typologically distinct West African languages: Fongbe (tonal, diacritic-rich) and Hausa (non-tonal). We fine-tune MMS-300M on a curated 12.3-hour Fongbe dataset, achieving 9.48% WER on the ALFFA benchmark - a 78% relative reduction from the prior 44.04% baseline - while preserving tonal diacritics critical to the language. For Hausa, we apply an existing fine-tuned Whisper-Small model. We catalog 1,553 YouTube videos (236 hours) and process a subset of 424 videos (45.49 hours) selected to balance domain diversity with available computational resources, producing 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language shows mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, indicating that while Hausa transcriptions approach acceptable quality for corpus construction, Fongbe transcriptions require post-processing or improved models for production use. We release the curated dataset, fine-tuned model, transcribed corpus, and full video catalog following platform terms and ethical guidelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes an investigation into using automatic speech recognition (ASR) to generate text corpora for the low-resource languages Fongbe and Hausa. For Fongbe, the authors fine-tune the MMS-300M model on a 12.3-hour curated dataset, reporting 9.48% word error rate (WER) on the ALFFA benchmark, representing a 78% relative reduction from the 44.04% baseline, and note preservation of tonal diacritics. For Hausa, they use a fine-tuned Whisper-Small model on 45.49 hours of selected YouTube videos (from a catalog of 1,553 videos totaling 236 hours), resulting in 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language yields mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, leading to the assessment that Hausa transcriptions are approaching usability for corpus construction while Fongbe requires additional post-processing. The authors commit to releasing the curated dataset, fine-tuned model, transcribed corpus, and full video catalog.
Significance. If the reported results and human evaluations hold under scrutiny, this paper makes a meaningful contribution to low-resource language technology by outlining a practical, video-based data acquisition pipeline for African languages. The substantial WER improvement for Fongbe and the scale of the Hausa corpus (6,770 segments) are notable. The commitment to open release of data and models following ethical guidelines strengthens the work's potential impact and reproducibility. This approach could serve as a template for other low-resource settings where video data is abundant but transcribed text is scarce.
major comments (2)
- Abstract: The usability conclusion that Hausa transcriptions approach acceptable quality for corpus construction (57.4/100) while Fongbe requires post-processing (36.5/100) rests on mean quality scores from a 50-segment human evaluation sample drawn from the 6,770-segment corpus. The video subset (424/1,553) was chosen to balance domain diversity against compute, yet no stratification by domain, score variance, or inter-annotator agreement is reported. If transcription quality correlates with domain or video-level factors, the aggregate means do not reliably support the corpus-wide claims.
- Abstract: The 78% relative WER reduction for Fongbe (from 44.04% to 9.48% on ALFFA) is a strong numeric result, but the manuscript must clarify whether the 12.3-hour fine-tuning set is disjoint from the ALFFA test data and provide the exact fine-tuning configuration to allow independent verification of the improvement and the diacritic preservation claim.
minor comments (1)
- The abstract mentions cataloging 1,553 videos but processes only a subset; a methods section should explicitly state the selection algorithm and any domain balancing criteria to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details where needed to strengthen the presentation of results.
read point-by-point responses
-
Referee: Abstract: The usability conclusion that Hausa transcriptions approach acceptable quality for corpus construction (57.4/100) while Fongbe requires post-processing (36.5/100) rests on mean quality scores from a 50-segment human evaluation sample drawn from the 6,770-segment corpus. The video subset (424/1,553) was chosen to balance domain diversity against compute, yet no stratification by domain, score variance, or inter-annotator agreement is reported. If transcription quality correlates with domain or video-level factors, the aggregate means do not reliably support the corpus-wide claims.
Authors: We agree that the current reporting of only aggregate mean scores from a random sample of 50 segments per language provides limited support for corpus-wide claims, particularly without details on variance, inter-annotator agreement, or domain stratification. In the revision we will expand the evaluation section to include score variance (or standard deviation), clarify the number of annotators and any agreement metrics if available, discuss the random sampling approach in the context of domain-balanced video selection, and explicitly note limitations of the evaluation. This will better contextualize the usability conclusions for Hausa versus the need for post-processing in Fongbe. revision: yes
-
Referee: Abstract: The 78% relative WER reduction for Fongbe (from 44.04% to 9.48% on ALFFA) is a strong numeric result, but the manuscript must clarify whether the 12.3-hour fine-tuning set is disjoint from the ALFFA test data and provide the exact fine-tuning configuration to allow independent verification of the improvement and the diacritic preservation claim.
Authors: The 12.3-hour curated Fongbe dataset was collected and prepared independently of the ALFFA benchmark and has no overlap with its test set; we will state this explicitly in the revised manuscript. We will also add the exact fine-tuning configuration (hyperparameters, training procedure, data preprocessing steps, and handling of tonal diacritics) to the methods section to support independent verification of the WER improvement and diacritic preservation. revision: yes
Circularity Check
No significant circularity; empirical metrics rest on external benchmark and independent human ratings
full rationale
The reported WER (9.48% on ALFFA) is measured against an external benchmark with a stated prior baseline, and human quality scores (57.4/100 and 36.5/100) are obtained from independent raters on randomly sampled segments. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation chain. Video subset selection for domain/compute balance and the 50-segment sampling are methodological choices whose representativeness is a validity question, not a circular reduction of outputs to inputs. The paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al
David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Lan...
2023
-
[2]
doi: 10.18653/v1/2023.emnlp-main.294. Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. AfroXLMR: Scaling Multilingual Pretraining for African Languages. InProceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), pages 7031–7041, Marseille, France,
-
[3]
Jesujoba O
European Language Resources Association. Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, and Junichi Yamagishi. AfriHuBERT: A Self-Supervised Speech Representation Model for African Languages. InInterspeech 2025, pages 4023–4027,
2025
-
[4]
Alabi and Xuechen Liu and Dietrich Klakow and Junichi Yamagishi , year =
doi: 10.21437/Interspeech.2025-1437. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. Mozilla Foundation.https://commonvoice.mozilla. org/,
-
[5]
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.arXiv preprint arXiv:2111.09296,
-
[6]
Cheikh M
DOI to be added upon data release. Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba O. Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, et al. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages. In Proceedings of the 61st Annual Meeting of the Association for Compu...
2023
-
[7]
DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig
doi: 10.18653/v1/2023.acl-long.609. DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig. Fongbe: ALanguageofBenin. Eth- nologue: Languages of the World, 27th edition. SIL International.https://www.ethnologue. com/language/fon/, 2024a. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Hausa: A Language of Nige- ria. Ethnologue: Languages of the World...
-
[8]
Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic
9 Elodie Gauthier, Laurent Besacier, Sylvie Voisin, Michael Melese, and Uriel Pascal Elingui. Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic. InProceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), Portorož, Slovenia,
2016
-
[9]
European Lan- guage Resources Association. Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin, Hadiza Ali Umar, Muhammad Yahuza Bello, Joyce Nakatumba-Nabende, Seid Muhie Yimam, and Shamsuddeen Hassan Muhammad. Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Re...
-
[10]
ELRA and ICCL. D. Fortuné Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène Cokou Ezin, and Yannick Estève. FFSTC 2: Extending the Fongbe to French Speech Translation Corpus. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 145–152, Vienna, Austria,
2025
-
[11]
doi: 10.18653/ v1/2025.iwslt-1.13
Association for Computational Linguistics. doi: 10.18653/ v1/2025.iwslt-1.13. Fréjus A. A. Laleye, Eugène C. Ezin, and Cina Motamed. Fongbe Speech Recognition: A Pilot Study. In2016 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 235–238. IEEE,
2025
-
[12]
NationalCentreforArtificialIntelligenceandRobotics
doi: 10.15439/2016F172. NationalCentreforArtificialIntelligenceandRobotics. NCAIRHausaASRModel. HuggingFace Model Hub,
-
[13]
Association for Computational Linguistics. doi: 10.18653/v1/ 2021.mrl-1.11. Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tober, Changhan Babu, Sayahna Kunber, Ali Elkahky, Zhaoheng Huang, Alexei Dvorber, Mohamed Gro, et al. Scaling Speech Technology to 1,000+ Languages.arXiv preprint arXiv:2305.13516,
-
[14]
Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,
-
[15]
Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, and Benjamin Rosman. InkubaLM: A Small Language Model for Low-Resource African Lan- guages.arXiv preprint arXiv:2408.17024,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.