From Speech to Text Corpora: Evaluating ASR-Based Data Acquisition for Low-Resource Fongbe and Hausa

Mahounan Pericles Adjovi; Prasenjit Mitra; Roald Eiselen; Victor Olufemi

arxiv: 2606.22274 · v1 · pith:FPD4LHEXnew · submitted 2026-06-20 · 💻 cs.CL · cs.AI· cs.LG

From Speech to Text Corpora: Evaluating ASR-Based Data Acquisition for Low-Resource Fongbe and Hausa

Mahounan Pericles Adjovi , Victor Olufemi , Roald Eiselen , Prasenjit Mitra This is my paper

Pith reviewed 2026-06-26 11:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords automatic speech recognitionlow-resource languagesFongbeHausatext corpus creationWest African languagesdata acquisitionmachine learning

0 comments

The pith

Fine-tuning ASR on 12 hours of Fongbe speech cuts benchmark errors by 78 percent while keeping tone marks, and Hausa video processing yields 6,770 segments rated near usable by humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Low-resource African languages lack the text data needed to train language models. The work tests whether automatic speech recognition pipelines can convert existing audio and video into usable text for Fongbe, a tonal language with diacritics, and Hausa. Fine-tuning the MMS-300M model on a 12.3-hour Fongbe set reaches 9.48 percent word error rate on the ALFFA test, down from 44 percent, and keeps the tone markers intact. Running a Hausa model on 45 hours of selected YouTube videos produces 6,770 text segments whose human-rated quality averages 57 out of 100, suggesting the output is close to ready for corpus use, while Fongbe scores of 36 out of 100 indicate more post-processing is still required.

Core claim

ASR pipelines can extend text resources for low-resource West African languages. Fine-tuning MMS-300M on 12.3 hours of Fongbe data achieves 9.48 percent WER on the ALFFA benchmark, an improvement that preserves tonal diacritics. Processing 45.49 hours of Hausa video with an existing Whisper-Small model yields 6,770 transcribed segments whose human quality scores average 57.4 out of 100 for Hausa and 36.5 out of 100 for Fongbe, showing Hausa output approaches acceptable quality for corpus construction while Fongbe needs further refinement.

What carries the argument

ASR data-acquisition pipeline that fine-tunes existing models on limited curated speech, transcribes selected video, and measures output quality with both word error rate and human scoring on random samples.

If this is right

Text corpora for these languages can grow from existing speech sources without starting from full manual transcription.
Targeted fine-tuning allows models to retain language-specific features such as tonal markings during transcription.
Video sources supply scalable raw data, but domain balance and compute limits shape which content gets processed.
Human scoring remains essential to judge whether automatic transcripts meet production standards for corpus building.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be tested on other tonal or diacritic-rich African languages to check how widely the quality patterns hold.
The produced segments could be fed back as additional training data to improve the ASR models in later rounds.
Expanding the video catalog beyond the 424 selected items might change the balance between usable and lower-quality output.

Load-bearing premise

The 50 sampled segments per language accurately reflect the quality and domain coverage of the full 6,770-segment corpus produced from the chosen videos.

What would settle it

A complete human review of all 6,770 segments that finds average quality scores below 40 out of 100, or new test data where the fine-tuned Fongbe model drops tonal diacritics at rates close to the original baseline.

Figures

Figures reproduced from arXiv: 2606.22274 by Mahounan Pericles Adjovi, Prasenjit Mitra, Roald Eiselen, Victor Olufemi.

read the original abstract

Low-resource African languages lack text corpora needed for language model training. We investigate whether ASR pipelines can extend text resources for two typologically distinct West African languages: Fongbe (tonal, diacritic-rich) and Hausa (non-tonal). We fine-tune MMS-300M on a curated 12.3-hour Fongbe dataset, achieving 9.48% WER on the ALFFA benchmark - a 78% relative reduction from the prior 44.04% baseline - while preserving tonal diacritics critical to the language. For Hausa, we apply an existing fine-tuned Whisper-Small model. We catalog 1,553 YouTube videos (236 hours) and process a subset of 424 videos (45.49 hours) selected to balance domain diversity with available computational resources, producing 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language shows mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, indicating that while Hausa transcriptions approach acceptable quality for corpus construction, Fongbe transcriptions require post-processing or improved models for production use. We release the curated dataset, fine-tuned model, transcribed corpus, and full video catalog following platform terms and ethical guidelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete WER numbers for Fongbe ASR fine-tuning and a YouTube pipeline that yields 6770 segments for Hausa and Fongbe, but the quality scores come from a 50-segment sample with no variance or agreement stats.

read the letter

The main point is that fine-tuning MMS-300M on 12.3 hours of Fongbe data drops WER to 9.48% on ALFFA from a 44% baseline while keeping tonal diacritics, and they process 45 hours of Hausa video into 6770 segments with mean human quality scores of 57.4 for Hausa and 36.5 for Fongbe. They also release the dataset, model, and video catalog.

What the work does well is run the actual fine-tuning, report the exact numbers, and commit to releasing the outputs under ethical terms. That supplies a usable baseline and starting corpus for two languages that have almost nothing, and the relative WER reduction is large enough to be worth checking.

The soft spot is the human evaluation. The 50 randomly sampled segments per language are too few to support claims about the full 6770-segment corpus, especially since the 424 videos were selected for domain balance without stratification or reported variance. The stress-test concern holds: if quality varies by domain or video, the means do not reliably indicate how much of the corpus is actually usable. No inter-annotator agreement is mentioned either.

This is for groups building low-resource ASR or text resources for West African languages. It deserves peer review because the core results are specific and the planned release makes them verifiable, even if the evaluation needs more detail on sampling and agreement.

Referee Report

2 major / 1 minor

Summary. The manuscript describes an investigation into using automatic speech recognition (ASR) to generate text corpora for the low-resource languages Fongbe and Hausa. For Fongbe, the authors fine-tune the MMS-300M model on a 12.3-hour curated dataset, reporting 9.48% word error rate (WER) on the ALFFA benchmark, representing a 78% relative reduction from the 44.04% baseline, and note preservation of tonal diacritics. For Hausa, they use a fine-tuned Whisper-Small model on 45.49 hours of selected YouTube videos (from a catalog of 1,553 videos totaling 236 hours), resulting in 6,770 transcribed segments. Human evaluation on 50 randomly sampled segments per language yields mean quality scores of 57.4/100 for Hausa and 36.5/100 for Fongbe, leading to the assessment that Hausa transcriptions are approaching usability for corpus construction while Fongbe requires additional post-processing. The authors commit to releasing the curated dataset, fine-tuned model, transcribed corpus, and full video catalog.

Significance. If the reported results and human evaluations hold under scrutiny, this paper makes a meaningful contribution to low-resource language technology by outlining a practical, video-based data acquisition pipeline for African languages. The substantial WER improvement for Fongbe and the scale of the Hausa corpus (6,770 segments) are notable. The commitment to open release of data and models following ethical guidelines strengthens the work's potential impact and reproducibility. This approach could serve as a template for other low-resource settings where video data is abundant but transcribed text is scarce.

major comments (2)

Abstract: The usability conclusion that Hausa transcriptions approach acceptable quality for corpus construction (57.4/100) while Fongbe requires post-processing (36.5/100) rests on mean quality scores from a 50-segment human evaluation sample drawn from the 6,770-segment corpus. The video subset (424/1,553) was chosen to balance domain diversity against compute, yet no stratification by domain, score variance, or inter-annotator agreement is reported. If transcription quality correlates with domain or video-level factors, the aggregate means do not reliably support the corpus-wide claims.
Abstract: The 78% relative WER reduction for Fongbe (from 44.04% to 9.48% on ALFFA) is a strong numeric result, but the manuscript must clarify whether the 12.3-hour fine-tuning set is disjoint from the ALFFA test data and provide the exact fine-tuning configuration to allow independent verification of the improvement and the diacritic preservation claim.

minor comments (1)

The abstract mentions cataloging 1,553 videos but processes only a subset; a methods section should explicitly state the selection algorithm and any domain balancing criteria to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional details where needed to strengthen the presentation of results.

read point-by-point responses

Referee: Abstract: The usability conclusion that Hausa transcriptions approach acceptable quality for corpus construction (57.4/100) while Fongbe requires post-processing (36.5/100) rests on mean quality scores from a 50-segment human evaluation sample drawn from the 6,770-segment corpus. The video subset (424/1,553) was chosen to balance domain diversity against compute, yet no stratification by domain, score variance, or inter-annotator agreement is reported. If transcription quality correlates with domain or video-level factors, the aggregate means do not reliably support the corpus-wide claims.

Authors: We agree that the current reporting of only aggregate mean scores from a random sample of 50 segments per language provides limited support for corpus-wide claims, particularly without details on variance, inter-annotator agreement, or domain stratification. In the revision we will expand the evaluation section to include score variance (or standard deviation), clarify the number of annotators and any agreement metrics if available, discuss the random sampling approach in the context of domain-balanced video selection, and explicitly note limitations of the evaluation. This will better contextualize the usability conclusions for Hausa versus the need for post-processing in Fongbe. revision: yes
Referee: Abstract: The 78% relative WER reduction for Fongbe (from 44.04% to 9.48% on ALFFA) is a strong numeric result, but the manuscript must clarify whether the 12.3-hour fine-tuning set is disjoint from the ALFFA test data and provide the exact fine-tuning configuration to allow independent verification of the improvement and the diacritic preservation claim.

Authors: The 12.3-hour curated Fongbe dataset was collected and prepared independently of the ALFFA benchmark and has no overlap with its test set; we will state this explicitly in the revised manuscript. We will also add the exact fine-tuning configuration (hyperparameters, training procedure, data preprocessing steps, and handling of tonal diacritics) to the methods section to support independent verification of the WER improvement and diacritic preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical metrics rest on external benchmark and independent human ratings

full rationale

The reported WER (9.48% on ALFFA) is measured against an external benchmark with a stated prior baseline, and human quality scores (57.4/100 and 36.5/100) are obtained from independent raters on randomly sampled segments. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation chain. Video subset selection for domain/compute balance and the 50-segment sampling are methodological choices whose representativeness is a validity question, not a circular reduction of outputs to inputs. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard assumptions that fine-tuning ASR models on limited data generalizes and that small human samples reflect corpus quality.

pith-pipeline@v0.9.1-grok · 5784 in / 1261 out tokens · 26111 ms · 2026-06-26T11:29:05.013386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages

[1]

Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al

David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Lan...

2023
[2]

Jesujoba O

doi: 10.18653/v1/2023.emnlp-main.294. Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. AfroXLMR: Scaling Multilingual Pretraining for African Languages. InProceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), pages 7031–7041, Marseille, France,

work page doi:10.18653/v1/2023.emnlp-main.294 2023
[3]

Jesujoba O

European Language Resources Association. Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, and Junichi Yamagishi. AfriHuBERT: A Self-Supervised Speech Representation Model for African Languages. InInterspeech 2025, pages 4023–4027,

2025
[4]

Alabi and Xuechen Liu and Dietrich Klakow and Junichi Yamagishi , year =

doi: 10.21437/Interspeech.2025-1437. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. Mozilla Foundation.https://commonvoice.mozilla. org/,

work page doi:10.21437/interspeech.2025-1437 2025
[5]

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.arXiv preprint arXiv:2111.09296,

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.arXiv preprint arXiv:2111.09296,

arXiv
[6]

Cheikh M

DOI to be added upon data release. Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba O. Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, et al. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages. In Proceedings of the 61st Annual Meeting of the Association for Compu...

2023
[7]

DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig

doi: 10.18653/v1/2023.acl-long.609. DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig. Fongbe: ALanguageofBenin. Eth- nologue: Languages of the World, 27th edition. SIL International.https://www.ethnologue. com/language/fon/, 2024a. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Hausa: A Language of Nige- ria. Ethnologue: Languages of the World...

work page doi:10.18653/v1/2023.acl-long.609 2023
[8]

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic

9 Elodie Gauthier, Laurent Besacier, Sylvie Voisin, Michael Melese, and Uriel Pascal Elingui. Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic. InProceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), Portorož, Slovenia,

2016
[9]

European Lan- guage Resources Association. Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin, Hadiza Ali Umar, Muhammad Yahuza Bello, Joyce Nakatumba-Nabende, Seid Muhie Yimam, and Shamsuddeen Hassan Muhammad. Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Re...

work page doi:10.18653/v1/2025.africanlp-1.13 2025
[10]

ELRA and ICCL. D. Fortuné Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène Cokou Ezin, and Yannick Estève. FFSTC 2: Extending the Fongbe to French Speech Translation Corpus. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 145–152, Vienna, Austria,

2025
[11]

doi: 10.18653/ v1/2025.iwslt-1.13

Association for Computational Linguistics. doi: 10.18653/ v1/2025.iwslt-1.13. Fréjus A. A. Laleye, Eugène C. Ezin, and Cina Motamed. Fongbe Speech Recognition: A Pilot Study. In2016 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 235–238. IEEE,

2025
[12]

NationalCentreforArtificialIntelligenceandRobotics

doi: 10.15439/2016F172. NationalCentreforArtificialIntelligenceandRobotics. NCAIRHausaASRModel. HuggingFace Model Hub,

work page doi:10.15439/2016f172
[13]

V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Association for Computational Linguistics. doi: 10.18653/v1/ 2021.mrl-1.11. Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tober, Changhan Babu, Sayahna Kunber, Ali Elkahky, Zhaoheng Huang, Alexei Dvorber, Mohamed Gro, et al. Scaling Speech Technology to 1,000+ Languages.arXiv preprint arXiv:2305.13516,

work page doi:10.18653/v1/ 2021
[14]

Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,

Pith/arXiv arXiv
[15]

Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, and Benjamin Rosman. InkubaLM: A Small Language Model for Low-Resource African Lan- guages.arXiv preprint arXiv:2408.17024,

arXiv

[1] [1]

Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al

David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen Hassan Muham- mad, Peter Nabende, et al. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Lan...

2023

[2] [2]

Jesujoba O

doi: 10.18653/v1/2023.emnlp-main.294. Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. AfroXLMR: Scaling Multilingual Pretraining for African Languages. InProceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), pages 7031–7041, Marseille, France,

work page doi:10.18653/v1/2023.emnlp-main.294 2023

[3] [3]

Jesujoba O

European Language Resources Association. Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, and Junichi Yamagishi. AfriHuBERT: A Self-Supervised Speech Representation Model for African Languages. InInterspeech 2025, pages 4023–4027,

2025

[4] [4]

Alabi and Xuechen Liu and Dietrich Klakow and Junichi Yamagishi , year =

doi: 10.21437/Interspeech.2025-1437. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. Mozilla Foundation.https://commonvoice.mozilla. org/,

work page doi:10.21437/interspeech.2025-1437 2025

[5] [5]

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.arXiv preprint arXiv:2111.09296,

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.arXiv preprint arXiv:2111.09296,

arXiv

[6] [6]

Cheikh M

DOI to be added upon data release. Cheikh M. Bamba Dione, David Ifeoluwa Adelani, Peter Nabende, Jesujoba O. Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, et al. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages. In Proceedings of the 61st Annual Meeting of the Association for Compu...

2023

[7] [7]

DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig

doi: 10.18653/v1/2023.acl-long.609. DavidM.Eberhard, GaryF.Simons, andCharlesD.Fennig. Fongbe: ALanguageofBenin. Eth- nologue: Languages of the World, 27th edition. SIL International.https://www.ethnologue. com/language/fon/, 2024a. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Hausa: A Language of Nige- ria. Ethnologue: Languages of the World...

work page doi:10.18653/v1/2023.acl-long.609 2023

[8] [8]

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic

9 Elodie Gauthier, Laurent Besacier, Sylvie Voisin, Michael Melese, and Uriel Pascal Elingui. Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: A Case Study of Wolof, Fongbe, Swahili and Amharic. InProceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), Portorož, Slovenia,

2016

[9] [9]

European Lan- guage Resources Association. Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin, Hadiza Ali Umar, Muhammad Yahuza Bello, Joyce Nakatumba-Nabende, Seid Muhie Yimam, and Shamsuddeen Hassan Muhammad. Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Re...

work page doi:10.18653/v1/2025.africanlp-1.13 2025

[10] [10]

ELRA and ICCL. D. Fortuné Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène Cokou Ezin, and Yannick Estève. FFSTC 2: Extending the Fongbe to French Speech Translation Corpus. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 145–152, Vienna, Austria,

2025

[11] [11]

doi: 10.18653/ v1/2025.iwslt-1.13

Association for Computational Linguistics. doi: 10.18653/ v1/2025.iwslt-1.13. Fréjus A. A. Laleye, Eugène C. Ezin, and Cina Motamed. Fongbe Speech Recognition: A Pilot Study. In2016 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 235–238. IEEE,

2025

[12] [12]

NationalCentreforArtificialIntelligenceandRobotics

doi: 10.15439/2016F172. NationalCentreforArtificialIntelligenceandRobotics. NCAIRHausaASRModel. HuggingFace Model Hub,

work page doi:10.15439/2016f172

[13] [13]

V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

Association for Computational Linguistics. doi: 10.18653/v1/ 2021.mrl-1.11. Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tober, Changhan Babu, Sayahna Kunber, Ali Elkahky, Zhaoheng Huang, Alexei Dvorber, Mohamed Gro, et al. Scaling Speech Technology to 1,000+ Languages.arXiv preprint arXiv:2305.13516,

work page doi:10.18653/v1/ 2021

[14] [14]

Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision.arXiv preprint arXiv:2212.04356,

Pith/arXiv arXiv

[15] [15]

Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, and Benjamin Rosman. InkubaLM: A Small Language Model for Low-Resource African Lan- guages.arXiv preprint arXiv:2408.17024,

arXiv