A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development
Pith reviewed 2026-05-25 00:29 UTC · model grok-4.3
The pith
Hausa has broader text resource diversity than Fongbe while both languages show specific gaps in speech and domain coverage for NLP.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The survey finds that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains while Fongbe has more limited text resources but recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. The catalog records size, domain coverage, format, licensing, and accessibility for parallel corpora, monolingual text, speech datasets, pre-trained models, and evaluation benchmarks, leading to concrete recommendations on remaining gaps.
What carries the argument
The systematic catalog of resources by size, domain coverage, format, licensing, and accessibility, which establishes the contrast in availability and identifies the priority gaps.
If this is right
- NLP developers can target domain-diverse Fongbe text collection as a high-priority next step.
- Dedicated Hausa speech corpora should be created to match the existing text resources.
- Existing Masakhane benchmarks for NER and POS tagging can serve as starting points for further work on both languages.
- Task-specific recommendations guide where to allocate effort in building models and evaluation sets.
Where Pith is reading between the lines
- Closing the identified gaps could enable more robust machine translation or information extraction tools that serve Hausa and Fongbe speakers directly.
- The pattern of one language having stronger text coverage and the other stronger speech data may appear in surveys of additional low-resource languages.
- Public release of the catalog itself could reduce duplication in future data collection projects.
Load-bearing premise
The systematic search of academic repositories, data platforms, and web sources captured a sufficiently complete and up-to-date picture of all publicly available resources without major omissions.
What would settle it
The discovery of a previously unlisted large Fongbe text collection spanning multiple domains or a substantial public Hausa speech corpus that would change which gaps rank as highest priority.
read the original abstract
This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textit{What is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain?} Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys publicly available text and speech resources for Hausa (Afroasiatic, ~80-100M speakers) and Fongbe (Niger-Congo, ~2M speakers) via a systematic search of academic repositories, data platforms, and web sources. It catalogs parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks, documenting size, domain, format, licensing, and accessibility for each. Key findings include broader text diversity for Hausa across news/encyclopedic/educational domains, more limited Fongbe text but recent academic speech collections, representation of both in Masakhane NER/POS benchmarks, and priority gaps in domain-diverse Fongbe text and dedicated Hausa speech corpora.
Significance. If the catalog proves complete and replicable, the survey would provide a useful baseline reference for NLP development in these West African languages, clarifying availability contrasts and directing data collection priorities in low-resource settings.
major comments (1)
- [Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater methodological transparency. We address the single major comment below and will incorporate the requested details in a revised manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.
Authors: We agree that the Methods section requires additional detail to support replicability and verification of our claims. In the revision we will add: (1) the exact search strings and Boolean combinations used on each platform (Google Scholar, ACL Anthology, arXiv, Hugging Face Datasets, Masakhane repositories, and language-specific web sources); (2) the date ranges of all searches (e.g., searches performed between January and March 2024); (3) explicit inclusion/exclusion criteria (publicly accessible resources with documented size, domain, and licensing; exclusion of private or paywalled data without clear release statements); and (4) platform-specific query examples. These additions will directly enable readers to assess the completeness of the catalog and the robustness of the identified gaps between Hausa and Fongbe resources. revision: yes
Circularity Check
No circularity: purely descriptive survey of external resources
full rationale
The paper is a catalog and gap analysis of publicly hosted text and speech datasets for Hausa and Fongbe. It contains no equations, no fitted parameters, no predictions, no uniqueness theorems, and no derivation chain. All claims reduce to enumeration of external repositories (Masakhane, academic platforms, web sources) whose existence and properties are independently verifiable outside the paper. The search methodology is a standard literature-review step, not a self-referential construction. No patterns from the seven enumerated kinds are present.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition,
D. I. Adelani, G. Neubig, S. Ruder, S. Rijhwani, M. Beukman, C. Palen- Michel, C. Lignos, J. O. Alabi, S. H. Muhammad, P. Nabende, C. M. B. Dione, A. Bukula, R. Mabuya, B. F. P. Dossou, B. Sibanda, H. Buzaaba, J. Mukiibi, G. Kalipe, D. Mbaye, A. Taylor, F. Kabore, C. C. Emezue, A. Aremu, P. Ogayo, C. Gitau, E. Munkoh-Buabeng, V . Memdjokam Koagne, A. A. T...
work page 2022
-
[2]
Masakha- POS: Part-of-Speech Tagging for Typologically Diverse African Lan- guages,
C. M. B. Dione, D. I. Adelani, P. Nabende, J. Alabi, T. Sindane, H. Buzaaba, S. H. Muhammad, C. C. Emezue, P. Ogayo, A. Aremu, C. Gitau, D. Mbaye, J. Mukiibi, B. Sibanda, B. F. P. Dossou, A. Bukula, R. Mabuya, A. A. Tapo, E. Munkoh-Buabeng, V . Memd- jokam Koagne, F. O. Kabore, A. Taylor, G. Kalipe, T. Macucwa, V . Mari- vate, T. Gwadabe, M. T. Elvis, I. ...
work page 2023
-
[3]
K. Ogueji, Y . Zhu, and J. Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,” inProc. 1st Workshop Multilingual Representation Learn- ing (MRL), Punta Cana, Dominican Republic, Nov. 2021, pp. 116–126. [Online]. Available: https://aclanthology.org/2021.mrl-1.11/
work page 2021
-
[4]
FFSTC2: Fongbe– French Speech Translation Corpus,
L. Laleye, F. Biao, E. Gauthier, and L. Besacier, “FFSTC2: Fongbe– French Speech Translation Corpus,” Hugging Face Datasets, 2025. [Online]. Available: https://huggingface.co/datasets/GbeBenin/FFSTC-2
work page 2025
-
[5]
InkubaLM: A Small Language Model for Low-resource African Lan- guages,
A. Tonja, B. F. P. Dossou, D. I. Adelani, C. C. Emezue, and others, “InkubaLM: A Small Language Model for Low-resource African Lan- guages,” arXiv preprint arXiv:2408.17024, 2024. [Online]. Available: https://arxiv.org/abs/2408.17024
-
[6]
D. Goldhahn, T. Eckart, and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 759–765. [Online]. Available: https://aclanthology.org/L12-1154/
work page 2012
-
[7]
Fon-French Daily Dialogues Parallel Corpus,
B. F. P. Dossou and C. C. Emezue, “Fon-French Daily Dialogues Parallel Corpus,” Zenodo, 2021. doi: 10.5281/zenodo.4432712. [Online]. Available: https://zenodo.org/records/4432712
-
[8]
pyFongbe: Fongbe ASR Resources,
L. Laleye, “pyFongbe: Fongbe ASR Resources,” GitHub Repository,
-
[9]
Available: https://github.com/laleye/pyFongbe
[Online]. Available: https://github.com/laleye/pyFongbe
-
[10]
English-Hausa Parallel Corpus,
G. Kenneth, “English-Hausa Parallel Corpus,” Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/gigikenneth/ englishhausa-corpus
work page 2020
-
[11]
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,
W. Nekoto, V . Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fag- bohun, S. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, F. Sackey, B. F. P. Dossou, and others, “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” inFindings Assoc. Comput. Linguistics (EMNLP), Online, Nov. 2020, pp. 2144–
work page 2020
-
[12]
Available: https://aclanthology.org/2020.findings-emnlp
[Online]. Available: https://aclanthology.org/2020.findings-emnlp. 195/
work page 2020
-
[13]
AI4D – African Language Program,
K. Siminyu, G. Kalipe, D. Orlic, J. Abbott, V . Marivate, S. Freshia, P. Sibal, B. Neupane, D. I. Adelani, A. Taylor, J. T. Ali, K. Degila, M. Balogoun, T. I. Diop, D. David, C. Fourati, H. Haddad, and M. Naski, “AI4D – African Language Program,” inProc. 2nd Workshop on AfricaNLP (AfricaNLP@EACL), Online, Apr. 2021. [Online]. Available: https://arxiv.org/...
-
[15]
Available: https://arxiv.org/abs/2003.11529
[Online]. Available: https://arxiv.org/abs/2003.11529
-
[16]
A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,
M. A. Hedderich, L. Lange, H. Adel, J. Str ¨obe, and D. Klakow, “A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,” inProc. 2021 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Online, Jun. 2021, pp. 2545–2568. [Online]. Available: https://aclanthology.org/2021.naacl-main.201/
work page 2021
-
[17]
The State and Fate of Linguistic Diversity and Inclusion in the NLP World,
P. Joshi, S. Santy, A. Buber, B. Bali, and M. Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), Online, Jul. 2020, pp. 6282–6293. [Online]. Available: https://aclanthology.org/ 2020.acl-main.560/
work page 2020
-
[18]
Lanfrica: Discover African Language Resources,
Lanfrica Labs, “Lanfrica: Discover African Language Resources,” 2024. [Online]. Available: https://lanfrica.com
work page 2024
-
[19]
A Review on NLP Approaches for African Languages and Dialects,
A. M. Naira, I. Benelallam, A. Allak, and K. Gaanoun, “A Review on NLP Approaches for African Languages and Dialects,” inAdvances in Science, Technology and Innovation, Springer, Cham, 2024. doi: 10.1007/978-3-031-46849-0 23. [Online]. Available: https://doi.org/10. 1007/978-3-031-46849-0 23
-
[20]
A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,
J. Abate and F. Rashid, “A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,”Natu- ral Language Processing Journal, vol. 6, p. 100051, Mar. 2024. doi: 10.1016/j.nlp.2023.100051. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S2949719123000481
-
[21]
No Language Left Behind: Scaling Human-Centered Machine Translation
NLLB Team, M. R. Costa-juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, and others, “No Language Left Behind: Scaling Human-Centered Machine Translation,” arXiv preprint arXiv:2207.04672, 2022. [Online]. Avail- able: https://arxiv.org/abs/2207.04672
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
D. I. Adelani, J. Alabi, A. Fan, J. Kreutzer, and others, “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,” inProc. 2022 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Seattle, W A, USA, Jul. 2022, pp. 3053–
work page 2022
-
[23]
Available: https://aclanthology.org/2022.naacl-main.223/
[Online]. Available: https://aclanthology.org/2022.naacl-main.223/
work page 2022
-
[24]
MMTAfrica: Multilingual Machine Translation for African Languages,
C. C. Emezue and B. F. P. Dossou, “MMTAfrica: Multilingual Machine Translation for African Languages,” inProc. 6th Conf. Machine Trans- lation (WMT), Online, Nov. 2021, pp. 398–411. [Online]. Available: https://aclanthology.org/2021.wmt-1.48
work page 2021
-
[25]
B. F. P. Dossou and C. C. Emezue, “Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,” inProc. AfricaNLP Workshop, EACL, Online, Apr
-
[26]
Available: https://arxiv.org/abs/2103.08052
[Online]. Available: https://arxiv.org/abs/2103.08052
-
[27]
CMU Wilderness Multilingual Speech Dataset,
A. W. Black, “CMU Wilderness Multilingual Speech Dataset,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), Brighton, UK, May 2019, pp. 5971–5975. doi: 10.1109/ICASSP.2019.8683536. [Online]. Available: https://ieeexplore.ieee.org/document/8683536
-
[28]
H. Hammarstr ¨om, R. Forkel, M. Haspelmath, and S. Bank, “Glot- tolog 4.8,” Max Planck Institute for Evolutionary Anthropology, Leipzig, 2023. doi: 10.5281/zenodo.8131084. [Online]. Available: https: //glottolog.org
-
[29]
D. M. Eberhard, G. F. Simons, and C. D. Fennig, Eds.,Ethnologue: Languages of the World, 26th ed. Dallas, TX: SIL International, 2023. [Online]. Available: https://www.ethnologue.com
work page 2023
-
[30]
NaijaWeb: A Large-Scale Nigerian Web Corpus,
T. Oladipo, A. Adeyemi, T. Ahia, and A. Ajayi, “NaijaWeb: A Large-Scale Nigerian Web Corpus,” Hugging Face Datasets,
-
[31]
Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets
[Online]. Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets
-
[32]
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,
G. Wenzek, M. Lachaux, A. Conneau, V . Chaudhary, F. Guzm ´an, A. Joulin, and E. Grave, “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4003–
work page 2020
-
[33]
Available: https://aclanthology.org/2020.lrec-1.494/
[Online]. Available: https://aclanthology.org/2020.lrec-1.494/
work page 2020
-
[34]
MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,
D. Varab and N. Schluter, “MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 10150–10161. doi: 10.18653/v1/2021.emnlp-main.797. [Online]. Available: https:// aclanthology.org/2021.emnlp-main.797/
-
[35]
M. A. Hedderich, D. I. Adelani, D. Zhu, J. Alabi, U. Markus, and D. Klakow, “Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,” inProc. Conf. Empirical Methods Natural Language Process. (EMNLP), Online, Nov. 2020, pp. 2580–2591. [Online]. Available: https://aclanthology.org/2020. emnlp-main.204/
work page 2020
-
[36]
Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,
A. Bhattacharjee, T. Hasan, W. Ahmad, K. Yuan, and R. Haque, “Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Toronto, Canada, Jul. 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.143/
work page 2023
-
[37]
AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,
B. F. P. Dossou and M. Sabry, “AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,” inProc. AfricaNLP Workshop, EACL, Online, Apr. 2021. [Online]. Available: https://arxiv. org/abs/2103.05132
-
[38]
E. Gauthier, L. Besacier, S. V oisin, M. Melese, and U. P. Elingui, “Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,” inProc. 10th Int. Conf. Language Resources Evaluation (LREC), Portoroˇz, Slovenia, May 2016, pp. 3863–3867. [Online]. Available: https://aclanthology.org/L16-1611/
work page 2016
-
[39]
CLEAR Global, “Gamayun Language Resources,” Translators Without Borders, 2021. [Online]. Available: https://gamayun.translatorswb.org
work page 2021
-
[40]
Hausa-English Code-Switched Dataset,
U. B. Umar, “Hausa-English Code-Switched Dataset,” Mendeley Data, vol. 1, 2024. doi: 10.17632/3xjyjsf4sb.1. [Online]. Available: https: //data.mendeley.com/datasets/3xjyjsf4sb/1
-
[41]
XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,
T. Hasan, A. Bhattacharjee, W. Ahmad, and others, “XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,” in Findings Assoc. Comput. Linguistics (ACL-IJCNLP), Online, Aug. 2021, pp. 4693–4703. [Online]. Available: https://aclanthology.org/2021. findings-acl.413/
work page 2021
-
[42]
Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,
S. H. Muhammad, D. I. Adelani, I. Abdulmumin, and others, “Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,” inProc. 13th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, Jun. 2022, pp. 590–602. [Online]. Available: https://aclanthology.org/2022.lrec-1.63/
work page 2022
-
[43]
Parallel Data, Tools and Interfaces in OPUS,
J. Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 2214–2218. [Online]. Available: https://aclanthology.org/ L12-1246/
work page 2012
-
[44]
Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,
I. Abdulmumin, S. R. Dash, M. A. Dawud, S. Parida, S. H. Muham- mad, I. S. Ahmad, S. Panda, O. Bojar, B. S. Galadanci, and B. S. Bello, “Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,” inProc. 13th Lang. Resources Evalua- tion Conf. (LREC), Marseille, France, Jun. 2022, pp. 6471–6479. doi: 10.18653/v1/2022.lrec-1.694....
-
[45]
TICO-19: The Translation Initiative for COVID-19,
A. Anastasopoulos, A. Cattelan, Z. Dou, and others, “TICO-19: The Translation Initiative for COVID-19,” inProc. NLP-COVID19 Workshop, EMNLP, Online, Nov. 2020. [Online]. Available: https:// aclanthology.org/2020.nlpcovid19-2.5/
work page 2020
-
[46]
The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,
S. Gehrmann, T. Adewumi, K. Agber, and others, “The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,” in Proc. GEM Workshop, ACL, Online, Aug. 2021, pp. 96–120. [Online]. Available: https://aclanthology.org/2021.gem-1.10/
work page 2021
-
[47]
Better Quality Pre- training Data and T5 Models for African Languages,
J. Oladipo, D. I. Adelani, A. Ahia, and others, “Better Quality Pre- training Data and T5 Models for African Languages,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Singapore, Dec
-
[48]
Available: https://aclanthology.org/2023.emnlp-main.11/
[Online]. Available: https://aclanthology.org/2023.emnlp-main.11/
work page 2023
-
[49]
AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,
O. Ogundepo, X. Zhang, S. Sun, K. Duh, and J. Lin, “AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,” inProc. 2022 Conf. Empirical Methods Natural Language Process. (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 8721–8728. doi: 10.18653/v1/2022.emnlp-main.597. [Online]. Available: https:// aclanthology.org/2022.emnlp-main.597/
-
[50]
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,
S. H. Muhammad, I. Abdulmumin, S. M. Yimam, D. I. Adelani, and others, “AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,” inProc. Conf. Empirical Methods Natural Lan- guage Process. (EMNLP), Singapore, Dec. 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.862/
work page 2023
-
[51]
I. Shode, D. I. Adelani, J. Peng, and A. Feldman, “NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Volume 2: Short Papers, Toronto, Canada, Jul. 2023, pp. 986–998. doi: 10.18653/v1/2023.acl-short.85. [Online]. Available: https://ac...
-
[52]
S. H. Muhammad, N. Ousidhoum, I. Abdulmumin, J. P. Wahle, T. Ruas, M. Beloucif, C. de Kock, N. Surange, D. Teodorescu, I. S. Ahmad, D. I. Adelani, A. F. Aji, and others, “BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,” arXiv preprint arXiv:2502.11926, 2025. [Online]. Available: https://arxiv.org/abs/...
-
[53]
BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,
J. Meyer, D. I. Adelani, E. Casanova, A. ¨Oktem, D. Whitenack, J. Weber, S. Kabongo Kabenamualu, E. Salesky, I. Orife, C. Leong, and others, “BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,” inProc. Interspeech, Incheon, South Korea, Sep. 2022, pp. 2383–2387. doi: 10.21437/Interspeech.2022- 10850. [Online]. Available: h...
-
[54]
Multilingual Spoken Words Corpus,
ML Commons, “Multilingual Spoken Words Corpus,” 2022. [Online]. Available: https://mlcommons.org/en/multilingual-spoken-words/
work page 2022
-
[55]
BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,
N. Kim, G. Lee, and others, “BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,” arXiv preprint arXiv:2406.09948, 2024. [Online]. Available: https://arxiv.org/abs/2406. 09948
-
[56]
T. A. Chang, C. Arnett, A. Eldesokey, and others (335 authors), “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures,” arXiv preprint arXiv:2510.24081, 2025. [Online]. Available: https://arxiv.org/abs/2510.24081
-
[57]
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,
D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. O. Alabi, X. He, M. Ochieng, S. Hooker, S. H. Muhammad, and others, “IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,” inProc. 2025 Conf. Nations Americas Chap- ter ACL (NAACL), Albuquerque, NM, USA, Apr. 2025, pp. 2732–
work page 2025
-
[58]
doi: 10.18653/v1/2025.naacl-long.139. [Online]. Available: https: //aclanthology.org/2025.naacl-long.139/
-
[59]
Common V oice: A Massively-Multilingual Speech Corpus,
R. Ardila, M. Branson, K. Davis, and others, “Common V oice: A Massively-Multilingual Speech Corpus,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4218–4222. [Online]. Available: https://aclanthology.org/2020. lrec-1.520/
work page 2020
-
[60]
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,” inProc. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, Jan. 2023, pp. 798–805. doi: 10.1109/SLT54892.2023.10023141. [Online]. Avail- able: https://ieeexplore.ieee....
-
[61]
African Storybook, “African Storybook Initiative,” 2024. [Online]. Avail- able: https://africanstorybook.org
work page 2024
-
[62]
PanLex: Building a Resource for Panlingual Lexical Translation,
D. Kamholz, J. Pool, and S. Colowick, “PanLex: Building a Resource for Panlingual Lexical Translation,” inProc. 9th Int. Conf. Language Resources Evaluation (LREC), Reykjavik, Iceland, May 2014, pp. 3145–
work page 2014
-
[63]
Available: https://aclanthology.org/L14-1023/
[Online]. Available: https://aclanthology.org/L14-1023/
-
[64]
Goloka: Africa’s Open Language Dataset Hub,
Goloka Project, “Goloka: Africa’s Open Language Dataset Hub,” 2024. [Online]. Available: https://goloka.ai
work page 2024
-
[65]
A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,
O. Yousuf, A. Aminu, M. S. Muhammad, B. Usman, M. K. Hashim, J. Nivre, B. Megyesi, and C. Høgel, “A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,” inProc. 19th Int. Conf. Document Analysis Recognition (ICDAR), Wuhan, China, Sep. 2025, pp. 620–637. doi: 10.1007/978-3-032- 04627-7 36. [Online]. Available: https://link.spr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.