pith. sign in

arxiv: 2605.22828 · v1 · pith:LSSTBSOVnew · submitted 2026-04-13 · 💻 cs.CL

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

Pith reviewed 2026-05-25 00:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords HausaFongbeNLP resourcestext corporaspeech datasetsresource surveyWest African languageslanguage technology gaps
0
0 comments X

The pith

Hausa has broader text resource diversity than Fongbe while both languages show specific gaps in speech and domain coverage for NLP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs publicly available text and speech resources for Hausa and Fongbe through a systematic search of repositories and web sources. It documents that Hausa draws from news, encyclopedic, and educational domains with greater variety, whereas Fongbe has fewer text options but benefits from recent academic speech collections. Both languages appear in shared benchmarks for named entity recognition and part-of-speech tagging. The survey supplies task-specific recommendations and flags priority gaps such as domain-diverse Fongbe text and dedicated Hausa speech corpora. A sympathetic reader would care because these details show exactly where new data collection can advance language technology for millions of speakers.

Core claim

The survey finds that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains while Fongbe has more limited text resources but recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. The catalog records size, domain coverage, format, licensing, and accessibility for parallel corpora, monolingual text, speech datasets, pre-trained models, and evaluation benchmarks, leading to concrete recommendations on remaining gaps.

What carries the argument

The systematic catalog of resources by size, domain coverage, format, licensing, and accessibility, which establishes the contrast in availability and identifies the priority gaps.

If this is right

  • NLP developers can target domain-diverse Fongbe text collection as a high-priority next step.
  • Dedicated Hausa speech corpora should be created to match the existing text resources.
  • Existing Masakhane benchmarks for NER and POS tagging can serve as starting points for further work on both languages.
  • Task-specific recommendations guide where to allocate effort in building models and evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Closing the identified gaps could enable more robust machine translation or information extraction tools that serve Hausa and Fongbe speakers directly.
  • The pattern of one language having stronger text coverage and the other stronger speech data may appear in surveys of additional low-resource languages.
  • Public release of the catalog itself could reduce duplication in future data collection projects.

Load-bearing premise

The systematic search of academic repositories, data platforms, and web sources captured a sufficiently complete and up-to-date picture of all publicly available resources without major omissions.

What would settle it

The discovery of a previously unlisted large Fongbe text collection spanning multiple domains or a substantial public Hausa speech corpus that would change which gaps rank as highest priority.

read the original abstract

This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textit{What is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain?} Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper surveys publicly available text and speech resources for Hausa (Afroasiatic, ~80-100M speakers) and Fongbe (Niger-Congo, ~2M speakers) via a systematic search of academic repositories, data platforms, and web sources. It catalogs parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks, documenting size, domain, format, licensing, and accessibility for each. Key findings include broader text diversity for Hausa across news/encyclopedic/educational domains, more limited Fongbe text but recent academic speech collections, representation of both in Masakhane NER/POS benchmarks, and priority gaps in domain-diverse Fongbe text and dedicated Hausa speech corpora.

Significance. If the catalog proves complete and replicable, the survey would provide a useful baseline reference for NLP development in these West African languages, clarifying availability contrasts and directing data collection priorities in low-resource settings.

major comments (1)
  1. [Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency. We address the single major comment below and will incorporate the requested details in a revised manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.

    Authors: We agree that the Methods section requires additional detail to support replicability and verification of our claims. In the revision we will add: (1) the exact search strings and Boolean combinations used on each platform (Google Scholar, ACL Anthology, arXiv, Hugging Face Datasets, Masakhane repositories, and language-specific web sources); (2) the date ranges of all searches (e.g., searches performed between January and March 2024); (3) explicit inclusion/exclusion criteria (publicly accessible resources with documented size, domain, and licensing; exclusion of private or paywalled data without clear release statements); and (4) platform-specific query examples. These additions will directly enable readers to assess the completeness of the catalog and the robustness of the identified gaps between Hausa and Fongbe resources. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive survey of external resources

full rationale

The paper is a catalog and gap analysis of publicly hosted text and speech datasets for Hausa and Fongbe. It contains no equations, no fitted parameters, no predictions, no uniqueness theorems, and no derivation chain. All claims reduce to enumeration of external repositories (Masakhane, academic platforms, web sources) whose existence and properties are independently verifiable outside the paper. The search methodology is a standard literature-review step, not a self-referential construction. No patterns from the seven enumerated kinds are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no mathematical modeling, fitted parameters, or theoretical derivations; the contribution rests on the completeness of the external search rather than any internal axioms or invented constructs.

pith-pipeline@v0.9.0 · 5757 in / 1171 out tokens · 24460 ms · 2026-05-25T00:29:59.875438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

  1. [1]

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition,

    D. I. Adelani, G. Neubig, S. Ruder, S. Rijhwani, M. Beukman, C. Palen- Michel, C. Lignos, J. O. Alabi, S. H. Muhammad, P. Nabende, C. M. B. Dione, A. Bukula, R. Mabuya, B. F. P. Dossou, B. Sibanda, H. Buzaaba, J. Mukiibi, G. Kalipe, D. Mbaye, A. Taylor, F. Kabore, C. C. Emezue, A. Aremu, P. Ogayo, C. Gitau, E. Munkoh-Buabeng, V . Memdjokam Koagne, A. A. T...

  2. [2]

    Masakha- POS: Part-of-Speech Tagging for Typologically Diverse African Lan- guages,

    C. M. B. Dione, D. I. Adelani, P. Nabende, J. Alabi, T. Sindane, H. Buzaaba, S. H. Muhammad, C. C. Emezue, P. Ogayo, A. Aremu, C. Gitau, D. Mbaye, J. Mukiibi, B. Sibanda, B. F. P. Dossou, A. Bukula, R. Mabuya, A. A. Tapo, E. Munkoh-Buabeng, V . Memd- jokam Koagne, F. O. Kabore, A. Taylor, G. Kalipe, T. Macucwa, V . Mari- vate, T. Gwadabe, M. T. Elvis, I. ...

  3. [3]

    Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,

    K. Ogueji, Y . Zhu, and J. Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,” inProc. 1st Workshop Multilingual Representation Learn- ing (MRL), Punta Cana, Dominican Republic, Nov. 2021, pp. 116–126. [Online]. Available: https://aclanthology.org/2021.mrl-1.11/

  4. [4]

    FFSTC2: Fongbe– French Speech Translation Corpus,

    L. Laleye, F. Biao, E. Gauthier, and L. Besacier, “FFSTC2: Fongbe– French Speech Translation Corpus,” Hugging Face Datasets, 2025. [Online]. Available: https://huggingface.co/datasets/GbeBenin/FFSTC-2

  5. [5]

    InkubaLM: A Small Language Model for Low-resource African Lan- guages,

    A. Tonja, B. F. P. Dossou, D. I. Adelani, C. C. Emezue, and others, “InkubaLM: A Small Language Model for Low-resource African Lan- guages,” arXiv preprint arXiv:2408.17024, 2024. [Online]. Available: https://arxiv.org/abs/2408.17024

  6. [6]

    Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,

    D. Goldhahn, T. Eckart, and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 759–765. [Online]. Available: https://aclanthology.org/L12-1154/

  7. [7]

    Fon-French Daily Dialogues Parallel Corpus,

    B. F. P. Dossou and C. C. Emezue, “Fon-French Daily Dialogues Parallel Corpus,” Zenodo, 2021. doi: 10.5281/zenodo.4432712. [Online]. Available: https://zenodo.org/records/4432712

  8. [8]

    pyFongbe: Fongbe ASR Resources,

    L. Laleye, “pyFongbe: Fongbe ASR Resources,” GitHub Repository,

  9. [9]

    Available: https://github.com/laleye/pyFongbe

    [Online]. Available: https://github.com/laleye/pyFongbe

  10. [10]

    English-Hausa Parallel Corpus,

    G. Kenneth, “English-Hausa Parallel Corpus,” Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/gigikenneth/ englishhausa-corpus

  11. [11]

    Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,

    W. Nekoto, V . Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fag- bohun, S. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, F. Sackey, B. F. P. Dossou, and others, “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” inFindings Assoc. Comput. Linguistics (EMNLP), Online, Nov. 2020, pp. 2144–

  12. [12]

    Available: https://aclanthology.org/2020.findings-emnlp

    [Online]. Available: https://aclanthology.org/2020.findings-emnlp. 195/

  13. [13]

    AI4D – African Language Program,

    K. Siminyu, G. Kalipe, D. Orlic, J. Abbott, V . Marivate, S. Freshia, P. Sibal, B. Neupane, D. I. Adelani, A. Taylor, J. T. Ali, K. Degila, M. Balogoun, T. I. Diop, D. David, C. Fourati, H. Haddad, and M. Naski, “AI4D – African Language Program,” inProc. 2nd Workshop on AfricaNLP (AfricaNLP@EACL), Online, Apr. 2021. [Online]. Available: https://arxiv.org/...

  14. [15]

    Available: https://arxiv.org/abs/2003.11529

    [Online]. Available: https://arxiv.org/abs/2003.11529

  15. [16]

    A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,

    M. A. Hedderich, L. Lange, H. Adel, J. Str ¨obe, and D. Klakow, “A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,” inProc. 2021 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Online, Jun. 2021, pp. 2545–2568. [Online]. Available: https://aclanthology.org/2021.naacl-main.201/

  16. [17]

    The State and Fate of Linguistic Diversity and Inclusion in the NLP World,

    P. Joshi, S. Santy, A. Buber, B. Bali, and M. Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), Online, Jul. 2020, pp. 6282–6293. [Online]. Available: https://aclanthology.org/ 2020.acl-main.560/

  17. [18]

    Lanfrica: Discover African Language Resources,

    Lanfrica Labs, “Lanfrica: Discover African Language Resources,” 2024. [Online]. Available: https://lanfrica.com

  18. [19]

    A Review on NLP Approaches for African Languages and Dialects,

    A. M. Naira, I. Benelallam, A. Allak, and K. Gaanoun, “A Review on NLP Approaches for African Languages and Dialects,” inAdvances in Science, Technology and Innovation, Springer, Cham, 2024. doi: 10.1007/978-3-031-46849-0 23. [Online]. Available: https://doi.org/10. 1007/978-3-031-46849-0 23

  19. [20]

    A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,

    J. Abate and F. Rashid, “A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,”Natu- ral Language Processing Journal, vol. 6, p. 100051, Mar. 2024. doi: 10.1016/j.nlp.2023.100051. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S2949719123000481

  20. [21]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    NLLB Team, M. R. Costa-juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, and others, “No Language Left Behind: Scaling Human-Centered Machine Translation,” arXiv preprint arXiv:2207.04672, 2022. [Online]. Avail- able: https://arxiv.org/abs/2207.04672

  21. [22]

    A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,

    D. I. Adelani, J. Alabi, A. Fan, J. Kreutzer, and others, “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,” inProc. 2022 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Seattle, W A, USA, Jul. 2022, pp. 3053–

  22. [23]

    Available: https://aclanthology.org/2022.naacl-main.223/

    [Online]. Available: https://aclanthology.org/2022.naacl-main.223/

  23. [24]

    MMTAfrica: Multilingual Machine Translation for African Languages,

    C. C. Emezue and B. F. P. Dossou, “MMTAfrica: Multilingual Machine Translation for African Languages,” inProc. 6th Conf. Machine Trans- lation (WMT), Online, Nov. 2021, pp. 398–411. [Online]. Available: https://aclanthology.org/2021.wmt-1.48

  24. [25]

    Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,

    B. F. P. Dossou and C. C. Emezue, “Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,” inProc. AfricaNLP Workshop, EACL, Online, Apr

  25. [26]

    Available: https://arxiv.org/abs/2103.08052

    [Online]. Available: https://arxiv.org/abs/2103.08052

  26. [27]

    CMU Wilderness Multilingual Speech Dataset,

    A. W. Black, “CMU Wilderness Multilingual Speech Dataset,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), Brighton, UK, May 2019, pp. 5971–5975. doi: 10.1109/ICASSP.2019.8683536. [Online]. Available: https://ieeexplore.ieee.org/document/8683536

  27. [28]

    Glot- tolog 4.8,

    H. Hammarstr ¨om, R. Forkel, M. Haspelmath, and S. Bank, “Glot- tolog 4.8,” Max Planck Institute for Evolutionary Anthropology, Leipzig, 2023. doi: 10.5281/zenodo.8131084. [Online]. Available: https: //glottolog.org

  28. [29]

    D. M. Eberhard, G. F. Simons, and C. D. Fennig, Eds.,Ethnologue: Languages of the World, 26th ed. Dallas, TX: SIL International, 2023. [Online]. Available: https://www.ethnologue.com

  29. [30]

    NaijaWeb: A Large-Scale Nigerian Web Corpus,

    T. Oladipo, A. Adeyemi, T. Ahia, and A. Ajayi, “NaijaWeb: A Large-Scale Nigerian Web Corpus,” Hugging Face Datasets,

  30. [31]

    Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

    [Online]. Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

  31. [32]

    CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,

    G. Wenzek, M. Lachaux, A. Conneau, V . Chaudhary, F. Guzm ´an, A. Joulin, and E. Grave, “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4003–

  32. [33]

    Available: https://aclanthology.org/2020.lrec-1.494/

    [Online]. Available: https://aclanthology.org/2020.lrec-1.494/

  33. [34]

    MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,

    D. Varab and N. Schluter, “MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 10150–10161. doi: 10.18653/v1/2021.emnlp-main.797. [Online]. Available: https:// aclanthology.org/2021.emnlp-main.797/

  34. [35]

    Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,

    M. A. Hedderich, D. I. Adelani, D. Zhu, J. Alabi, U. Markus, and D. Klakow, “Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,” inProc. Conf. Empirical Methods Natural Language Process. (EMNLP), Online, Nov. 2020, pp. 2580–2591. [Online]. Available: https://aclanthology.org/2020. emnlp-main.204/

  35. [36]

    Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,

    A. Bhattacharjee, T. Hasan, W. Ahmad, K. Yuan, and R. Haque, “Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Toronto, Canada, Jul. 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.143/

  36. [37]

    AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,

    B. F. P. Dossou and M. Sabry, “AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,” inProc. AfricaNLP Workshop, EACL, Online, Apr. 2021. [Online]. Available: https://arxiv. org/abs/2103.05132

  37. [38]

    Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,

    E. Gauthier, L. Besacier, S. V oisin, M. Melese, and U. P. Elingui, “Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,” inProc. 10th Int. Conf. Language Resources Evaluation (LREC), Portoroˇz, Slovenia, May 2016, pp. 3863–3867. [Online]. Available: https://aclanthology.org/L16-1611/

  38. [39]

    Gamayun Language Resources,

    CLEAR Global, “Gamayun Language Resources,” Translators Without Borders, 2021. [Online]. Available: https://gamayun.translatorswb.org

  39. [40]

    Hausa-English Code-Switched Dataset,

    U. B. Umar, “Hausa-English Code-Switched Dataset,” Mendeley Data, vol. 1, 2024. doi: 10.17632/3xjyjsf4sb.1. [Online]. Available: https: //data.mendeley.com/datasets/3xjyjsf4sb/1

  40. [41]

    XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,

    T. Hasan, A. Bhattacharjee, W. Ahmad, and others, “XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,” in Findings Assoc. Comput. Linguistics (ACL-IJCNLP), Online, Aug. 2021, pp. 4693–4703. [Online]. Available: https://aclanthology.org/2021. findings-acl.413/

  41. [42]

    Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,

    S. H. Muhammad, D. I. Adelani, I. Abdulmumin, and others, “Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,” inProc. 13th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, Jun. 2022, pp. 590–602. [Online]. Available: https://aclanthology.org/2022.lrec-1.63/

  42. [43]

    Parallel Data, Tools and Interfaces in OPUS,

    J. Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 2214–2218. [Online]. Available: https://aclanthology.org/ L12-1246/

  43. [44]

    Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,

    I. Abdulmumin, S. R. Dash, M. A. Dawud, S. Parida, S. H. Muham- mad, I. S. Ahmad, S. Panda, O. Bojar, B. S. Galadanci, and B. S. Bello, “Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,” inProc. 13th Lang. Resources Evalua- tion Conf. (LREC), Marseille, France, Jun. 2022, pp. 6471–6479. doi: 10.18653/v1/2022.lrec-1.694....

  44. [45]

    TICO-19: The Translation Initiative for COVID-19,

    A. Anastasopoulos, A. Cattelan, Z. Dou, and others, “TICO-19: The Translation Initiative for COVID-19,” inProc. NLP-COVID19 Workshop, EMNLP, Online, Nov. 2020. [Online]. Available: https:// aclanthology.org/2020.nlpcovid19-2.5/

  45. [46]

    The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,

    S. Gehrmann, T. Adewumi, K. Agber, and others, “The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,” in Proc. GEM Workshop, ACL, Online, Aug. 2021, pp. 96–120. [Online]. Available: https://aclanthology.org/2021.gem-1.10/

  46. [47]

    Better Quality Pre- training Data and T5 Models for African Languages,

    J. Oladipo, D. I. Adelani, A. Ahia, and others, “Better Quality Pre- training Data and T5 Models for African Languages,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Singapore, Dec

  47. [48]

    Available: https://aclanthology.org/2023.emnlp-main.11/

    [Online]. Available: https://aclanthology.org/2023.emnlp-main.11/

  48. [49]

    AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,

    O. Ogundepo, X. Zhang, S. Sun, K. Duh, and J. Lin, “AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,” inProc. 2022 Conf. Empirical Methods Natural Language Process. (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 8721–8728. doi: 10.18653/v1/2022.emnlp-main.597. [Online]. Available: https:// aclanthology.org/2022.emnlp-main.597/

  49. [50]

    AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,

    S. H. Muhammad, I. Abdulmumin, S. M. Yimam, D. I. Adelani, and others, “AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,” inProc. Conf. Empirical Methods Natural Lan- guage Process. (EMNLP), Singapore, Dec. 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.862/

  50. [51]

    NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,

    I. Shode, D. I. Adelani, J. Peng, and A. Feldman, “NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Volume 2: Short Papers, Toronto, Canada, Jul. 2023, pp. 986–998. doi: 10.18653/v1/2023.acl-short.85. [Online]. Available: https://ac...

  51. [52]

    BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,

    S. H. Muhammad, N. Ousidhoum, I. Abdulmumin, J. P. Wahle, T. Ruas, M. Beloucif, C. de Kock, N. Surange, D. Teodorescu, I. S. Ahmad, D. I. Adelani, A. F. Aji, and others, “BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,” arXiv preprint arXiv:2502.11926, 2025. [Online]. Available: https://arxiv.org/abs/...

  52. [53]

    BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,

    J. Meyer, D. I. Adelani, E. Casanova, A. ¨Oktem, D. Whitenack, J. Weber, S. Kabongo Kabenamualu, E. Salesky, I. Orife, C. Leong, and others, “BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,” inProc. Interspeech, Incheon, South Korea, Sep. 2022, pp. 2383–2387. doi: 10.21437/Interspeech.2022- 10850. [Online]. Available: h...

  53. [54]

    Multilingual Spoken Words Corpus,

    ML Commons, “Multilingual Spoken Words Corpus,” 2022. [Online]. Available: https://mlcommons.org/en/multilingual-spoken-words/

  54. [55]

    BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

    N. Kim, G. Lee, and others, “BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,” arXiv preprint arXiv:2406.09948, 2024. [Online]. Available: https://arxiv.org/abs/2406. 09948

  55. [56]

    Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

    T. A. Chang, C. Arnett, A. Eldesokey, and others (335 authors), “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures,” arXiv preprint arXiv:2510.24081, 2025. [Online]. Available: https://arxiv.org/abs/2510.24081

  56. [57]

    IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,

    D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. O. Alabi, X. He, M. Ochieng, S. Hooker, S. H. Muhammad, and others, “IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,” inProc. 2025 Conf. Nations Americas Chap- ter ACL (NAACL), Albuquerque, NM, USA, Apr. 2025, pp. 2732–

  57. [58]

    [Online]

    doi: 10.18653/v1/2025.naacl-long.139. [Online]. Available: https: //aclanthology.org/2025.naacl-long.139/

  58. [59]

    Common V oice: A Massively-Multilingual Speech Corpus,

    R. Ardila, M. Branson, K. Davis, and others, “Common V oice: A Massively-Multilingual Speech Corpus,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4218–4222. [Online]. Available: https://aclanthology.org/2020. lrec-1.520/

  59. [60]

    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,” inProc. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, Jan. 2023, pp. 798–805. doi: 10.1109/SLT54892.2023.10023141. [Online]. Avail- able: https://ieeexplore.ieee....

  60. [61]

    African Storybook Initiative,

    African Storybook, “African Storybook Initiative,” 2024. [Online]. Avail- able: https://africanstorybook.org

  61. [62]

    PanLex: Building a Resource for Panlingual Lexical Translation,

    D. Kamholz, J. Pool, and S. Colowick, “PanLex: Building a Resource for Panlingual Lexical Translation,” inProc. 9th Int. Conf. Language Resources Evaluation (LREC), Reykjavik, Iceland, May 2014, pp. 3145–

  62. [63]

    Available: https://aclanthology.org/L14-1023/

    [Online]. Available: https://aclanthology.org/L14-1023/

  63. [64]

    Goloka: Africa’s Open Language Dataset Hub,

    Goloka Project, “Goloka: Africa’s Open Language Dataset Hub,” 2024. [Online]. Available: https://goloka.ai

  64. [65]

    A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,

    O. Yousuf, A. Aminu, M. S. Muhammad, B. Usman, M. K. Hashim, J. Nivre, B. Megyesi, and C. Høgel, “A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,” inProc. 19th Int. Conf. Document Analysis Recognition (ICDAR), Wuhan, China, Sep. 2025, pp. 620–637. doi: 10.1007/978-3-032- 04627-7 36. [Online]. Available: https://link.spr...