A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

Mahounan Pericles Adjovi; Prasenjit Mitra; Roald Eiselen; Victor Olufemi

arxiv: 2605.22828 · v1 · pith:LSSTBSOVnew · submitted 2026-04-13 · 💻 cs.CL

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

Mahounan Pericles Adjovi , Victor Olufemi , Roald Eiselen , Prasenjit Mitra This is my paper

Pith reviewed 2026-05-25 00:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords HausaFongbeNLP resourcestext corporaspeech datasetsresource surveyWest African languageslanguage technology gaps

0 comments

The pith

Hausa has broader text resource diversity than Fongbe while both languages show specific gaps in speech and domain coverage for NLP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs publicly available text and speech resources for Hausa and Fongbe through a systematic search of repositories and web sources. It documents that Hausa draws from news, encyclopedic, and educational domains with greater variety, whereas Fongbe has fewer text options but benefits from recent academic speech collections. Both languages appear in shared benchmarks for named entity recognition and part-of-speech tagging. The survey supplies task-specific recommendations and flags priority gaps such as domain-diverse Fongbe text and dedicated Hausa speech corpora. A sympathetic reader would care because these details show exactly where new data collection can advance language technology for millions of speakers.

Core claim

The survey finds that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains while Fongbe has more limited text resources but recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. The catalog records size, domain coverage, format, licensing, and accessibility for parallel corpora, monolingual text, speech datasets, pre-trained models, and evaluation benchmarks, leading to concrete recommendations on remaining gaps.

What carries the argument

The systematic catalog of resources by size, domain coverage, format, licensing, and accessibility, which establishes the contrast in availability and identifies the priority gaps.

If this is right

NLP developers can target domain-diverse Fongbe text collection as a high-priority next step.
Dedicated Hausa speech corpora should be created to match the existing text resources.
Existing Masakhane benchmarks for NER and POS tagging can serve as starting points for further work on both languages.
Task-specific recommendations guide where to allocate effort in building models and evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Closing the identified gaps could enable more robust machine translation or information extraction tools that serve Hausa and Fongbe speakers directly.
The pattern of one language having stronger text coverage and the other stronger speech data may appear in surveys of additional low-resource languages.
Public release of the catalog itself could reduce duplication in future data collection projects.

Load-bearing premise

The systematic search of academic repositories, data platforms, and web sources captured a sufficiently complete and up-to-date picture of all publicly available resources without major omissions.

What would settle it

The discovery of a previously unlisted large Fongbe text collection spanning multiple domains or a substantial public Hausa speech corpus that would change which gaps rank as highest priority.

read the original abstract

This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textit{What is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain?} Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward resource catalog for Hausa and Fongbe that contrasts their availability but rests on an undocumented search.

read the letter

The paper pulls together publicly available text and speech resources for Hausa and Fongbe and notes the differences between them. Hausa has more text variety across news, encyclopedic, and educational sources while Fongbe has thinner text coverage but some recent academic speech collections. Both show up in Masakhane benchmarks for NER and POS tagging, and the authors flag needs like domain-diverse Fongbe text and dedicated Hausa speech data. They also list sizes, domains, licenses, and access details for the items they found. That side-by-side view and the task recommendations are the practical parts. Someone starting work on these languages gets a single starting point instead of hunting across sites. The soft spot is the search itself. The abstract describes a systematic look at academic repositories, data platforms, and web sources, but without search strings, dates, inclusion rules, or verification steps, there is no way to check whether the catalog is complete or whether the reported gaps would hold if a few more resources turned up. That is the part the main claims depend on. This kind of survey is aimed at people doing NLP on West African or other low-resource languages who need an organized overview before planning new data collection. It does not introduce new methods or results. I would send it for peer review so the authors can add the missing method details and make the inventory verifiable.

Referee Report

1 major / 0 minor

Summary. The paper surveys publicly available text and speech resources for Hausa (Afroasiatic, ~80-100M speakers) and Fongbe (Niger-Congo, ~2M speakers) via a systematic search of academic repositories, data platforms, and web sources. It catalogs parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks, documenting size, domain, format, licensing, and accessibility for each. Key findings include broader text diversity for Hausa across news/encyclopedic/educational domains, more limited Fongbe text but recent academic speech collections, representation of both in Masakhane NER/POS benchmarks, and priority gaps in domain-diverse Fongbe text and dedicated Hausa speech corpora.

Significance. If the catalog proves complete and replicable, the survey would provide a useful baseline reference for NLP development in these West African languages, clarifying availability contrasts and directing data collection priorities in low-resource settings.

major comments (1)

[Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency. We address the single major comment below and will incorporate the requested details in a revised manuscript.

read point-by-point responses

Referee: [Methods] Methods section: the description of the 'systematic search' provides no exact search strings, search date(s), inclusion/exclusion criteria, or platform-specific query details. This omission prevents verification of completeness and directly undermines the central claims about relative resource diversity, limitations, and priority gaps between Hausa and Fongbe.

Authors: We agree that the Methods section requires additional detail to support replicability and verification of our claims. In the revision we will add: (1) the exact search strings and Boolean combinations used on each platform (Google Scholar, ACL Anthology, arXiv, Hugging Face Datasets, Masakhane repositories, and language-specific web sources); (2) the date ranges of all searches (e.g., searches performed between January and March 2024); (3) explicit inclusion/exclusion criteria (publicly accessible resources with documented size, domain, and licensing; exclusion of private or paywalled data without clear release statements); and (4) platform-specific query examples. These additions will directly enable readers to assess the completeness of the catalog and the robustness of the identified gaps between Hausa and Fongbe resources. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive survey of external resources

full rationale

The paper is a catalog and gap analysis of publicly hosted text and speech datasets for Hausa and Fongbe. It contains no equations, no fitted parameters, no predictions, no uniqueness theorems, and no derivation chain. All claims reduce to enumeration of external repositories (Masakhane, academic platforms, web sources) whose existence and properties are independently verifiable outside the paper. The search methodology is a standard literature-review step, not a self-referential construction. No patterns from the seven enumerated kinds are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no mathematical modeling, fitted parameters, or theoretical derivations; the contribution rests on the completeness of the external search rather than any internal axioms or invented constructs.

pith-pipeline@v0.9.0 · 5757 in / 1171 out tokens · 24460 ms · 2026-05-25T00:29:59.875438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

[1]

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition,

D. I. Adelani, G. Neubig, S. Ruder, S. Rijhwani, M. Beukman, C. Palen- Michel, C. Lignos, J. O. Alabi, S. H. Muhammad, P. Nabende, C. M. B. Dione, A. Bukula, R. Mabuya, B. F. P. Dossou, B. Sibanda, H. Buzaaba, J. Mukiibi, G. Kalipe, D. Mbaye, A. Taylor, F. Kabore, C. C. Emezue, A. Aremu, P. Ogayo, C. Gitau, E. Munkoh-Buabeng, V . Memdjokam Koagne, A. A. T...

work page 2022
[2]

Masakha- POS: Part-of-Speech Tagging for Typologically Diverse African Lan- guages,

C. M. B. Dione, D. I. Adelani, P. Nabende, J. Alabi, T. Sindane, H. Buzaaba, S. H. Muhammad, C. C. Emezue, P. Ogayo, A. Aremu, C. Gitau, D. Mbaye, J. Mukiibi, B. Sibanda, B. F. P. Dossou, A. Bukula, R. Mabuya, A. A. Tapo, E. Munkoh-Buabeng, V . Memd- jokam Koagne, F. O. Kabore, A. Taylor, G. Kalipe, T. Macucwa, V . Mari- vate, T. Gwadabe, M. T. Elvis, I. ...

work page 2023
[3]

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,

K. Ogueji, Y . Zhu, and J. Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,” inProc. 1st Workshop Multilingual Representation Learn- ing (MRL), Punta Cana, Dominican Republic, Nov. 2021, pp. 116–126. [Online]. Available: https://aclanthology.org/2021.mrl-1.11/

work page 2021
[4]

FFSTC2: Fongbe– French Speech Translation Corpus,

L. Laleye, F. Biao, E. Gauthier, and L. Besacier, “FFSTC2: Fongbe– French Speech Translation Corpus,” Hugging Face Datasets, 2025. [Online]. Available: https://huggingface.co/datasets/GbeBenin/FFSTC-2

work page 2025
[5]

InkubaLM: A Small Language Model for Low-resource African Lan- guages,

A. Tonja, B. F. P. Dossou, D. I. Adelani, C. C. Emezue, and others, “InkubaLM: A Small Language Model for Low-resource African Lan- guages,” arXiv preprint arXiv:2408.17024, 2024. [Online]. Available: https://arxiv.org/abs/2408.17024

work page arXiv 2024
[6]

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,

D. Goldhahn, T. Eckart, and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 759–765. [Online]. Available: https://aclanthology.org/L12-1154/

work page 2012
[7]

Fon-French Daily Dialogues Parallel Corpus,

B. F. P. Dossou and C. C. Emezue, “Fon-French Daily Dialogues Parallel Corpus,” Zenodo, 2021. doi: 10.5281/zenodo.4432712. [Online]. Available: https://zenodo.org/records/4432712

work page doi:10.5281/zenodo.4432712 2021
[8]

pyFongbe: Fongbe ASR Resources,

L. Laleye, “pyFongbe: Fongbe ASR Resources,” GitHub Repository,

work page
[9]

Available: https://github.com/laleye/pyFongbe

[Online]. Available: https://github.com/laleye/pyFongbe

work page
[10]

English-Hausa Parallel Corpus,

G. Kenneth, “English-Hausa Parallel Corpus,” Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/gigikenneth/ englishhausa-corpus

work page 2020
[11]

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,

W. Nekoto, V . Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fag- bohun, S. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, F. Sackey, B. F. P. Dossou, and others, “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” inFindings Assoc. Comput. Linguistics (EMNLP), Online, Nov. 2020, pp. 2144–

work page 2020
[12]

Available: https://aclanthology.org/2020.findings-emnlp

[Online]. Available: https://aclanthology.org/2020.findings-emnlp. 195/

work page 2020
[13]

AI4D – African Language Program,

K. Siminyu, G. Kalipe, D. Orlic, J. Abbott, V . Marivate, S. Freshia, P. Sibal, B. Neupane, D. I. Adelani, A. Taylor, J. T. Ali, K. Degila, M. Balogoun, T. I. Diop, D. David, C. Fourati, H. Haddad, and M. Naski, “AI4D – African Language Program,” inProc. 2nd Workshop on AfricaNLP (AfricaNLP@EACL), Online, Apr. 2021. [Online]. Available: https://arxiv.org/...

work page arXiv 2021
[15]

Available: https://arxiv.org/abs/2003.11529

[Online]. Available: https://arxiv.org/abs/2003.11529

work page arXiv 2003
[16]

A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,

M. A. Hedderich, L. Lange, H. Adel, J. Str ¨obe, and D. Klakow, “A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,” inProc. 2021 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Online, Jun. 2021, pp. 2545–2568. [Online]. Available: https://aclanthology.org/2021.naacl-main.201/

work page 2021
[17]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World,

P. Joshi, S. Santy, A. Buber, B. Bali, and M. Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), Online, Jul. 2020, pp. 6282–6293. [Online]. Available: https://aclanthology.org/ 2020.acl-main.560/

work page 2020
[18]

Lanfrica: Discover African Language Resources,

Lanfrica Labs, “Lanfrica: Discover African Language Resources,” 2024. [Online]. Available: https://lanfrica.com

work page 2024
[19]

A Review on NLP Approaches for African Languages and Dialects,

A. M. Naira, I. Benelallam, A. Allak, and K. Gaanoun, “A Review on NLP Approaches for African Languages and Dialects,” inAdvances in Science, Technology and Innovation, Springer, Cham, 2024. doi: 10.1007/978-3-031-46849-0 23. [Online]. Available: https://doi.org/10. 1007/978-3-031-46849-0 23

work page doi:10.1007/978-3-031-46849-0 2024
[20]

A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,

J. Abate and F. Rashid, “A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,”Natu- ral Language Processing Journal, vol. 6, p. 100051, Mar. 2024. doi: 10.1016/j.nlp.2023.100051. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S2949719123000481

work page doi:10.1016/j.nlp.2023.100051 2024
[21]

No Language Left Behind: Scaling Human-Centered Machine Translation

NLLB Team, M. R. Costa-juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, and others, “No Language Left Behind: Scaling Human-Centered Machine Translation,” arXiv preprint arXiv:2207.04672, 2022. [Online]. Avail- able: https://arxiv.org/abs/2207.04672

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,

D. I. Adelani, J. Alabi, A. Fan, J. Kreutzer, and others, “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,” inProc. 2022 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Seattle, W A, USA, Jul. 2022, pp. 3053–

work page 2022
[23]

Available: https://aclanthology.org/2022.naacl-main.223/

[Online]. Available: https://aclanthology.org/2022.naacl-main.223/

work page 2022
[24]

MMTAfrica: Multilingual Machine Translation for African Languages,

C. C. Emezue and B. F. P. Dossou, “MMTAfrica: Multilingual Machine Translation for African Languages,” inProc. 6th Conf. Machine Trans- lation (WMT), Online, Nov. 2021, pp. 398–411. [Online]. Available: https://aclanthology.org/2021.wmt-1.48

work page 2021
[25]

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,

B. F. P. Dossou and C. C. Emezue, “Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,” inProc. AfricaNLP Workshop, EACL, Online, Apr

work page
[26]

Available: https://arxiv.org/abs/2103.08052

[Online]. Available: https://arxiv.org/abs/2103.08052

work page arXiv
[27]

CMU Wilderness Multilingual Speech Dataset,

A. W. Black, “CMU Wilderness Multilingual Speech Dataset,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), Brighton, UK, May 2019, pp. 5971–5975. doi: 10.1109/ICASSP.2019.8683536. [Online]. Available: https://ieeexplore.ieee.org/document/8683536

work page doi:10.1109/icassp.2019.8683536 2019
[28]

Glot- tolog 4.8,

H. Hammarstr ¨om, R. Forkel, M. Haspelmath, and S. Bank, “Glot- tolog 4.8,” Max Planck Institute for Evolutionary Anthropology, Leipzig, 2023. doi: 10.5281/zenodo.8131084. [Online]. Available: https: //glottolog.org

work page doi:10.5281/zenodo.8131084 2023
[29]

D. M. Eberhard, G. F. Simons, and C. D. Fennig, Eds.,Ethnologue: Languages of the World, 26th ed. Dallas, TX: SIL International, 2023. [Online]. Available: https://www.ethnologue.com

work page 2023
[30]

NaijaWeb: A Large-Scale Nigerian Web Corpus,

T. Oladipo, A. Adeyemi, T. Ahia, and A. Ajayi, “NaijaWeb: A Large-Scale Nigerian Web Corpus,” Hugging Face Datasets,

work page
[31]

Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

[Online]. Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

work page
[32]

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,

G. Wenzek, M. Lachaux, A. Conneau, V . Chaudhary, F. Guzm ´an, A. Joulin, and E. Grave, “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4003–

work page 2020
[33]

Available: https://aclanthology.org/2020.lrec-1.494/

[Online]. Available: https://aclanthology.org/2020.lrec-1.494/

work page 2020
[34]

MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,

D. Varab and N. Schluter, “MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 10150–10161. doi: 10.18653/v1/2021.emnlp-main.797. [Online]. Available: https:// aclanthology.org/2021.emnlp-main.797/

work page doi:10.18653/v1/2021.emnlp-main.797 2021
[35]

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,

M. A. Hedderich, D. I. Adelani, D. Zhu, J. Alabi, U. Markus, and D. Klakow, “Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,” inProc. Conf. Empirical Methods Natural Language Process. (EMNLP), Online, Nov. 2020, pp. 2580–2591. [Online]. Available: https://aclanthology.org/2020. emnlp-main.204/

work page 2020
[36]

Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,

A. Bhattacharjee, T. Hasan, W. Ahmad, K. Yuan, and R. Haque, “Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Toronto, Canada, Jul. 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.143/

work page 2023
[37]

AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,

B. F. P. Dossou and M. Sabry, “AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,” inProc. AfricaNLP Workshop, EACL, Online, Apr. 2021. [Online]. Available: https://arxiv. org/abs/2103.05132

work page arXiv 2021
[38]

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,

E. Gauthier, L. Besacier, S. V oisin, M. Melese, and U. P. Elingui, “Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,” inProc. 10th Int. Conf. Language Resources Evaluation (LREC), Portoroˇz, Slovenia, May 2016, pp. 3863–3867. [Online]. Available: https://aclanthology.org/L16-1611/

work page 2016
[39]

Gamayun Language Resources,

CLEAR Global, “Gamayun Language Resources,” Translators Without Borders, 2021. [Online]. Available: https://gamayun.translatorswb.org

work page 2021
[40]

Hausa-English Code-Switched Dataset,

U. B. Umar, “Hausa-English Code-Switched Dataset,” Mendeley Data, vol. 1, 2024. doi: 10.17632/3xjyjsf4sb.1. [Online]. Available: https: //data.mendeley.com/datasets/3xjyjsf4sb/1

work page doi:10.17632/3xjyjsf4sb.1 2024
[41]

XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,

T. Hasan, A. Bhattacharjee, W. Ahmad, and others, “XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,” in Findings Assoc. Comput. Linguistics (ACL-IJCNLP), Online, Aug. 2021, pp. 4693–4703. [Online]. Available: https://aclanthology.org/2021. findings-acl.413/

work page 2021
[42]

Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,

S. H. Muhammad, D. I. Adelani, I. Abdulmumin, and others, “Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,” inProc. 13th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, Jun. 2022, pp. 590–602. [Online]. Available: https://aclanthology.org/2022.lrec-1.63/

work page 2022
[43]

Parallel Data, Tools and Interfaces in OPUS,

J. Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 2214–2218. [Online]. Available: https://aclanthology.org/ L12-1246/

work page 2012
[44]

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,

I. Abdulmumin, S. R. Dash, M. A. Dawud, S. Parida, S. H. Muham- mad, I. S. Ahmad, S. Panda, O. Bojar, B. S. Galadanci, and B. S. Bello, “Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,” inProc. 13th Lang. Resources Evalua- tion Conf. (LREC), Marseille, France, Jun. 2022, pp. 6471–6479. doi: 10.18653/v1/2022.lrec-1.694....

work page doi:10.18653/v1/2022.lrec-1.694 2022
[45]

TICO-19: The Translation Initiative for COVID-19,

A. Anastasopoulos, A. Cattelan, Z. Dou, and others, “TICO-19: The Translation Initiative for COVID-19,” inProc. NLP-COVID19 Workshop, EMNLP, Online, Nov. 2020. [Online]. Available: https:// aclanthology.org/2020.nlpcovid19-2.5/

work page 2020
[46]

The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,

S. Gehrmann, T. Adewumi, K. Agber, and others, “The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,” in Proc. GEM Workshop, ACL, Online, Aug. 2021, pp. 96–120. [Online]. Available: https://aclanthology.org/2021.gem-1.10/

work page 2021
[47]

Better Quality Pre- training Data and T5 Models for African Languages,

J. Oladipo, D. I. Adelani, A. Ahia, and others, “Better Quality Pre- training Data and T5 Models for African Languages,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Singapore, Dec

work page
[48]

Available: https://aclanthology.org/2023.emnlp-main.11/

[Online]. Available: https://aclanthology.org/2023.emnlp-main.11/

work page 2023
[49]

AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,

O. Ogundepo, X. Zhang, S. Sun, K. Duh, and J. Lin, “AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,” inProc. 2022 Conf. Empirical Methods Natural Language Process. (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 8721–8728. doi: 10.18653/v1/2022.emnlp-main.597. [Online]. Available: https:// aclanthology.org/2022.emnlp-main.597/

work page doi:10.18653/v1/2022.emnlp-main.597 2022
[50]

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,

S. H. Muhammad, I. Abdulmumin, S. M. Yimam, D. I. Adelani, and others, “AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,” inProc. Conf. Empirical Methods Natural Lan- guage Process. (EMNLP), Singapore, Dec. 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.862/

work page 2023
[51]

NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,

I. Shode, D. I. Adelani, J. Peng, and A. Feldman, “NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Volume 2: Short Papers, Toronto, Canada, Jul. 2023, pp. 986–998. doi: 10.18653/v1/2023.acl-short.85. [Online]. Available: https://ac...

work page doi:10.18653/v1/2023.acl-short.85 2023
[52]

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,

S. H. Muhammad, N. Ousidhoum, I. Abdulmumin, J. P. Wahle, T. Ruas, M. Beloucif, C. de Kock, N. Surange, D. Teodorescu, I. S. Ahmad, D. I. Adelani, A. F. Aji, and others, “BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,” arXiv preprint arXiv:2502.11926, 2025. [Online]. Available: https://arxiv.org/abs/...

work page arXiv 2025
[53]

BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,

J. Meyer, D. I. Adelani, E. Casanova, A. ¨Oktem, D. Whitenack, J. Weber, S. Kabongo Kabenamualu, E. Salesky, I. Orife, C. Leong, and others, “BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,” inProc. Interspeech, Incheon, South Korea, Sep. 2022, pp. 2383–2387. doi: 10.21437/Interspeech.2022- 10850. [Online]. Available: h...

work page doi:10.21437/interspeech.2022- 2022
[54]

Multilingual Spoken Words Corpus,

ML Commons, “Multilingual Spoken Words Corpus,” 2022. [Online]. Available: https://mlcommons.org/en/multilingual-spoken-words/

work page 2022
[55]

BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

N. Kim, G. Lee, and others, “BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,” arXiv preprint arXiv:2406.09948, 2024. [Online]. Available: https://arxiv.org/abs/2406. 09948

work page arXiv 2024
[56]

Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

T. A. Chang, C. Arnett, A. Eldesokey, and others (335 authors), “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures,” arXiv preprint arXiv:2510.24081, 2025. [Online]. Available: https://arxiv.org/abs/2510.24081

work page arXiv 2025
[57]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,

D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. O. Alabi, X. He, M. Ochieng, S. Hooker, S. H. Muhammad, and others, “IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,” inProc. 2025 Conf. Nations Americas Chap- ter ACL (NAACL), Albuquerque, NM, USA, Apr. 2025, pp. 2732–

work page 2025
[58]

[Online]

doi: 10.18653/v1/2025.naacl-long.139. [Online]. Available: https: //aclanthology.org/2025.naacl-long.139/

work page doi:10.18653/v1/2025.naacl-long.139 2025
[59]

Common V oice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, and others, “Common V oice: A Massively-Multilingual Speech Corpus,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4218–4222. [Online]. Available: https://aclanthology.org/2020. lrec-1.520/

work page 2020
[60]

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,” inProc. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, Jan. 2023, pp. 798–805. doi: 10.1109/SLT54892.2023.10023141. [Online]. Avail- able: https://ieeexplore.ieee....

work page doi:10.1109/slt54892.2023.10023141 2023
[61]

African Storybook Initiative,

African Storybook, “African Storybook Initiative,” 2024. [Online]. Avail- able: https://africanstorybook.org

work page 2024
[62]

PanLex: Building a Resource for Panlingual Lexical Translation,

D. Kamholz, J. Pool, and S. Colowick, “PanLex: Building a Resource for Panlingual Lexical Translation,” inProc. 9th Int. Conf. Language Resources Evaluation (LREC), Reykjavik, Iceland, May 2014, pp. 3145–

work page 2014
[63]

Available: https://aclanthology.org/L14-1023/

[Online]. Available: https://aclanthology.org/L14-1023/

work page
[64]

Goloka: Africa’s Open Language Dataset Hub,

Goloka Project, “Goloka: Africa’s Open Language Dataset Hub,” 2024. [Online]. Available: https://goloka.ai

work page 2024
[65]

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,

O. Yousuf, A. Aminu, M. S. Muhammad, B. Usman, M. K. Hashim, J. Nivre, B. Megyesi, and C. Høgel, “A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,” inProc. 19th Int. Conf. Document Analysis Recognition (ICDAR), Wuhan, China, Sep. 2025, pp. 620–637. doi: 10.1007/978-3-032- 04627-7 36. [Online]. Available: https://link.spr...

work page doi:10.1007/978-3-032- 2025

[1] [1]

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition,

D. I. Adelani, G. Neubig, S. Ruder, S. Rijhwani, M. Beukman, C. Palen- Michel, C. Lignos, J. O. Alabi, S. H. Muhammad, P. Nabende, C. M. B. Dione, A. Bukula, R. Mabuya, B. F. P. Dossou, B. Sibanda, H. Buzaaba, J. Mukiibi, G. Kalipe, D. Mbaye, A. Taylor, F. Kabore, C. C. Emezue, A. Aremu, P. Ogayo, C. Gitau, E. Munkoh-Buabeng, V . Memdjokam Koagne, A. A. T...

work page 2022

[2] [2]

Masakha- POS: Part-of-Speech Tagging for Typologically Diverse African Lan- guages,

C. M. B. Dione, D. I. Adelani, P. Nabende, J. Alabi, T. Sindane, H. Buzaaba, S. H. Muhammad, C. C. Emezue, P. Ogayo, A. Aremu, C. Gitau, D. Mbaye, J. Mukiibi, B. Sibanda, B. F. P. Dossou, A. Bukula, R. Mabuya, A. A. Tapo, E. Munkoh-Buabeng, V . Memd- jokam Koagne, F. O. Kabore, A. Taylor, G. Kalipe, T. Macucwa, V . Mari- vate, T. Gwadabe, M. T. Elvis, I. ...

work page 2023

[3] [3]

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,

K. Ogueji, Y . Zhu, and J. Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages,” inProc. 1st Workshop Multilingual Representation Learn- ing (MRL), Punta Cana, Dominican Republic, Nov. 2021, pp. 116–126. [Online]. Available: https://aclanthology.org/2021.mrl-1.11/

work page 2021

[4] [4]

FFSTC2: Fongbe– French Speech Translation Corpus,

L. Laleye, F. Biao, E. Gauthier, and L. Besacier, “FFSTC2: Fongbe– French Speech Translation Corpus,” Hugging Face Datasets, 2025. [Online]. Available: https://huggingface.co/datasets/GbeBenin/FFSTC-2

work page 2025

[5] [5]

InkubaLM: A Small Language Model for Low-resource African Lan- guages,

A. Tonja, B. F. P. Dossou, D. I. Adelani, C. C. Emezue, and others, “InkubaLM: A Small Language Model for Low-resource African Lan- guages,” arXiv preprint arXiv:2408.17024, 2024. [Online]. Available: https://arxiv.org/abs/2408.17024

work page arXiv 2024

[6] [6]

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,

D. Goldhahn, T. Eckart, and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 759–765. [Online]. Available: https://aclanthology.org/L12-1154/

work page 2012

[7] [7]

Fon-French Daily Dialogues Parallel Corpus,

B. F. P. Dossou and C. C. Emezue, “Fon-French Daily Dialogues Parallel Corpus,” Zenodo, 2021. doi: 10.5281/zenodo.4432712. [Online]. Available: https://zenodo.org/records/4432712

work page doi:10.5281/zenodo.4432712 2021

[8] [8]

pyFongbe: Fongbe ASR Resources,

L. Laleye, “pyFongbe: Fongbe ASR Resources,” GitHub Repository,

work page

[9] [9]

Available: https://github.com/laleye/pyFongbe

[Online]. Available: https://github.com/laleye/pyFongbe

work page

[10] [10]

English-Hausa Parallel Corpus,

G. Kenneth, “English-Hausa Parallel Corpus,” Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/gigikenneth/ englishhausa-corpus

work page 2020

[11] [11]

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,

W. Nekoto, V . Marivate, T. Matsila, T. Fasubaa, T. Kolawole, T. Fag- bohun, S. Akinola, S. H. Muhammad, S. Kabongo, S. Osei, F. Sackey, B. F. P. Dossou, and others, “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” inFindings Assoc. Comput. Linguistics (EMNLP), Online, Nov. 2020, pp. 2144–

work page 2020

[12] [12]

Available: https://aclanthology.org/2020.findings-emnlp

[Online]. Available: https://aclanthology.org/2020.findings-emnlp. 195/

work page 2020

[13] [13]

AI4D – African Language Program,

K. Siminyu, G. Kalipe, D. Orlic, J. Abbott, V . Marivate, S. Freshia, P. Sibal, B. Neupane, D. I. Adelani, A. Taylor, J. T. Ali, K. Degila, M. Balogoun, T. I. Diop, D. David, C. Fourati, H. Haddad, and M. Naski, “AI4D – African Language Program,” inProc. 2nd Workshop on AfricaNLP (AfricaNLP@EACL), Online, Apr. 2021. [Online]. Available: https://arxiv.org/...

work page arXiv 2021

[14] [15]

Available: https://arxiv.org/abs/2003.11529

[Online]. Available: https://arxiv.org/abs/2003.11529

work page arXiv 2003

[15] [16]

A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,

M. A. Hedderich, L. Lange, H. Adel, J. Str ¨obe, and D. Klakow, “A Survey on Recent Approaches for Natural Language Processing in Low- Resource Scenarios,” inProc. 2021 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Online, Jun. 2021, pp. 2545–2568. [Online]. Available: https://aclanthology.org/2021.naacl-main.201/

work page 2021

[16] [17]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World,

P. Joshi, S. Santy, A. Buber, B. Bali, and M. Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), Online, Jul. 2020, pp. 6282–6293. [Online]. Available: https://aclanthology.org/ 2020.acl-main.560/

work page 2020

[17] [18]

Lanfrica: Discover African Language Resources,

Lanfrica Labs, “Lanfrica: Discover African Language Resources,” 2024. [Online]. Available: https://lanfrica.com

work page 2024

[18] [19]

A Review on NLP Approaches for African Languages and Dialects,

A. M. Naira, I. Benelallam, A. Allak, and K. Gaanoun, “A Review on NLP Approaches for African Languages and Dialects,” inAdvances in Science, Technology and Innovation, Springer, Cham, 2024. doi: 10.1007/978-3-031-46849-0 23. [Online]. Available: https://doi.org/10. 1007/978-3-031-46849-0 23

work page doi:10.1007/978-3-031-46849-0 2024

[19] [20]

A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,

J. Abate and F. Rashid, “A Review of Sentiment Analysis for Afaan Oromo: Current Trends and Future Perspectives,”Natu- ral Language Processing Journal, vol. 6, p. 100051, Mar. 2024. doi: 10.1016/j.nlp.2023.100051. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S2949719123000481

work page doi:10.1016/j.nlp.2023.100051 2024

[20] [21]

No Language Left Behind: Scaling Human-Centered Machine Translation

NLLB Team, M. R. Costa-juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, and others, “No Language Left Behind: Scaling Human-Centered Machine Translation,” arXiv preprint arXiv:2207.04672, 2022. [Online]. Avail- able: https://arxiv.org/abs/2207.04672

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [22]

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,

D. I. Adelani, J. Alabi, A. Fan, J. Kreutzer, and others, “A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation,” inProc. 2022 Conf. North American Chapter Assoc. Comput. Linguistics (NAACL), Seattle, W A, USA, Jul. 2022, pp. 3053–

work page 2022

[22] [23]

Available: https://aclanthology.org/2022.naacl-main.223/

[Online]. Available: https://aclanthology.org/2022.naacl-main.223/

work page 2022

[23] [24]

MMTAfrica: Multilingual Machine Translation for African Languages,

C. C. Emezue and B. F. P. Dossou, “MMTAfrica: Multilingual Machine Translation for African Languages,” inProc. 6th Conf. Machine Trans- lation (WMT), Online, Nov. 2021, pp. 398–411. [Online]. Available: https://aclanthology.org/2021.wmt-1.48

work page 2021

[24] [25]

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,

B. F. P. Dossou and C. C. Emezue, “Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language,” inProc. AfricaNLP Workshop, EACL, Online, Apr

work page

[25] [26]

Available: https://arxiv.org/abs/2103.08052

[Online]. Available: https://arxiv.org/abs/2103.08052

work page arXiv

[26] [27]

CMU Wilderness Multilingual Speech Dataset,

A. W. Black, “CMU Wilderness Multilingual Speech Dataset,” inProc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), Brighton, UK, May 2019, pp. 5971–5975. doi: 10.1109/ICASSP.2019.8683536. [Online]. Available: https://ieeexplore.ieee.org/document/8683536

work page doi:10.1109/icassp.2019.8683536 2019

[27] [28]

Glot- tolog 4.8,

H. Hammarstr ¨om, R. Forkel, M. Haspelmath, and S. Bank, “Glot- tolog 4.8,” Max Planck Institute for Evolutionary Anthropology, Leipzig, 2023. doi: 10.5281/zenodo.8131084. [Online]. Available: https: //glottolog.org

work page doi:10.5281/zenodo.8131084 2023

[28] [29]

D. M. Eberhard, G. F. Simons, and C. D. Fennig, Eds.,Ethnologue: Languages of the World, 26th ed. Dallas, TX: SIL International, 2023. [Online]. Available: https://www.ethnologue.com

work page 2023

[29] [30]

NaijaWeb: A Large-Scale Nigerian Web Corpus,

T. Oladipo, A. Adeyemi, T. Ahia, and A. Ajayi, “NaijaWeb: A Large-Scale Nigerian Web Corpus,” Hugging Face Datasets,

work page

[30] [31]

Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

[Online]. Available: https://huggingface.co/collections/saheedniyi/ naijaweb-datasets

work page

[31] [32]

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,

G. Wenzek, M. Lachaux, A. Conneau, V . Chaudhary, F. Guzm ´an, A. Joulin, and E. Grave, “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4003–

work page 2020

[32] [33]

Available: https://aclanthology.org/2020.lrec-1.494/

[Online]. Available: https://aclanthology.org/2020.lrec-1.494/

work page 2020

[33] [34]

MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,

D. Varab and N. Schluter, “MassiveSumm: A Very Large-Scale, Very Multilingual, News Summarisation Dataset,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 10150–10161. doi: 10.18653/v1/2021.emnlp-main.797. [Online]. Available: https:// aclanthology.org/2021.emnlp-main.797/

work page doi:10.18653/v1/2021.emnlp-main.797 2021

[34] [35]

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,

M. A. Hedderich, D. I. Adelani, D. Zhu, J. Alabi, U. Markus, and D. Klakow, “Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages,” inProc. Conf. Empirical Methods Natural Language Process. (EMNLP), Online, Nov. 2020, pp. 2580–2591. [Online]. Available: https://aclanthology.org/2020. emnlp-main.204/

work page 2020

[35] [36]

Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,

A. Bhattacharjee, T. Hasan, W. Ahmad, K. Yuan, and R. Haque, “Cross- Sum: Beyond English-Centric Cross-Lingual Abstractive Summarization for 1500+ Language Pairs,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Toronto, Canada, Jul. 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.143/

work page 2023

[36] [37]

AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,

B. F. P. Dossou and M. Sabry, “AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin,” inProc. AfricaNLP Workshop, EACL, Online, Apr. 2021. [Online]. Available: https://arxiv. org/abs/2103.05132

work page arXiv 2021

[37] [38]

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,

E. Gauthier, L. Besacier, S. V oisin, M. Melese, and U. P. Elingui, “Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof,” inProc. 10th Int. Conf. Language Resources Evaluation (LREC), Portoroˇz, Slovenia, May 2016, pp. 3863–3867. [Online]. Available: https://aclanthology.org/L16-1611/

work page 2016

[38] [39]

Gamayun Language Resources,

CLEAR Global, “Gamayun Language Resources,” Translators Without Borders, 2021. [Online]. Available: https://gamayun.translatorswb.org

work page 2021

[39] [40]

Hausa-English Code-Switched Dataset,

U. B. Umar, “Hausa-English Code-Switched Dataset,” Mendeley Data, vol. 1, 2024. doi: 10.17632/3xjyjsf4sb.1. [Online]. Available: https: //data.mendeley.com/datasets/3xjyjsf4sb/1

work page doi:10.17632/3xjyjsf4sb.1 2024

[40] [41]

XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,

T. Hasan, A. Bhattacharjee, W. Ahmad, and others, “XL-Sum: Large- Scale Multilingual Abstractive Summarization for 44 Languages,” in Findings Assoc. Comput. Linguistics (ACL-IJCNLP), Online, Aug. 2021, pp. 4693–4703. [Online]. Available: https://aclanthology.org/2021. findings-acl.413/

work page 2021

[41] [42]

Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,

S. H. Muhammad, D. I. Adelani, I. Abdulmumin, and others, “Nai- jaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis,” inProc. 13th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, Jun. 2022, pp. 590–602. [Online]. Available: https://aclanthology.org/2022.lrec-1.63/

work page 2022

[42] [43]

Parallel Data, Tools and Interfaces in OPUS,

J. Tiedemann, “Parallel Data, Tools and Interfaces in OPUS,” inProc. 8th Int. Conf. Language Resources Evaluation (LREC), Istanbul, Turkey, May 2012, pp. 2214–2218. [Online]. Available: https://aclanthology.org/ L12-1246/

work page 2012

[43] [44]

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,

I. Abdulmumin, S. R. Dash, M. A. Dawud, S. Parida, S. H. Muham- mad, I. S. Ahmad, S. Panda, O. Bojar, B. S. Galadanci, and B. S. Bello, “Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation,” inProc. 13th Lang. Resources Evalua- tion Conf. (LREC), Marseille, France, Jun. 2022, pp. 6471–6479. doi: 10.18653/v1/2022.lrec-1.694....

work page doi:10.18653/v1/2022.lrec-1.694 2022

[44] [45]

TICO-19: The Translation Initiative for COVID-19,

A. Anastasopoulos, A. Cattelan, Z. Dou, and others, “TICO-19: The Translation Initiative for COVID-19,” inProc. NLP-COVID19 Workshop, EMNLP, Online, Nov. 2020. [Online]. Available: https:// aclanthology.org/2020.nlpcovid19-2.5/

work page 2020

[45] [46]

The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,

S. Gehrmann, T. Adewumi, K. Agber, and others, “The GEM Bench- mark: Natural Language Generation, its Evaluation and Metrics,” in Proc. GEM Workshop, ACL, Online, Aug. 2021, pp. 96–120. [Online]. Available: https://aclanthology.org/2021.gem-1.10/

work page 2021

[46] [47]

Better Quality Pre- training Data and T5 Models for African Languages,

J. Oladipo, D. I. Adelani, A. Ahia, and others, “Better Quality Pre- training Data and T5 Models for African Languages,” inProc. Conf. Em- pirical Methods Natural Language Process. (EMNLP), Singapore, Dec

work page

[47] [48]

Available: https://aclanthology.org/2023.emnlp-main.11/

[Online]. Available: https://aclanthology.org/2023.emnlp-main.11/

work page 2023

[48] [49]

AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,

O. Ogundepo, X. Zhang, S. Sun, K. Duh, and J. Lin, “AfriCLIRMa- trix: Enabling Cross-Lingual Information Retrieval for African Lan- guages,” inProc. 2022 Conf. Empirical Methods Natural Language Process. (EMNLP), Abu Dhabi, UAE, Dec. 2022, pp. 8721–8728. doi: 10.18653/v1/2022.emnlp-main.597. [Online]. Available: https:// aclanthology.org/2022.emnlp-main.597/

work page doi:10.18653/v1/2022.emnlp-main.597 2022

[49] [50]

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,

S. H. Muhammad, I. Abdulmumin, S. M. Yimam, D. I. Adelani, and others, “AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages,” inProc. Conf. Empirical Methods Natural Lan- guage Process. (EMNLP), Singapore, Dec. 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.862/

work page 2023

[50] [51]

NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,

I. Shode, D. I. Adelani, J. Peng, and A. Feldman, “NollySenti: Lever- aging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification,” inProc. 61st Annu. Meeting Assoc. Comput. Linguistics (ACL), Volume 2: Short Papers, Toronto, Canada, Jul. 2023, pp. 986–998. doi: 10.18653/v1/2023.acl-short.85. [Online]. Available: https://ac...

work page doi:10.18653/v1/2023.acl-short.85 2023

[51] [52]

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,

S. H. Muhammad, N. Ousidhoum, I. Abdulmumin, J. P. Wahle, T. Ruas, M. Beloucif, C. de Kock, N. Surange, D. Teodorescu, I. S. Ahmad, D. I. Adelani, A. F. Aji, and others, “BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Lan- guages,” arXiv preprint arXiv:2502.11926, 2025. [Online]. Available: https://arxiv.org/abs/...

work page arXiv 2025

[52] [53]

BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,

J. Meyer, D. I. Adelani, E. Casanova, A. ¨Oktem, D. Whitenack, J. Weber, S. Kabongo Kabenamualu, E. Salesky, I. Orife, C. Leong, and others, “BibleTTS: A Large, High-Fidelity, Multilingual, and Uniquely African Speech Corpus,” inProc. Interspeech, Incheon, South Korea, Sep. 2022, pp. 2383–2387. doi: 10.21437/Interspeech.2022- 10850. [Online]. Available: h...

work page doi:10.21437/interspeech.2022- 2022

[53] [54]

Multilingual Spoken Words Corpus,

ML Commons, “Multilingual Spoken Words Corpus,” 2022. [Online]. Available: https://mlcommons.org/en/multilingual-spoken-words/

work page 2022

[54] [55]

BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

N. Kim, G. Lee, and others, “BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,” arXiv preprint arXiv:2406.09948, 2024. [Online]. Available: https://arxiv.org/abs/2406. 09948

work page arXiv 2024

[55] [56]

Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al

T. A. Chang, C. Arnett, A. Eldesokey, and others (335 authors), “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures,” arXiv preprint arXiv:2510.24081, 2025. [Online]. Available: https://arxiv.org/abs/2510.24081

work page arXiv 2025

[56] [57]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,

D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. O. Alabi, X. He, M. Ochieng, S. Hooker, S. H. Muhammad, and others, “IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models,” inProc. 2025 Conf. Nations Americas Chap- ter ACL (NAACL), Albuquerque, NM, USA, Apr. 2025, pp. 2732–

work page 2025

[57] [58]

[Online]

doi: 10.18653/v1/2025.naacl-long.139. [Online]. Available: https: //aclanthology.org/2025.naacl-long.139/

work page doi:10.18653/v1/2025.naacl-long.139 2025

[58] [59]

Common V oice: A Massively-Multilingual Speech Corpus,

R. Ardila, M. Branson, K. Davis, and others, “Common V oice: A Massively-Multilingual Speech Corpus,” inProc. 12th Int. Conf. Language Resources Evaluation (LREC), Marseille, France, May 2020, pp. 4218–4222. [Online]. Available: https://aclanthology.org/2020. lrec-1.520/

work page 2020

[59] [60]

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech,” inProc. IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, Jan. 2023, pp. 798–805. doi: 10.1109/SLT54892.2023.10023141. [Online]. Avail- able: https://ieeexplore.ieee....

work page doi:10.1109/slt54892.2023.10023141 2023

[60] [61]

African Storybook Initiative,

African Storybook, “African Storybook Initiative,” 2024. [Online]. Avail- able: https://africanstorybook.org

work page 2024

[61] [62]

PanLex: Building a Resource for Panlingual Lexical Translation,

D. Kamholz, J. Pool, and S. Colowick, “PanLex: Building a Resource for Panlingual Lexical Translation,” inProc. 9th Int. Conf. Language Resources Evaluation (LREC), Reykjavik, Iceland, May 2014, pp. 3145–

work page 2014

[62] [63]

Available: https://aclanthology.org/L14-1023/

[Online]. Available: https://aclanthology.org/L14-1023/

work page

[63] [64]

Goloka: Africa’s Open Language Dataset Hub,

Goloka Project, “Goloka: Africa’s Open Language Dataset Hub,” 2024. [Online]. Available: https://goloka.ai

work page 2024

[64] [65]

A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,

O. Yousuf, A. Aminu, M. S. Muhammad, B. Usman, M. K. Hashim, J. Nivre, B. Megyesi, and C. Høgel, “A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,” inProc. 19th Int. Conf. Document Analysis Recognition (ICDAR), Wuhan, China, Sep. 2025, pp. 620–637. doi: 10.1007/978-3-032- 04627-7 36. [Online]. Available: https://link.spr...

work page doi:10.1007/978-3-032- 2025