The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Vukosi Marivate

arxiv: 2605.19066 · v1 · pith:F6HIZFQKnew · submitted 2026-05-18 · 💻 cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Vukosi Marivate This is my paper

Pith reviewed 2026-05-20 10:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords annotation scarcity paradoxlow-resource NLPevaluation validitydata sovereigntyparticipatory evaluationghost workmultilingual models

0 comments

The pith

Low-resource NLP model scaling has outpaced the human expertise and infrastructure needed for authentic evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys low-resource NLP evaluation from 2014 onward and identifies a core tension: technical tools for building and benchmarking models have grown quickly, yet the specialized human judgment required to assess complex systems remains limited and unevenly available. This mismatch defines the Annotation Scarcity Paradox. A reader would care because it suggests that many claims of progress in multilingual and low-resource settings rest on evaluations that lack sufficient depth or fairness. The survey reviews earlier optimistic phases, later benchmark-focused approaches, and current generative challenges, then examines responses such as participatory curation and data-efficient methods. It concludes by urging a move toward community-rooted practices that respect data sovereignty.

Core claim

The central claim is the Annotation Scarcity Paradox, defined as the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. Tracing low-resource NLP evaluation across phases of early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks, and examining extractive data pipelines, undercompensated ghost work, and language data flaring, the paper argues that this paradox threatens the epistemic validity of reported progress.

What carries the argument

The Annotation Scarcity Paradox, which captures the mismatch between rapid model and benchmark scaling and the strained, inequitably distributed human expertise required for authentic evaluation.

If this is right

Existing benchmarks in low-resource NLP may overestimate model capabilities because they rely on insufficiently deep or representative human judgments.
Responses such as data augmentation, model-based evaluation, and active learning carry their own equity and validity trade-offs that must be weighed carefully.
Sustainable progress requires shifting from transactional data extraction to relational, community-embedded evaluation practices grounded in epistemic governance and data sovereignty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could test whether involving local language communities earlier in evaluation design improves alignment between benchmark scores and real-world utility.
The same mismatch between scaling capacity and human evaluation resources may surface in other AI areas that depend on large-scale human feedback.
Long-term field stability may hinge on building distributed infrastructure for annotation rather than continuing to centralize evaluation through global benchmarks.

Load-bearing premise

The sociolinguistic expertise required for authentic evaluation of generative systems is severely strained, inequitably distributed, and structurally marginalised in a manner that directly undermines the validity of existing benchmarks and reported results.

What would settle it

A side-by-side study in which native-speaker experts from low-resource language communities re-evaluate a set of existing benchmarks and produce results that closely match the original reported scores would challenge the claim that the paradox undermines epistemic validity.

Figures

Figures reproduced from arXiv: 2605.19066 by Vukosi Marivate.

read the original abstract

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper coins the Annotation Scarcity Paradox to name the mismatch between rapid model scaling and the strained human expertise needed for credible low-resource NLP evaluation.

read the letter

Colleague, the main point is that this work gives a name to a tension many of us have felt in low-resource NLP: we keep scaling models and benchmarks while the people who can actually evaluate them properly are stretched thin and often sidelined. The paper traces the last decade in three phases—early heuristic optimism, top-down benchmark scaling, and the current generative bottlenecks—and uses that arc to introduce the Annotation Scarcity Paradox as the friction when technical capacity outruns sovereign human infrastructure. That framing pulls together documented issues like extractive pipelines, ghost work, and data flaring into one structural story, which is the clearest new element here. It also reviews responses such as participatory curation, model-based evaluation, and IRT-based methods, and flags the equity-validity trade-offs in each. That survey is straightforward and gives readers a map of current options. The argument stays interpretive rather than quantitative, so the claim that the paradox threatens epistemic validity rests on the strength of the historical synthesis and examples rather than new measurements or derivations. From the abstract and stress-test, there are no internal contradictions or ungrounded leaps in the logic itself, but the absence of concrete counts or case studies leaves the central claim open to pushback on how directly the practices undermine reported results. This is useful for researchers working on multilingual benchmarks, evaluation protocols, or community-driven data work who want a conceptual handle on why progress feels uneven. Readers already thinking about labor, sovereignty, and governance will get the most out of it. I would send it to peer review. The framing is worth community discussion and the survey is organized enough to support revisions that add more specific evidence where needed.

Referee Report

1 major / 2 minor

Summary. The paper presents a critical narrative survey of low-resource NLP evaluation from 2014 to the present, tracing three phases of development (early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks). It defines the Annotation Scarcity Paradox as the structural mismatch between rapid model scaling capacity and limited sovereign human infrastructure for authentic evaluation, supported by analysis of extractive data pipelines, undercompensated ghost work, and language data flaring. The authors argue this paradox undermines epistemic validity of reported progress, survey responses including data augmentation, model-based evaluation, participatory curation, and IRT/active learning methods, and advocate a paradigm shift to relational, community-embedded evaluation grounded in epistemic governance and data sovereignty.

Significance. If the interpretive synthesis holds, the work offers a timely framework for recognizing structural constraints in low-resource NLP evaluation that could encourage more equitable practices and improved validity in multilingual benchmarks. The survey of historical phases and assessment of equity-validity trade-offs in emerging methods provides a useful reference point for researchers addressing annotation challenges, though its impact would be strengthened by tighter linkages to concrete evaluation failures.

major comments (1)

[Abstract and section on the Annotation Scarcity Paradox] Abstract and section on the Annotation Scarcity Paradox: The central claim that the paradox 'threatens the epistemic validity of reported progress' is load-bearing but rests on interpretive synthesis of practices like ghost work; the manuscript would benefit from at least one concrete example (with citation) of a low-resource benchmark whose results have been directly questioned due to annotation infrastructure issues, to move beyond narrative to a falsifiable link.

minor comments (2)

[section tracing the three phases] The three-phase historical framing is clear but could note any overlap or counter-examples between phases to avoid implying a strictly linear progression.
[section surveying emerging responses] In the survey of responses, the equity and validity trade-offs for IRT-based and participatory methods are assessed at a high level; adding a short table summarizing the trade-offs would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. The feedback highlights an opportunity to strengthen the empirical grounding of our central argument, and we have revised the paper accordingly while preserving its character as a critical narrative survey.

read point-by-point responses

Referee: [Abstract and section on the Annotation Scarcity Paradox] Abstract and section on the Annotation Scarcity Paradox: The central claim that the paradox 'threatens the epistemic validity of reported progress' is load-bearing but rests on interpretive synthesis of practices like ghost work; the manuscript would benefit from at least one concrete example (with citation) of a low-resource benchmark whose results have been directly questioned due to annotation infrastructure issues, to move beyond narrative to a falsifiable link.

Authors: We agree that a concrete, citable example would make the load-bearing claim more directly falsifiable and would help readers connect the structural analysis to specific evaluation outcomes. While the manuscript's strength lies in its synthesis of patterns across extractive pipelines, ghost work, and data flaring, we accept that an illustrative case would improve clarity. In the revised version we have added a concise example in the section defining the Annotation Scarcity Paradox: we now reference documented concerns about annotation quality and inter-annotator agreement in the MasakhaNER benchmark for low-resource African languages, where reliance on non-expert and non-native annotators has been shown in follow-up studies to affect the reliability of reported performance metrics. We have also lightly updated the abstract to signal this addition. This revision keeps the narrative framing intact while addressing the request for a more explicit link. revision: yes

Circularity Check

0 steps flagged

No significant circularity in conceptual survey

full rationale

The paper is a critical narrative survey that defines the Annotation Scarcity Paradox through synthesis of external documented practices (extractive pipelines, ghost work, data flaring) across three historical phases. It draws on external literature for support and presents interpretive analysis of equity/validity trade-offs in responses such as participatory curation and IRT methods. No quantitative predictions, fitted parameters, equations, or derivations exist that could reduce to inputs by construction. The central claim relies on observed structural mismatches rather than self-referential definitions or load-bearing self-citations, rendering the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is a conceptual survey paper whose central contribution is a new framing rather than a derivation from data or axioms; the ledger therefore contains only the high-level domain assumption used to structure the narrative.

axioms (1)

domain assumption Low-resource NLP evaluation has evolved through three identifiable phases (early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks) since 2014.
This tripartite periodization organizes the entire critical narrative survey.

invented entities (1)

Annotation Scarcity Paradox no independent evidence
purpose: To name and conceptualize the structural mismatch between model scaling capacity and available expert human evaluation infrastructure.
This is the primary novel construct introduced by the paper.

pith-pipeline@v0.9.0 · 5771 in / 1395 out tokens · 44135 ms · 2026-05-20T10:40:25.630659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We conceptualise the Annotation Scarcity Paradox, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By examining extractive data pipelines, undercompensated ghost work, and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 3 internal anchors

[1]

Towards Neural Machine Translation for African Languages

Jade Z Abbott and Laura Martinus. Towards neural machine translation for african languages. arXiv preprint arXiv:1811.05467 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Correcting FLORES evaluation dataset for four A frican languages

Idris Abdulmumin, Sthembiso Mkhwanazi, Mahlatse Mbooi, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Neo Putini, Miehleketo Mathebula, Matimba Shingange, Tajuddeen Gwadabe, and Vukosi Marivate. Correcting FLORES evaluation dataset for four A frican languages. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Nint...

work page 2024
[3]

Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

Seye Abimbola and Madhukar Pai. Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

work page 2020
[4]

Cross-lingual word embeddings for low-resource language modeling

Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 937--947, 2017

work page 2017
[5]

AI and language data flaring in A frica: Addressing the low-resource challenge

Ife Adebara. AI and language data flaring in A frica: Addressing the low-resource challenge. Policy Brief No. 216 , 2025

work page 2025
[6]

Masakhaner 2.0: Africa-centric transfer learning for named entity recognition

David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen H Muhammad, Peter Nabende, et al. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proce...

work page 2022
[7]

Irokobench: A new benchmark for african languages in the age of large language models

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, et al. Irokobench: A new benchmark for african languages in the age of large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

work page 2025
[8]

JW 300: A wide-coverage parallel corpus for low-resource languages

Z eljko Agi \'c and Ivan Vuli \'c . JW 300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3204--3210, 2019

work page 2019
[9]

Mega: Multilingual evaluation of generative ai

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, et al. Mega: Multilingual evaluation of generative ai. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4232--4267, 2023

work page 2023
[10]

Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning

Jesujoba Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics , pages 4336--4349, 2022

work page 2022
[11]

Charting the landscape of african nlp: Mapping progress and shaping the road ahead

Jesujoba Alabi, Michael A Hedderich, David Ifeoluwa Adelani, and Dietrich Klakow. Charting the landscape of african nlp: Mapping progress and shaping the road ahead. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 27795--27829, 2025

work page 2025
[12]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference , pages 4218--4222, 2020

work page 2020
[13]

The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis

Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, et al. The rise of africanlp: Contributions, contributors, and community impact (2005-2025). arXiv preprint arXiv:2509.25477 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2005
[14]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics , 6:587--604, 2018

work page 2018
[15]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

work page 2021
[16]

Decolonising speech and language technology

Steven Bird. Decolonising speech and language technology. In Proceedings of the 28th international conference on computational linguistics , pages 3504--3519, 2020

work page 2020
[17]

Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546

Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546. IEEE, 2021

work page 2021
[18]

The values encoded in machine learning research

Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages 173--184, 2022

work page 2022
[19]

Algorithmic colonization of africa

Abeba Birhane. Algorithmic colonization of africa. SCRIPTed , 17:389, 2020

work page 2020
[20]

Systematic inequalities in language technology performance across the world’s languages

Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5486--5505, 2022

work page 2022
[21]

The care principles for indigenous data governance

Stephanie Russo Carroll, Ibrahim Garba, Oscar L Figueroa-Rodr \' guez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, Kay Raseroka, Desi Rodriguez-Lonebear, Robyn Rowe, et al. The care principles for indigenous data governance. Open Scholarship Press Curated Volumes: Policy , 2023

work page 2023
[22]

An empirical survey of data augmentation for limited data learning in nlp

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics , 11:191--211, 2023

work page 2023
[23]

Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming

Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, et al. Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming. In Proceedings of the 63rd Annual Meeting of the Association for Com...

work page 2025
[24]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 8440--8451, 2020

work page 2020
[25]

Scrambling for africa? universities and global health

Johanna Crane. Scrambling for africa? universities and global health. The Lancet , 377(9775):1388--1390, 2011

work page 2011
[26]

Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development

DataDotOrg . Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development. Playbook, DataDotOrg, Washington, D.C., USA, 2026. A playbook for building sustainable African language technology ecosystems

work page 2026
[27]

Localising the mozilla common voice platform for south africa’s official languages

Febe de Wet, Andiswa Bukula, Willem Karsten, Martin Puttkammer, Erwin Schillack, Rone Wierenga, and Roald Eiselen. Localising the mozilla common voice platform for south africa’s official languages. Journal of the Digital Humanities Association of Southern Africa (DHASA) , 4(01), 2022

work page 2022
[28]

Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance

Sylvie Delacroix and Neil D Lawrence. Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance. International data privacy law , 9(4):236--252, 2019

work page 2019
[29]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171--4186, 2019

work page 2019
[30]

Nl-augmenter: A framework for task-sensitive natural language augmentation

Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. Nl-augmenter: A framework for task-sensitive natural language augmentation. Northern European Journal of Language Technology , 9, 2023

work page 2023
[31]

Eberhard, Gary F

David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Ethnologue : Languages of the world. SIL International, 2025

work page 2025
[32]

AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages

Abteen Ebrahimi, Manuel Mager, Adam Wiemerslage, Pavel Denisov, Katharina Kann, et al. AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6279--6299, 2022

work page 2022
[33]

Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty

Jake Okechukwu Effoduh. Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty. Science and Public Policy , 53(2):245--257, 2026

work page 2026
[34]

Developing text resources for ten S outh A frican languages

Roald Eiselen and Martin J Puttkammer. Developing text resources for ten S outh A frican languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) , pages 3698--3703, 2014

work page 2014
[35]

A survey of data augmentation approaches for nlp

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. In Findings of the association for computational linguistics: ACL-IJCNLP 2021 , pages 968--988, 2021

work page 2021
[36]

A typology of reviews: an analysis of 14 review types and associated methodologies

Maria J Grant and Andrew Booth. A typology of reviews: an analysis of 14 review types and associated methodologies. Health information & libraries journal , 26(2):91--108, 2009

work page 2009
[37]

Universal neural machine translation for extremely low resource languages

Jiatao Gu, Hany Hassan Awadalla, Jacob Devlin, and Victor OK Li. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 344--354, 2018

work page 2018
[38]

The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

Joseph Henrich, Steven J Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

work page 2010
[39]

Challenges and strategies in cross-cultural nlp

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, et al. Challenges and strategies in cross-cultural nlp. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6...

work page 2022
[40]

XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning , pages 4411--4421. PMLR, 2020

work page 2020
[41]

Lessons from archives: Strategies for collecting sociocultural data in machine learning

Eun Seo Jo and Timnit Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency , pages 306--316, 2020

work page 2020
[42]

The state and fate of linguistic diversity and inclusion in the nlp world

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 6282--6293, 2020

work page 2020
[43]

Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages

Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, and Michael Granitzer. Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 397--412. Springer, 2024

work page 2024
[44]

Practical Natural Language Processing for Low-Resource Languages

Benjamin Philip King. Practical Natural Language Processing for Low-Resource Languages . PhD thesis, University of Michigan, 2015

work page 2015
[45]

Lessons learned from a citizen science project for natural language processing

Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, G \"o zde S ahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart De Castilho, and Iryna Gurevych. Lessons learned from a citizen science project for natural language processing. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of t...

work page 2023
[46]

The IIT B ombay E nglish- H indi parallel corpus

Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. The IIT B ombay E nglish- H indi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , 2018

work page 2018
[47]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP) , pages 4674--4684, Hong Kong, China, November 2019. Association for Computational Linguistics

work page 2019
[48]

Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P

Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, B \"o rje F. Karlsson, James J...

work page 2024
[49]

Challenges of language technologies for the indigenous languages of the A mericas

Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, and Ivan Meza-Ruiz. Challenges of language technologies for the indigenous languages of the A mericas. In Proceedings of the 27th International Conference on Computational Linguistics , pages 55--69, 2018

work page 2018
[50]

Findings of the A mericas NLP 2021 shared task on open machine translation for indigenous languages of the A mericas

Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Gim \'e nez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, and Katharina Kann. Findings of the A mericas NLP 2021 sh...

work page 2021
[51]

a ubener, Sophie Fellenz, Asja Fischer, Thomas G \

Laura Manduchi, Clara Meister, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina D \"a ubener, Sophie Fellenz, Asja Fischer, Thomas G \"a rtner, Matthias Kirchler, et al. On the challenges and opportunities in generative ai. arXiv preprint arXiv:2403.00025 , 2024

work page arXiv 2024
[52]

Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI . Simon and Schuster, 2021

work page 2021
[53]

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Alipio Jorge, Felermino D \'a rio M \'a rio Ant \'o nio Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Taj...

work page 2023
[54]

Participatory research for low-resourced machine translation: A case study in A frican languages

Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, et al. Participatory research for low-resourced machine translation: A case study in A frican languages. In Findings of the Association for Computational Linguis...

work page 2020
[55]

Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

work page 2025
[56]

Moving toward truly responsible AI development in the global AI market, 2024

Chinasa Okolo and Marie Tano. Moving toward truly responsible AI development in the global AI market, 2024. Brookings Institution

work page 2024
[57]

Reforming data regulation to advance AI governance in Africa , 2024

Chinasa Okolo. Reforming data regulation to advance AI governance in Africa , 2024

work page 2024
[58]

Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence

Chijioke Okorie and Melissa Omino. Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence. Law, Tech. & Hum. , 7:94, 2025

work page 2025
[59]

It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

Chijioke Okorie. It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

work page 2025
[60]

African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

Nokuthula Olorunju and Rachel Adams. African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

work page 2024
[61]

Outreach programme to strengthen the AI4D network: final technical report

Davor Orlic. Outreach programme to strengthen the AI4D network: final technical report. Technical report, AI4D Africa, 2021

work page 2021
[62]

PazaBench : A speech and language model benchmark for low-resource african languages

Salomey Osei et al. PazaBench : A speech and language model benchmark for low-resource african languages. Microsoft Research, 2024

work page 2024
[63]

Ai by the people, for the people, July 2023

Billy Perrigo. Ai by the people, for the people, July 2023

work page 2023
[64]

tinybenchmarks: evaluating llms with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. In Proceedings of the 41st International Conference on Machine Learning , pages 34303--34326, 2024

work page 2024
[65]

On releasing annotator-level labels and information in datasets

Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. On releasing annotator-level labels and information in datasets. In Claire Bonial and Nianwen Xue, editors, Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop , pages 133--138, Punta Cana, Dominican Republic, November 2...

work page 2021
[66]

The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages

Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessica Ojo, Atnafu Lambebo Tonja, Wilhelmina NdapewaOnyothi Nekoto, et al. The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages. In Proceedings of the 63rd Annual Meeting of the Associ...

work page 2025
[67]

Lalor, Robin Jain, and Jordan Boyd-Graber

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jain, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 4489--4504, Online, August 2021. Assoc...

work page 2021
[68]

o zde G \

G \"o zde G \"u l S ahin. To augment or not to augment? a comparative study on text augmentation techniques for low-resource nlp. Computational Linguistics , 48(1):5--42, 2022

work page 2022
[69]

Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages 1--15, 2021

work page 2021
[70]

Ai4d--african language dataset challenge

Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate. Ai4d--african language dataset challenge. arXiv preprint arXiv:2007.11865 , 2020

work page arXiv 2007
[71]

Ai4d--african language program

Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I Adelani, Amelia Taylor, et al. Ai4d--african language program. arXiv preprint arXiv:2104.02516 , 2021

work page arXiv 2021
[72]

Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11047--11073, 2024

work page 2024
[73]

Aya dataset: An open-access collection for multilingual instruction tuning

Shivalika Singh, Freddie Vargus, Daniel D’souza, B \"o rje F Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, et al. Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page 2024
[74]

Participation is not a design fix for machine learning

Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano. Participation is not a design fix for machine learning. In Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages 1--6, 2022

work page 2022
[75]

Literature review as a research methodology: An overview and guidelines

Hannah Snyder. Literature review as a research methodology: An overview and guidelines. Journal of business research , 104:333--339, 2019

work page 2019
[76]

Sea-helm: Southeast asian holistic evaluation of language models

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. Sea-helm: Southeast asian holistic evaluation of language models. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12308--12336, 2025

work page 2025
[77]

Kaitiakitanga m \=a ori data sovereignty licences, 2021

Karaitiana Taiuru. Kaitiakitanga m \=a ori data sovereignty licences, 2021

work page 2021
[78]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sad...

work page arXiv 2025
[79]

Introducing the asian language treebank (alt)

Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , pages 1574--1578, 2016

work page 2016
[80]

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages. arXiv preprint arXiv:2601.06395 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

Towards Neural Machine Translation for African Languages

Jade Z Abbott and Laura Martinus. Towards neural machine translation for african languages. arXiv preprint arXiv:1811.05467 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Correcting FLORES evaluation dataset for four A frican languages

Idris Abdulmumin, Sthembiso Mkhwanazi, Mahlatse Mbooi, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Neo Putini, Miehleketo Mathebula, Matimba Shingange, Tajuddeen Gwadabe, and Vukosi Marivate. Correcting FLORES evaluation dataset for four A frican languages. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Nint...

work page 2024

[3] [3]

Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

Seye Abimbola and Madhukar Pai. Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

work page 2020

[4] [4]

Cross-lingual word embeddings for low-resource language modeling

Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 937--947, 2017

work page 2017

[5] [5]

AI and language data flaring in A frica: Addressing the low-resource challenge

Ife Adebara. AI and language data flaring in A frica: Addressing the low-resource challenge. Policy Brief No. 216 , 2025

work page 2025

[6] [6]

Masakhaner 2.0: Africa-centric transfer learning for named entity recognition

David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen H Muhammad, Peter Nabende, et al. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proce...

work page 2022

[7] [7]

Irokobench: A new benchmark for african languages in the age of large language models

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, et al. Irokobench: A new benchmark for african languages in the age of large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

work page 2025

[8] [8]

JW 300: A wide-coverage parallel corpus for low-resource languages

Z eljko Agi \'c and Ivan Vuli \'c . JW 300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3204--3210, 2019

work page 2019

[9] [9]

Mega: Multilingual evaluation of generative ai

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, et al. Mega: Multilingual evaluation of generative ai. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4232--4267, 2023

work page 2023

[10] [10]

Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning

Jesujoba Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics , pages 4336--4349, 2022

work page 2022

[11] [11]

Charting the landscape of african nlp: Mapping progress and shaping the road ahead

Jesujoba Alabi, Michael A Hedderich, David Ifeoluwa Adelani, and Dietrich Klakow. Charting the landscape of african nlp: Mapping progress and shaping the road ahead. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 27795--27829, 2025

work page 2025

[12] [12]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference , pages 4218--4222, 2020

work page 2020

[13] [13]

The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis

Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, et al. The rise of africanlp: Contributions, contributors, and community impact (2005-2025). arXiv preprint arXiv:2509.25477 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2005

[14] [14]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics , 6:587--604, 2018

work page 2018

[15] [15]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

work page 2021

[16] [16]

Decolonising speech and language technology

Steven Bird. Decolonising speech and language technology. In Proceedings of the 28th international conference on computational linguistics , pages 3504--3519, 2020

work page 2020

[17] [17]

Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546

Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546. IEEE, 2021

work page 2021

[18] [18]

The values encoded in machine learning research

Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages 173--184, 2022

work page 2022

[19] [19]

Algorithmic colonization of africa

Abeba Birhane. Algorithmic colonization of africa. SCRIPTed , 17:389, 2020

work page 2020

[20] [20]

Systematic inequalities in language technology performance across the world’s languages

Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5486--5505, 2022

work page 2022

[21] [21]

The care principles for indigenous data governance

Stephanie Russo Carroll, Ibrahim Garba, Oscar L Figueroa-Rodr \' guez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, Kay Raseroka, Desi Rodriguez-Lonebear, Robyn Rowe, et al. The care principles for indigenous data governance. Open Scholarship Press Curated Volumes: Policy , 2023

work page 2023

[22] [22]

An empirical survey of data augmentation for limited data learning in nlp

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics , 11:191--211, 2023

work page 2023

[23] [23]

Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming

Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, et al. Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming. In Proceedings of the 63rd Annual Meeting of the Association for Com...

work page 2025

[24] [24]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 8440--8451, 2020

work page 2020

[25] [25]

Scrambling for africa? universities and global health

Johanna Crane. Scrambling for africa? universities and global health. The Lancet , 377(9775):1388--1390, 2011

work page 2011

[26] [26]

Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development

DataDotOrg . Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development. Playbook, DataDotOrg, Washington, D.C., USA, 2026. A playbook for building sustainable African language technology ecosystems

work page 2026

[27] [27]

Localising the mozilla common voice platform for south africa’s official languages

Febe de Wet, Andiswa Bukula, Willem Karsten, Martin Puttkammer, Erwin Schillack, Rone Wierenga, and Roald Eiselen. Localising the mozilla common voice platform for south africa’s official languages. Journal of the Digital Humanities Association of Southern Africa (DHASA) , 4(01), 2022

work page 2022

[28] [28]

Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance

Sylvie Delacroix and Neil D Lawrence. Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance. International data privacy law , 9(4):236--252, 2019

work page 2019

[29] [29]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171--4186, 2019

work page 2019

[30] [30]

Nl-augmenter: A framework for task-sensitive natural language augmentation

Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. Nl-augmenter: A framework for task-sensitive natural language augmentation. Northern European Journal of Language Technology , 9, 2023

work page 2023

[31] [31]

Eberhard, Gary F

David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Ethnologue : Languages of the world. SIL International, 2025

work page 2025

[32] [32]

AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages

Abteen Ebrahimi, Manuel Mager, Adam Wiemerslage, Pavel Denisov, Katharina Kann, et al. AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6279--6299, 2022

work page 2022

[33] [33]

Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty

Jake Okechukwu Effoduh. Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty. Science and Public Policy , 53(2):245--257, 2026

work page 2026

[34] [34]

Developing text resources for ten S outh A frican languages

Roald Eiselen and Martin J Puttkammer. Developing text resources for ten S outh A frican languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) , pages 3698--3703, 2014

work page 2014

[35] [35]

A survey of data augmentation approaches for nlp

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. In Findings of the association for computational linguistics: ACL-IJCNLP 2021 , pages 968--988, 2021

work page 2021

[36] [36]

A typology of reviews: an analysis of 14 review types and associated methodologies

Maria J Grant and Andrew Booth. A typology of reviews: an analysis of 14 review types and associated methodologies. Health information & libraries journal , 26(2):91--108, 2009

work page 2009

[37] [37]

Universal neural machine translation for extremely low resource languages

Jiatao Gu, Hany Hassan Awadalla, Jacob Devlin, and Victor OK Li. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 344--354, 2018

work page 2018

[38] [38]

The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

Joseph Henrich, Steven J Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

work page 2010

[39] [39]

Challenges and strategies in cross-cultural nlp

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, et al. Challenges and strategies in cross-cultural nlp. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6...

work page 2022

[40] [40]

XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning , pages 4411--4421. PMLR, 2020

work page 2020

[41] [41]

Lessons from archives: Strategies for collecting sociocultural data in machine learning

Eun Seo Jo and Timnit Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency , pages 306--316, 2020

work page 2020

[42] [42]

The state and fate of linguistic diversity and inclusion in the nlp world

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 6282--6293, 2020

work page 2020

[43] [43]

Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages

Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, and Michael Granitzer. Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 397--412. Springer, 2024

work page 2024

[44] [44]

Practical Natural Language Processing for Low-Resource Languages

Benjamin Philip King. Practical Natural Language Processing for Low-Resource Languages . PhD thesis, University of Michigan, 2015

work page 2015

[45] [45]

Lessons learned from a citizen science project for natural language processing

Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, G \"o zde S ahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart De Castilho, and Iryna Gurevych. Lessons learned from a citizen science project for natural language processing. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of t...

work page 2023

[46] [46]

The IIT B ombay E nglish- H indi parallel corpus

Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. The IIT B ombay E nglish- H indi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , 2018

work page 2018

[47] [47]

Lalor, Hao Wu, and Hong Yu

John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP) , pages 4674--4684, Hong Kong, China, November 2019. Association for Computational Linguistics

work page 2019

[48] [48]

Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P

Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, B \"o rje F. Karlsson, James J...

work page 2024

[49] [49]

Challenges of language technologies for the indigenous languages of the A mericas

Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, and Ivan Meza-Ruiz. Challenges of language technologies for the indigenous languages of the A mericas. In Proceedings of the 27th International Conference on Computational Linguistics , pages 55--69, 2018

work page 2018

[50] [50]

Findings of the A mericas NLP 2021 shared task on open machine translation for indigenous languages of the A mericas

Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Gim \'e nez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, and Katharina Kann. Findings of the A mericas NLP 2021 sh...

work page 2021

[51] [51]

a ubener, Sophie Fellenz, Asja Fischer, Thomas G \

Laura Manduchi, Clara Meister, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina D \"a ubener, Sophie Fellenz, Asja Fischer, Thomas G \"a rtner, Matthias Kirchler, et al. On the challenges and opportunities in generative ai. arXiv preprint arXiv:2403.00025 , 2024

work page arXiv 2024

[52] [52]

Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI . Simon and Schuster, 2021

work page 2021

[53] [53]

Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Alipio Jorge, Felermino D \'a rio M \'a rio Ant \'o nio Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Taj...

work page 2023

[54] [54]

Participatory research for low-resourced machine translation: A case study in A frican languages

Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, et al. Participatory research for low-resourced machine translation: A case study in A frican languages. In Findings of the Association for Computational Linguis...

work page 2020

[55] [55]

Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

work page 2025

[56] [56]

Moving toward truly responsible AI development in the global AI market, 2024

Chinasa Okolo and Marie Tano. Moving toward truly responsible AI development in the global AI market, 2024. Brookings Institution

work page 2024

[57] [57]

Reforming data regulation to advance AI governance in Africa , 2024

Chinasa Okolo. Reforming data regulation to advance AI governance in Africa , 2024

work page 2024

[58] [58]

Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence

Chijioke Okorie and Melissa Omino. Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence. Law, Tech. & Hum. , 7:94, 2025

work page 2025

[59] [59]

It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

Chijioke Okorie. It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

work page 2025

[60] [60]

African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

Nokuthula Olorunju and Rachel Adams. African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

work page 2024

[61] [61]

Outreach programme to strengthen the AI4D network: final technical report

Davor Orlic. Outreach programme to strengthen the AI4D network: final technical report. Technical report, AI4D Africa, 2021

work page 2021

[62] [62]

PazaBench : A speech and language model benchmark for low-resource african languages

Salomey Osei et al. PazaBench : A speech and language model benchmark for low-resource african languages. Microsoft Research, 2024

work page 2024

[63] [63]

Ai by the people, for the people, July 2023

Billy Perrigo. Ai by the people, for the people, July 2023

work page 2023

[64] [64]

tinybenchmarks: evaluating llms with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. In Proceedings of the 41st International Conference on Machine Learning , pages 34303--34326, 2024

work page 2024

[65] [65]

On releasing annotator-level labels and information in datasets

Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. On releasing annotator-level labels and information in datasets. In Claire Bonial and Nianwen Xue, editors, Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop , pages 133--138, Punta Cana, Dominican Republic, November 2...

work page 2021

[66] [66]

The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages

Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessica Ojo, Atnafu Lambebo Tonja, Wilhelmina NdapewaOnyothi Nekoto, et al. The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages. In Proceedings of the 63rd Annual Meeting of the Associ...

work page 2025

[67] [67]

Lalor, Robin Jain, and Jordan Boyd-Graber

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jain, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 4489--4504, Online, August 2021. Assoc...

work page 2021

[68] [68]

o zde G \

G \"o zde G \"u l S ahin. To augment or not to augment? a comparative study on text augmentation techniques for low-resource nlp. Computational Linguistics , 48(1):5--42, 2022

work page 2022

[69] [69]

Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages 1--15, 2021

work page 2021

[70] [70]

Ai4d--african language dataset challenge

Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate. Ai4d--african language dataset challenge. arXiv preprint arXiv:2007.11865 , 2020

work page arXiv 2007

[71] [71]

Ai4d--african language program

Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I Adelani, Amelia Taylor, et al. Ai4d--african language program. arXiv preprint arXiv:2104.02516 , 2021

work page arXiv 2021

[72] [72]

Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11047--11073, 2024

work page 2024

[73] [73]

Aya dataset: An open-access collection for multilingual instruction tuning

Shivalika Singh, Freddie Vargus, Daniel D’souza, B \"o rje F Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, et al. Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page 2024

[74] [74]

Participation is not a design fix for machine learning

Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano. Participation is not a design fix for machine learning. In Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages 1--6, 2022

work page 2022

[75] [75]

Literature review as a research methodology: An overview and guidelines

Hannah Snyder. Literature review as a research methodology: An overview and guidelines. Journal of business research , 104:333--339, 2019

work page 2019

[76] [76]

Sea-helm: Southeast asian holistic evaluation of language models

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. Sea-helm: Southeast asian holistic evaluation of language models. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12308--12336, 2025

work page 2025

[77] [77]

Kaitiakitanga m \=a ori data sovereignty licences, 2021

Karaitiana Taiuru. Kaitiakitanga m \=a ori data sovereignty licences, 2021

work page 2021

[78] [78]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sad...

work page arXiv 2025

[79] [79]

Introducing the asian language treebank (alt)

Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , pages 1574--1578, 2016

work page 2016

[80] [80]

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages. arXiv preprint arXiv:2601.06395 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026