pith. sign in

arxiv: 2605.19066 · v1 · pith:F6HIZFQKnew · submitted 2026-05-18 · 💻 cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Pith reviewed 2026-05-20 10:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords annotation scarcity paradoxlow-resource NLPevaluation validitydata sovereigntyparticipatory evaluationghost workmultilingual models
0
0 comments X

The pith

Low-resource NLP model scaling has outpaced the human expertise and infrastructure needed for authentic evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys low-resource NLP evaluation from 2014 onward and identifies a core tension: technical tools for building and benchmarking models have grown quickly, yet the specialized human judgment required to assess complex systems remains limited and unevenly available. This mismatch defines the Annotation Scarcity Paradox. A reader would care because it suggests that many claims of progress in multilingual and low-resource settings rest on evaluations that lack sufficient depth or fairness. The survey reviews earlier optimistic phases, later benchmark-focused approaches, and current generative challenges, then examines responses such as participatory curation and data-efficient methods. It concludes by urging a move toward community-rooted practices that respect data sovereignty.

Core claim

The central claim is the Annotation Scarcity Paradox, defined as the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. Tracing low-resource NLP evaluation across phases of early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks, and examining extractive data pipelines, undercompensated ghost work, and language data flaring, the paper argues that this paradox threatens the epistemic validity of reported progress.

What carries the argument

The Annotation Scarcity Paradox, which captures the mismatch between rapid model and benchmark scaling and the strained, inequitably distributed human expertise required for authentic evaluation.

If this is right

  • Existing benchmarks in low-resource NLP may overestimate model capabilities because they rely on insufficiently deep or representative human judgments.
  • Responses such as data augmentation, model-based evaluation, and active learning carry their own equity and validity trade-offs that must be weighed carefully.
  • Sustainable progress requires shifting from transactional data extraction to relational, community-embedded evaluation practices grounded in epistemic governance and data sovereignty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could test whether involving local language communities earlier in evaluation design improves alignment between benchmark scores and real-world utility.
  • The same mismatch between scaling capacity and human evaluation resources may surface in other AI areas that depend on large-scale human feedback.
  • Long-term field stability may hinge on building distributed infrastructure for annotation rather than continuing to centralize evaluation through global benchmarks.

Load-bearing premise

The sociolinguistic expertise required for authentic evaluation of generative systems is severely strained, inequitably distributed, and structurally marginalised in a manner that directly undermines the validity of existing benchmarks and reported results.

What would settle it

A side-by-side study in which native-speaker experts from low-resource language communities re-evaluate a set of existing benchmarks and produce results that closely match the original reported scores would challenge the claim that the paradox undermines epistemic validity.

Figures

Figures reproduced from arXiv: 2605.19066 by Vukosi Marivate.

Figure 1
Figure 1. Figure 1: The African language AI pipeline bottleneck. Each bar [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a critical narrative survey of low-resource NLP evaluation from 2014 to the present, tracing three phases of development (early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks). It defines the Annotation Scarcity Paradox as the structural mismatch between rapid model scaling capacity and limited sovereign human infrastructure for authentic evaluation, supported by analysis of extractive data pipelines, undercompensated ghost work, and language data flaring. The authors argue this paradox undermines epistemic validity of reported progress, survey responses including data augmentation, model-based evaluation, participatory curation, and IRT/active learning methods, and advocate a paradigm shift to relational, community-embedded evaluation grounded in epistemic governance and data sovereignty.

Significance. If the interpretive synthesis holds, the work offers a timely framework for recognizing structural constraints in low-resource NLP evaluation that could encourage more equitable practices and improved validity in multilingual benchmarks. The survey of historical phases and assessment of equity-validity trade-offs in emerging methods provides a useful reference point for researchers addressing annotation challenges, though its impact would be strengthened by tighter linkages to concrete evaluation failures.

major comments (1)
  1. [Abstract and section on the Annotation Scarcity Paradox] Abstract and section on the Annotation Scarcity Paradox: The central claim that the paradox 'threatens the epistemic validity of reported progress' is load-bearing but rests on interpretive synthesis of practices like ghost work; the manuscript would benefit from at least one concrete example (with citation) of a low-resource benchmark whose results have been directly questioned due to annotation infrastructure issues, to move beyond narrative to a falsifiable link.
minor comments (2)
  1. [section tracing the three phases] The three-phase historical framing is clear but could note any overlap or counter-examples between phases to avoid implying a strictly linear progression.
  2. [section surveying emerging responses] In the survey of responses, the equity and validity trade-offs for IRT-based and participatory methods are assessed at a high level; adding a short table summarizing the trade-offs would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. The feedback highlights an opportunity to strengthen the empirical grounding of our central argument, and we have revised the paper accordingly while preserving its character as a critical narrative survey.

read point-by-point responses
  1. Referee: [Abstract and section on the Annotation Scarcity Paradox] Abstract and section on the Annotation Scarcity Paradox: The central claim that the paradox 'threatens the epistemic validity of reported progress' is load-bearing but rests on interpretive synthesis of practices like ghost work; the manuscript would benefit from at least one concrete example (with citation) of a low-resource benchmark whose results have been directly questioned due to annotation infrastructure issues, to move beyond narrative to a falsifiable link.

    Authors: We agree that a concrete, citable example would make the load-bearing claim more directly falsifiable and would help readers connect the structural analysis to specific evaluation outcomes. While the manuscript's strength lies in its synthesis of patterns across extractive pipelines, ghost work, and data flaring, we accept that an illustrative case would improve clarity. In the revised version we have added a concise example in the section defining the Annotation Scarcity Paradox: we now reference documented concerns about annotation quality and inter-annotator agreement in the MasakhaNER benchmark for low-resource African languages, where reliance on non-expert and non-native annotators has been shown in follow-up studies to affect the reliability of reported performance metrics. We have also lightly updated the abstract to signal this addition. This revision keeps the narrative framing intact while addressing the request for a more explicit link. revision: yes

Circularity Check

0 steps flagged

No significant circularity in conceptual survey

full rationale

The paper is a critical narrative survey that defines the Annotation Scarcity Paradox through synthesis of external documented practices (extractive pipelines, ghost work, data flaring) across three historical phases. It draws on external literature for support and presents interpretive analysis of equity/validity trade-offs in responses such as participatory curation and IRT methods. No quantitative predictions, fitted parameters, equations, or derivations exist that could reduce to inputs by construction. The central claim relies on observed structural mismatches rather than self-referential definitions or load-bearing self-citations, rendering the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is a conceptual survey paper whose central contribution is a new framing rather than a derivation from data or axioms; the ledger therefore contains only the high-level domain assumption used to structure the narrative.

axioms (1)
  • domain assumption Low-resource NLP evaluation has evolved through three identifiable phases (early heuristic optimism, illusions of top-down benchmark scaling, and generative bottlenecks) since 2014.
    This tripartite periodization organizes the entire critical narrative survey.
invented entities (1)
  • Annotation Scarcity Paradox no independent evidence
    purpose: To name and conceptualize the structural mismatch between model scaling capacity and available expert human evaluation infrastructure.
    This is the primary novel construct introduced by the paper.

pith-pipeline@v0.9.0 · 5771 in / 1395 out tokens · 44135 ms · 2026-05-20T10:40:25.630659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 3 internal anchors

  1. [1]

    Towards Neural Machine Translation for African Languages

    Jade Z Abbott and Laura Martinus. Towards neural machine translation for african languages. arXiv preprint arXiv:1811.05467 , 2018

  2. [2]

    Correcting FLORES evaluation dataset for four A frican languages

    Idris Abdulmumin, Sthembiso Mkhwanazi, Mahlatse Mbooi, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Neo Putini, Miehleketo Mathebula, Matimba Shingange, Tajuddeen Gwadabe, and Vukosi Marivate. Correcting FLORES evaluation dataset for four A frican languages. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Nint...

  3. [3]

    Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

    Seye Abimbola and Madhukar Pai. Will global health survive its decolonisation? The Lancet , 396(10263):1627--1628, 2020

  4. [4]

    Cross-lingual word embeddings for low-resource language modeling

    Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 937--947, 2017

  5. [5]

    AI and language data flaring in A frica: Addressing the low-resource challenge

    Ife Adebara. AI and language data flaring in A frica: Addressing the low-resource challenge. Policy Brief No. 216 , 2025

  6. [6]

    Masakhaner 2.0: Africa-centric transfer learning for named entity recognition

    David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen H Muhammad, Peter Nabende, et al. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proce...

  7. [7]

    Irokobench: A new benchmark for african languages in the age of large language models

    David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, et al. Irokobench: A new benchmark for african languages in the age of large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associatio...

  8. [8]

    JW 300: A wide-coverage parallel corpus for low-resource languages

    Z eljko Agi \'c and Ivan Vuli \'c . JW 300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3204--3210, 2019

  9. [9]

    Mega: Multilingual evaluation of generative ai

    Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, et al. Mega: Multilingual evaluation of generative ai. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4232--4267, 2023

  10. [10]

    Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning

    Jesujoba Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics , pages 4336--4349, 2022

  11. [11]

    Charting the landscape of african nlp: Mapping progress and shaping the road ahead

    Jesujoba Alabi, Michael A Hedderich, David Ifeoluwa Adelani, and Dietrich Klakow. Charting the landscape of african nlp: Mapping progress and shaping the road ahead. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 27795--27829, 2025

  12. [12]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the twelfth language resources and evaluation conference , pages 4218--4222, 2020

  13. [13]

    The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis

    Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, et al. The rise of africanlp: Contributions, contributors, and community impact (2005-2025). arXiv preprint arXiv:2509.25477 , 2025

  14. [14]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics , 6:587--604, 2018

  15. [15]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 610--623, 2021

  16. [16]

    Decolonising speech and language technology

    Steven Bird. Decolonising speech and language technology. In Proceedings of the 28th international conference on computational linguistics , pages 3504--3519, 2020

  17. [17]

    Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546

    Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 1536--1546. IEEE, 2021

  18. [18]

    The values encoded in machine learning research

    Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages 173--184, 2022

  19. [19]

    Algorithmic colonization of africa

    Abeba Birhane. Algorithmic colonization of africa. SCRIPTed , 17:389, 2020

  20. [20]

    Systematic inequalities in language technology performance across the world’s languages

    Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5486--5505, 2022

  21. [21]

    The care principles for indigenous data governance

    Stephanie Russo Carroll, Ibrahim Garba, Oscar L Figueroa-Rodr \' guez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, Kay Raseroka, Desi Rodriguez-Lonebear, Robyn Rowe, et al. The care principles for indigenous data governance. Open Scholarship Press Curated Volumes: Policy , 2023

  22. [22]

    An empirical survey of data augmentation for limited data learning in nlp

    Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics , 11:191--211, 2023

  23. [23]

    Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming

    Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, et al. Culturalbench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human-ai red-teaming. In Proceedings of the 63rd Annual Meeting of the Association for Com...

  24. [24]

    Unsupervised cross-lingual representation learning at scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 8440--8451, 2020

  25. [25]

    Scrambling for africa? universities and global health

    Johanna Crane. Scrambling for africa? universities and global health. The Lancet , 377(9775):1388--1390, 2011

  26. [26]

    Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development

    DataDotOrg . Digitisation of oral data for nlp of low-resource languages: Practical methods and processes for scalable and sustainable ecosystem development. Playbook, DataDotOrg, Washington, D.C., USA, 2026. A playbook for building sustainable African language technology ecosystems

  27. [27]

    Localising the mozilla common voice platform for south africa’s official languages

    Febe de Wet, Andiswa Bukula, Willem Karsten, Martin Puttkammer, Erwin Schillack, Rone Wierenga, and Roald Eiselen. Localising the mozilla common voice platform for south africa’s official languages. Journal of the Digital Humanities Association of Southern Africa (DHASA) , 4(01), 2022

  28. [28]

    Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance

    Sylvie Delacroix and Neil D Lawrence. Bottom-up data trusts: Disturbing the ‘one size fits all’approach to data governance. International data privacy law , 9(4):236--252, 2019

  29. [29]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171--4186, 2019

  30. [30]

    Nl-augmenter: A framework for task-sensitive natural language augmentation

    Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. Nl-augmenter: A framework for task-sensitive natural language augmentation. Northern European Journal of Language Technology , 9, 2023

  31. [31]

    Eberhard, Gary F

    David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Ethnologue : Languages of the world. SIL International, 2025

  32. [32]

    AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages

    Abteen Ebrahimi, Manuel Mager, Adam Wiemerslage, Pavel Denisov, Katharina Kann, et al. AmericasNLI : Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6279--6299, 2022

  33. [33]

    Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty

    Jake Okechukwu Effoduh. Decolonizing the governance of artificial intelligence in africa: from normative mimicry to epistemic sovereignty. Science and Public Policy , 53(2):245--257, 2026

  34. [34]

    Developing text resources for ten S outh A frican languages

    Roald Eiselen and Martin J Puttkammer. Developing text resources for ten S outh A frican languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) , pages 3698--3703, 2014

  35. [35]

    A survey of data augmentation approaches for nlp

    Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. In Findings of the association for computational linguistics: ACL-IJCNLP 2021 , pages 968--988, 2021

  36. [36]

    A typology of reviews: an analysis of 14 review types and associated methodologies

    Maria J Grant and Andrew Booth. A typology of reviews: an analysis of 14 review types and associated methodologies. Health information & libraries journal , 26(2):91--108, 2009

  37. [37]

    Universal neural machine translation for extremely low resource languages

    Jiatao Gu, Hany Hassan Awadalla, Jacob Devlin, and Victor OK Li. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 344--354, 2018

  38. [38]

    The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

    Joseph Henrich, Steven J Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61--83, 2010

  39. [39]

    Challenges and strategies in cross-cultural nlp

    Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, et al. Challenges and strategies in cross-cultural nlp. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6...

  40. [40]

    XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation

    Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME : A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning , pages 4411--4421. PMLR, 2020

  41. [41]

    Lessons from archives: Strategies for collecting sociocultural data in machine learning

    Eun Seo Jo and Timnit Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency , pages 306--316, 2020

  42. [42]

    The state and fate of linguistic diversity and inclusion in the nlp world

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world. In Proceedings of the 58th annual meeting of the association for computational linguistics , pages 6282--6293, 2020

  43. [43]

    Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages

    Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, and Michael Granitzer. Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 397--412. Springer, 2024

  44. [44]

    Practical Natural Language Processing for Low-Resource Languages

    Benjamin Philip King. Practical Natural Language Processing for Low-Resource Languages . PhD thesis, University of Michigan, 2015

  45. [45]

    Lessons learned from a citizen science project for natural language processing

    Jan-Christoph Klie, Ji-Ung Lee, Kevin Stowe, G \"o zde S ahin, Nafise Sadat Moosavi, Luke Bates, Dominic Petrak, Richard Eckart De Castilho, and Iryna Gurevych. Lessons learned from a citizen science project for natural language processing. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of t...

  46. [46]

    The IIT B ombay E nglish- H indi parallel corpus

    Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. The IIT B ombay E nglish- H indi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , 2018

  47. [47]

    Lalor, Hao Wu, and Hong Yu

    John P. Lalor, Hao Wu, and Hong Yu. Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP) , pages 4674--4684, Hong Kong, China, November 2019. Association for Computational Linguistics

  48. [48]

    Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P

    Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, B \"o rje F. Karlsson, James J...

  49. [49]

    Challenges of language technologies for the indigenous languages of the A mericas

    Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, and Ivan Meza-Ruiz. Challenges of language technologies for the indigenous languages of the A mericas. In Proceedings of the 27th International Conference on Computational Linguistics , pages 55--69, 2018

  50. [50]

    Findings of the A mericas NLP 2021 shared task on open machine translation for indigenous languages of the A mericas

    Manuel Mager, Arturo Oncevay, Abteen Ebrahimi, John Ortega, Annette Rios, Angela Fan, Ximena Gutierrez-Vasques, Luis Chiruzzo, Gustavo Gim \'e nez-Lugo, Ricardo Ramos, Ivan Vladimir Meza Ruiz, Rolando Coto-Solano, Alexis Palmer, Elisabeth Mager-Hois, Vishrav Chaudhary, Graham Neubig, Ngoc Thang Vu, and Katharina Kann. Findings of the A mericas NLP 2021 sh...

  51. [51]

    a ubener, Sophie Fellenz, Asja Fischer, Thomas G \

    Laura Manduchi, Clara Meister, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina D \"a ubener, Sophie Fellenz, Asja Fischer, Thomas G \"a rtner, Matthias Kirchler, et al. On the challenges and opportunities in generative ai. arXiv preprint arXiv:2403.00025 , 2024

  52. [52]

    Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

    Robert Munro Monarch. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI . Simon and Schuster, 2021

  53. [53]

    Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Alipio Jorge, Felermino D \'a rio M \'a rio Ant \'o nio Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Taj...

  54. [54]

    Participatory research for low-resourced machine translation: A case study in A frican languages

    Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, et al. Participatory research for low-resourced machine translation: A case study in A frican languages. In Findings of the Association for Computational Linguis...

  55. [55]

    Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

    Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. Afrobench: how good are large language models on african languages? In Findings of the Association for Computational Linguistics: ACL 2025 , pages 19048--19095, 2025

  56. [56]

    Moving toward truly responsible AI development in the global AI market, 2024

    Chinasa Okolo and Marie Tano. Moving toward truly responsible AI development in the global AI market, 2024. Brookings Institution

  57. [57]

    Reforming data regulation to advance AI governance in Africa , 2024

    Chinasa Okolo. Reforming data regulation to advance AI governance in Africa , 2024

  58. [58]

    Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence

    Chijioke Okorie and Melissa Omino. Addressing inequitable openness in licences for sharing african data and datasets through the nwulite obodo open data licence. Law, Tech. & Hum. , 7:94, 2025

  59. [59]

    It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

    Chijioke Okorie. It’s the noodl license--awesome and amazingly geeky! Available at SSRN 5339254 , 2025

  60. [60]

    African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

    Nokuthula Olorunju and Rachel Adams. African data trusts: new tools towards collective data governance? Information & Communications Technology Law , 33(1):85--98, 2024

  61. [61]

    Outreach programme to strengthen the AI4D network: final technical report

    Davor Orlic. Outreach programme to strengthen the AI4D network: final technical report. Technical report, AI4D Africa, 2021

  62. [62]

    PazaBench : A speech and language model benchmark for low-resource african languages

    Salomey Osei et al. PazaBench : A speech and language model benchmark for low-resource african languages. Microsoft Research, 2024

  63. [63]

    Ai by the people, for the people, July 2023

    Billy Perrigo. Ai by the people, for the people, July 2023

  64. [64]

    tinybenchmarks: evaluating llms with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. In Proceedings of the 41st International Conference on Machine Learning , pages 34303--34326, 2024

  65. [65]

    On releasing annotator-level labels and information in datasets

    Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. On releasing annotator-level labels and information in datasets. In Claire Bonial and Nianwen Xue, editors, Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop , pages 133--138, Punta Cana, Dominican Republic, November 2...

  66. [66]

    The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages

    Jenalea Rajab, Anuoluwapo Aremu, Everlyn Asiko Chimoto, Dale Dunbar, Graham Morrissey, Fadel Thior, Luandrie Potgieter, Jessica Ojo, Atnafu Lambebo Tonja, Wilhelmina NdapewaOnyothi Nekoto, et al. The esethu framework: Reimagining sustainable dataset governance and curation for low-resource languages. In Proceedings of the 63rd Annual Meeting of the Associ...

  67. [67]

    Lalor, Robin Jain, and Jordan Boyd-Graber

    Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jain, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 4489--4504, Online, August 2021. Assoc...

  68. [68]

    o zde G \

    G \"o zde G \"u l S ahin. To augment or not to augment? a comparative study on text augmentation techniques for low-resource nlp. Computational Linguistics , 48(1):5--42, 2022

  69. [69]

    Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai

    Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. Everyone wants to do the model work, not the data work: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages 1--15, 2021

  70. [70]

    Ai4d--african language dataset challenge

    Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate. Ai4d--african language dataset challenge. arXiv preprint arXiv:2007.11865 , 2020

  71. [71]

    Ai4d--african language program

    Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I Adelani, Amelia Taylor, et al. Ai4d--african language program. arXiv preprint arXiv:2104.02516 , 2021

  72. [72]

    Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages

    Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11047--11073, 2024

  73. [73]

    Aya dataset: An open-access collection for multilingual instruction tuning

    Shivalika Singh, Freddie Vargus, Daniel D’souza, B \"o rje F Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, et al. Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

  74. [74]

    Participation is not a design fix for machine learning

    Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano. Participation is not a design fix for machine learning. In Proceedings of the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages 1--6, 2022

  75. [75]

    Literature review as a research methodology: An overview and guidelines

    Hannah Snyder. Literature review as a research methodology: An overview and guidelines. Journal of business research , 104:333--339, 2019

  76. [76]

    Sea-helm: Southeast asian holistic evaluation of language models

    Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. Sea-helm: Southeast asian holistic evaluation of language models. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 12308--12336, 2025

  77. [77]

    Kaitiakitanga m \=a ori data sovereignty licences, 2021

    Karaitiana Taiuru. Kaitiakitanga m \=a ori data sovereignty licences, 2021

  78. [78]

    Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

    Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sad...

  79. [79]

    Introducing the asian language treebank (alt)

    Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Introducing the asian language treebank (alt). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) , pages 1574--1578, 2016

  80. [80]

    AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

    Hao Yu, Tianyi Xu, Michael A Hedderich, Wassim Hamidouche, Syed Waqas Zamir, and David Ifeoluwa Adelani. Afriquellm: How data mixing and model architecture impact continued pre-training for african languages. arXiv preprint arXiv:2601.06395 , 2026

Showing first 80 references.