arxiv: 2604.12540 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

Mahounan Pericles Adjovi , Roald Eiselen , Prasenjit Mitra

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords data augmentationlow-resource languagesHausaFongbenamed entity recognitionpart-of-speech taggingLLM generationback-translation

0 comments

The pith

Data augmentation effectiveness for Hausa and Fongbe depends on task type rather than language or LLM quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates LLM-based generation and back-translation as ways to create extra training data for two low-resource West African languages. It tests these methods on named entity recognition and part-of-speech tagging using established benchmarks. Results show that neither approach reliably helps named entity recognition and can even lower scores, while part-of-speech tagging sees small gains in some cases. The same generated data produces opposite results on the two tasks for Fongbe, which indicates that task structure drives outcomes more than data quality or language differences. Readers working on data-scarce languages would care because the work questions the default use of augmentation as an always-helpful step.

Core claim

We show that augmentation effectiveness depends on task type rather than language or LLM quality alone. For named entity recognition, neither LLM generation nor back-translation improves over the baseline for Hausa or Fongbe, with some reductions in F1. For part-of-speech tagging, LLM augmentation improves Fongbe slightly while back-translation improves Hausa modestly. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe, hurting named entity recognition while helping part-of-speech tagging.

What carries the argument

Cross-task comparison of LLM-generated and back-translated synthetic data on named entity recognition and part-of-speech tagging for Hausa and Fongbe using MasakhaNER 2.0 and MasakhaPOS benchmarks.

If this is right

Augmentation cannot be applied as a default preprocessing step and must instead be tested per task.
LLM generation quality does not predict whether synthetic data will improve downstream performance.
Named entity recognition for these languages shows no benefit and occasional harm from current augmentation methods.
Part-of-speech tagging may see modest gains but requires case-by-case validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The task-specific pattern could appear in other sequence-labeling tasks or additional low-resource languages.
Varying the volume of synthetic data might shift the observed effects and identify useful thresholds.
Combining augmentation with other low-resource techniques could change how task structure influences outcomes.

Load-bearing premise

The small observed performance differences reflect real effects of augmentation rather than random experimental variation.

What would settle it

A replication on larger test sets or additional languages that finds consistent improvements from augmentation on both named entity recognition and part-of-speech tagging would falsify the task-dependence claim.

read the original abstract

Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows the same synthetic data hurting NER while helping POS on Fongbe, but the sub-2% deltas without error bars or repeated runs make the task-specific conclusion hard to trust.

read the letter

The core result is that data augmentation for Hausa and Fongbe is not reliably helpful and can even hurt performance depending on the task. Using Gemini 2.5 Flash and NLLB-200 back-translation on the MasakhaNER 2.0 and MasakhaPOS sets, the authors report no NER gains for either language or method, with drops of 0.24% F1 on Hausa and 1.81% on Fongbe. POS tagging shows small mixed effects, including a 0.33% accuracy lift for Fongbe with LLM data and a 0.17% lift for Hausa with back-translation. The same generated data producing opposite directions on the two tasks for Fongbe is the clearest new observation here. It supplies a direct comparison on these languages that was not in the prior work they cite, and it pushes back on the idea that higher-quality LLM output will automatically translate into better augmentation. The experiments stick to public benchmarks and report concrete point estimates, which keeps the work straightforward and easy to check. The main weakness is the size and reliability of the differences. Effects in the 0.17–1.81% range on small low-resource datasets can easily come from single-run fine-tuning variance, and the abstract gives no standard deviations, no multiple seeds, and no significance tests. Without those, the claim that task structure governs outcomes more than language or data quality rests on shaky ground. The scope is also narrow—only two languages and two tasks—so generalizing beyond these cases is premature. This is useful for people running practical pipelines on African languages or other low-resource settings who need to decide whether to add augmentation steps. It deserves peer review because it adds specific empirical comparisons that are missing from the literature, even though the authors will need to tighten the statistics and perhaps broaden the tasks before the conclusions can be taken as firm guidance.

Referee Report

1 major / 1 minor

Summary. The paper evaluates LLM-based (Gemini 2.5 Flash) and back-translation (NLLB-200) data augmentation for NER and POS tagging on Hausa and Fongbe using the MasakhaNER 2.0 and MasakhaPOS benchmarks. It reports that neither method improves NER for either language (with LLM augmentation reducing Hausa NER F1 by 0.24% and Fongbe by 1.81%), while POS results are mixed (LLM helps Fongbe by 0.33% accuracy; back-translation helps Hausa by 0.17% but hurts Fongbe by 0.35%). The central claim is that augmentation success is governed by task structure rather than language or synthetic-data quality, since the same LLM data produces opposite effects on Fongbe NER vs. POS.

Significance. If the task-specific pattern holds after proper statistical validation, the work offers practical guidance for low-resource African-language NLP by showing that augmentation is not a universal win and should be tested per task. The empirical focus on two typologically distinct languages and two augmentation paradigms is a strength, as is the direct comparison of LLM generation quality against downstream utility.

major comments (1)

[Results / Abstract] Abstract and results: the reported deltas (0.17%–1.81%) are presented as point estimates only, with no standard deviations across random seeds, no p-values, and no mention of multiple training runs. On the small Masakha datasets, fine-tuning variance routinely exceeds 1% F1/accuracy; without these statistics the opposite effects for Fongbe (NER hurt, POS helped) cannot be distinguished from training noise and therefore do not yet support the task-structure claim.

minor comments (1)

[Abstract] The abstract states concrete percentage changes but does not specify the exact baseline F1/accuracy values or the size of the augmented training sets; adding these numbers would make the magnitude of the effects easier to interpret.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the need for statistical rigor in our empirical claims. The concern about variance on small datasets is valid, and we address it directly below with a commitment to strengthen the manuscript.

read point-by-point responses

Referee: [Results / Abstract] Abstract and results: the reported deltas (0.17%–1.81%) are presented as point estimates only, with no standard deviations across random seeds, no p-values, and no mention of multiple training runs. On the small Masakha datasets, fine-tuning variance routinely exceeds 1% F1/accuracy; without these statistics the opposite effects for Fongbe (NER hurt, POS helped) cannot be distinguished from training noise and therefore do not yet support the task-structure claim.

Authors: We agree that single-run point estimates are insufficient to support the task-structure interpretation, especially on the modest-sized MasakhaNER 2.0 and MasakhaPOS splits where fine-tuning variance is known to be high. In the revised version we will (i) rerun every baseline and augmentation condition with a minimum of five distinct random seeds, (ii) report mean F1/accuracy together with standard deviations, and (iii) add paired statistical tests (e.g., Wilcoxon signed-rank or bootstrap confidence intervals) to assess whether the observed differences—particularly the opposite Fongbe NER vs. POS outcomes—are statistically distinguishable from noise. The abstract and results sections will be updated to reflect these statistics and any resulting changes to the strength of the claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or self-referential reductions

full rationale

The manuscript reports direct experimental results from fine-tuning on MasakhaNER 2.0 and MasakhaPOS benchmarks after applying LLM generation and back-translation augmentation. No equations, fitted parameters, uniqueness theorems, or predictive derivations appear; performance deltas are measured against external public baselines rather than constructed from the inputs themselves. All claims about task-specific effects follow from the observed point estimates on independent test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of supervised NLP evaluation rather than new theoretical constructs.

axioms (1)

domain assumption MasakhaNER 2.0 and MasakhaPOS are representative benchmarks for the target languages and tasks.
The paper uses these benchmarks to measure augmentation effects and generalizes from them.

pith-pipeline@v0.9.0 · 5559 in / 1205 out tokens · 22890 ms · 2026-05-10T15:51:11.078632+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 2 canonical work pages · 1 internal anchor

[1]

David Ifeoluwa Adelani et al. 2022. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. In EMNLP

2022
[2]

Intisaar Adebara et al. 2023. Serengeti: Massively Multilingual Language Models for Africa. In Findings of ACL

2023
[3]

Jesujoba Alabi et al. 2022. Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning. In COLING

2022
[4]

Markus Bayer et al. 2022. A Survey on Data Augmentation for Text Classification. ACM Computing Surveys, 55(7)

2022
[5]

Xiang Dai and Heike Adel. 2020. An Analysis of Simple Data Augmentation for Named Entity Recognition. In COLING

2020
[6]

Bosheng Ding et al. 2020. DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks. In EMNLP

2020
[7]

Cheikh M Bamba Dione et al. 2023. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages. In ACL

2023
[8]

Bonaventure F. P. Dossou et al. 2022. AfroLM: A Self-Active Pre-trained Language Model for 10 African Languages. In SustaiNLP Workshop at EMNLP

2022
[9]

Sergey Edunov et al. 2018. Understanding Back-Translation at Scale. In EMNLP

2018
[10]

Feng et al

Steven Y. Feng et al. 2021. A Survey of Data Augmentation Approaches for NLP. In Findings of ACL

2021
[11]

Amr Hendy et al. 2023. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv:2302.09210

work page arXiv 2023
[12]

Pratik Joshi et al. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In ACL

2020
[13]

Varun Kumar et al. 2020. Data Augmentation using Pre-trained Transformer Models. In ACL Workshop on NLP for Similar Languages, Varieties and Dialects

2020
[14]

Claire Lefebvre and Anne-Marie Brousseau. 2002. A Grammar of Fongbe. Mouton de Gruyter

2002
[15]

Shamsuddeen Hassan Muhammad et al. 2023. AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages. In EMNLP

2023
[16]

Wilhelmina Nekoto et al. 2020. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In EMNLP Findings

2020
[17]

Paul Newman. 2000. The Hausa Language: An Encyclopedic Reference Grammar. Yale University Press

2000
[18]

NLLB Team. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672

work page internal anchor Pith review arXiv 2022
[19]

Kelechi Ogueji et al. 2021. Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages. In RepL4NLP Workshop

2021
[20]

Nathaniel Robinson et al. 2023. ChatGPT MT: Competitive for High- (but not Low-) Resource Languages. In WMT

2023
[21]

Timo Schick and Hinrich Sch\" u tze. 2021. Generating Datasets with Pretrained Language Models. In EMNLP

2021
[22]

Rico Sennrich et al. 2016. Improving Neural Machine Translation Models with Monolingual Data. In ACL

2016
[23]

Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In EMNLP

2019
[24]

Chenxi Whitehouse et al. 2023. LLM-powered Data Augmentation for Enhanced Cross-lingual Performance. In EMNLP

2023
[25]

Qizhe Xie et al. 2020. Unsupervised Data Augmentation for Consistency Training. In NeurIPS

2020