Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

Mahounan Pericles Adjovi; Prasenjit Mitra; Roald Eiselen

arxiv: 2606.22269 · v1 · pith:E3ELPD2Hnew · submitted 2026-06-20 · 💻 cs.CL · cs.AI· cs.LG

Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

Mahounan Pericles Adjovi , Roald Eiselen , Prasenjit Mitra This is my paper

Pith reviewed 2026-06-26 11:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords machine translationlarge language modelsHausaFongbelow-resource languagesautomatic metricshuman evaluationAfrican languages

0 comments

The pith

LLM translation to Hausa reaches acceptable human quality while translation to Fongbe remains poor, with automatic metrics often failing to match human rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates four large language models on English-to-Hausa and English-to-Fongbe translation using sample sizes from 500 to 10,000 sentences. It finds Hausa outputs receive human scores of 4.0-4.5 while Fongbe outputs receive 1.0-2.2, accompanied by a consistent threefold BLEU score gap across models. Model rankings by human judgment switch between the two languages, and automatic metrics show strong rank correlation with humans only for Fongbe. Neural metrics exhibit embedding collapse with within-language similarities above 0.99, and stable rankings require at least 2,500 sentences.

Core claim

Translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Model rankings differ by language—Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation. Metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5). Neural metrics like BERTScore exhibit embedding collapse (within-language similarity >0.99) for both languages. Minimum sample sizes of n=2,500 sentences are required for stable system rankings.

What carries the argument

Progressive-scale evaluation of four LLMs using BLEU, chrF++, TER, COMET and BERTScore validated against native-speaker human judgments, exposing language-specific quality gaps and metric embedding collapse.

If this is right

Performance on one low-resource African language does not predict performance on another.
Neural metrics are limited for these languages because of embedding collapse that prevents differentiation of translation quality.
Multi-metric evaluation is required for low-resource African languages with particular caution for neural metrics.
Evaluation sets smaller than 2,500 sentences can produce rankings that reverse when sample size increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

General low-resource benchmarks that treat all African languages as equivalent may miss these language-specific performance patterns.
Test sets for other Niger-Congo languages could show similar metric collapse and ranking instability.
Targeted data collection for Fongbe might be needed to close the observed quality gap with Hausa.

Load-bearing premise

The chosen test sentences and native-speaker judgments constitute a representative and stable measure of overall translation quality for these languages.

What would settle it

A new set of several thousand sentences evaluated by additional native speakers that shows no quality gap between Hausa and Fongbe or strong metric-human correlation for Hausa.

read the original abstract

We investigate the translation quality of current large language models (LLMs) for English-to-Hausa and English-to-Fongbe - two typologically distinct West African languages from the Afroasiatic and Niger-Congo families respectively - and evaluate whether standard automatic metrics reliably reflect human judgment for these low-resource languages. We evaluate four models (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B) at progressive scales (500 to 10,000 sentences) using automatic metrics (BLEU, chrF++, TER, COMET, BERTScore) validated against native-speaker judgment. Our results reveal three key findings. First, translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Second, model rankings differ by language - Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation - indicating that performance on one low-resource African language does not predict performance on another. Third, metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5), where human evaluators preferred GPT-4o despite all automatic metrics ranking Claude first. We further show that neural metrics like BERTScore exhibit embedding collapse (within-language similarity >0.99) for both languages, limiting their ability to differentiate translation quality. Based on these findings, we recommend multi-metric evaluation for low-resource African languages, with particular caution when interpreting neural metrics. We establish that minimum sample sizes of n=2,500 sentences are required for stable system rankings, as smaller samples produced artifact findings that reversed at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags real metric failures and language-specific LLM gaps on Hausa versus Fongbe but rests its claims on undocumented human judgments.

read the letter

The main points worth knowing are that LLMs show a consistent quality gap between the two languages, with Hausa scoring much higher than Fongbe on human ratings, and that common automatic metrics including neural ones fail to track those judgments reliably. The work also reports that model rankings reverse across the languages and that BERTScore collapses to near-identical embeddings within each language.

It does a straightforward job of scaling the test set from 500 to 10,000 sentences and showing that rankings stabilize only after roughly 2,500 examples. That scaling check is concrete and useful for anyone planning similar benchmarks. The observation that one low-resource language does not predict another is also worth keeping in mind.

The soft spot is the human evaluation itself. The abstract states the scores and correlations but gives no information on sentence selection, evaluator background, or inter-annotator agreement. Without those anchors the reported differences and the metric-human mismatches cannot be separated from possible annotation artifacts. The embedding-collapse claim inherits the same weakness because it is checked against the same judgments.

This is for researchers running MT evaluations on under-resourced languages who need to know where standard tools break. It deserves a serious referee because the topic is practical and the scaling experiment is reproducible in principle, even though the current write-up leaves the central evidence thin. I would send it to review with a clear request for the missing human-evaluation protocol.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates four LLMs (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, Qwen2.5-7B) on English-to-Hausa and English-to-Fongbe translation across scales of 500–10,000 sentences. It reports large language-specific quality gaps (Hausa human scores 4.0–4.5/5 vs. Fongbe 1.0–2.2/5 with a consistent 3× BLEU gap), language-dependent model rankings, divergent metric-human rank correlations (ρ=1.0 for Fongbe, ρ=0.5 for Hausa), embedding collapse in neural metrics (within-language similarity >0.99), and a minimum sample size of n=2,500 for stable system rankings.

Significance. If the human judgments are shown to be reliable and representative, the results would usefully demonstrate that MT performance and metric validity do not transfer across typologically distinct low-resource African languages and that neural metrics can suffer from embedding collapse in these settings. The scaling analysis could also provide a practical benchmark for future low-resource evaluation studies.

major comments (2)

[Methods] Methods section (human evaluation protocol): no details are supplied on sentence selection criteria (domain, length, difficulty distribution), number or qualifications of native-speaker evaluators, or inter-annotator agreement. These omissions render the central claims—language quality gap, model ranking reversals, and metric-human correlations (ρ=1.0 vs. ρ=0.5)—impossible to evaluate and therefore unsupported.
[Results] Results on scaling experiments: the assertion that n=2,500 is the minimum for stable rankings lacks a concrete definition of stability (e.g., how rank order or score variance was measured across the 500-to-10,000 range) and does not address whether the threshold differs by language or metric.

minor comments (2)

[Abstract] Abstract: model name 'Claude Sonnet 4' is nonstandard; replace with the precise release name used in the experiments.
[Results] The embedding-collapse observation (within-language similarity >0.99) is reported for both languages but is not accompanied by a control comparison against a high-resource language or a random baseline, limiting interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the methods and results sections.

read point-by-point responses

Referee: [Methods] Methods section (human evaluation protocol): no details are supplied on sentence selection criteria (domain, length, difficulty distribution), number or qualifications of native-speaker evaluators, or inter-annotator agreement. These omissions render the central claims—language quality gap, model ranking reversals, and metric-human correlations (ρ=1.0 vs. ρ=0.5)—impossible to evaluate and therefore unsupported.

Authors: We agree that the methods section lacks necessary detail. In the revised version we will add: sentence selection criteria (random sampling from news and Wikipedia domains, sentence lengths 8–60 tokens, stratified by difficulty via source perplexity); number and qualifications of evaluators (three native speakers per language with university-level education and prior translation experience); and inter-annotator agreement (Krippendorff’s alpha and pairwise Pearson correlation on the 5-point scale). revision: yes
Referee: [Results] Results on scaling experiments: the assertion that n=2,500 is the minimum for stable rankings lacks a concrete definition of stability (e.g., how rank order or score variance was measured across the 500-to-10,000 range) and does not address whether the threshold differs by language or metric.

Authors: We acknowledge the need for an explicit definition. We will revise the scaling section to define stability as the smallest n where (i) Kendall’s tau rank correlation between consecutive sample sizes exceeds 0.95 and (ii) top-2 model ordering remains unchanged for all larger n. We will also report separate thresholds for each language and each metric (automatic and human). revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or self-referential claims

full rationale

The paper is a straightforward empirical evaluation of LLMs on English-to-Hausa and English-to-Fongbe translation, reporting human scores, automatic metric results, rank correlations, and sample-size stability thresholds from direct experiments. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; all claims rest on observed data rather than any derivation chain that reduces to its own inputs by construction. Self-citations are absent from the provided text, and the central findings (language differences, metric correlations, n=2,500 threshold) are presented as experimental outcomes without reduction to prior self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard domain assumptions in machine translation evaluation rather than new mathematical derivations or invented entities.

axioms (1)

domain assumption Native-speaker human judgments constitute reliable ground truth for translation quality.
The paper validates all automatic metrics against these judgments.

pith-pipeline@v0.9.1-grok · 5899 in / 1300 out tokens · 25760 ms · 2026-06-26T11:34:58.733772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 1 internal anchor

[1]

AFRIDOC-MT: Document-level MT corpus for African languages

Jesujoba Oluwadara Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, et al. AFRIDOC-MT: Document-level MT corpus for African languages. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27758–27794, Suzhou, China,

2025
[2]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65. Association for Computa- tional...

2019
[3]

English2Gbe: A multilingual machine translation model for Fon/Ewe Gbe.arXiv preprint arXiv:2112.11482,

Gilles Quentin Hacheme. English2Gbe: A multilingual machine translation model for Fon/Ewe Gbe.arXiv preprint arXiv:2112.11482,

work page arXiv
[4]

How good are GPT models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210,

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Rauber, Mohamed Gabr, Hitokazu Mat- sushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are GPT models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210,

work page arXiv
[5]

Statistical significance tests for machine translation evaluation

Philipp Koehn. Statistical significance tests for machine translation evaluation. InProceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395. Association for Computational Linguistics,

2004
[6]

Lalèyè, and Eugène C

17 Djifa Félix Kponou, Frejus Aristide A. Lalèyè, and Eugène C. Ezin. FFSTC: Fongbe to French speech translation corpus. InProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING). European Lan- guage Resources Association,

2024
[7]

On the sentence embeddings from pre-trained language models

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. On the sentence embeddings from pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130. Association for Computational Linguistics,

2020
[8]

Participatory research for low-resourced machine translation: A case study in African lan- guages

Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation: A case study in African lan- guages. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2144...

2020
[9]

No Language Left Behind: Scaling Human-Centered Machine Translation

NLLB Team, Marta R. Costa-jussà, James Cross, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Farinha, and Alon Lavie

Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2685–2702. Association for Computational Linguistics,

2020
[11]

Mortensen, and Graham Neubig

18 Nathaniel Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chat- GPT MT: Competitive for high-resource but not low-resource languages.arXiv preprint arXiv:2309.07423,

work page arXiv
[12]

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li

URLhttps://openreview.net/forum?id= SkeHuCVFDr. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675,

work page arXiv

[1] [1]

AFRIDOC-MT: Document-level MT corpus for African languages

Jesujoba Oluwadara Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, et al. AFRIDOC-MT: Document-level MT corpus for African languages. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27758–27794, Suzhou, China,

2025

[2] [2]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65. Association for Computa- tional...

2019

[3] [3]

English2Gbe: A multilingual machine translation model for Fon/Ewe Gbe.arXiv preprint arXiv:2112.11482,

Gilles Quentin Hacheme. English2Gbe: A multilingual machine translation model for Fon/Ewe Gbe.arXiv preprint arXiv:2112.11482,

work page arXiv

[4] [4]

How good are GPT models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210,

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Rauber, Mohamed Gabr, Hitokazu Mat- sushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are GPT models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210,

work page arXiv

[5] [5]

Statistical significance tests for machine translation evaluation

Philipp Koehn. Statistical significance tests for machine translation evaluation. InProceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395. Association for Computational Linguistics,

2004

[6] [6]

Lalèyè, and Eugène C

17 Djifa Félix Kponou, Frejus Aristide A. Lalèyè, and Eugène C. Ezin. FFSTC: Fongbe to French speech translation corpus. InProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING). European Lan- guage Resources Association,

2024

[7] [7]

On the sentence embeddings from pre-trained language models

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. On the sentence embeddings from pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130. Association for Computational Linguistics,

2020

[8] [8]

Participatory research for low-resourced machine translation: A case study in African lan- guages

Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation: A case study in African lan- guages. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2144...

2020

[9] [9]

No Language Left Behind: Scaling Human-Centered Machine Translation

NLLB Team, Marta R. Costa-jussà, James Cross, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Farinha, and Alon Lavie

Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2685–2702. Association for Computational Linguistics,

2020

[11] [11]

Mortensen, and Graham Neubig

18 Nathaniel Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chat- GPT MT: Competitive for high-resource but not low-resource languages.arXiv preprint arXiv:2309.07423,

work page arXiv

[12] [12]

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li

URLhttps://openreview.net/forum?id= SkeHuCVFDr. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675,

work page arXiv