Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability
Pith reviewed 2026-06-26 11:34 UTC · model grok-4.3
The pith
LLM translation to Hausa reaches acceptable human quality while translation to Fongbe remains poor, with automatic metrics often failing to match human rankings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Model rankings differ by language—Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation. Metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5). Neural metrics like BERTScore exhibit embedding collapse (within-language similarity >0.99) for both languages. Minimum sample sizes of n=2,500 sentences are required for stable system rankings.
What carries the argument
Progressive-scale evaluation of four LLMs using BLEU, chrF++, TER, COMET and BERTScore validated against native-speaker human judgments, exposing language-specific quality gaps and metric embedding collapse.
If this is right
- Performance on one low-resource African language does not predict performance on another.
- Neural metrics are limited for these languages because of embedding collapse that prevents differentiation of translation quality.
- Multi-metric evaluation is required for low-resource African languages with particular caution for neural metrics.
- Evaluation sets smaller than 2,500 sentences can produce rankings that reverse when sample size increases.
Where Pith is reading between the lines
- General low-resource benchmarks that treat all African languages as equivalent may miss these language-specific performance patterns.
- Test sets for other Niger-Congo languages could show similar metric collapse and ranking instability.
- Targeted data collection for Fongbe might be needed to close the observed quality gap with Hausa.
Load-bearing premise
The chosen test sentences and native-speaker judgments constitute a representative and stable measure of overall translation quality for these languages.
What would settle it
A new set of several thousand sentences evaluated by additional native speakers that shows no quality gap between Hausa and Fongbe or strong metric-human correlation for Hausa.
read the original abstract
We investigate the translation quality of current large language models (LLMs) for English-to-Hausa and English-to-Fongbe - two typologically distinct West African languages from the Afroasiatic and Niger-Congo families respectively - and evaluate whether standard automatic metrics reliably reflect human judgment for these low-resource languages. We evaluate four models (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B) at progressive scales (500 to 10,000 sentences) using automatic metrics (BLEU, chrF++, TER, COMET, BERTScore) validated against native-speaker judgment. Our results reveal three key findings. First, translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Second, model rankings differ by language - Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation - indicating that performance on one low-resource African language does not predict performance on another. Third, metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5), where human evaluators preferred GPT-4o despite all automatic metrics ranking Claude first. We further show that neural metrics like BERTScore exhibit embedding collapse (within-language similarity >0.99) for both languages, limiting their ability to differentiate translation quality. Based on these findings, we recommend multi-metric evaluation for low-resource African languages, with particular caution when interpreting neural metrics. We establish that minimum sample sizes of n=2,500 sentences are required for stable system rankings, as smaller samples produced artifact findings that reversed at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four LLMs (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, Qwen2.5-7B) on English-to-Hausa and English-to-Fongbe translation across scales of 500–10,000 sentences. It reports large language-specific quality gaps (Hausa human scores 4.0–4.5/5 vs. Fongbe 1.0–2.2/5 with a consistent 3× BLEU gap), language-dependent model rankings, divergent metric-human rank correlations (ρ=1.0 for Fongbe, ρ=0.5 for Hausa), embedding collapse in neural metrics (within-language similarity >0.99), and a minimum sample size of n=2,500 for stable system rankings.
Significance. If the human judgments are shown to be reliable and representative, the results would usefully demonstrate that MT performance and metric validity do not transfer across typologically distinct low-resource African languages and that neural metrics can suffer from embedding collapse in these settings. The scaling analysis could also provide a practical benchmark for future low-resource evaluation studies.
major comments (2)
- [Methods] Methods section (human evaluation protocol): no details are supplied on sentence selection criteria (domain, length, difficulty distribution), number or qualifications of native-speaker evaluators, or inter-annotator agreement. These omissions render the central claims—language quality gap, model ranking reversals, and metric-human correlations (ρ=1.0 vs. ρ=0.5)—impossible to evaluate and therefore unsupported.
- [Results] Results on scaling experiments: the assertion that n=2,500 is the minimum for stable rankings lacks a concrete definition of stability (e.g., how rank order or score variance was measured across the 500-to-10,000 range) and does not address whether the threshold differs by language or metric.
minor comments (2)
- [Abstract] Abstract: model name 'Claude Sonnet 4' is nonstandard; replace with the precise release name used in the experiments.
- [Results] The embedding-collapse observation (within-language similarity >0.99) is reported for both languages but is not accompanied by a control comparison against a high-resource language or a random baseline, limiting interpretability.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the methods and results sections.
read point-by-point responses
-
Referee: [Methods] Methods section (human evaluation protocol): no details are supplied on sentence selection criteria (domain, length, difficulty distribution), number or qualifications of native-speaker evaluators, or inter-annotator agreement. These omissions render the central claims—language quality gap, model ranking reversals, and metric-human correlations (ρ=1.0 vs. ρ=0.5)—impossible to evaluate and therefore unsupported.
Authors: We agree that the methods section lacks necessary detail. In the revised version we will add: sentence selection criteria (random sampling from news and Wikipedia domains, sentence lengths 8–60 tokens, stratified by difficulty via source perplexity); number and qualifications of evaluators (three native speakers per language with university-level education and prior translation experience); and inter-annotator agreement (Krippendorff’s alpha and pairwise Pearson correlation on the 5-point scale). revision: yes
-
Referee: [Results] Results on scaling experiments: the assertion that n=2,500 is the minimum for stable rankings lacks a concrete definition of stability (e.g., how rank order or score variance was measured across the 500-to-10,000 range) and does not address whether the threshold differs by language or metric.
Authors: We acknowledge the need for an explicit definition. We will revise the scaling section to define stability as the smallest n where (i) Kendall’s tau rank correlation between consecutive sample sizes exceeds 0.95 and (ii) top-2 model ordering remains unchanged for all larger n. We will also report separate thresholds for each language and each metric (automatic and human). revision: yes
Circularity Check
Empirical benchmarking study with no derivations or self-referential claims
full rationale
The paper is a straightforward empirical evaluation of LLMs on English-to-Hausa and English-to-Fongbe translation, reporting human scores, automatic metric results, rank correlations, and sample-size stability thresholds from direct experiments. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; all claims rest on observed data rather than any derivation chain that reduces to its own inputs by construction. Self-citations are absent from the provided text, and the central findings (language differences, metric correlations, n=2,500 threshold) are presented as experimental outcomes without reduction to prior self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Native-speaker human judgments constitute reliable ground truth for translation quality.
Reference graph
Works this paper leans on
-
[1]
AFRIDOC-MT: Document-level MT corpus for African languages
Jesujoba Oluwadara Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, et al. AFRIDOC-MT: Document-level MT corpus for African languages. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27758–27794, Suzhou, China,
2025
-
[2]
How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings
Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65. Association for Computa- tional...
2019
-
[3]
Gilles Quentin Hacheme. English2Gbe: A multilingual machine translation model for Fon/Ewe Gbe.arXiv preprint arXiv:2112.11482,
-
[4]
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Rauber, Mohamed Gabr, Hitokazu Mat- sushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are GPT models at machine translation? a comprehensive evaluation.arXiv preprint arXiv:2302.09210,
-
[5]
Statistical significance tests for machine translation evaluation
Philipp Koehn. Statistical significance tests for machine translation evaluation. InProceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395. Association for Computational Linguistics,
2004
-
[6]
Lalèyè, and Eugène C
17 Djifa Félix Kponou, Frejus Aristide A. Lalèyè, and Eugène C. Ezin. FFSTC: Fongbe to French speech translation corpus. InProceedings of the 2024 Joint International Conference on Com- putational Linguistics, Language Resources and Evaluation (LREC-COLING). European Lan- guage Resources Association,
2024
-
[7]
On the sentence embeddings from pre-trained language models
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. On the sentence embeddings from pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130. Association for Computational Linguistics,
2020
-
[8]
Participatory research for low-resourced machine translation: A case study in African lan- guages
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation: A case study in African lan- guages. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 2144...
2020
-
[9]
No Language Left Behind: Scaling Human-Centered Machine Translation
NLLB Team, Marta R. Costa-jussà, James Cross, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Farinha, and Alon Lavie
Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2685–2702. Association for Computational Linguistics,
2020
-
[11]
18 Nathaniel Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chat- GPT MT: Competitive for high-resource but not low-resource languages.arXiv preprint arXiv:2309.07423,
-
[12]
URLhttps://openreview.net/forum?id= SkeHuCVFDr. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.