Recognition: unknown
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
Pith reviewed 2026-05-08 17:23 UTC · model grok-4.3
The pith
No LLM reaches both high accuracy and high consistency in zero-shot translations of 43 Ghanaian languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nsanku evaluates 19 LLMs on English-Ghanaian language pairs and finds that while gemini-2.5-flash leads with an average score of 26.88, no model or language achieves both high performance and high consistency simultaneously, indicating LLMs are not yet reliably usable for Ghanaian language translation at scale.
What carries the argument
The performance-consistency quadrant analysis that places each model-language pair into one of four categories based on average BLEU/chrF score and cross-language consistency.
If this is right
- The benchmark can be extended by the community to track future model improvements on Ghanaian languages.
- Proprietary models outperform open-weight ones on average, with gemini-2.5-flash highest overall and kimi-k2-instruct-0905 leading among open models.
- Language variation matters: Siwu reaches the highest per-language average while Nkonya scores lowest.
- Current systems cannot yet support reliable scaling of Ghanaian language translation applications.
Where Pith is reading between the lines
- Consistency shortfalls observed here likely appear in other low-resource African languages as well.
- Testing on non-religious or spoken-language text could expose different model limitations.
- Fine-tuning approaches on local Ghanaian corpora might raise both performance and consistency together.
- The public benchmark allows direct comparison of new models against the reported baselines.
Load-bearing premise
That 300 Bible-derived sentence pairs per language form a representative sample for general translation quality and that BLEU and chrF scores reflect real translation utility without human validation.
What would settle it
A human evaluation study on the same sentences or on non-Bible everyday text that finds high quality and consistency even where automatic scores remain low.
read the original abstract
Large language models (LLMs) have demonstrated impressive multilingual capabilities for well-resourced languages, yet their performance on low-resource African languages remains poorly understood and largely unevaluated. This paper presents Nsanku, a systematic benchmark that evaluates the zero-shot machine translation performance of 19 open-weight and proprietary LLMs across 43 Ghanaian languages paired with English. Evaluation sentences were sourced from the YouVersion Bible platform, providing 300 sentence pairs per language. Two complementary automatic metrics are employed: Bilingual Evaluation Understudy (BLEU) and Character n-gram F-Score (chrF), alongside an average accuracy score and a cross-language consistency dimension. Nsanku represents the most comprehensive LLM translation evaluation for Ghanaian languages conducted to date. Results show that gemini-2.5-flash achieves the highest overall average score of 26.88 (BLEU: 24.60, chrF: 29.16), followed by claude-sonnet-4-5 at 24.87 (BLEU: 22.46, chrF: 27.28) and gpt-4.1 at 23.20 (BLEU: 21.15, chrF: 25.24). Among open-weight models, kimi-k2-instruct-0905 leads at an average score of 20.87. A critical finding from the consistency analysis is that no model and no language reached the Leaders quadrant of high performance and high consistency simultaneously, indicating that current LLMs are not yet reliably usable for Ghanaian language translation at scale. Siwu achieved the highest per-language average score at 25.73 while Nkonya scored lowest at 11.65. Nsanku establishes a publicly available, community-extensible evaluation infrastructure for African language NLP research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Nsanku, a benchmark for zero-shot machine translation performance of 19 LLMs (open-weight and proprietary) across 43 Ghanaian languages paired with English. It uses 300 sentence pairs per language sourced from the YouVersion Bible, evaluates with BLEU and chrF plus an average accuracy score and cross-language consistency dimension, reports gemini-2.5-flash as highest overall (avg 26.88), and finds that no model-language pair reaches both high performance and high consistency (Leaders quadrant), concluding that current LLMs are not yet reliably usable for Ghanaian language translation at scale. The benchmark and data are released publicly for community extension.
Significance. If the empirical measurements hold, this constitutes the most comprehensive LLM translation evaluation for Ghanaian languages to date and supplies a publicly available, extensible infrastructure for African-language NLP research. The direct reporting of per-language and per-model scores (e.g., Siwu at 25.73, Nkonya at 11.65) and the quadrant analysis provide concrete, falsifiable baselines that future work can build upon.
major comments (3)
- [§3] §3 (Data and Evaluation Setup): The central claim that 'no model and no language reached the Leaders quadrant' and therefore 'current LLMs are not yet reliably usable for Ghanaian language translation at scale' rests entirely on 300 Bible-derived sentence pairs per language. Bible text is stylistically narrow (formal, repetitive, archaic), so the performance-consistency patterns may not generalize to conversational or technical domains; this domain restriction is load-bearing for the generalization in the abstract and conclusion.
- [Methods] Methods section (consistency computation): The abstract and results invoke an 'average accuracy score' and 'cross-language consistency dimension' to define the quadrants, yet no explicit formula, aggregation method, or threshold for 'high' vs. 'low' is provided. Without this, it is impossible to verify whether the Leaders-quadrant finding is robust or sensitive to the precise definition of consistency.
- [Results] Results and Discussion: No human adequacy or fluency ratings, nor any correlation analysis between BLEU/chrF and human judgments, are reported. In low-resource settings where automatic metrics are known to be unreliable, the absence of human validation weakens the claim that the observed scores reflect actual translation utility.
minor comments (2)
- [Abstract] Abstract: The phrase 'average score' is used without immediate definition; readers must reach the methods to learn it combines BLEU and chrF.
- [Table 1] Table 1 or equivalent (model list): The 19 models and 43 languages should be enumerated with exact names and language codes for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Data and Evaluation Setup): The central claim that 'no model and no language reached the Leaders quadrant' and therefore 'current LLMs are not yet reliably usable for Ghanaian language translation at scale' rests entirely on 300 Bible-derived sentence pairs per language. Bible text is stylistically narrow (formal, repetitive, archaic), so the performance-consistency patterns may not generalize to conversational or technical domains; this domain restriction is load-bearing for the generalization in the abstract and conclusion.
Authors: We acknowledge that Bible-derived text is stylistically narrow and that this choice limits direct generalization to other domains. The YouVersion Bible was selected because it supplies the only large-scale, high-quality, sentence-aligned parallel data available across all 43 languages. In the revision we will expand §3 to state this limitation explicitly, add a dedicated paragraph on domain specificity, and qualify the abstract and conclusion to indicate that the 'not yet reliably usable' claim applies to the evaluated domain while calling for future work on conversational and technical text. revision: partial
-
Referee: [Methods] Methods section (consistency computation): The abstract and results invoke an 'average accuracy score' and 'cross-language consistency dimension' to define the quadrants, yet no explicit formula, aggregation method, or threshold for 'high' vs. 'low' is provided. Without this, it is impossible to verify whether the Leaders-quadrant finding is robust or sensitive to the precise definition of consistency.
Authors: We apologize for the omission of explicit definitions. The average accuracy score is the arithmetic mean of BLEU and chrF (both scaled 0–100). Cross-language consistency is the coefficient of variation (standard deviation divided by mean) of the 43 per-language average scores; 'high consistency' is defined as values below the median across all evaluated model–language pairs. We will insert the full formulas, aggregation procedure, and exact thresholds into the Methods section so that the quadrant classification can be reproduced and tested for sensitivity. revision: yes
-
Referee: [Results] Results and Discussion: No human adequacy or fluency ratings, nor any correlation analysis between BLEU/chrF and human judgments, are reported. In low-resource settings where automatic metrics are known to be unreliable, the absence of human validation weakens the claim that the observed scores reflect actual translation utility.
Authors: We agree that human validation would strengthen the utility claims. Conducting native-speaker adequacy and fluency ratings across 43 languages was beyond the resource scope of this benchmark. In the revision we will add a Limitations subsection that (a) cites existing correlation studies between BLEU/chrF and human judgments for African languages and (b) explicitly notes the absence of human ratings in the present work. We cannot add new human data at this stage but will outline plans for such evaluation in future extensions of Nsanku. revision: partial
Circularity Check
Direct empirical measurements with no derivations or self-referential reductions
full rationale
The paper evaluates LLMs via zero-shot translation on 300 Bible sentence pairs per language, computing standard automatic metrics (BLEU, chrF) plus derived averages and consistency scores. All reported results, including the Leaders quadrant finding, are direct outputs of these measurements on the test data with no fitted parameters, equations, or derivations that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims; the work is self-contained as an empirical benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sentences from the YouVersion Bible platform provide a representative sample for evaluating machine translation performance in Ghanaian languages.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Ghana is home to over eighty (80) documented languages spanning several major linguistic families, including Kwa, Gur, Grusi, and Mande [1, 2]. These languages are spoken daily by millions of Ghanaians in markets, homes, churches, schools, and courts, yet they remain almost entirely absent from the tools and technologies that define modern na...
-
[2]
Literature Review 2.1 Evaluation Metrics for Machine Translation BLEU, proposed by Papineni et. al. [15], measures n-gram precision between candidate and reference translations with a brevity penalty and became the de facto standard for MT evaluation due to its speed and reproducibility. Its limitations are equally well documented: sensitivity to tokenisa...
2020
-
[3]
As demonstrated by Mensah et al
Limitations The evaluation corpus is drawn exclusively from YouVersion Bible translations [10]. As demonstrated by Mensah et al. [26] in the context of Akan ASR, models evaluated on scriptural text show marked accuracy degradation when applied to conversational, journalistic, or parliamentary domains. The BLEU and chrF scores reported in this paper reflec...
-
[4]
GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages,
Conclusion This paper has presented Nsanku, the most comprehensive systematic evaluation of zero-shot LLM translation performance for Ghanaian languages conducted to date. By evaluating nineteen (19) LLMs across forty-three (43) Ghanaian language-English pairs using a four-stage reproducible pipeline and multiple complementary metrics, Nsanku establishes ...
-
[5]
English-Twi Parallel Corpus for Machine Translation,
P. Azunre et al. , “English-Twi Parallel Corpus for Machine Translation,” Apr. 2021. https://doi.org/10.48550/arXiv.2103.15625
-
[6]
MasakhaNEWS: News Topic Classification for African languages,
D. I. Adelani et al. , “MasakhaNEWS: News Topic Classification for African languages,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwa...
-
[7]
ChatGPT MT: Competitive for High- (but Not Low-) Resource Languages,
N. Robinson, P. Ogayo, D. R. Mortensen, and G. Neubig, “ChatGPT MT: Competitive for High- (but Not Low-) Resource Languages,” in Proceedings of the Eighth Conference on Machine Translation , P. Koehn, B. Haddow, T. Kocmi, and C. Monz, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 392–418. doi: 10.18653/v1/2023.wmt-1.40
-
[8]
Machine translation and fair access to information,
M. Nurminen and M. Koponen, “Machine translation and fair access to information,” Translation Spaces , vol. 9, no. 1, pp. 150–169, Aug. 2020, doi: 10.1075/ts.00025.nur
-
[9]
LLMs4All: A Review of Large Language Models Across Academic Disciplines,
Y. Ye et al. , “LLMs4All: A Review of Large Language Models Across Academic Disciplines,” Nov. 2025, [Online]. Available: http://arxiv.org/abs/2509.19580 Corresponding Author: stephen.moore@ucc.edu.gh
-
[10]
D. Ataman, A. Birch, N. Habash, M. Federico, P. Koehn, and K. Cho, “Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems,” Information , vol. 16, no. 9, p. 723, Aug. 2025, doi: 10.3390/info16090723
-
[11]
P. S. Herrera-Espejel and S. Rach, “The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review,” JMIR Public Health Surveill. , vol. 9, p. e50814, Nov. 2023, doi: 10.2196/50814
-
[12]
K. N. Dew, A. M. Turner, Y. K. Choi, A. Bosold, and K. Kirchhoff, “Development of machine translation technology for assisting health communication: A systematic review,” J. Biomed. Inform. , vol. 85, pp. 56–67, Sep. 2018, doi: 10.1016/j.jbi.2018.07.018
-
[13]
YouVersion,
“YouVersion,” The Bible App. Life.Church. Accessed: May 02, 2026. [Online]. Available: https://www.youversion.com/
2026
-
[14]
chr F ++: words helping character n-grams
M. Popović, “chrF++: words helping character n-grams,” in Proceedings of the Second Conference on Machine Translation , O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, and J. Kreutzer, Eds., Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 612–618. doi: 10.18653/v1/W17-4770
-
[15]
S. Kumar, P. Jyothi, and P. Bhattacharyya, “Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics,” Feb. 2026, [Online]. Available: http://arxiv.org/abs/2602.17425
-
[16]
ChatGPT MT: Competitive for High-(but not Low-) Resource Languages
N. R. Robinson, P. Ogayo, D. R. Mortensen, and G. Neubig, “ChatGPT MT: Competitive for High-(but not Low-) Resource Languages.” [Online]. Available: https://arxiv.org/abs/2309.07423
-
[17]
J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization,” Sep. 2020, [Online]. Available: http://arxiv.org/abs/2003.11080
-
[18]
Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02 , Morristown, NJ, USA: Association for Computational Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135
-
[19]
Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust,
M. Freitag et al. , “Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust,” in Proceedings of the Seventh Conference on Machine Translation (WMT) , Stroudsburg, PA, USA: Association for Computational Linguistics, 2022, pp. 46–68. doi: 10.18653/v1/2022.wmt-1.2
-
[20]
Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages
A. Vázquez and M. I. Torres, “Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages.” [Online]. Available: https://hf.rst.im/pere/norwegian-gpt2-social
-
[21]
S. Shalawati, A. H. Nasution, W. Monika, T. Derin, A. Onan, and Y. Murakami, “Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation,” Digital , vol. 6, no. 1, p. 8, Jan. 2026, doi: 10.3390/digital6010008
-
[22]
A. Hendy et al. , “How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation,” Feb. 2023, [Online]. Available: http://arxiv.org/abs/2302.09210
-
[23]
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
W. Zhu et al. , “Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis,” in Findings of the Association for Computational Linguistics: NAACL 2024 , Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 2765–2781. doi: 10.18653/v1/2024.findings-naacl.176
-
[24]
Unsupervised Cross-lingual Representation Learning at Scale
A. Conneau et al. , “Unsupervised Cross-lingual Representation Learning at Scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Corresponding Author: stephen.moore@ucc.edu.gh Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 8440–8451. doi: 10.18653/v1/2020.acl-main.747
-
[25]
Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of YorùbáYor`Yorùbá and Twi,
J. O. Alabi, K. Amponsah-Kaakyire, D. I. Adelani, and C. Espã, “Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of YorùbáYor`Yorùbá and Twi,” 2020. [Online]. Available: https://github.com/Niger-Volta-LTI/
2020
-
[26]
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
W. Nekoto et al. , “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020 , T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 2144–2160. doi: 10.18653/v1/2020.findings-emnlp.195
-
[27]
P. Azunre et al. , “NLP for Ghanaian Languages,” Apr. 2021. https://doi.org/10.48550/arXiv.2103.15475
-
[28]
Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG),
E. Agyei, X. Zhang, S. Bannerman, A. B. Quaye, S. B. Yussi, and V. K. Agbesi, “Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG),” Discover Computing , vol. 27, no. 1, p. 17, Jul. 2024, doi: 10.1007/s10791-024-09451-8
-
[29]
Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability,
M. A. Mensah et al. , “Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability,” 2025
2025
-
[30]
Ndwom: A Multimodal Music Information Retrieval Dataset for Akan Musical Videos,
S. E. Moore, N. A. Asare, and S. K. Kubiti, “Ndwom: A Multimodal Music Information Retrieval Dataset for Akan Musical Videos,” Jan. 22, 2025. doi: 10.21203/rs.3.rs-5876078/v1
-
[31]
Ayoo: A Multilingual Multimodal Music Information Retrieval Dataset for Ghana Music Award Videos,
S. E. Moore, A. Asare, and S. K. Kubiti, “Ayoo: A Multilingual Multimodal Music Information Retrieval Dataset for Ghana Music Award Videos,” Apr. 22, 2026. doi: 10.21203/rs.3.rs-9475824/v1
-
[32]
Language Models are Few-Shot Learners
T. B. Brown et al. , “Language Models are Few-Shot Learners,” Jul. 2020, [Online]. Available: http://arxiv.org/abs/2005.14165
work page internal anchor Pith review arXiv 2020
-
[33]
J. Novikova, C. Anderson, B. Blili-Hamelin, D. Rosati, and S. Majumdar, “Consistency in Language Models: Current Landscape, Challenges, and Future Directions,” Jul. 2025, [Online]. Available: http://arxiv.org/abs/2505.00268
-
[34]
Aligning LLMs for Multilingual Consistency in Enterprise Applications
A. Agarwal, H. Meghwani, H. L. Patel, T. Sheng, S. Ravi, and D. Roth, “Aligning LLMs for Multilingual Consistency in Enterprise Applications.”
-
[35]
Measuring Massive Multitask Language Understanding
D. Hendrycks et al. , “Measuring Massive Multitask Language Understanding,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2009.03300
work page internal anchor Pith review arXiv 2021
-
[36]
The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
N. Goyal et al. , “The <scp>Flores-101</scp> Evaluation Benchmark for Low-Resource and Multilingual Machine Translation,” Trans. Assoc. Comput. Linguist. , vol. 10, pp. 522–538, May 2022, doi: 10.1162/tacl_a_00474. Corresponding Author: stephen.moore@ucc.edu.gh
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.