arxiv: 2605.04208 · v1 · submitted 2026-05-05 · 💻 cs.CL

Recognition: unknown

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

Stephen E. Moore , Mich-Seth Owusu , Akwasi Asare , Lawrence Adu Gyamfi , Paul Azunre , Joel Budu , Jonathan Asiamah , Elias Dzobo

show 8 more authors

Kelvin Newman Edmund O. Benefo Gerhardt Datsomor Onesimus Addo Appiah Ama Branoa Banful Lucas Woedem Kpatah Saani Mustapha Deishini John Ayernor

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords zero-shot translationGhanaian languagesLLM evaluationmachine translation benchmarklow-resource languagesBLEUchrFconsistency analysis

0 comments

The pith

No LLM reaches both high accuracy and high consistency in zero-shot translations of 43 Ghanaian languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nsanku as a benchmark to measure zero-shot machine translation between English and 43 Ghanaian languages using 19 different LLMs. Sentences come from the YouVersion Bible, with 300 pairs per language, scored by BLEU and chrF metrics plus a new consistency measure across languages. Results show top models like gemini-2.5-flash reach an average of 26.88 yet no model-language pair lands in the high-performance and high-consistency quadrant. This establishes an open evaluation resource while showing current systems fall short of reliable large-scale use for these languages.

Core claim

Nsanku evaluates 19 LLMs on English-Ghanaian language pairs and finds that while gemini-2.5-flash leads with an average score of 26.88, no model or language achieves both high performance and high consistency simultaneously, indicating LLMs are not yet reliably usable for Ghanaian language translation at scale.

What carries the argument

The performance-consistency quadrant analysis that places each model-language pair into one of four categories based on average BLEU/chrF score and cross-language consistency.

If this is right

The benchmark can be extended by the community to track future model improvements on Ghanaian languages.
Proprietary models outperform open-weight ones on average, with gemini-2.5-flash highest overall and kimi-k2-instruct-0905 leading among open models.
Language variation matters: Siwu reaches the highest per-language average while Nkonya scores lowest.
Current systems cannot yet support reliable scaling of Ghanaian language translation applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consistency shortfalls observed here likely appear in other low-resource African languages as well.
Testing on non-religious or spoken-language text could expose different model limitations.
Fine-tuning approaches on local Ghanaian corpora might raise both performance and consistency together.
The public benchmark allows direct comparison of new models against the reported baselines.

Load-bearing premise

That 300 Bible-derived sentence pairs per language form a representative sample for general translation quality and that BLEU and chrF scores reflect real translation utility without human validation.

What would settle it

A human evaluation study on the same sentences or on non-Bible everyday text that finds high quality and consistency even where automatic scores remain low.

read the original abstract

Large language models (LLMs) have demonstrated impressive multilingual capabilities for well-resourced languages, yet their performance on low-resource African languages remains poorly understood and largely unevaluated. This paper presents Nsanku, a systematic benchmark that evaluates the zero-shot machine translation performance of 19 open-weight and proprietary LLMs across 43 Ghanaian languages paired with English. Evaluation sentences were sourced from the YouVersion Bible platform, providing 300 sentence pairs per language. Two complementary automatic metrics are employed: Bilingual Evaluation Understudy (BLEU) and Character n-gram F-Score (chrF), alongside an average accuracy score and a cross-language consistency dimension. Nsanku represents the most comprehensive LLM translation evaluation for Ghanaian languages conducted to date. Results show that gemini-2.5-flash achieves the highest overall average score of 26.88 (BLEU: 24.60, chrF: 29.16), followed by claude-sonnet-4-5 at 24.87 (BLEU: 22.46, chrF: 27.28) and gpt-4.1 at 23.20 (BLEU: 21.15, chrF: 25.24). Among open-weight models, kimi-k2-instruct-0905 leads at an average score of 20.87. A critical finding from the consistency analysis is that no model and no language reached the Leaders quadrant of high performance and high consistency simultaneously, indicating that current LLMs are not yet reliably usable for Ghanaian language translation at scale. Siwu achieved the highest per-language average score at 25.73 while Nkonya scored lowest at 11.65. Nsanku establishes a publicly available, community-extensible evaluation infrastructure for African language NLP research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nsanku gives the first broad zero-shot LLM benchmark for 43 Ghanaian languages and a public baseline, but its claim that no model is reliably usable rests on Bible sentences and automatic metrics without human checks.

read the letter

The main thing here is that Nsanku creates a public benchmark evaluating 19 LLMs on zero-shot English-to-43-Ghanaian-languages translation, which is new at this scale. The paper runs the same 300 YouVersion Bible sentence pairs per language through BLEU, chrF, and a consistency measure, ranks the models (Gemini-2.5-flash highest at 26.88 average, Kimi leading the open ones), and notes that nothing lands in the high-performance high-consistency quadrant. Releasing the data and infrastructure is the practical part that others can actually use.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Nsanku, a benchmark for zero-shot machine translation performance of 19 LLMs (open-weight and proprietary) across 43 Ghanaian languages paired with English. It uses 300 sentence pairs per language sourced from the YouVersion Bible, evaluates with BLEU and chrF plus an average accuracy score and cross-language consistency dimension, reports gemini-2.5-flash as highest overall (avg 26.88), and finds that no model-language pair reaches both high performance and high consistency (Leaders quadrant), concluding that current LLMs are not yet reliably usable for Ghanaian language translation at scale. The benchmark and data are released publicly for community extension.

Significance. If the empirical measurements hold, this constitutes the most comprehensive LLM translation evaluation for Ghanaian languages to date and supplies a publicly available, extensible infrastructure for African-language NLP research. The direct reporting of per-language and per-model scores (e.g., Siwu at 25.73, Nkonya at 11.65) and the quadrant analysis provide concrete, falsifiable baselines that future work can build upon.

major comments (3)

[§3] §3 (Data and Evaluation Setup): The central claim that 'no model and no language reached the Leaders quadrant' and therefore 'current LLMs are not yet reliably usable for Ghanaian language translation at scale' rests entirely on 300 Bible-derived sentence pairs per language. Bible text is stylistically narrow (formal, repetitive, archaic), so the performance-consistency patterns may not generalize to conversational or technical domains; this domain restriction is load-bearing for the generalization in the abstract and conclusion.
[Methods] Methods section (consistency computation): The abstract and results invoke an 'average accuracy score' and 'cross-language consistency dimension' to define the quadrants, yet no explicit formula, aggregation method, or threshold for 'high' vs. 'low' is provided. Without this, it is impossible to verify whether the Leaders-quadrant finding is robust or sensitive to the precise definition of consistency.
[Results] Results and Discussion: No human adequacy or fluency ratings, nor any correlation analysis between BLEU/chrF and human judgments, are reported. In low-resource settings where automatic metrics are known to be unreliable, the absence of human validation weakens the claim that the observed scores reflect actual translation utility.

minor comments (2)

[Abstract] Abstract: The phrase 'average score' is used without immediate definition; readers must reach the methods to learn it combines BLEU and chrF.
[Table 1] Table 1 or equivalent (model list): The 19 models and 43 languages should be enumerated with exact names and language codes for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Data and Evaluation Setup): The central claim that 'no model and no language reached the Leaders quadrant' and therefore 'current LLMs are not yet reliably usable for Ghanaian language translation at scale' rests entirely on 300 Bible-derived sentence pairs per language. Bible text is stylistically narrow (formal, repetitive, archaic), so the performance-consistency patterns may not generalize to conversational or technical domains; this domain restriction is load-bearing for the generalization in the abstract and conclusion.

Authors: We acknowledge that Bible-derived text is stylistically narrow and that this choice limits direct generalization to other domains. The YouVersion Bible was selected because it supplies the only large-scale, high-quality, sentence-aligned parallel data available across all 43 languages. In the revision we will expand §3 to state this limitation explicitly, add a dedicated paragraph on domain specificity, and qualify the abstract and conclusion to indicate that the 'not yet reliably usable' claim applies to the evaluated domain while calling for future work on conversational and technical text. revision: partial
Referee: [Methods] Methods section (consistency computation): The abstract and results invoke an 'average accuracy score' and 'cross-language consistency dimension' to define the quadrants, yet no explicit formula, aggregation method, or threshold for 'high' vs. 'low' is provided. Without this, it is impossible to verify whether the Leaders-quadrant finding is robust or sensitive to the precise definition of consistency.

Authors: We apologize for the omission of explicit definitions. The average accuracy score is the arithmetic mean of BLEU and chrF (both scaled 0–100). Cross-language consistency is the coefficient of variation (standard deviation divided by mean) of the 43 per-language average scores; 'high consistency' is defined as values below the median across all evaluated model–language pairs. We will insert the full formulas, aggregation procedure, and exact thresholds into the Methods section so that the quadrant classification can be reproduced and tested for sensitivity. revision: yes
Referee: [Results] Results and Discussion: No human adequacy or fluency ratings, nor any correlation analysis between BLEU/chrF and human judgments, are reported. In low-resource settings where automatic metrics are known to be unreliable, the absence of human validation weakens the claim that the observed scores reflect actual translation utility.

Authors: We agree that human validation would strengthen the utility claims. Conducting native-speaker adequacy and fluency ratings across 43 languages was beyond the resource scope of this benchmark. In the revision we will add a Limitations subsection that (a) cites existing correlation studies between BLEU/chrF and human judgments for African languages and (b) explicitly notes the absence of human ratings in the present work. We cannot add new human data at this stage but will outline plans for such evaluation in future extensions of Nsanku. revision: partial

Circularity Check

0 steps flagged

Direct empirical measurements with no derivations or self-referential reductions

full rationale

The paper evaluates LLMs via zero-shot translation on 300 Bible sentence pairs per language, computing standard automatic metrics (BLEU, chrF) plus derived averages and consistency scores. All reported results, including the Leaders quadrant finding, are direct outputs of these measurements on the test data with no fitted parameters, equations, or derivations that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims; the work is self-contained as an empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that Bible text is suitable for translation benchmarking and on the established validity of automatic metrics for low-resource languages; no free parameters or new entities are introduced.

axioms (1)

domain assumption Sentences from the YouVersion Bible platform provide a representative sample for evaluating machine translation performance in Ghanaian languages.
300 sentence pairs per language are sourced from this platform as the evaluation corpus.

pith-pipeline@v0.9.0 · 5711 in / 1201 out tokens · 33033 ms · 2026-05-08T17:23:01.439573+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 28 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Ghana is home to over eighty (80) documented languages spanning several major linguistic families, including Kwa, Gur, Grusi, and Mande [1, 2]. These languages are spoken daily by millions of Ghanaians in markets, homes, churches, schools, and courts, yet they remain almost entirely absent from the tools and technologies that define modern na...
[2]

Literature Review 2.1 Evaluation Metrics for Machine Translation BLEU, proposed by Papineni et. al. [15], measures n-gram precision between candidate and reference translations with a brevity penalty and became the de facto standard for MT evaluation due to its speed and reproducibility. Its limitations are equally well documented: sensitivity to tokenisa...

2020
[3]

As demonstrated by Mensah et al

Limitations The evaluation corpus is drawn exclusively from YouVersion Bible translations [10]. As demonstrated by Mensah et al. [26] in the context of Akan ASR, models evaluated on scriptural text show marked accuracy degradation when applied to conversational, journalistic, or parliamentary domains. The BLEU and chrF scores reported in this paper reflec...
[4]

GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages,

Conclusion This paper has presented Nsanku, the most comprehensive systematic evaluation of zero-shot LLM translation performance for Ghanaian languages conducted to date. By evaluating nineteen (19) LLMs across forty-three (43) Ghanaian language-English pairs using a four-stage reproducible pipeline and multiple complementary metrics, Nsanku establishes ...

work page doi:10.48550/arxiv.2603.13793 2025
[5]

English-Twi Parallel Corpus for Machine Translation,

P. Azunre et al. , “English-Twi Parallel Corpus for Machine Translation,” Apr. 2021. https://doi.org/10.48550/arXiv.2103.15625

work page doi:10.48550/arxiv.2103.15625 2021
[6]

MasakhaNEWS: News Topic Classification for African languages,

D. I. Adelani et al. , “MasakhaNEWS: News Topic Classification for African languages,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwa...

work page doi:10.18653/v1/2023.ijcnlp-main.10 2023
[7]

ChatGPT MT: Competitive for High- (but Not Low-) Resource Languages,

N. Robinson, P. Ogayo, D. R. Mortensen, and G. Neubig, “ChatGPT MT: Competitive for High- (but Not Low-) Resource Languages,” in Proceedings of the Eighth Conference on Machine Translation , P. Koehn, B. Haddow, T. Kocmi, and C. Monz, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 392–418. doi: 10.18653/v1/2023.wmt-1.40

work page doi:10.18653/v1/2023.wmt-1.40 2023
[8]

Machine translation and fair access to information,

M. Nurminen and M. Koponen, “Machine translation and fair access to information,” Translation Spaces , vol. 9, no. 1, pp. 150–169, Aug. 2020, doi: 10.1075/ts.00025.nur

work page doi:10.1075/ts.00025.nur 2020
[9]

LLMs4All: A Review of Large Language Models Across Academic Disciplines,

Y. Ye et al. , “LLMs4All: A Review of Large Language Models Across Academic Disciplines,” Nov. 2025, [Online]. Available: http://arxiv.org/abs/2509.19580 Corresponding Author: stephen.moore@ucc.edu.gh

work page arXiv 2025
[10]

Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems,

D. Ataman, A. Birch, N. Habash, M. Federico, P. Koehn, and K. Cho, “Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems,” Information , vol. 16, no. 9, p. 723, Aug. 2025, doi: 10.3390/info16090723

work page doi:10.3390/info16090723 2025
[11]

The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review,

P. S. Herrera-Espejel and S. Rach, “The Use of Machine Translation for Outreach and Health Communication in Epidemiology and Public Health: Scoping Review,” JMIR Public Health Surveill. , vol. 9, p. e50814, Nov. 2023, doi: 10.2196/50814

work page doi:10.2196/50814 2023
[12]

and Turner, Anne M

K. N. Dew, A. M. Turner, Y. K. Choi, A. Bosold, and K. Kirchhoff, “Development of machine translation technology for assisting health communication: A systematic review,” J. Biomed. Inform. , vol. 85, pp. 56–67, Sep. 2018, doi: 10.1016/j.jbi.2018.07.018

work page doi:10.1016/j.jbi.2018.07.018 2018
[13]

YouVersion,

“YouVersion,” The Bible App. Life.Church. Accessed: May 02, 2026. [Online]. Available: https://www.youversion.com/

2026
[14]

chr F ++: words helping character n-grams

M. Popović, “chrF++: words helping character n-grams,” in Proceedings of the Second Conference on Machine Translation , O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, and J. Kreutzer, Eds., Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 612–618. doi: 10.18653/v1/W17-4770

work page doi:10.18653/v1/w17-4770 2017
[15]

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics,

S. Kumar, P. Jyothi, and P. Bhattacharyya, “Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics,” Feb. 2026, [Online]. Available: http://arxiv.org/abs/2602.17425

work page arXiv 2026
[16]

ChatGPT MT: Competitive for High-(but not Low-) Resource Languages

N. R. Robinson, P. Ogayo, D. R. Mortensen, and G. Neubig, “ChatGPT MT: Competitive for High-(but not Low-) Resource Languages.” [Online]. Available: https://arxiv.org/abs/2309.07423

work page arXiv
[17]

[Huet al., 2020a ] Junjie Hu, Sebastian Ruder, Aditya Sid- dhant, Graham Neubig, Orhan Firat, and Melvin John- son

J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization,” Sep. 2020, [Online]. Available: http://arxiv.org/abs/2003.11080

work page arXiv 2020
[18]

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02 , Morristown, NJ, USA: Association for Computational Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2001
[19]

Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust,

M. Freitag et al. , “Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust,” in Proceedings of the Seventh Conference on Machine Translation (WMT) , Stroudsburg, PA, USA: Association for Computational Linguistics, 2022, pp. 46–68. doi: 10.18653/v1/2022.wmt-1.2

work page doi:10.18653/v1/2022.wmt-1.2 2022
[20]

Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages

A. Vázquez and M. I. Torres, “Prompt-based Language Generation for Complex Conversational Coaching Tasks across Languages.” [Online]. Available: https://hf.rst.im/pere/norwegian-gpt2-social
[21]

Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation,

S. Shalawati, A. H. Nasution, W. Monika, T. Derin, A. Onan, and Y. Murakami, “Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation,” Digital , vol. 6, no. 1, p. 8, Jan. 2026, doi: 10.3390/digital6010008

work page doi:10.3390/digital6010008 2026
[22]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others

A. Hendy et al. , “How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation,” Feb. 2023, [Online]. Available: http://arxiv.org/abs/2302.09210

work page arXiv 2023
[23]

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

W. Zhu et al. , “Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis,” in Findings of the Association for Computational Linguistics: NAACL 2024 , Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 2765–2781. doi: 10.18653/v1/2024.findings-naacl.176

work page doi:10.18653/v1/2024.findings-naacl.176 2024
[24]

Unsupervised Cross-lingual Representation Learning at Scale

A. Conneau et al. , “Unsupervised Cross-lingual Representation Learning at Scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Corresponding Author: stephen.moore@ucc.edu.gh Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 8440–8451. doi: 10.18653/v1/2020.acl-main.747

work page doi:10.18653/v1/2020.acl-main.747 2020
[25]

Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of YorùbáYor`Yorùbá and Twi,

J. O. Alabi, K. Amponsah-Kaakyire, D. I. Adelani, and C. Espã, “Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of YorùbáYor`Yorùbá and Twi,” 2020. [Online]. Available: https://github.com/Niger-Volta-LTI/

2020
[26]

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

W. Nekoto et al. , “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020 , T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 2144–2160. doi: 10.18653/v1/2020.findings-emnlp.195

work page doi:10.18653/v1/2020.findings-emnlp.195 2020
[27]

NLP for Ghanaian Languages,

P. Azunre et al. , “NLP for Ghanaian Languages,” Apr. 2021. https://doi.org/10.48550/arXiv.2103.15475

work page doi:10.48550/arxiv.2103.15475 2021
[28]

Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG),

E. Agyei, X. Zhang, S. Bannerman, A. B. Quaye, S. B. Yussi, and V. K. Agbesi, “Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG),” Discover Computing , vol. 27, no. 1, p. 17, Jul. 2024, doi: 10.1007/s10791-024-09451-8

work page doi:10.1007/s10791-024-09451-8 2024
[29]

Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability,

M. A. Mensah et al. , “Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability,” 2025

2025
[30]

Ndwom: A Multimodal Music Information Retrieval Dataset for Akan Musical Videos,

S. E. Moore, N. A. Asare, and S. K. Kubiti, “Ndwom: A Multimodal Music Information Retrieval Dataset for Akan Musical Videos,” Jan. 22, 2025. doi: 10.21203/rs.3.rs-5876078/v1

work page doi:10.21203/rs.3.rs-5876078/v1 2025
[31]

Ayoo: A Multilingual Multimodal Music Information Retrieval Dataset for Ghana Music Award Videos,

S. E. Moore, A. Asare, and S. K. Kubiti, “Ayoo: A Multilingual Multimodal Music Information Retrieval Dataset for Ghana Music Award Videos,” Apr. 22, 2026. doi: 10.21203/rs.3.rs-9475824/v1

work page doi:10.21203/rs.3.rs-9475824/v1 2026
[32]

Language Models are Few-Shot Learners

T. B. Brown et al. , “Language Models are Few-Shot Learners,” Jul. 2020, [Online]. Available: http://arxiv.org/abs/2005.14165

work page internal anchor Pith review arXiv 2020
[33]

Novikova, C

J. Novikova, C. Anderson, B. Blili-Hamelin, D. Rosati, and S. Majumdar, “Consistency in Language Models: Current Landscape, Challenges, and Future Directions,” Jul. 2025, [Online]. Available: http://arxiv.org/abs/2505.00268

work page arXiv 2025
[34]

Aligning LLMs for Multilingual Consistency in Enterprise Applications

A. Agarwal, H. Meghwani, H. L. Patel, T. Sheng, S. Ravi, and D. Roth, “Aligning LLMs for Multilingual Consistency in Enterprise Applications.”
[35]

Measuring Massive Multitask Language Understanding

D. Hendrycks et al. , “Measuring Massive Multitask Language Understanding,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2009.03300

work page internal anchor Pith review arXiv 2021
[36]

The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

N. Goyal et al. , “The <scp>Flores-101</scp> Evaluation Benchmark for Low-Resource and Multilingual Machine Translation,” Trans. Assoc. Comput. Linguist. , vol. 10, pp. 522–538, May 2022, doi: 10.1162/tacl_a_00474. Corresponding Author: stephen.moore@ucc.edu.gh

work page doi:10.1162/tacl_a_00474 2022