pith. sign in

arxiv: 2607.02235 · v1 · pith:4MMFIMEBnew · submitted 2026-07-02 · 💻 cs.CL · cs.AI

Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

Pith reviewed 2026-07-03 14:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM-as-a-Judgemultilingual evaluationlow-resource languagesnatural language generationevaluation metricsACL Anthologyhuman validation
0
0 comments X

The pith

LLM-as-a-Judge produces inconsistent outcomes and shows overtrust in multilingual and low-resource language settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys the growing use of LLM-as-a-Judge for evaluating natural language generation tasks beyond English. It identifies only 33 relevant papers from the ACL Anthology that address multilingual or low-resource languages out of 650 mentioning the method. Analysis of these papers shows inconsistent results across studies, frequent overtrust in the LLM outputs without enough human checks, and heavy dependence on a single judge model. The authors end by offering concrete recommendations for more reliable application in these settings.

Core claim

The central claim is that attempts to apply LLM-as-a-Judge to multilingual and low-resource languages reveal inconsistent evaluation outcomes, a pattern of overtrusting the LLM judgments without sufficient human validation, and near-universal reliance on one judge model per study, based on close reading of the 33 papers that focus on these settings.

What carries the argument

The in-depth analysis of the 33 ACL Anthology papers that apply LLM-as-a-Judge to multilingual or low-resource language tasks.

If this is right

  • Evaluation outcomes remain inconsistent when the same tasks are assessed in different studies.
  • Researchers tend to accept LLM judgments without enough human validation in non-English settings.
  • Most work depends on a single judge model rather than multiple or varied ones.
  • Specific recommendations are needed to improve reliability of LLM judges for multilingual and low-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If these patterns persist, automated evaluations could slow reliable progress on multilingual NLP systems.
  • Teams working on low-resource languages may need to create language-specific validation steps rather than reuse English-centric judges.
  • The single-model reliance could be tested by re-running the same evaluations with alternate judge models to measure variance.

Load-bearing premise

The 33 papers found through the search accurately represent current practices without major gaps or selection bias.

What would settle it

A broader search or new studies that show consistent LLM judge results backed by human validation across many low-resource languages would contradict the observed patterns.

Figures

Figures reproduced from arXiv: 2607.02235 by A.Seza Do\u{g}ru\"oz, David Ifeoluwa Adelani, Jakob Prange, Senyu Li, Verena Blaschke, Xixian Liao.

Figure 1
Figure 1. Figure 1: Taxonomy of the 33 papers that use LLM-as-a-Judge in an [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

LLM-as-a-Judge has become the dominant evaluation paradigm for many natural language generation tasks, due to shortcomings of conventional metrics and high correlations with human judgment, albeit mostly in English. There are now attempts to extend LLM-as-a-Judge to multilingual settings including low-resource languages. However, LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings. To highlight the scope of the problem and current practices, we explore the use of LLM-as-a-Judge evaluators in ACL Anthology papers focusing on multilingual settings and low-resource languages across a diverse set of tasks. Out of 650 papers mentioning LLM-as-a-judge, only 33 of them focus on low-resource or multilingual settings. Our in-depth analysis of these papers indicates inconsistent evaluation outcomes, a tendency to overtrust LLM judgments in multilingual settings, and the widespread reliance on a single judge model per study. To help the NLP community further, we conclude with recommendations about how to use LLM-as-a-Judge in multilingual and low-resource settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper surveys LLM-as-a-Judge usage in multilingual and low-resource settings via an ACL Anthology search that identifies 650 papers mentioning the term and narrows to 33 focused on such settings. Analysis of these papers reveals inconsistent evaluation outcomes, overtrust in LLM judgments, and reliance on single judge models per study; the work concludes with recommendations for improved practices in these contexts.

Significance. If the 33-paper sample is representative, the work is significant for documenting practical limitations of LLM judges outside English, where model proficiency is low and human validation is often absent. The recommendations could help standardize evaluation in multilingual NLP if supported by transparent methods.

major comments (2)
  1. [Abstract and paper selection description] Abstract and paper selection description: The filtering from 650 to 33 papers is presented without the explicit search query, date bounds, inclusion/exclusion criteria, or synonym checks (e.g., 'LLM evaluator'). This is load-bearing for the central claim that the patterns reflect widespread practices rather than selection bias.
  2. [Analysis section] Analysis section: The report of 'inconsistent evaluation outcomes' and 'overtrust' lacks concrete measures, examples, or tables quantifying disagreement rates, correlation drops, or validation gaps across the 33 papers, undermining assessment of how severe or general these issues are.
minor comments (1)
  1. [Abstract] The abstract could specify the distribution of the 33 papers across tasks or languages to strengthen the 'diverse set of tasks' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to improve transparency and rigor in the manuscript.

read point-by-point responses
  1. Referee: [Abstract and paper selection description] Abstract and paper selection description: The filtering from 650 to 33 papers is presented without the explicit search query, date bounds, inclusion/exclusion criteria, or synonym checks (e.g., 'LLM evaluator'). This is load-bearing for the central claim that the patterns reflect widespread practices rather than selection bias.

    Authors: We agree that the paper selection methodology requires greater transparency to support the claim of reflecting widespread practices. In the revised manuscript, we will add a dedicated 'Paper Selection' subsection detailing the exact search query (primarily 'LLM-as-a-Judge' and close variants), the ACL Anthology search date range, and full inclusion/exclusion criteria (e.g., papers must explicitly use LLM-as-a-Judge for evaluation in multilingual or low-resource settings). We did conduct synonym checks including 'LLM evaluator' and 'LLM judge', which largely overlapped with the primary term and did not yield additional relevant papers beyond the 650; this will be explicitly documented to address selection bias concerns. revision: yes

  2. Referee: [Analysis section] Analysis section: The report of 'inconsistent evaluation outcomes' and 'overtrust' lacks concrete measures, examples, or tables quantifying disagreement rates, correlation drops, or validation gaps across the 33 papers, undermining assessment of how severe or general these issues are.

    Authors: We acknowledge that the analysis section would be strengthened by more quantitative and illustrative support. While our review of the 33 papers identified the patterns through systematic coding (e.g., presence/absence of human validation, number of judge models used, and reported inconsistencies), we will revise this section to include a summary table across all papers quantifying key metrics such as percentage with human validation, reported correlations with human judgments (where available), and specific examples of outcome inconsistencies or overtrust. This will make the severity and generality of the issues more assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive literature survey without derivations or predictions

full rationale

This is a survey paper that searches ACL Anthology for papers mentioning LLM-as-a-Judge, filters to 33 relevant ones, and summarizes observed practices. It contains no equations, no first-principles derivations, no fitted parameters presented as predictions, and no load-bearing self-citations. The central claims rest on direct analysis of the identified external papers rather than any reduction to the authors' own inputs or prior work. Methodological concerns about search completeness are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an observational survey paper with no mathematical content; no free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5741 in / 1034 out tokens · 24638 ms · 2026-07-03T14:29:43.953169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

151 extracted references · 87 canonical work pages · 8 internal anchors

  1. [1]

    arXiv , volume=

    Socially responsible data for large multilingual language models , author=. arXiv , volume=. 2024 , url=

  2. [4]

    Findings of the WMT 25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help

    Lavie, Alon and Hanneman, Greg and Agrawal, Sweta and Kanojia, Diptesh and Lo, Chi-Kiu and Zouhar, Vil \'e m and Blain, Frederic and Zerva, Chrysoula and Avramidis, Eleftherios and Deoghare, Sourabh and Sindhujan, Archchana and Wang, Jiayi and Adelani, David Ifeoluwa and Thompson, Brian and Kocmi, Tom and Freitag, Markus and Deutsch, Daniel. Findings of t...

  3. [5]

    arXiv , url=

    Sumuk Shashidhar and Clémentine Fourrier and Alina Lozovskia and Thomas Wolf and Gokhan Tur and Dilek Hakkani-Tür , year=. arXiv , url=

  4. [6]

    Ethical Reasoning and Moral Value Alignment of LLM s Depend on the Language We Prompt Them in

    Agarwal, Utkarsh and Tanmay, Kumar and Khandelwal, Aditi and Choudhury, Monojit. Ethical Reasoning and Moral Value Alignment of LLM s Depend on the Language We Prompt Them in. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  5. [9]

    arXiv , volume=

    Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristi. arXiv , volume=. 2024 , url=

  6. [10]

    Neither Valid nor Reliable?

    Khaoula Chehbouni and Mohammed Haddou and Jackie CK Cheung and Golnoosh Farnadi , booktitle=. Neither Valid nor Reliable?. 2025 , url=

  7. [11]

    2024 , url=

    Haitao Li and Qian Dong and Junjie Chen and Huixue Su and Yujia Zhou and Qingyao Ai and Ziyi Ye and Yiqun Liu , journal=. 2024 , url=

  8. [20]

    Bowman and Shi Feng , booktitle=

    Arjun Panickssery and Samuel R. Bowman and Shi Feng , booktitle=. 2024 , url=

  9. [24]

    Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on T urkish

    Umutlu, Elif Ecem and Cengiz, Ayse Aysu and Sever, Ahmet Kaan and Erdem, Seyma and Aytan, Burak and Tufan, Busra and Topraksoy, Abdullah and Dar c , Esra and Toraman, Cagri. Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on T urkish. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

  10. [34]

    Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on N epali

    Duwal, Sharad and Prasai, Suraj and Manandhar, Suresh. Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on N epali. Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025). 2025

  11. [36]

    How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLM s?

    Doostmohammadi, Ehsan and Holmstr. How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLM s?. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.367

  12. [37]

    LLM -based NLG Evaluation: Current Status and Challenges

    Gao, Mingqi and Hu, Xinyu and Yin, Xunjian and Ruan, Jie and Pu, Xiao and Wan, Xiaojun. LLM -based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. doi:10.1162/coli_a_00561

  13. [40]

    Hyderabadi Pearls at Multilingual Counterspeech Generation : HALT : Hate Speech Alleviation using Large Language Models and Transformers

    Farhan, Md Shariq. Hyderabadi Pearls at Multilingual Counterspeech Generation : HALT : Hate Speech Alleviation using Large Language Models and Transformers. Proceedings of the First Workshop on Multilingual Counterspeech Generation. 2025

  14. [41]

    CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages

    Bennie, Michael and Xiao, Bushi and Liu, Chryseis Xinyi and Zhang, Demi and Meng, Jian and Tripp, Alayo. CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages. Proceedings of the First Workshop on Multilingual Counterspeech Generation. 2025

  15. [42]

    NLP @ IIMAS - CLTL at Multilingual Counterspeech Generation: Combating Hate Speech Using Contextualized Knowledge Graph Representations and LLM s

    Preciado M \'a rquez, David Salvador and G \'o mez Adorno, Helena and Markov, Ilia and Baez Santamaria, Selene. NLP @ IIMAS - CLTL at Multilingual Counterspeech Generation: Combating Hate Speech Using Contextualized Knowledge Graph Representations and LLM s. Proceedings of the First Workshop on Multilingual Counterspeech Generation. 2025

  16. [46]

    MQM - APE : Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators

    Lu, Qingyu and Ding, Liang and Zhang, Kanjian and Zhang, Jinxia and Tao, Dacheng. MQM - APE : Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  17. [47]

    How Human-Like Are Word Associations in Generative Models? An Experiment in S lovene

    Brglez, Mojca and Vintar, S pela and Z agar, Ale s. How Human-Like Are Word Associations in Generative Models? An Experiment in S lovene. Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024. 2024

  18. [53]

    D\'ej\`a Vu: Multilingual

    Julia Kreutzer and Eleftheria Briakou and Sweta Agrawal and Marzieh Fadaee and Tom Kocmi , booktitle=. D\'ej\`a Vu: Multilingual. 2025 , url=

  19. [54]

    2023 , issue_date =

    Gehrmann, Sebastian and Clark, Elizabeth and Sellam, Thibault , title =. 2023 , issue_date =. doi:10.1613/jair.1.13715 , journal =

  20. [55]

    2019 , booktitle =

    Speech Synthesis Evaluation -- State-of-the-Art Assessment and Suggestion for a Novel Research Program , author =. 2019 , booktitle =

  21. [57]

    L3i++ at G en AI Detection Task 1: Can Label-Supervised LL a MA Detect Machine-Generated Text?

    Tran, Hanh Thi Hong and Nam, Nguyen Tien. L3i++ at G en AI Detection Task 1: Can Label-Supervised LL a MA Detect Machine-Generated Text?. Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect). 2025

  22. [62]

    Identifying equivalents of specialized verbs in a bilingual comparable corpus of judgments: A frame-based methodology

    Pimentel, Janine. Identifying equivalents of specialized verbs in a bilingual comparable corpus of judgments: A frame-based methodology. Proceedings of the Eighth International Conference on Language Resources and Evaluation ( LREC '12). 2012

  23. [64]

    X -Guard: Multilingual Guard Agent for Content Moderation

    Upadhayay, Bibek and Behzadan, Vahid. X -Guard: Multilingual Guard Agent for Content Moderation. Proceedings of the The First Workshop on LLM Security (LLMSEC). 2025

  24. [70]

    arXiv , url=

    Instruction-Following Evaluation for Large Language Models , author=. arXiv , url=. 2023 , volume=

  25. [71]

    Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

    Xu, Wenda and Zhu, Guanglei and Zhao, Xuandong and Pan, Liangming and Li, Lei and Wang, William. Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.826

  26. [72]

    Better Quality Pre-training Data and T5 Models for A frican Languages

    Oladipo, Akintunde and Adeyemi, Mofetoluwa and Ahia, Orevaoghene and Owodunni, Abraham Toluwalase and Ogundepo, Odunayo and Adelani, David Ifeoluwa and Lin, Jimmy. Better Quality Pre-training Data and T5 Models for A frican Languages. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.11

  27. [73]

    Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

    Li, Qintong and Cui, Leyang and Kong, Lingpeng and Bi, Wei. Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  28. [74]

    Don ' t Trust C hat GPT when your Question is not in E nglish: A Study of Multilingual Abilities and Types of LLM s

    Zhang, Xiang and Li, Senyu and Hauer, Bradley and Shi, Ning and Kondrak, Grzegorz. Don ' t Trust C hat GPT when your Question is not in E nglish: A Study of Multilingual Abilities and Types of LLM s. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.491

  29. [75]

    arXiv , url=

    Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability , author=. arXiv , url=. 2025 , volume=

  30. [77]

    Hale and Adam Mahdi and Elizaveta Semenova and Bertie Vidgen , booktitle=

    Shreyansh Padarha and Scott A. Hale and Adam Mahdi and Elizaveta Semenova and Bertie Vidgen , booktitle=. Evaluating. 2025 , url=

  31. [78]

    Dietz, Laura and Zendel, Oleg and Bailey, Peter and Clarke, Charles L. A. and Cotterill, Ellese and Dalton, Jeff and Hasibi, Faegheh and Sanderson, Mark and Craswell, Nick , title =. Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) , pages =. 2025 , isbn =. doi:10.1145/3731120....

  32. [79]

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging

  33. [80]

    A survey on

    Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Saizhuo Wang and Kun Zhang and Yuanzhuo Wang and Wen Gao and Lionel Ni and Jian Guo , journal=. A survey on. 2024 , url=

  34. [81]

    The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

    Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

  35. [82]

    Open Mind , volume =

    Trott, Sean , title =. Open Mind , volume =. 2024 , month =. doi:10.1162/opmi_a_00144 , url =

  36. [83]

    Can Large Language Models Be an Alternative to Human Evaluations?

    Chiang, Cheng-Han and Lee, Hung-yi. Can Large Language Models Be an Alternative to Human Evaluations?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.870

  37. [84]

    ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks , year=

    Nasution, Arbi Haza and Onan, Aytuğ , journal=. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks , year=

  38. [85]

    M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

    Feng, Zhaopeng and Su, Jiayuan and Zheng, Jiamei and Ren, Jiahan and Zhang, Yan and Wu, Jian and Wang, Hongwei and Liu, Zuozhu. M - MAD : Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/202...

  39. [86]

    Large Language Models Are State-of-the-Art Evaluators of Translation Quality

    Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023

  40. [88]

    , title =

    Dubois, Yann and Li, Xuechen and Taori, Rohan and Zhang, Tianyi and Gulrajani, Ishaan and Ba, Jimmy and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  41. [89]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

  42. [90]

    Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

    Hada, Rishav and Gumma, Varun and de Wynter, Adrian and Diddee, Harshita and Ahmed, Mohamed and Choudhury, Monojit and Bali, Kalika and Sitaram, Sunayana. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?. Findings of the Association for Computational Linguistics: EACL 2024. 2024

  43. [91]

    Large Language Models are not Fair Evaluators

    Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

  44. [92]

    CoRR , volume=

    Prometheus: Inducing fine-grained evaluation capability in language models , author=. CoRR , volume=

  45. [93]

    CoRR , volume=

    Pandalm: An automatic evaluation benchmark for LLM instruction tuning optimization , author=. CoRR , volume=

  46. [94]

    arXiv , volume=

    Tinystories: How small can language models be and still speak coherent english? , author=. arXiv , volume=. 2023 , url=

  47. [95]

    arXiv , url=

    Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence , author=. arXiv , url=. 2026 , volume=

  48. [96]

    NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning , year=

    Adapting Vision-Language Models for Evaluating World Models , author=. NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning , year=

  49. [97]

    Leveraging Multiple LLM Evaluators for Scalable and Fair Language Model Assessments , year=

    Koc, Vincent and Janjua, Jamshaid Iqbal and Alang, Karan and Peta, Sumeer Basha , booktitle=. Leveraging Multiple LLM Evaluators for Scalable and Fair Language Model Assessments , year=

  50. [98]

    Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

    Li, Kevin Danis and Fernandez, Adrian M and Schwartz, Rachel and Rios, Natalie and Carlisle, Marvin Nathaniel and Amend, Gregory M and Patel, Hiren V and Breyer, Benjamin N. Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study. J Med Internet Res. 2024. doi:10.2196/56500

  51. [100]

    Large Language Models Discriminate Against Speakers of G erman Dialects

    Bui, Minh Duc and Holtermann, Carolin and Hofmann, Valentin and Lauscher, Anne and von der Wense, Katharina. Large Language Models Discriminate Against Speakers of G erman Dialects. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.415

  52. [102]

    Evaluating the Elementary Multilingual Capabilities of Large Language Models with M ulti Q

    Holtermann, Carolin and R. Evaluating the Elementary Multilingual Capabilities of Large Language Models with M ulti Q. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.265

  53. [104]

    Adaption and Evaluation of Generative Large Language Models for G erman Medical Information Extraction

    Spiegel, S. Adaption and Evaluation of Generative Large Language Models for G erman Medical Information Extraction. Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers. 2025

  54. [105]

    OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and Lama Ahmad and Ilge Akkaya and Florencia Leoni Aleman and Diogo Almeida and Janko Altenschmidt and Sam Altman and Shyamal Anadkat and Red Avila and Igor Babuschkin and Suchir Balaji and Valerie Balcom and Paul Baltescu and Haiming Bao and Mohammad Bavarian and Jeff Belgum and Irwan Bello and...

  55. [106]

    Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J

    DeepSeek-AI and Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang...

  56. [107]

    arXiv , url=

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv , url=. 2024 , volume=

  57. [108]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Wang, Yaochen and Liu, Yinuo and Zhou, Huichi and Zhang, Qihui and Wan, Yao and Zhou, Pan and Sun, Lichao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  58. [109]

    2026 , url=

    Ivry, Amir and Watanabe, Shinji , journal=. 2026 , url=

  59. [110]

    Trust but Check:

    Tadesse Destaw Belay and Henok Biadglign Ademtew and Idris Abdulmumin and Sukairaj Hafiz Imam and Abubakar Juma Chilala and Godfred Agyapong and Chinedu Emmanuel Mbonu and Basil Friday Ovu and Catherine Nana Nyaah Essuman and Alfred Malengo Kondoro and Sonia Adhiambo and Daud Abolade and Ponts'o Mpholle and Nicholaus Dismas Ladislaus and Saminu Mohammad A...

  60. [112]

    Understanding and Mitigating Language Confusion in LLM s

    Marchisio, Kelly and Ko, Wei-Yin and Berard, Alexandre and Dehaze, Th \'e o and Ruder, Sebastian. Understanding and Mitigating Language Confusion in LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.380

  61. [113]

    Science , volume =

    Myra Cheng and Cinoo Lee and Pranav Khadpe and Sunny Yu and Dyllan Han and Dan Jurafsky , title =. Science , volume =. 2026 , doi =

  62. [116]

    How to Create Treebanks without Human Annotators -- An Indigenous Language Grammar Checker for Treebank Construction

    Wiechetek, Linda and Pirinen, Flammie A and Kappfjell, Maja Lisa. How to Create Treebanks without Human Annotators -- An Indigenous Language Grammar Checker for Treebank Construction. Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025). 2025

  63. [119]

    Judging Against the Reference: Uncovering Knowledge-Driven Failures in

    Lee, Dongryeol and Hwang, Yerin and Kang, Taegwan and Lee, Minwoo and Chae, Younhyung and Jung, Kyomin , journal=. Judging Against the Reference: Uncovering Knowledge-Driven Failures in. 2026 , url=

  64. [120]

    2026 , eprint=

    TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages , author=. 2026 , eprint=

  65. [121]

    Seza Do g ru \"o z, Andr \'e Coneglian, and Atul Kr

    David Ifeoluwa Adelani, A. Seza Do g ru \"o z, Andr \'e Coneglian, and Atul Kr. Ojha. 2024. https://doi.org/10.18653/v1/2024.americasnlp-1.5 Comparing LLM prompting with cross-lingual transfer performance on indigenous and low-resource B razilian languages . In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the ...

  66. [123]

    Vaibhav Adlakha, Parishad BehnamGhader, Xing Han Lu, Nicholas Meade, and Siva Reddy. 2024 b . https://doi.org/10.1162/tacl_a_00667 Evaluating correctness and faithfulness of instruction-following models for question answering . Transactions of the Association for Computational Linguistics, 12:681--699

  67. [124]

    Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, and Monojit Choudhury. 2024. https://aclanthology.org/2024.lrec-main.560/ Ethical reasoning and moral value alignment of LLM s depend on the language we prompt them in . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 202...

  68. [125]

    Victor Akinode, Senyu Li, Wassim Hamidouche, Waqas Zamir, Inbal Becker-Reshef, and David Ifeoluwa Adelani. 2026. https://arxiv.org/abs/2606.01322 Tukabench: A culturally grounded jailbreak benchmark for african languages . Preprint, arXiv:2606.01322

  69. [126]

    Dang Anh, Limor Raviv, and Lukas Galke. 2024. https://doi.org/10.18653/v1/2024.cmcl-1.15 Morphology matters: Probing the cross-linguistic morphological generalization abilities of large language models through a wug test . In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 177--188, Bangkok, Thailand. Association for...

  70. [127]

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. https:...

  71. [128]

    Tadesse Destaw Belay, Henok Biadglign Ademtew, Idris Abdulmumin, Sukairaj Hafiz Imam, Abubakar Juma Chilala, Godfred Agyapong, Chinedu Emmanuel Mbonu, Basil Friday Ovu, Catherine Nana Nyaah Essuman, Alfred Malengo Kondoro, Sonia Adhiambo, Daud Abolade, Ponts'o Mpholle, Nicholaus Dismas Ladislaus, Saminu Mohammad Aliyu, Gali Ahmad Samuel, Fabrice Hakuziman...

  72. [129]

    Costa-Juss \`a

    Samuel Bell, Eduardo S \'a nchez, David Dale, Pontus Stenetorp, Mikel Artetxe, and Marta R. Costa-Juss \`a . 2025. https://doi.org/10.18653/v1/2025.wmt-1.15 Translate, then detect: Leveraging machine translation for cross-lingual toxicity classification . In Proceedings of the Tenth Conference on Machine Translation, pages 253--268, Suzhou, China. Associa...

  73. [130]

    Michael Bennie, Bushi Xiao, Chryseis Xinyi Liu, Demi Zhang, Jian Meng, and Alayo Tripp. 2025. https://aclanthology.org/2025.mcg-1.5/ CODEOFCONDUCT at multilingual counterspeech generation: A context-aware model for robust counterspeech generation in low-resource languages . In Proceedings of the First Workshop on Multilingual Counterspeech Generation, pag...

  74. [131]

    Marcel Bollmann, Nathan Schneider, Arne K \"o hn, and Matt Post. 2023. https://doi.org/10.18653/v1/2023.nlposs-1.10 Two decades of the ACL A nthology: Development, impact, and open challenges . In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 83--94, Singapore. Association for Computational Linguistics

  75. [132]

    Sabri Boughorbel and Majd Hawasly. 2023. https://doi.org/10.18653/v1/2023.arabicnlp-1.11 Analyzing multilingual competency of LLM s in multi-turn instruction following: A case study of A rabic . In Proceedings of ArabicNLP 2023, pages 128--139, Singapore (Hybrid). Association for Computational Linguistics

  76. [133]

    Sabri Boughorbel, Md Rizwan Parvez, and Majd Hawasly. 2024. https://doi.org/10.18653/v1/2024.arabicnlp-1.7 Improving language models trained on translated data with continual pre-training and dictionary learning analysis . In Proceedings of the Second Arabic Natural Language Processing Conference, pages 73--88, Bangkok, Thailand. Association for Computati...

  77. [134]

    Mojca Brglez, S pela Vintar, and Ale s Z agar. 2024. https://aclanthology.org/2024.cogalex-1.5/ How human-like are word associations in generative models? An experiment in S lovene . In Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024, pages 42--48, Torino, Italia. ELRA and ICCL

  78. [135]

    Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung, and Golnoosh Farnadi. 2025. https://openreview.net/forum?id=yqKfMr0yvY Neither valid nor reliable? Investigating the use of LLM s as judges . In The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track

  79. [136]

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024 a . https://dl.acm.org/doi/10.5555/3692070.3692324 MLLM-as-a-Judge : assessing multimodal LLM-as-a-Judge with vision-language benchmark . In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  80. [137]

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-main.474 Humans or LLM s as the judge? A study on judgement bias . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301--8327, Miami, Florida, USA. Association for Computational Linguistics

Showing first 80 references.