pith. sign in

arxiv: 2606.07874 · v1 · pith:4CRWGGVFnew · submitted 2026-06-05 · 💻 cs.AI

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Pith reviewed 2026-06-27 21:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM judgessafety evaluationin-context learningsafety priorsAI alignmentcontextual safety
0
0 comments X

The pith

LLM-judges rarely adjust safety evaluations when context or definitions contradict their internal priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well LLM-based safety judges respond to new information supplied in their prompts. It supplies task demonstrations, novel context, and alternative safety definitions that sometimes conflict with the models' built-in safety understanding. The central result is that the judges can absorb new facts but seldom revise their safety judgments when the supplied material clashes with their priors. This rigidity matters because LLM-judges are the main practical tool for checking safety at scale, so any failure to follow user-specified criteria can produce evaluations that do not match the intended policy.

Core claim

LLM-judges can learn from new information, yet they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

What carries the argument

The internal safety priors of the LLM-judge, which determine whether in-context information or redefined safety criteria can override the model's default safety stance.

If this is right

  • Safety evaluations produced by LLM-judges will often reflect the model's fixed priors rather than the criteria explicitly stated in the prompt.
  • Providing demonstrations or novel context has limited steering power when the new material conflicts with those priors.
  • Generalist and safety-specific LLM judges exhibit similar resistance, suggesting the issue is widespread across current judge models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid human-plus-LLM review pipelines may be needed for cases where prompt context and model priors diverge.
  • The finding limits how far in-context learning alone can be relied upon to enforce new safety policies in evaluator systems.
  • Future work could test whether additional training stages that explicitly reward context-following reduce this rigidity.

Load-bearing premise

The tested in-context scenarios, safety definitions, and model set sufficiently represent the behavior of LLM-judges in real deployment for safety evaluation at scale.

What would settle it

A follow-up experiment that presents the same models with real deployment safety cases and shows consistent revision of judgments when the supplied safety definition directly contradicts the model's prior.

Figures

Figures reproduced from arXiv: 2606.07874 by Anissa Alloula, Ava Batchkala, Federico Licini, Seraphina Goldfarb-Tarrant.

Figure 1
Figure 1. Figure 1: We test whether LLMs-judges for safety are [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Context can bridge knowledge gaps in Novel￾Prompts. Bars show F1 scores (mean and std dev of 5 seeds) when judges are only given the user request vs. when they are also given in-context information explaining the request. 3They do cause strong performance decreases for a few judges in Sorry-BENCH, driving the average F1 down, how￾ever this is on a minority of models (Claude-haiku, Tiny-Aya, Command-A/R), b… view at source ↗
Figure 3
Figure 3. Figure 3: Judges are more susceptible to changing their predictions on samples on which they have less prior knowledge, in MultilingualPrompts. Left: bars represent the proportion of samples on which judges change their prediction in response to being given: novel context information, irrelevant context information, task demonstrations, and incorrect demonstrations. Right: the likelihood that judges change their pre… view at source ↗
Figure 4
Figure 4. Figure 4: Judges change their predictions similarly in re￾sponse to context, demonstrations, irrelevant context, and misleading demonstrations. Each dot represents the average ranking of how susceptible a judge is across one type of con￾text experiment in NovelPrompts, MultilingualPrompts, and SORRY-Bench. 5 Can you steer a judge to specific safety policies? Our experiments on susceptiblity show that the like￾lihood… view at source ↗
Figure 5
Figure 5. Figure 5: Most judges cannot adapt to safety definition variants. The first two bars show accuracy relative to the base safety policy when they are given/not given the policy, and the last two show accuracy relative to the two safety definition variants they are given in the prompt. Mean and st dev across all 13 judges and seeds in MultilingualPrompts is shown. We push this evaluation further by giving them 5 For in… view at source ↗
Figure 6
Figure 6. Figure 6: Steerability is much higher when the safety judg￾ing task is masked as a classification task (hashed bars) in MultilingualPrompts.6 Steerability is measured as mean judge prediction flips (relative to the expected number of flips) when given a safety definition variant. 6 Conclusion Safety policy, and thus safety evaluation, widely varies across languages, cultures, and use cases. But most LLM judge evalua… view at source ↗
Figure 7
Figure 7. Figure 7: Mean judge accuracy across 5 seeds (with error bars representing standard deviation) when given different prompt [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of estimated frequency scores of each prompt in MultilingualPrompts and NovelPrompts. The frequency [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of demonstrations and incorrect demonstrations on judge F1 score in MultilingualPrompts, NovelPrompts [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Context has little effect in MultilingualPrompts. Bars compare F1 scores (mean and std dev across 5 seeds) when [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Context trends are similar in model completions safety evaluation. Bars compare F1 scores (mean and std dev across [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of context and irrelevant context on judge performance. Bars represent mean judge F1 score averaged across [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Judges are more susceptible to changing their predictions on samples on which they have less prior knowledge. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Histograms of mean per-sample flip rate grouped by prompt frequency in NovelPrompts. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Similar judges change their predictions in response to context, demonstrations, shuffled context, and misleading [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Most judges cannot adapt to safety definition variants, causing their accuracy with respect to the adjusted ground-truth [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Bars indicate judge accuracy and standard deviation on the sports dataset when given various sports safety definition [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Most judges are steerable to the absurd safety definition, even if it means predicting truly unsafe samples as safe. Bars [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Steerability is much higher when the safety judging task (lighter bars) is masked as a classification task (darker bars). [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Judge steerability is positively correlated across tasks. In both MultilingualPrompts and NovelPrompts similar judges are steerable to safety definition a and b. Steerability is measured as changes in prediction (flip rate) relative to when judges are given with the base definition, across 5 seeds. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Judge susceptibility, steerability, and accuracy are not significantly correlated in MultilingualPrompts (top) and [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Judge performance across safety evaluation tasks is not always correlated. [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Judge rankings (in terms of accuracy) in the Arabic, French, Japanese, and Korean subsets of MultilingualPrompts are [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Multilingual knowledge does not predict multilingual safety judging abilities. Each dot represents one judge tested in one language on Global-MMLU and on MultilingualPrompts [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Left: judge performance vs. model performance. Right: judge performance vs. model performance per language. [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
read the original abstract

LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates two properties of LLMs used as safety judges: their susceptibility to in-context information and their steerability to safety definitions that may conflict with internal priors. Experiments evaluate generalist LLMs and safety-specific judges under task demonstrations, novel in-context information, and changed safety definitions. The central finding is that while LLM-judges can learn from new information, they are broadly unlikely to adjust evaluations when the context or safety definition contradicts their prior.

Significance. If the result holds under more detailed scrutiny, it would highlight a key limitation in deploying LLM-judges for scalable safety evaluation: their rigidity to priors could produce unreliable assessments whenever real-world safety criteria diverge from training distributions. This would have direct implications for AI safety pipelines that rely on automated judging.

major comments (2)
  1. [Abstract] Abstract: The claim that LLM-judges are 'broadly unlikely' to adjust evaluations rests on the tested conditions being representative. The abstract supplies no information on how in-context scenarios and safety definitions were selected to ensure meaningful contradictions with model priors, the number or diversity of generalist and safety-specific models, statistical tests, or controls for prompt variation. Without these details the data-to-claim link cannot be verified.
  2. [Abstract] Abstract: The broad generalization requires evidence that the chosen contradictions were not mild and that the model set is not skewed toward particular training regimes. If either is true, the observed rigidity could be an artifact of the test distribution rather than a general property of LLM-judges.
minor comments (1)
  1. The abstract refers to 'many generalist LLMs and safety-specific judges' without naming them; the main text should list the exact models and versions for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the abstract. We address each major comment below and are prepared to revise the abstract to improve clarity on experimental design while maintaining the integrity of our findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that LLM-judges are 'broadly unlikely' to adjust evaluations rests on the tested conditions being representative. The abstract supplies no information on how in-context scenarios and safety definitions were selected to ensure meaningful contradictions with model priors, the number or diversity of generalist and safety-specific models, statistical tests, or controls for prompt variation. Without these details the data-to-claim link cannot be verified.

    Authors: The abstract is space-constrained, but the full manuscript (Section 3) describes the selection of in-context scenarios and safety definitions chosen specifically to create direct contradictions with typical model priors (e.g., redefining harm to permit categories that standard safety training prohibits). We evaluate 8 generalist LLMs spanning multiple providers and 3 safety-specific judges; results are aggregated over 5 prompt variations per condition with statistical significance assessed via paired t-tests (Section 4 and Appendix A). We will revise the abstract to briefly note the model count, the contradictory design of the tests, and the use of multiple prompts. revision: yes

  2. Referee: [Abstract] Abstract: The broad generalization requires evidence that the chosen contradictions were not mild and that the model set is not skewed toward particular training regimes. If either is true, the observed rigidity could be an artifact of the test distribution rather than a general property of LLM-judges.

    Authors: The contradictions were constructed as direct inversions of common safety priors (detailed with examples in Section 3.1), not mild shifts. The model set includes both open-weight and proprietary models from distinct training regimes to reduce skew. While no finite set can cover all possible regimes, the observed pattern holds consistently across the tested distribution. We do not believe additional evidence is required in the current manuscript to support the stated scope of the claim. revision: no

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper conducts an empirical investigation into LLM-judges' behavior under varying in-context information and safety definitions, reporting experimental outcomes on model responses without any mathematical derivations, parameter fitting, or predictive claims that reduce to inputs by construction. No equations, ansatzes, uniqueness theorems, or self-citation chains appear in the provided abstract or description, and the central finding is a direct observation from tests rather than a renamed or fitted result. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study only; no mathematical model, free parameters, axioms, or invented entities are referenced in the abstract.

pith-pipeline@v0.9.1-grok · 5667 in / 976 out tokens · 20993 ms · 2026-06-27T21:37:54.764564+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 54 canonical work pages · 5 internal anchors

  1. [1]

    XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

  2. [2]

    2025 , eprint=

    AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

  3. [3]

    2025 , eprint=

    The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs , author=. 2025 , eprint=

  4. [4]

    IVES Conference Series , year =

    Twenty-two shades of grey -- An analysis of alcohol regulations in the Arab world , author =. IVES Conference Series , year =. doi:10.58233/SlRqOlPt , url =

  5. [5]

    Introducing GPT-5.2 , year =

  6. [6]

    GPT-5 mini Model , year =

  7. [7]

    Introducing Claude Sonnet 4.5 , year =

  8. [8]

    Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku , year =

  9. [9]

    Announcing Command A , year =

  10. [10]

    Cohere's Command R Model , year =

  11. [11]

    Qwen/Qwen3-235B-A22B-Instruct-2507 , year =

  12. [12]

    meta-llama/Llama-3.1-70B-Instruct , year =

  13. [13]

    gpt-oss-120b & gpt-oss-20b Model Card , year =

  14. [14]

    Llama 3.1 Nemotron Safety Guard 8B v3 , year =

  15. [15]

    Llama Guard 4 , year =

  16. [16]

    gpt-oss-safeguard Technical Report , year =

  17. [17]

    Is Your LLM Outdated? A Deep Look at Temporal Generalization

    Zhu, Chenghao and Chen, Nuo and Gao, Yufei and Zhang, Yunyi and Tiwari, Prayag and Wang, Benyou. Is Your LLM Outdated? A Deep Look at Temporal Generalization. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1...

  18. [18]

    2026 , eprint=

    AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models , author=. 2026 , eprint=

  19. [19]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  20. [20]

    Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =

    Mehta, Manisha and Giunchiglia, Fausto , title =. 2025 , isbn =. doi:10.1145/3715275.3732184 , booktitle =

  21. [21]

    SLANG : New Concept Comprehension of Large Language Models

    Mei, Lingrui and Liu, Shenghua and Wang, Yiwei and Bi, Baolong and Cheng, Xueqi. SLANG : New Concept Comprehension of Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.698

  22. [22]

    Nature , pages=

    Synthesizing scientific literature with retrieval-augmented language models , author=. Nature , pages=. 2026 , publisher=

  23. [23]

    2026 , eprint=

    Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges , author=. 2026 , eprint=

  24. [24]

    Multiculturalism and AI Value Alignment , booktitle =

    Townsend, Beverley A , year =. Multiculturalism and AI Value Alignment , booktitle =. doi:10.1093/9780198945215.003.0178 , url =

  25. [25]

    2021 , url =

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , booktitle =. 2021 , url =

  26. [26]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  27. [27]

    arXiv preprint arXiv:2504.00698 , year=

    Command a: An enterprise-ready large language model , author=. arXiv preprint arXiv:2504.00698 , year=

  28. [28]

    2025 , url =

    Tan, Sijun and Zhuang, Siyuan and Montgomery, Kyle and Tang, William Yuan and Cuadron, Alejandro and Wang, Chenguang and Popa, Raluca and Stoica, Ion , booktitle =. 2025 , url =

  29. [29]

    GPTS core: Evaluate as You Desire

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

  30. [30]

    The Innovation , year =

    A Survey on LLM-as-a-Judge , author =. The Innovation , year =

  31. [31]

    2025 , eprint=

    FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language , author=. 2025 , eprint=

  32. [32]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  33. [33]

    Low-Resource Languages Jailbreak GPT-4

    Yong, Zheng-Xin and Menghini, Cristina and Bach, Stephen H. , month = jan, year =. Low-. doi:10.48550/arXiv.2310.02446 , year=

  34. [34]

    Wang, Wenxuan and Tu, Zhaopeng and Chen, Chang and Yuan, Youliang and Huang, Jen-tse and Jiao, Wenxiang and Lyu, Michael , year =. All. Findings of the. doi:10.18653/v1/2024.findings-acl.349 , abstract =

  35. [35]

    Peppin, Aidan and Kreutzer, Julia and Sebag, Alice Schoenauer and Marchisio, Kelly and Ermis, Beyza and Dang, John and Cahyawijaya, Samuel and Singh, Shivalika and Goldfarb-Tarrant, Seraphina and Aryabumi, Viraat and Aakanksha and Ko, Wei-Yin and Üstün, Ahmet and Gallé, Matthias and Fadaee, Marzieh and Hooker, Sara , month = may, year =. The. doi:10.48550...

  36. [36]

    ElNokrashy and Mai Alkhamissi and Mona T

    AlKhamissi, Badr and ElNokrashy, Muhammad and Alkhamissi, Mai and Diab, Mona , year =. Investigating. Proceedings of the 62nd. doi:10.18653/v1/2024.acl-long.671 , language =

  37. [37]

    Mora, David and Aryabumi, Viraat and Ko, Wei-Yin and Hooker, Sara and Kreutzer, Julia and Fadaee, Marzieh , month = oct, year =. The. doi:10.48550/arXiv.2510.19806 , abstract =

  38. [38]

    doi:10.48550/arXiv.2404.05399 , abstract =

    Röttger, Paul and Pernisi, Fabio and Vidgen, Bertie and Hovy, Dirk , month = jan, year =. doi:10.48550/arXiv.2404.05399 , abstract =

  39. [39]

    Aakanksha and Ahmadian, Arash and Ermis, Beyza and Goldfarb-Tarrant, Seraphina and Kreutzer, Julia and Fadaee, Marzieh and Hooker, Sara , month = jul, year =. The. doi:10.48550/arXiv.2406.18682 , abstract =

  40. [40]

    Shen, Lingfeng and Tan, Weiting and Chen, Sihao and Chen, Yunmo and Zhang, Jingyu and Xu, Haoran and Zheng, Boyuan and Koehn, Philipp and Khashabi, Daniel , month = jan, year =. The. doi:10.48550/arXiv.2401.13136 , abstract =

  41. [41]

    doi:10.48550/arXiv.2407.03963 , abstract =

    LLM-jp and Aizawa, Akiko and Aramaki, Eiji and Chen, Bowen and Cheng, Fei and Deguchi, Hiroyuki and Enomoto, Rintaro and Fujii, Kazuki and Fukumoto, Kensuke and Fukushima, Takuya and Han, Namgi and Harada, Yuto and Hashimoto, Chikara and Hiraoka, Tatsuya and Hisada, Shohei and Hosokawa, Sosuke and Jie, Lu and Kamata, Keisuke and Kanazawa, Teruhito and Kan...

  42. [42]

    Üstün, Ahmet and Aryabumi, Viraat and Yong, Zheng-Xin and Ko, Wei-Yin and D'souza, Daniel and Onilude, Gbemileke and Bhandari, Neel and Singh, Shivalika and Ooi, Hui-Lee and Kayid, Amr and Vargus, Freddie and Blunsom, Phil and Longpre, Shayne and Muennighoff, Niklas and Fadaee, Marzieh and Kreutzer, Julia and Hooker, Sara , month = feb, year =. Aya. doi:1...

  43. [43]

    Pozzobon, Luiza and Lewis, Patrick and Hooker, Sara and Ermis, Beyza , month = may, year =. From. doi:10.48550/arXiv.2403.03893 , abstract =

  44. [44]

    Safer or Luckier? LLM s as Safety Evaluators Are Not Robust to Artifacts

    Chen, Hongyu and Goldfarb-Tarrant, Seraphina. Safer or Luckier? LLM s as Safety Evaluators Are Not Robust to Artifacts. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.970

  45. [45]

    URL https: //aclanthology.org/2025.acl-long.919/

    Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

  46. [46]

    Ovalle, Anaelia and Ross, Candace and Ruder, Sebastian and Williams, Adina and Ullrich, Karen and Ibrahim, Mark and Sagun, Levent , year =. Beg to. doi:10.48550/ARXIV.2512.22712 , abstract =

  47. [47]

    Zhao, Weixiang and Hu, Yulin and Deng, Yang and Wu, Tongtong and Zhang, Wenxuan and Guo, Jiahe and Zhang, An and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , keywords =

  48. [48]

    Validating

    Guerdan, Luke and Barocas, Solon , keywords =. Validating

  49. [49]

    Multilingual !=

    Rystrøm, Jonathan Hvithamar and Kirk, Hannah Rose and Hale, Scott , editor =. Multilingual !=. Proceedings of. 2025 , keywords =

  50. [50]

    2025 , url =

    Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek , booktitle =. 2025 , url =

  51. [51]

    Evaluating the

    Murugadoss, Bhuvanashree and Poelitz, Christian and Drosos, Ian and Le, Vu and McKenna, Nick and Negreanu, Carina Suzana and Parnin, Chris and Sarkar, Advait , keywords =. Evaluating the. Annual AAAI Conference on Artificial Intelligence , file =. 2025 , abstract =

  52. [52]

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , month = may, year =. G-. doi:10.48550/arXiv.2303.16634 , abstract =

  53. [53]

    Does Context Matter? C ontextual J udge B ench for Evaluating LLM -based Judges in Contextual Settings

    Xu, Austin and Bansal, Srijan and Ming, Yifei and Yavuz, Semih and Joty, Shafiq. Does Context Matter? C ontextual J udge B ench for Evaluating LLM -based Judges in Contextual Settings. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.470

  54. [54]

    and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. 2023 , url =

  55. [55]

    2024 , url =

    Kim, Seungone and Shin, Jamin and Cho, Yejin and Jang, Joel and Longpre, Shayne and Lee, Hwaran and Yun, Sangdoo and Shin, Seongjin and Kim, Sungdong and Thorne, James and Seo, Minjoon , booktitle =. 2024 , url =

  56. [56]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  57. [57]

    Kreutzer, Julia and Briakou, Eleftheria and Agrawal, Sweta and Fadaee, Marzieh and Tom, Kocmi , month = sep, year =. Déjà. doi:10.48550/arXiv.2504.11829 , abstract =

  58. [58]

    doi:10.48550/arXiv.2410.17578 , abstract =

    Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristià, Jaume and Tormo-Bañuelos, Lucía and Kim, Seungone , month = mar, year =. doi:10.48550/arXiv.2410.17578 , abstract =

  59. [59]

    doi:10.48550/arXiv.2508.12733 , abstract =

    Ning, Zhiyuan and Gu, Tianle and Song, Jiaxin and Hong, Shixin and Li, Lingyu and Liu, Huacan and Li, Jie and Wang, Yixu and Lingyu, Meng and Teng, Yan and Wang, Yingchun , month = aug, year =. doi:10.48550/arXiv.2508.12733 , abstract =

  60. [60]

    Bridging

    Salamanca, Alejandro R and Abagyan, Diana and D’souza, Daniel and Khairi, Ammar and Mora, David and Dash, Saurabh and Aryabumi, Viraat and Rajaee, Sara and Mofakhami, Mehrnaz and Sahu, Ananya and Euyang, Thomas and Prince, Brittawnya and Smith, Madeline and Lin, Hangyu and Locatelli, Acyr and Hooker, Sara and Kocmi, Tom and Gomez, Aidan and Zhang, Ivan an...

  61. [61]

    Findings of the WMT 25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation

    Kocmi, Tom and Agrawal, Sweta and Artemova, Ekaterina and Avramidis, Eleftherios and Briakou, Eleftheria and Chen, Pinzhen and Fadaee, Marzieh and Freitag, Markus and Grundkiewicz, Roman and Hou, Yupeng and Koehn, Philipp and Kreutzer, Julia and Mansour, Saab and Perrella, Stefano and Proietti, Lorenzo and Riley, Parker and Sánchez, Eduardo and Schmidtova...

  62. [62]

    Aakanksha and Ahmadian, Arash and Goldfarb-Tarrant, Seraphina and Ermis, Beyza and Fadaee, Marzieh and Hooker, Sara , month = oct, year =. Mix. doi:10.48550/arXiv.2410.10801 , abstract =

  63. [63]

    Bu, Yuyan and Liu, Xiaohao and Ren, ZhaoXing and Yang, Yaodong and Dai, Juntao , month = feb, year =. Align

  64. [64]

    Context versus

    Du, Kevin and Snæbjarnarson, Vésteinn and Stoehr, Niklas and White, Jennifer and Schein, Aaron and Cotterell, Ryan , editor =. Context versus. Proceedings of the 62nd. 2024 , keywords =. doi:10.18653/v1/2024.acl-long.714 , abstract =

  65. [65]

    Longpre, Shayne and Perisetla, Kartik and Chen, Anthony and Ramesh, Nikhil and DuBois, Chris and Singh, Sameer , editor =. Entity-. Proceedings of the 2021. 2021 , keywords =. doi:10.18653/v1/2021.emnlp-main.565 , abstract =

  66. [66]

    Context-faithful Prompting for Large Language Models

    Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao , editor =. Context-faithful. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-emnlp.968 , abstract =

  67. [67]

    Onoe, Yasumasa and Zhang, Michael and Padmanabhan, Shankar and Durrett, Greg and Choi, Eunsol , editor =. Can. Proceedings of the 61st. 2023 , keywords =. doi:10.18653/v1/2023.acl-long.300 , abstract =

  68. [68]

    Yoran, Ori and Wolfson, Tomer and Ram, Ori and Berant, Jonathan , month = may, year =. Making. doi:10.48550/arXiv.2310.01558 , abstract =

  69. [69]

    and Schärli, Nathanael and Zhou, Denny , month = jul, year =

    Shi, Freda and Chen, Xinyun and Misra, Kanishka and Scales, Nathan and Dohan, David and Chi, Ed H. and Schärli, Nathanael and Zhou, Denny , month = jul, year =. Large. Proceedings of the 40th

  70. [70]

    Yang, Minglai and Huang, Ethan and Zhang, Liang and Surdeanu, Mihai and Wang, William and Pan, Liangming , month = sep, year =. How. doi:10.48550/arXiv.2505.18761 , abstract =

  71. [71]

    Overthinking the

    Halawi, Danny and Denain, Jean-Stanislas and Steinhardt, Jacob , month = mar, year =. Overthinking the. doi:10.48550/arXiv.2307.09476 , abstract =

  72. [72]

    Li, Daliang and Rawat, Ankit Singh and Zaheer, Manzil and Wang, Xin and Lukasik, Michal and Veit, Andreas and Yu, Felix and Kumar, Sanjiv , editor =. Large. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.112 , abstract =

  73. [73]

    Rethinking the

    Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke , editor =. Rethinking the. Proceedings of the 2022. 2022 , keywords =. doi:10.18653/v1/2022.emnlp-main.759 , abstract =

  74. [74]

    Long, Quanyu and Wu, Yin and Wang, Wenya and Pan, Sinno Jialin , year =. Does

  75. [75]

    2024 , url =

    Kossen, Jannik and Gal, Yarin and Rainforth, Tom , booktitle =. 2024 , url =

  76. [76]

    and Chaudhry, Arslan and Chan, Stephanie C

    Lampinen, Andrew K. and Chaudhry, Arslan and Chan, Stephanie C. Y. and Wild, Cody and Wan, Diane and Ku, Alex and Bornschein, Jörg and Pascanu, Razvan and Shanahan, Murray and McClelland, James L. , month = may, year =. On the generalization of language models from in-context learning and finetuning: a controlled study , shorttitle =. doi:10.48550/arXiv.2...

  77. [77]

    LLM-Safety Evaluations Lack Robustness

    Beyer, Tim and Xhonneux, Sophie and Geisler, Simon and Gidel, Gauthier and Schwinn, Leo and Günnemann, Stephan , month = mar, year =. doi:10.48550/arXiv.2503.02574 , abstract =

  78. [78]

    Schwinn, Leo and Ladenburger, Moritz and Beyer, Tim and Mofakhami, Mehrnaz and Gidel, Gauthier and Günnemann, Stephan , month = mar, year =. A. doi:10.48550/arXiv.2603.06594 , abstract =

  79. [79]

    Systematic

    Wei, Hui and He, Shenghua and Xia, Tian and Liu, Fei and Wong, Andy and Lin, Jingyang and Han, Mei , month = mar, year =. Systematic. doi:10.48550/arXiv.2408.13006 , abstract =

  80. [80]

    I Can't Believe It's Not Better: Challenges in Applied Deep Learning

    Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges , author =. Proceedings on "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" at ICLR 2025 Workshops , pages =. 2025 , editor =

Showing first 80 references.