pith. the verified trust layer for science. sign in

arxiv: 2509.23542 · v2 · submitted 2025-09-28 · 💻 cs.CL · cs.AI· cs.LG

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

Pith reviewed 2026-05-18 12:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM-as-a-judgefine-tuningfuture-proofingbackward compatibilityquestion generalizationdistribution shiftDPOcontinual learning
0
0 comments X p. Extension

The pith

Fine-tuned LLM judges lose effectiveness on responses from newer models but retain it on older ones and degrade on unseen questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes three practical limits on how long fine-tuned LLM judges remain useful: future-proofing against improved generators, backward-compatibility with older generators, and generalization to questions not seen in training. Experiments across two reasoning datasets, SFT and DPO fine-tuning methods, and multiple backbone models show that future-proofing is hard while backward-compatibility is easier, especially after DPO training. Continual learning across response distributions gives more even performance than training only on stronger or weaker responses. All tested judges still lose accuracy when questions change, indicating that current fine-tuning does not fully solve question generalization.

Core claim

By training and testing judges under controlled shifts between older, current, and newer generator responses plus seen versus unseen questions, the work finds that future-proofing remains difficult for most models, backward-compatibility is relatively straightforward and improved by DPO, continual learning balances adaptation across shifts better than single-distribution training, and performance drops on unseen questions for every model and method examined.

What carries the argument

Unified evaluation framework that varies train and test distributions across older/current/newer response generators and seen/unseen questions to isolate future-proofing, backward-compatibility, and question generalization.

If this is right

  • DPO-trained judges should be preferred when backward-compatibility with older generators matters.
  • Continual learning across multiple response distributions yields more stable performance than training on a single stronger or weaker set.
  • Judges will require periodic retraining whenever generator models advance enough to shift response distributions.
  • Additional techniques beyond standard fine-tuning are needed to reduce degradation on unseen questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If real generator improvements create larger stylistic shifts than the simulated ones, the shelf life of fine-tuned judges may be even shorter than measured here.
  • The observed question-generalization gap suggests that training data should deliberately maximize question diversity rather than focus only on response quality.
  • Hybrid systems that combine a fine-tuned judge with occasional frontier-model prompting could extend practical shelf life without full retraining.

Load-bearing premise

The simulated shifts between older, current, and newer generator responses on the two chosen reasoning datasets capture the real temporal and question-distribution changes that occur in deployed LLM systems.

What would settle it

A new experiment in which a judge fine-tuned on current responses matches or exceeds its accuracy on future-model responses, or shows no drop when tested on questions absent from training, would contradict the reported pattern.

Figures

Figures reproduced from arXiv: 2509.23542 by Austin Xu, Dilek Hakkani-Tur, Janvijay Singh, Shafiq Joty, Yefan Zhou, Yilun Zhou.

Figure 1
Figure 1. Figure 1: High-level overview of our setup for studying [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Future-proofing measured by FutureProof; negative values show degraded perfor￾mance on stronger responses. All models and recipes performance degrade, indicating poor eval￾uation of newer, stronger responses. (b) Benefits of re-training on strong responses, measured by RefreshAdvantage. Re-training consistently improves performance, with the largest gains un￾der DPO. SFT DPO SFT+DPO 6 4 2 0 2 4 6 8 10 … view at source ↗
Figure 3
Figure 3. Figure 3: (a) BackCompatibility of judges trained on strong responses when evaluating older responses; positive values indicate improved performance relative to older-judge baselines. Judges trained on newer responses show good BackCompatibility, with minimal drops—or even ab￾solute gains. (b) Despite strong absolute performance, newer judges still face a distribution shift, reflected by CompatibilityShift, with per… view at source ↗
Figure 4
Figure 4. Figure 4: Changes in future-proofing metrics when replacing a weak-response-trained judge (solid) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Changes in backward-compatibility metrics when replacing a strong-response-trained [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generalization of judges trained on weak vs. strong responses to seen and unseen ques [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generator strength on the DeepScaleR dataset, measured using pass@1 with [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and fine-tuning. Recently, fine-tuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of fine-tuned judges regarding their real-world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future-proofing and backward-compatibility -- how well judges fine-tuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models exhibit some degree of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes three practical aspects of fine-tuned LLM judges—future-proofing against newer generator responses, backward-compatibility with older responses, and generalization to unseen questions—and evaluates them in a unified experimental framework across two reasoning datasets, SFT and DPO fine-tuning algorithms, and three backbone models. Key findings are that future-proofing is challenging while backward-compatibility is relatively straightforward (especially under DPO), continual learning yields more balanced adaptation than training on stronger or weaker responses alone, and all models exhibit performance drops on unseen questions.

Significance. If the empirical results hold, the work supplies actionable guidance for deploying judge models amid rapidly evolving generators, highlighting DPO and continual learning as preferable strategies for extending shelf life. The multi-dataset, multi-algorithm, multi-backbone design is a strength that increases robustness of the comparative claims.

major comments (3)
  1. [§3] §3 (Distribution Shift Construction): The simulation of older/current/newer response distributions is load-bearing for the central claims on future-proofing difficulty and backward-compatibility ease. The manuscript must specify the exact generators, prompting techniques, or capability proxies used to create these shifts and provide evidence or argumentation that they reproduce key real-world dimensions (new failure modes, stylistic drift, capability jumps) rather than artifacts of the chosen simulation method.
  2. [§4.2–4.3] §4.2–4.3 (Results and Statistical Reporting): Claims such as “DPO-trained models consistently improving performance” and “continual learning provides a more balanced adaptation” are presented without reported error bars, number of random seeds, or statistical significance tests across the train/test distribution shifts. This weakens assessment of whether the observed differences are reliable or could be explained by variance in the chosen data splits.
  3. [§4.4] §4.4 (Question Generalization): The reported degradation on unseen questions is central to the practical takeaway that current judges do not fully generalize. The paper should quantify the magnitude of this drop relative to the distribution-shift effects and test whether it persists when question overlap is controlled more stringently (e.g., via explicit train/test question partitioning metrics).
minor comments (2)
  1. Figure legends and axis labels should explicitly name the train and test distribution combinations (older/current/newer) rather than relying on color alone.
  2. Ensure the related-work section cites recent studies on continual learning for LLM alignment and judge robustness to avoid under-claiming novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas to improve the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: §3 (Distribution Shift Construction): The simulation of older/current/newer response distributions is load-bearing for the central claims on future-proofing difficulty and backward-compatibility ease. The manuscript must specify the exact generators, prompting techniques, or capability proxies used to create these shifts and provide evidence or argumentation that they reproduce key real-world dimensions (new failure modes, stylistic drift, capability jumps) rather than artifacts of the chosen simulation method.

    Authors: We agree that greater transparency and validation of the distribution shift construction is important for supporting our claims. In the revised manuscript, we will expand Section 3 to provide detailed specifications of the generators, prompting techniques, and any capability proxies used. Additionally, we will include qualitative examples and argumentation demonstrating how these shifts capture aspects such as new failure modes and stylistic drift observed in evolving LLMs, to address concerns about potential artifacts. revision: yes

  2. Referee: §4.2–4.3 (Results and Statistical Reporting): Claims such as “DPO-trained models consistently improving performance” and “continual learning provides a more balanced adaptation” are presented without reported error bars, number of random seeds, or statistical significance tests across the train/test distribution shifts. This weakens assessment of whether the observed differences are reliable or could be explained by variance in the chosen data splits.

    Authors: We acknowledge this limitation in our current statistical reporting. We will revise Sections 4.2 and 4.3 to include error bars based on multiple random seeds, specify the number of seeds used, and report results of statistical significance tests (such as t-tests) for the key comparisons. This will allow readers to better assess the reliability of the observed differences. revision: yes

  3. Referee: §4.4 (Question Generalization): The reported degradation on unseen questions is central to the practical takeaway that current judges do not fully generalize. The paper should quantify the magnitude of this drop relative to the distribution-shift effects and test whether it persists when question overlap is controlled more stringently (e.g., via explicit train/test question partitioning metrics).

    Authors: We appreciate this suggestion to strengthen the analysis of question generalization. In the revision, we will quantify the magnitude of the performance degradation on unseen questions and compare it directly to the effects from response distribution shifts. We will also introduce stricter controls on question overlap, such as using embedding-based similarity metrics to partition questions, and present results under these conditions to confirm the persistence of the degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation framework

full rationale

This paper conducts a purely empirical study measuring the performance of fine-tuned LLM judges across controlled shifts in response distributions on two reasoning datasets, using SFT and DPO algorithms with three backbone models. All reported findings on future-proofing, backward-compatibility, and question generalization derive directly from experimental train/test splits and accuracy metrics rather than any mathematical derivations, predictions, or first-principles results. No steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the evaluation framework remains self-contained against the observed data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from the LLM evaluation literature rather than introducing new free parameters, axioms, or entities.

axioms (1)
  • domain assumption The selected reasoning datasets and simulated response-distribution shifts adequately proxy real-world temporal changes in generator models and question distributions.
    This premise underpins the train/test splits used to measure the three shelf-life aspects.

pith-pipeline@v0.9.0 · 5842 in / 1372 out tokens · 43696 ms · 2026-05-18T12:50:27.982485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 11 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Francis Christiano, John Schulman, and Dandelion Man \'e . Concrete problems in ai safety. ArXiv, abs/1606.06565, 2016. URL https://api.semanticscholar.org/CorpusID:10242377

  2. [2]

    Axolotl: Post-training for ai models, 2023

    Axolotl maintainers and contributors . Axolotl: Post-training for ai models, 2023. URL https://github.com/axolotl-ai-cloud/axolotl

  3. [3]

    Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas R. Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. ArXiv, abs/2312.09390, 2023. URL https://api.semanticscholar.org/CorpusID:266312608

  4. [4]

    Judgelrm: Large reasoning models as a judge

    Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. ArXiv, abs/2504.00050, 2025 a . URL https://api.semanticscholar.org/CorpusId:277467872

  5. [5]

    Do llm evaluators prefer themselves for a reason? arXiv preprint arXiv:2504.03846, 2025 b

    Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, and Yu Meng. Do llm evaluators prefer themselves for a reason? arXiv preprint arXiv:2504.03846, 2025 b

  6. [6]

    Rm-r1: Reward modeling as reasoning

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning. ArXiv, abs/2505.02387, 2025 c . URL https://api.semanticscholar.org/CorpusID:278327900

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651

  8. [8]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony S. Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur'elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \`e re,...

  9. [9]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. ArXiv, abs/2305.14387, 2023. URL https://arxiv.org/pdf/2305.14387.pdf

  10. [10]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJz6tiCqYm

  11. [11]

    Themis: A reference-free nlg evaluation language model with flexibility and interpretability

    Xinyu Hu, Li Lin, Mingqi Gao, Xunjian Yin, and Xiaojun Wan. Themis: A reference-free nlg evaluation language model with flexibility and interpretability. arXiv preprint arXiv:2406.18365, 2024

  12. [12]

    Camels in a changing climate: Enhancing lm adaptation with tulu 2

    Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023

  13. [13]

    The comparative trap: Pairwise comparisons amplifies biased preferences of llm evaluators

    Hawon Jeong, chaeHun Park, Jimin Hong, and Jaegul Choo. The comparative trap: Pairwise comparisons amplifies biased preferences of llm evaluators. 2024. URL https://api.semanticscholar.org/CorpusID:270562681

  14. [14]

    Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram'e, Morgane Rivi \`e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, Francesco Visin, Kathleen Kenealy, Lucas...

  15. [15]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=8euJaTveKw

  16. [16]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. ArXiv, abs/2405.01535, 2024 b . URL https://api.semanticscholar.org/CorpusID:269502688

  17. [17]

    Scaling evaluation-time compute with reasoning models as process evaluators.arXiv preprint arXiv:2503.19877, 2025

    Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, J. Hockenmaier, Graham Neubig, and S. Welleck. Scaling evaluation-time compute with reasoning models as process evaluators. ArXiv, abs/2503.19877, 2025. URL https://api.semanticscholar.org/CorpusId:277313538

  18. [18]

    Wilds: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A ...

  19. [19]

    No free labels: Limitations of llm-as-a-judge without human grounding

    Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of llm-as-a-judge without human grounding. arXiv preprint arXiv:2503.05061, 2025

  20. [20]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ArXiv, abs/2412.05579, 2024 a . URL https://api.semanticscholar.org/CorpusID:274596907

  21. [21]

    Generative judge for evaluating alignment

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=gtkFw6sZGS

  22. [22]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuo Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://arxiv.org/pdf/2303.16634.pdf

  23. [23]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

  24. [24]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...

  25. [25]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4NJBV6Wp0h

  26. [27]

    Offsetbias: Leveraging debiased data for tuning evaluators

    Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551, 2024 b

  27. [28]

    Large language models sensitivity to the order of options in multiple-choice questions

    Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp.\ 2006--2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi:10.1865...

  28. [29]

    Vyas Raina, Adian Liusie, and Mark J. F. Gales. Is llm-as-a-judge robust? investigating universal adversarial attacks on zero-shot llm assessment. ArXiv, abs/2402.14016, 2024. URL https://api.semanticscholar.org/CorpusId:267770121

  29. [30]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L'eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram'e, Johan Ferret, Peter Liu, Pouya Dehghani Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stańczyk, Serta...

  30. [31]

    Lmunit: Fine-grained evaluation with natural language unit tests

    Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests. arXiv preprint arXiv:2412.13091, 2024

  31. [33]

    When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning

    Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach. When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning. ArXiv, abs/2504.01005, 2025. URL https://api.semanticscholar.org/CorpusId:277467695

  32. [34]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 3008--3021. Curran Associates, Inc., 202...

  33. [35]

    Easy-to-hard generalization: Scalable alignment beyond human supervision

    Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. Easy-to-hard generalization: Scalable alignment beyond human supervision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=qwgfh2fTtN

  34. [36]

    Smith, and Yejin Choi

    Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ ...

  35. [37]

    Judgebench: A benchmark for evaluating LLM -based judges

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq

  36. [38]

    Un ministral, des ministraux, a

    Mistral Team. Un ministral, des ministraux, a . URL https://mistral.ai/news/ministraux

  37. [39]

    Mistral small 3, b

    Mistral Team. Mistral small 3, b . URL https://mistral.ai/news/mistral-small-3

  38. [40]

    Pairwise or pointwise? evaluating feedback protocols for bias in LLM -based evaluation

    Tuhina Tripathi, Manya Wadhwa, Greg Durrett, and Scott Niekum. Pairwise or pointwise? evaluating feedback protocols for bias in LLM -based evaluation. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=uyX5Vnow3U

  39. [41]

    Foundational autoraters: Taming large language models for better automatic evaluation

    Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung. Foundational autoraters: Taming large language models for better automatic evaluation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 17086--17105, Miami, Flori...

  40. [42]

    Direct judgement preference optimization

    Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, and Shafiq Joty. Direct judgement preference optimization. arXiv preprint arXiv:2409.14664, 2024 a

  41. [43]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. ArXiv, abs/2305.17926, 2023. URL https://api.semanticscholar.org/CorpusID:258960339

  42. [44]

    Self-taught evaluators

    Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-taught evaluators. arXiv preprint arXiv:2408.02666, 2024 b

  43. [45]

    J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning

    Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning. arXiv preprint arXiv:2505.10320, 2025

  44. [46]

    Does context matter? contextualjudgebench for evaluating llm-based judges in contextual settings

    Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, and Shafiq Joty. Does context matter? contextualjudgebench for evaluating llm-based judges in contextual settings. arXiv preprint arXiv:2503.15620, 2025 a

  45. [47]

    J4r: Learning to judge with equivalent initial state group relative policy optimization

    Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. J4r: Learning to judge with equivalent initial state group relative policy optimization. ArXiv, abs/2505.13346, 2025 b . URL https://api.semanticscholar.org/CorpusID:278768650

  46. [48]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke-Yang Chen, Kexin Yang, Mei Li, Min Xue...

  47. [49]

    Qwen2.5 Technical Report

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

  48. [50]

    Beyond scalar reward model: Learning generative judge from preference data

    Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Beyond scalar reward model: Learning generative judge from preference data. arXiv preprint arXiv:2410.03742, 2024

  49. [51]

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models. ArXiv, abs/2401.10020, 2024. URL https://arxiv.org/pdf/2401.10020.pdf

  50. [52]

    Evaluating large language models at evaluating instruction following

    Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tr0KidwPLc

  51. [53]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

  52. [54]

    Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators

    Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. ArXiv, abs/2504.15253, 2025. URL https://api.semanticscholar.org/CorpusId:277955867

  53. [55]

    Judge LM : Fine-tuned large language models are scalable judges

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judge LM : Fine-tuned large language models are scalable judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=xsELpEPn4A

  54. [56]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  55. [57]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  56. [58]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  57. [59]

    @position\@positionfalse position \@positiontrue \@trackname \@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear) Position Paper Track

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...