On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Pith reviewed 2026-05-18 12:50 UTC · model grok-4.3
The pith
Fine-tuned LLM judges lose effectiveness on responses from newer models but retain it on older ones and degrade on unseen questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training and testing judges under controlled shifts between older, current, and newer generator responses plus seen versus unseen questions, the work finds that future-proofing remains difficult for most models, backward-compatibility is relatively straightforward and improved by DPO, continual learning balances adaptation across shifts better than single-distribution training, and performance drops on unseen questions for every model and method examined.
What carries the argument
Unified evaluation framework that varies train and test distributions across older/current/newer response generators and seen/unseen questions to isolate future-proofing, backward-compatibility, and question generalization.
If this is right
- DPO-trained judges should be preferred when backward-compatibility with older generators matters.
- Continual learning across multiple response distributions yields more stable performance than training on a single stronger or weaker set.
- Judges will require periodic retraining whenever generator models advance enough to shift response distributions.
- Additional techniques beyond standard fine-tuning are needed to reduce degradation on unseen questions.
Where Pith is reading between the lines
- If real generator improvements create larger stylistic shifts than the simulated ones, the shelf life of fine-tuned judges may be even shorter than measured here.
- The observed question-generalization gap suggests that training data should deliberately maximize question diversity rather than focus only on response quality.
- Hybrid systems that combine a fine-tuned judge with occasional frontier-model prompting could extend practical shelf life without full retraining.
Load-bearing premise
The simulated shifts between older, current, and newer generator responses on the two chosen reasoning datasets capture the real temporal and question-distribution changes that occur in deployed LLM systems.
What would settle it
A new experiment in which a judge fine-tuned on current responses matches or exceeds its accuracy on future-model responses, or shows no drop when tested on questions absent from training, would contradict the reported pattern.
Figures
read the original abstract
The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and fine-tuning. Recently, fine-tuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of fine-tuned judges regarding their real-world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future-proofing and backward-compatibility -- how well judges fine-tuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models exhibit some degree of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes three practical aspects of fine-tuned LLM judges—future-proofing against newer generator responses, backward-compatibility with older responses, and generalization to unseen questions—and evaluates them in a unified experimental framework across two reasoning datasets, SFT and DPO fine-tuning algorithms, and three backbone models. Key findings are that future-proofing is challenging while backward-compatibility is relatively straightforward (especially under DPO), continual learning yields more balanced adaptation than training on stronger or weaker responses alone, and all models exhibit performance drops on unseen questions.
Significance. If the empirical results hold, the work supplies actionable guidance for deploying judge models amid rapidly evolving generators, highlighting DPO and continual learning as preferable strategies for extending shelf life. The multi-dataset, multi-algorithm, multi-backbone design is a strength that increases robustness of the comparative claims.
major comments (3)
- [§3] §3 (Distribution Shift Construction): The simulation of older/current/newer response distributions is load-bearing for the central claims on future-proofing difficulty and backward-compatibility ease. The manuscript must specify the exact generators, prompting techniques, or capability proxies used to create these shifts and provide evidence or argumentation that they reproduce key real-world dimensions (new failure modes, stylistic drift, capability jumps) rather than artifacts of the chosen simulation method.
- [§4.2–4.3] §4.2–4.3 (Results and Statistical Reporting): Claims such as “DPO-trained models consistently improving performance” and “continual learning provides a more balanced adaptation” are presented without reported error bars, number of random seeds, or statistical significance tests across the train/test distribution shifts. This weakens assessment of whether the observed differences are reliable or could be explained by variance in the chosen data splits.
- [§4.4] §4.4 (Question Generalization): The reported degradation on unseen questions is central to the practical takeaway that current judges do not fully generalize. The paper should quantify the magnitude of this drop relative to the distribution-shift effects and test whether it persists when question overlap is controlled more stringently (e.g., via explicit train/test question partitioning metrics).
minor comments (2)
- Figure legends and axis labels should explicitly name the train and test distribution combinations (older/current/newer) rather than relying on color alone.
- Ensure the related-work section cites recent studies on continual learning for LLM alignment and judge robustness to avoid under-claiming novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas to improve the manuscript. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: §3 (Distribution Shift Construction): The simulation of older/current/newer response distributions is load-bearing for the central claims on future-proofing difficulty and backward-compatibility ease. The manuscript must specify the exact generators, prompting techniques, or capability proxies used to create these shifts and provide evidence or argumentation that they reproduce key real-world dimensions (new failure modes, stylistic drift, capability jumps) rather than artifacts of the chosen simulation method.
Authors: We agree that greater transparency and validation of the distribution shift construction is important for supporting our claims. In the revised manuscript, we will expand Section 3 to provide detailed specifications of the generators, prompting techniques, and any capability proxies used. Additionally, we will include qualitative examples and argumentation demonstrating how these shifts capture aspects such as new failure modes and stylistic drift observed in evolving LLMs, to address concerns about potential artifacts. revision: yes
-
Referee: §4.2–4.3 (Results and Statistical Reporting): Claims such as “DPO-trained models consistently improving performance” and “continual learning provides a more balanced adaptation” are presented without reported error bars, number of random seeds, or statistical significance tests across the train/test distribution shifts. This weakens assessment of whether the observed differences are reliable or could be explained by variance in the chosen data splits.
Authors: We acknowledge this limitation in our current statistical reporting. We will revise Sections 4.2 and 4.3 to include error bars based on multiple random seeds, specify the number of seeds used, and report results of statistical significance tests (such as t-tests) for the key comparisons. This will allow readers to better assess the reliability of the observed differences. revision: yes
-
Referee: §4.4 (Question Generalization): The reported degradation on unseen questions is central to the practical takeaway that current judges do not fully generalize. The paper should quantify the magnitude of this drop relative to the distribution-shift effects and test whether it persists when question overlap is controlled more stringently (e.g., via explicit train/test question partitioning metrics).
Authors: We appreciate this suggestion to strengthen the analysis of question generalization. In the revision, we will quantify the magnitude of the performance degradation on unseen questions and compare it directly to the effects from response distribution shifts. We will also introduce stricter controls on question overlap, such as using embedding-based similarity metrics to partition questions, and present results under these conditions to confirm the persistence of the degradation. revision: yes
Circularity Check
No significant circularity in empirical evaluation framework
full rationale
This paper conducts a purely empirical study measuring the performance of fine-tuned LLM judges across controlled shifts in response distributions on two reasoning datasets, using SFT and DPO algorithms with three backbone models. All reported findings on future-proofing, backward-compatibility, and question generalization derive directly from experimental train/test splits and accuracy metrics rather than any mathematical derivations, predictions, or first-principles results. No steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the evaluation framework remains self-contained against the observed data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected reasoning datasets and simulated response-distribution shifts adequately proxy real-world temporal changes in generator models and question distributions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FutureProof = Acc_strong(J_weak) − Acc_weak(J_weak)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Francis Christiano, John Schulman, and Dandelion Man \'e . Concrete problems in ai safety. ArXiv, abs/1606.06565, 2016. URL https://api.semanticscholar.org/CorpusID:10242377
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Axolotl: Post-training for ai models, 2023
Axolotl maintainers and contributors . Axolotl: Post-training for ai models, 2023. URL https://github.com/axolotl-ai-cloud/axolotl
work page 2023
-
[3]
Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas R. Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. ArXiv, abs/2312.09390, 2023. URL https://api.semanticscholar.org/CorpusID:266312608
-
[4]
Judgelrm: Large reasoning models as a judge
Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. ArXiv, abs/2504.00050, 2025 a . URL https://api.semanticscholar.org/CorpusId:277467872
-
[5]
Do llm evaluators prefer themselves for a reason? arXiv preprint arXiv:2504.03846, 2025 b
Wei-Lin Chen, Zhepei Wei, Xinyu Zhu, Shi Feng, and Yu Meng. Do llm evaluators prefer themselves for a reason? arXiv preprint arXiv:2504.03846, 2025 b
-
[6]
Rm-r1: Reward modeling as reasoning
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning. ArXiv, abs/2505.02387, 2025 c . URL https://api.semanticscholar.org/CorpusID:278327900
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony S. Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur'elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \`e re,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. ArXiv, abs/2305.14387, 2023. URL https://arxiv.org/pdf/2305.14387.pdf
-
[10]
Benchmarking neural network robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJz6tiCqYm
work page 2019
-
[11]
Themis: A reference-free nlg evaluation language model with flexibility and interpretability
Xinyu Hu, Li Lin, Mingqi Gao, Xunjian Yin, and Xiaojun Wan. Themis: A reference-free nlg evaluation language model with flexibility and interpretability. arXiv preprint arXiv:2406.18365, 2024
-
[12]
Camels in a changing climate: Enhancing lm adaptation with tulu 2
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023
-
[13]
The comparative trap: Pairwise comparisons amplifies biased preferences of llm evaluators
Hawon Jeong, chaeHun Park, Jimin Hong, and Jaegul Choo. The comparative trap: Pairwise comparisons amplifies biased preferences of llm evaluators. 2024. URL https://api.semanticscholar.org/CorpusID:270562681
work page 2024
-
[14]
Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram'e, Morgane Rivi \`e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, Francesco Visin, Kathleen Kenealy, Lucas...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Prometheus: Inducing fine-grained evaluation capability in language models
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=8euJaTveKw
work page 2024
-
[16]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. ArXiv, abs/2405.01535, 2024 b . URL https://api.semanticscholar.org/CorpusID:269502688
-
[17]
Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Kiril Gashteovski, Carolin Lawrence, J. Hockenmaier, Graham Neubig, and S. Welleck. Scaling evaluation-time compute with reasoning models as process evaluators. ArXiv, abs/2503.19877, 2025. URL https://api.semanticscholar.org/CorpusId:277313538
-
[18]
Wilds: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A ...
work page 2021
-
[19]
No free labels: Limitations of llm-as-a-judge without human grounding
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of llm-as-a-judge without human grounding. arXiv preprint arXiv:2503.05061, 2025
-
[20]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ArXiv, abs/2412.05579, 2024 a . URL https://api.semanticscholar.org/CorpusID:274596907
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Generative judge for evaluating alignment
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=gtkFw6sZGS
work page 2024
-
[22]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuo Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://arxiv.org/pdf/2303.16634.pdf
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog
work page 2025
-
[24]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback...
work page 2022
-
[25]
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4NJBV6Wp0h
work page 2024
-
[27]
Offsetbias: Leveraging debiased data for tuning evaluators
Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551, 2024 b
-
[28]
Large language models sensitivity to the order of options in multiple-choice questions
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp.\ 2006--2017, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi:10.1865...
- [29]
-
[30]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L'eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram'e, Johan Ferret, Peter Liu, Pouya Dehghani Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stańczyk, Serta...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Lmunit: Fine-grained evaluation with natural language unit tests
Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests. arXiv preprint arXiv:2412.13091, 2024
-
[33]
Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, and Anna Rohrbach. When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning. ArXiv, abs/2504.01005, 2025. URL https://api.semanticscholar.org/CorpusId:277467695
-
[34]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 3008--3021. Curran Associates, Inc., 202...
work page 2020
-
[35]
Easy-to-hard generalization: Scalable alignment beyond human supervision
Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, and Chuang Gan. Easy-to-hard generalization: Scalable alignment beyond human supervision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=qwgfh2fTtN
work page 2024
-
[36]
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ ...
-
[37]
Judgebench: A benchmark for evaluating LLM -based judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq
work page 2025
-
[38]
Un ministral, des ministraux, a
Mistral Team. Un ministral, des ministraux, a . URL https://mistral.ai/news/ministraux
-
[39]
Mistral Team. Mistral small 3, b . URL https://mistral.ai/news/mistral-small-3
-
[40]
Pairwise or pointwise? evaluating feedback protocols for bias in LLM -based evaluation
Tuhina Tripathi, Manya Wadhwa, Greg Durrett, and Scott Niekum. Pairwise or pointwise? evaluating feedback protocols for bias in LLM -based evaluation. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=uyX5Vnow3U
work page 2025
-
[41]
Foundational autoraters: Taming large language models for better automatic evaluation
Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, and Yun-Hsuan Sung. Foundational autoraters: Taming large language models for better automatic evaluation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 17086--17105, Miami, Flori...
-
[42]
Direct judgement preference optimization
Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, and Shafiq Joty. Direct judgement preference optimization. arXiv preprint arXiv:2409.14664, 2024 a
-
[43]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. ArXiv, abs/2305.17926, 2023. URL https://api.semanticscholar.org/CorpusID:258960339
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. Self-taught evaluators. arXiv preprint arXiv:2408.02666, 2024 b
-
[45]
J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning
Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning. arXiv preprint arXiv:2505.10320, 2025
-
[46]
Does context matter? contextualjudgebench for evaluating llm-based judges in contextual settings
Austin Xu, Srijan Bansal, Yifei Ming, Semih Yavuz, and Shafiq Joty. Does context matter? contextualjudgebench for evaluating llm-based judges in contextual settings. arXiv preprint arXiv:2503.15620, 2025 a
-
[47]
J4r: Learning to judge with equivalent initial state group relative policy optimization
Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. J4r: Learning to judge with equivalent initial state group relative policy optimization. ArXiv, abs/2505.13346, 2025 b . URL https://api.semanticscholar.org/CorpusID:278768650
-
[48]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke-Yang Chen, Kexin Yang, Mei Li, Min Xue...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Beyond scalar reward model: Learning generative judge from preference data
Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Beyond scalar reward model: Learning generative judge from preference data. arXiv preprint arXiv:2410.03742, 2024
-
[51]
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason E. Weston. Self-rewarding language models. ArXiv, abs/2401.10020, 2024. URL https://arxiv.org/pdf/2401.10020.pdf
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Evaluating large language models at evaluating instruction following
Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tr0KidwPLc
work page 2024
-
[53]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...
work page 2023
-
[54]
Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The jetts benchmark of llm-as-judges as test-time scaling evaluators. ArXiv, abs/2504.15253, 2025. URL https://api.semanticscholar.org/CorpusId:277955867
-
[55]
Judge LM : Fine-tuned large language models are scalable judges
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judge LM : Fine-tuned large language models are scalable judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=xsELpEPn4A
work page 2025
-
[56]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[57]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[58]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[59]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.