pith. sign in

arxiv: 2509.21319 · v3 · pith:ZFFFMQIXnew · submitted 2025-09-25 · 💻 cs.CL · cs.AI· cs.LG

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Pith reviewed 2026-05-21 22:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reward modelsRLHFbinary principlesLLM alignmententailmenthuman feedbackverifiable rewards
0
0 comments X

The pith

Extracting binary yes-no principles from human feedback lets reward models beat traditional preference models on alignment benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert free-form human feedback into a set of simple binary principles that can be checked yes or no. These principles turn reward model training into an entailment task that decides whether a response satisfies each principle. The method keeps the flexibility of human judgments while adding the structure of verifiable rules. When trained this way the resulting models outperform standard Bradley-Terry reward models on the same data and reach top scores on existing leaderboards. Users can also choose which principles to apply at inference time, giving the model a custom focus without retraining.

Core claim

Decomposing natural language feedback into binary principles that a response either satisfies or does not satisfy, then training reward models to judge entailment against those principles, produces reward models that surpass Bradley-Terry models trained on matched data and reach 86.2 percent on RM-Bench and 81.4 percent on JudgeBench while allowing principle selection at inference time.

What carries the argument

Binary Flexible Feedback extraction that turns natural language comments into yes-no principles and frames reward modeling as an entailment task between a response and each principle.

If this is right

  • Reward models achieve 86.2 percent on RM-Bench and 81.4 percent on JudgeBench.
  • An aligned Qwen3-32B model matches or exceeds o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard v2 at under five percent of the inference cost.
  • Users can specify any set of principles at inference time to steer the reward model toward chosen quality aspects.
  • The same data produces stronger results than Bradley-Terry training because the binary format supplies explicit criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Custom principle selection at inference time could support domain-specific alignment without new training runs.
  • Making feedback criteria explicit may reduce reward hacking by limiting the reward model to stated principles.
  • The binary decomposition approach might extend to other feedback sources such as automated verifiers or multi-turn conversations.

Load-bearing premise

Natural language feedback can be split into binary principles that keep the main aspects of response quality without losing important detail or adding extraction mistakes.

What would settle it

A head-to-head test on identical data where the binary-principle reward models score no higher than Bradley-Terry models on RM-Bench or JudgeBench would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2509.21319 by Daniel Egert, Ellie Evans, Felipe Soares, Hoo-Chang Shin, Jiaqi Zeng, Oleksii Kuchaiev, Olivier Delalleau, Yi Dong, Zhilin Wang.

Figure 1
Figure 1. Figure 1: Example of Binary Flexible Feedback in Natural Language. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 40 most frequent words in principles, excluding stop-words. Clarity, accuracy and relevance [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RLBFF, which extracts binary principles (e.g., 'accuracy of information: yes') from natural language human feedback to train reward models as an entailment task rather than standard Bradley-Terry ranking. This enables interpretable rewards and inference-time customization by specifying principles of interest. The authors claim RLBFF reward models outperform matched Bradley-Terry models, achieve state-of-the-art results on RM-Bench (86.2%) and JudgeBench (81.4%, #1 as of September 24, 2025), and provide a fully open recipe (data and code) to align Qwen3-32B via RLBFF to match or exceed o3-mini and DeepSeek-R1 on MT-Bench, WildBench, and Arena Hard v2 at <5% inference cost.

Significance. If the results hold under scrutiny, RLBFF offers a practical bridge between the flexibility of RLHF and the precision of RLVR, with added benefits of interpretability and user-specified customization at inference time. The open release of models, data, and training recipe is a clear strength that supports reproducibility and adoption. The approach could influence post-training practices if the binary decomposition reliably captures nuanced preferences without systematic loss.

major comments (3)
  1. [§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.
  2. [§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.
  3. [§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.
minor comments (2)
  1. [Abstract] The abstract's reference to the JudgeBench leaderboard position 'as of September 24, 2025' would benefit from a direct link or archived snapshot for independent verification.
  2. [§3] Notation for the entailment task (response satisfies principle or not) could be formalized with a short equation or pseudocode for clarity, especially when contrasting with Bradley-Terry loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify areas where the manuscript can be strengthened. We address each major comment point by point below, indicating the revisions we will make in the next version of the paper.

read point-by-point responses
  1. Referee: [§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.

    Authors: We agree that greater specificity on the extraction procedure is needed to support the central claims and enable reproducibility. While §3 outlines the conceptual approach, the revised manuscript will add a dedicated subsection detailing the exact LLM employed for decomposition, the complete prompts, all filtering steps, observed error rates, and direct validation comparing the extracted binary principles against the original natural language feedback distributions. This addition will allow readers to evaluate potential loss of nuance or introduction of artifacts. revision: yes

  2. Referee: [§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.

    Authors: We acknowledge that additional controls and ablations are required to confidently attribute performance to the entailment formulation rather than data curation or extraction choices. The current version describes the data-matching procedure, but the revised manuscript will incorporate expanded ablation studies, full hyperparameter details, and targeted controls that isolate the contribution of entailment training from the extraction process. These will be presented in §4 and §5 alongside the main results. revision: yes

  3. Referee: [§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.

    Authors: The referee correctly notes the absence of quantitative fidelity assessment. To directly evaluate whether binary principles preserve key aspects of the original feedback, the revised manuscript will add agreement metrics and human validation results in §3. These will include inter-annotator agreement scores and human ratings of how faithfully the extracted principles capture the original natural language feedback. revision: yes

Circularity Check

0 steps flagged

No circularity: central claims rest on external benchmark evaluation independent of training inputs

full rationale

The paper introduces RLBFF by extracting binary principles from natural-language feedback and training reward models as an entailment task. Performance is reported on independent public benchmarks (RM-Bench 86.2%, JudgeBench 81.4%) and alignment suites (MT-Bench, WildBench, Arena Hard v2) rather than on any quantity defined from the training data or fitted parameters. No equations, derivations, or self-citations are shown that reduce the claimed improvements to the input feedback or extraction process by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that binary principles extracted from human text can faithfully represent quality dimensions. No new physical entities or mathematical axioms are introduced; the main added element is the principle-extraction step whose details are not specified in the abstract.

free parameters (1)
  • Principle extraction procedure
    The process that turns natural language feedback into a set of binary principles is not described and likely involves modeling choices or additional data.
axioms (1)
  • domain assumption Binary yes/no answers to extracted principles preserve the essential information in human feedback for reward modeling
    This premise is required for the entailment training to substitute for direct preference modeling.

pith-pipeline@v0.9.0 · 5891 in / 1497 out tokens · 44594 ms · 2026-05-21T22:06:04.579199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 12 internal anchors

  1. [1]

    Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata

    David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models, 2025. URL https://arxiv.org/abs/2505.13388

  2. [2]

    rapidfuzz/rapidfuzz: Release 3.13.0, April 2025

    Max Bachmann. rapidfuzz/rapidfuzz: Release 3.13.0, April 2025. URL https://doi.org/10.5281/zenodo.15133267

  3. [3]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  4. [4]

    In Practice and Experience in Advanced Research Comput- ing 2019: Rise of the Machines (Learning) , PEARC ’19

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning, 2025. URL https://arxiv.org/abs/2505.02387

  5. [5]

    Gonzalez, and Ion Stoica

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

  6. [6]

    Lmunit-llama3.1-70b

    ContextualAI. Lmunit-llama3.1-70b. https://huggingface.co/ContextualAI/LMUnit-llama3.1-70b, 2025 a

  7. [7]

    Lmunit-qwen2.5-72b

    ContextualAI. Lmunit-qwen2.5-72b. https://huggingface.co/ContextualAI/LMUnit-qwen2.5-72b, 2025 b

  8. [8]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  9. [9]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2025. URL https://arxiv.org/abs/2404.04475

  10. [10]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/2402.01306

  11. [11]

    Team Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  12. [12]

    Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025

    Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025. URL https://arxiv.org/abs/2505.22203

  13. [13]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  14. [14]

    Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews

    Daniel Keller and Maria Kostromitina. Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews. International Journal of Hospitality Management, 86: 0 102440, 2020. ISSN 0278-4319. doi:https://doi.org/10.1016/j.ijhm.2019.102440. URL https://www.sciencedirect.com/science/article/pii/S02784...

  15. [15]

    Prometheus 2: An open source language model specialized in evaluating other language models, 2024

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models, 2024. URL https://arxiv.org/abs/2405.01535

  16. [16]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

  17. [17]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The Arena-Hard pipeline. https://lmsys.org/blog/2024-04-19-arena-hard/, April 2024

  18. [18]

    Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MKEHCx25xp

  19. [19]

    Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024

    Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024. URL https://arxiv.org/abs/2410.16184

  20. [20]

    RM -bench: Benchmarking reward models of language models with subtlety and style

    Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. RM -bench: Benchmarking reward models of language models with subtlety and style. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=QEHrmQPBdd

  21. [21]

    Inference-time scaling for generalist reward modeling,

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025 b . URL https://arxiv.org/abs/2504.02495

  22. [22]

    Arena-hard-auto leaderboard

    LMSys. Arena-hard-auto leaderboard. https://github.com/lm-sys/arena-hard-auto, 2024

  23. [23]

    SimPO : Simple preference optimization with a reference-free reward, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO : Simple preference optimization with a reference-free reward, 2024

  24. [24]

    Rule based rewards for language model safety, 2024

    Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety, 2024. URL https://arxiv.org/abs/2411.01111

  25. [25]

    MTEB : Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

  26. [26]

    nvidia/HelpSteer3\#feedback

    NVIDIA. nvidia/HelpSteer3\#feedback . https://huggingface.co/datasets/nvidia/HelpSteer3#feedback, 2025 a

  27. [27]

    Nemo rl: A scalable and efficient post-training library

    Nemo NVIDIA. Nemo rl: A scalable and efficient post-training library. https://github.com/NVIDIA-NeMo/RL, 2025 b . GitHub repository

  28. [28]

    Openai model spec, Apr 2025

    OpenAI. Openai model spec, Apr 2025. URL https://model-spec.openai.com/2025-04-11.html

  29. [29]

    Openrouter

    OpenRouter. Openrouter. https://openrouter.ai/models?fmt=table, 2025

  30. [30]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

  31. [31]

    Generalizing Verifiable Instruction Following

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025. URL https://arxiv.org/abs/2507.02833

  32. [32]

    Lmunit: Fine-grained evaluation with natural language unit tests, 2024

    Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests, 2024. URL https://arxiv.org/abs/2412.13091

  33. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  34. [34]

    Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023

  35. [35]

    NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

  36. [36]

    Judgebench: A benchmark for evaluating LLM -based judges

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq

  37. [37]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

  38. [38]

    Rm-bench leaderboard

    THU-KEG. Rm-bench leaderboard. https://github.com/THU-KEG/RM-Bench-Leaderboard, 2025

  39. [39]

    Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiming Yang, Shiyong Li, Tianhang Zhu, Wen Xie, Wenhao Huang, Xi...

  40. [40]

    Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

    Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/...

  41. [41]

    Helpsteer2-preference: Complementing ratings with preferences

    Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with preferences. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=MnfHxPP5gs

  42. [42]

    H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks

    Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, and Oleksii Kuchaiev. H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings o...

  43. [43]

    HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages,

    Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages, 2025 c . URL https://arxiv.org/abs/2505.11475

  44. [44]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652

  45. [45]

    Reward hacking in reinforcement learning

    Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

  46. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 b

  47. [48]

    Rewardanything: Generalizable principle-following reward models, 2025

    Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models, 2025. URL https://arxiv.org/abs/2506.03637

  48. [49]

    arXiv:2408.15240

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2024. URL https://arxiv. org/abs/2408.15240, 2024

  49. [50]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https://arxiv.org/abs/2506.05176

  50. [51]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  51. [52]

    Jordan, and Jiantao Jiao

    Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf, 2024. URL https://arxiv.org/abs/2401.16335

  52. [53]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  53. [54]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  54. [55]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  55. [56]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...