RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Daniel Egert; Ellie Evans; Felipe Soares; Hoo-Chang Shin; Jiaqi Zeng; Oleksii Kuchaiev; Olivier Delalleau; Yi Dong; Zhilin Wang

arxiv: 2509.21319 · v3 · pith:ZFFFMQIXnew · submitted 2025-09-25 · 💻 cs.CL · cs.AI· cs.LG

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang , Jiaqi Zeng , Olivier Delalleau , Ellie Evans , Daniel Egert , Hoo-Chang Shin , Felipe Soares , Yi Dong

show 1 more author

Oleksii Kuchaiev

This is my paper

Pith reviewed 2026-05-21 22:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reward modelsRLHFbinary principlesLLM alignmententailmenthuman feedbackverifiable rewards

0 comments

The pith

Extracting binary yes-no principles from human feedback lets reward models beat traditional preference models on alignment benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert free-form human feedback into a set of simple binary principles that can be checked yes or no. These principles turn reward model training into an entailment task that decides whether a response satisfies each principle. The method keeps the flexibility of human judgments while adding the structure of verifiable rules. When trained this way the resulting models outperform standard Bradley-Terry reward models on the same data and reach top scores on existing leaderboards. Users can also choose which principles to apply at inference time, giving the model a custom focus without retraining.

Core claim

Decomposing natural language feedback into binary principles that a response either satisfies or does not satisfy, then training reward models to judge entailment against those principles, produces reward models that surpass Bradley-Terry models trained on matched data and reach 86.2 percent on RM-Bench and 81.4 percent on JudgeBench while allowing principle selection at inference time.

What carries the argument

Binary Flexible Feedback extraction that turns natural language comments into yes-no principles and frames reward modeling as an entailment task between a response and each principle.

If this is right

Reward models achieve 86.2 percent on RM-Bench and 81.4 percent on JudgeBench.
An aligned Qwen3-32B model matches or exceeds o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard v2 at under five percent of the inference cost.
Users can specify any set of principles at inference time to steer the reward model toward chosen quality aspects.
The same data produces stronger results than Bradley-Terry training because the binary format supplies explicit criteria.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Custom principle selection at inference time could support domain-specific alignment without new training runs.
Making feedback criteria explicit may reduce reward hacking by limiting the reward model to stated principles.
The binary decomposition approach might extend to other feedback sources such as automated verifiers or multi-turn conversations.

Load-bearing premise

Natural language feedback can be split into binary principles that keep the main aspects of response quality without losing important detail or adding extraction mistakes.

What would settle it

A head-to-head test on identical data where the binary-principle reward models score no higher than Bradley-Terry models on RM-Bench or JudgeBench would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2509.21319 by Daniel Egert, Ellie Evans, Felipe Soares, Hoo-Chang Shin, Jiaqi Zeng, Oleksii Kuchaiev, Olivier Delalleau, Yi Dong, Zhilin Wang.

**Figure 2.** Figure 2: 40 most frequent words in principles, excluding stop-words. Clarity, accuracy and relevance [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLBFF converts natural-language feedback into binary principles for entailment-based reward modeling, which beats matched Bradley-Terry baselines on public benchmarks and adds inference-time customizability, backed by a full open release.

read the letter

The main thing to know about this paper is that it shows how to extract binary principles from human feedback and train reward models to answer them as entailment questions. This setup beats matched Bradley-Terry models and hits top scores on RM-Bench and JudgeBench, while also letting you specify custom principles at inference time. They do a good job releasing everything openly, including the data and a full recipe to fine-tune Qwen3-32B with their reward model. The aligned model matches or beats o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard at much lower cost. That's concrete and useful for anyone replicating alignment work. The bridge between RLHF's versatility and RLVR's precision is handled cleanly by framing the reward as satisfaction of these principles. It addresses interpretability and reward hacking concerns directly. The soft spot is in the principle extraction itself. The stress test raises a fair point: if the conversion from free-form feedback to binary statements loses nuance or introduces artifacts, the performance edge might not hold up as well as claimed. The abstract mentions strong numbers but skips details on the extraction procedure, filtering, or ablations. Even with the open release, I'd want to see how sensitive the results are to those choices. This work is for researchers focused on reward modeling in LLM post-training. Someone looking for a new way to handle preferences with more structure would find it relevant. It has enough going for it—benchmarks, open code, a clear technical step—that it merits a full peer review rather than a desk reject. I'd recommend sending it to referees.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RLBFF, which extracts binary principles (e.g., 'accuracy of information: yes') from natural language human feedback to train reward models as an entailment task rather than standard Bradley-Terry ranking. This enables interpretable rewards and inference-time customization by specifying principles of interest. The authors claim RLBFF reward models outperform matched Bradley-Terry models, achieve state-of-the-art results on RM-Bench (86.2%) and JudgeBench (81.4%, #1 as of September 24, 2025), and provide a fully open recipe (data and code) to align Qwen3-32B via RLBFF to match or exceed o3-mini and DeepSeek-R1 on MT-Bench, WildBench, and Arena Hard v2 at <5% inference cost.

Significance. If the results hold under scrutiny, RLBFF offers a practical bridge between the flexibility of RLHF and the precision of RLVR, with added benefits of interpretability and user-specified customization at inference time. The open release of models, data, and training recipe is a clear strength that supports reproducibility and adoption. The approach could influence post-training practices if the binary decomposition reliably captures nuanced preferences without systematic loss.

major comments (3)

[§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.
[§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.
[§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.

minor comments (2)

[Abstract] The abstract's reference to the JudgeBench leaderboard position 'as of September 24, 2025' would benefit from a direct link or archived snapshot for independent verification.
[§3] Notation for the entailment task (response satisfies principle or not) could be formalized with a short equation or pseudocode for clarity, especially when contrasting with Bradley-Terry loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify areas where the manuscript can be strengthened. We address each major comment point by point below, indicating the revisions we will make in the next version of the paper.

read point-by-point responses

Referee: [§3] The principle extraction procedure (described conceptually in the abstract and presumably detailed in §3) provides no specifics on the LLM or prompts used for decomposition, filtering steps, error rates, or validation against original feedback distributions. This is load-bearing for the central claim that binary principles preserve key aspects of response quality, as any systematic loss of nuance or injection of artifacts could explain the reported gains over Bradley-Terry baselines rather than the entailment formulation itself.

Authors: We agree that greater specificity on the extraction procedure is needed to support the central claims and enable reproducibility. While §3 outlines the conceptual approach, the revised manuscript will add a dedicated subsection detailing the exact LLM employed for decomposition, the complete prompts, all filtering steps, observed error rates, and direct validation comparing the extracted binary principles against the original natural language feedback distributions. This addition will allow readers to evaluate potential loss of nuance or introduction of artifacts. revision: yes
Referee: [§4, §5] §5 and §4: No ablation studies, hyperparameter details, or controls are reported for the 'matched for data' comparison with Bradley-Terry models, nor for the contribution of the entailment training versus data curation. The headline results (86.2% RM-Bench, 81.4% JudgeBench) cannot be confidently attributed to RLBFF without these, especially given the free parameter of the extraction process.

Authors: We acknowledge that additional controls and ablations are required to confidently attribute performance to the entailment formulation rather than data curation or extraction choices. The current version describes the data-matching procedure, but the revised manuscript will incorporate expanded ablation studies, full hyperparameter details, and targeted controls that isolate the contribution of entailment training from the extraction process. These will be presented in §4 and §5 alongside the main results. revision: yes
Referee: [§3] The manuscript does not include any quantitative assessment (e.g., agreement metrics or human validation) of how faithfully the extracted binary principles represent the original natural language feedback, which directly tests the weakest assumption underlying the performance claims.

Authors: The referee correctly notes the absence of quantitative fidelity assessment. To directly evaluate whether binary principles preserve key aspects of the original feedback, the revised manuscript will add agreement metrics and human validation results in §3. These will include inter-annotator agreement scores and human ratings of how faithfully the extracted principles capture the original natural language feedback. revision: yes

Circularity Check

0 steps flagged

No circularity: central claims rest on external benchmark evaluation independent of training inputs

full rationale

The paper introduces RLBFF by extracting binary principles from natural-language feedback and training reward models as an entailment task. Performance is reported on independent public benchmarks (RM-Bench 86.2%, JudgeBench 81.4%) and alignment suites (MT-Bench, WildBench, Arena Hard v2) rather than on any quantity defined from the training data or fitted parameters. No equations, derivations, or self-citations are shown that reduce the claimed improvements to the input feedback or extraction process by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that binary principles extracted from human text can faithfully represent quality dimensions. No new physical entities or mathematical axioms are introduced; the main added element is the principle-extraction step whose details are not specified in the abstract.

free parameters (1)

Principle extraction procedure
The process that turns natural language feedback into a set of binary principles is not described and likely involves modeling choices or additional data.

axioms (1)

domain assumption Binary yes/no answers to extracted principles preserve the essential information in human feedback for reward modeling
This premise is required for the entailment training to substitute for direct preference modeling.

pith-pipeline@v0.9.0 · 5891 in / 1497 out tokens · 44594 ms · 2026-05-21T22:06:04.579199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 12 internal anchors

[1]

Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata

David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models, 2025. URL https://arxiv.org/abs/2505.13388

work page arXiv 2025
[2]

rapidfuzz/rapidfuzz: Release 3.13.0, April 2025

Max Bachmann. rapidfuzz/rapidfuzz: Release 3.13.0, April 2025. URL https://doi.org/10.5281/zenodo.15133267

work page doi:10.5281/zenodo.15133267 2025
[3]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[4]

In Practice and Experience in Advanced Research Comput- ing 2019: Rise of the Machines (Learning) , PEARC ’19

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning, 2025. URL https://arxiv.org/abs/2505.02387

work page arXiv 2025
[5]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

work page 2024
[6]

Lmunit-llama3.1-70b

ContextualAI. Lmunit-llama3.1-70b. https://huggingface.co/ContextualAI/LMUnit-llama3.1-70b, 2025 a

work page 2025
[7]

Lmunit-qwen2.5-72b

ContextualAI. Lmunit-qwen2.5-72b. https://huggingface.co/ContextualAI/LMUnit-qwen2.5-72b, 2025 b

work page 2025
[8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2025. URL https://arxiv.org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/2402.01306

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Team Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025. URL https://arxiv.org/abs/2505.22203

work page arXiv 2025
[13]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews

Daniel Keller and Maria Kostromitina. Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews. International Journal of Hospitality Management, 86: 0 102440, 2020. ISSN 0278-4319. doi:https://doi.org/10.1016/j.ijhm.2019.102440. URL https://www.sciencedirect.com/science/article/pii/S02784...

work page doi:10.1016/j.ijhm.2019.102440 2020
[15]

Prometheus 2: An open source language model specialized in evaluating other language models, 2024

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models, 2024. URL https://arxiv.org/abs/2405.01535

work page arXiv 2024
[16]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The Arena-Hard pipeline. https://lmsys.org/blog/2024-04-19-arena-hard/, April 2024

work page 2024
[18]

Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MKEHCx25xp

work page 2025
[19]

Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024. URL https://arxiv.org/abs/2410.16184

work page arXiv 2024
[20]

RM -bench: Benchmarking reward models of language models with subtlety and style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. RM -bench: Benchmarking reward models of language models with subtlety and style. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=QEHrmQPBdd

work page 2025
[21]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025 b . URL https://arxiv.org/abs/2504.02495

work page arXiv 2025
[22]

Arena-hard-auto leaderboard

LMSys. Arena-hard-auto leaderboard. https://github.com/lm-sys/arena-hard-auto, 2024

work page 2024
[23]

SimPO : Simple preference optimization with a reference-free reward, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO : Simple preference optimization with a reference-free reward, 2024

work page 2024
[24]

Rule based rewards for language model safety, 2024

Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety, 2024. URL https://arxiv.org/abs/2411.01111

work page arXiv 2024
[25]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014
[26]

nvidia/HelpSteer3\#feedback

NVIDIA. nvidia/HelpSteer3\#feedback . https://huggingface.co/datasets/nvidia/HelpSteer3#feedback, 2025 a

work page 2025
[27]

Nemo rl: A scalable and efficient post-training library

Nemo NVIDIA. Nemo rl: A scalable and efficient post-training library. https://github.com/NVIDIA-NeMo/RL, 2025 b . GitHub repository

work page 2025
[28]

Openai model spec, Apr 2025

OpenAI. Openai model spec, Apr 2025. URL https://model-spec.openai.com/2025-04-11.html

work page 2025
[29]

Openrouter

OpenRouter. Openrouter. https://openrouter.ai/models?fmt=table, 2025

work page 2025
[30]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[31]

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025. URL https://arxiv.org/abs/2507.02833

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Lmunit: Fine-grained evaluation with natural language unit tests, 2024

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests, 2024. URL https://arxiv.org/abs/2412.13091

work page arXiv 2024
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023

work page 2023
[35]

NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

work page 2024
[36]

Judgebench: A benchmark for evaluating LLM -based judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq

work page 2025
[37]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Rm-bench leaderboard

THU-KEG. Rm-bench leaderboard. https://github.com/THU-KEG/RM-Bench-Leaderboard, 2025

work page 2025
[39]

Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiming Yang, Shiyong Li, Tianhang Zhu, Wen Xie, Wenhao Huang, Xi...

work page arXiv 2025
[40]

Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/...

work page 2024
[41]

Helpsteer2-preference: Complementing ratings with preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with preferences. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=MnfHxPP5gs

work page 2025
[42]

H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, and Oleksii Kuchaiev. H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings o...

work page doi:10.18653/v1/2025.acl-long.1246 2025
[43]

HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages,

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages, 2025 c . URL https://arxiv.org/abs/2505.11475

work page arXiv 2025
[44]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Reward hacking in reinforcement learning

Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

work page 2024
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Rewardanything: Generalizable principle-following reward models, 2025

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models, 2025. URL https://arxiv.org/abs/2506.03637

work page arXiv 2025
[49]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2024. URL https://arxiv. org/abs/2408.15240, 2024

work page arXiv 2024
[50]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https://arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[52]

Jordan, and Jiantao Jiao

Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf, 2024. URL https://arxiv.org/abs/2401.16335

work page arXiv 2024
[53]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[54]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[55]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[56]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata

David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, and Genta Indra Winata. R3: Robust rubric-agnostic reward models, 2025. URL https://arxiv.org/abs/2505.13388

work page arXiv 2025

[2] [2]

rapidfuzz/rapidfuzz: Release 3.13.0, April 2025

Max Bachmann. rapidfuzz/rapidfuzz: Release 3.13.0, April 2025. URL https://doi.org/10.5281/zenodo.15133267

work page doi:10.5281/zenodo.15133267 2025

[3] [3]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022

[4] [4]

In Practice and Experience in Advanced Research Comput- ing 2019: Rise of the Machines (Learning) , PEARC ’19

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, and Heng Ji. Rm-r1: Reward modeling as reasoning, 2025. URL https://arxiv.org/abs/2505.02387

work page arXiv 2025

[5] [5]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

work page 2024

[6] [6]

Lmunit-llama3.1-70b

ContextualAI. Lmunit-llama3.1-70b. https://huggingface.co/ContextualAI/LMUnit-llama3.1-70b, 2025 a

work page 2025

[7] [7]

Lmunit-qwen2.5-72b

ContextualAI. Lmunit-qwen2.5-72b. https://huggingface.co/ContextualAI/LMUnit-qwen2.5-72b, 2025 b

work page 2025

[8] [8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2025. URL https://arxiv.org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/2402.01306

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Team Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025

Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, and Junxian He. Pitfalls of rule- and model-based verifiers -- a case study on mathematical reasoning, 2025. URL https://arxiv.org/abs/2505.22203

work page arXiv 2025

[13] [13]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews

Daniel Keller and Maria Kostromitina. Characterizing non-chain restaurants’ yelp star-ratings: Generalizable findings from a representative sample of yelp reviews. International Journal of Hospitality Management, 86: 0 102440, 2020. ISSN 0278-4319. doi:https://doi.org/10.1016/j.ijhm.2019.102440. URL https://www.sciencedirect.com/science/article/pii/S02784...

work page doi:10.1016/j.ijhm.2019.102440 2020

[15] [15]

Prometheus 2: An open source language model specialized in evaluating other language models, 2024

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models, 2024. URL https://arxiv.org/abs/2405.01535

work page arXiv 2024

[16] [16]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The Arena-Hard pipeline. https://lmsys.org/blog/2024-04-19-arena-hard/, April 2024

work page 2024

[18] [18]

Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking LLM s with challenging tasks from real users in the wild. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MKEHCx25xp

work page 2025

[19] [19]

Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024. URL https://arxiv.org/abs/2410.16184

work page arXiv 2024

[20] [20]

RM -bench: Benchmarking reward models of language models with subtlety and style

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. RM -bench: Benchmarking reward models of language models with subtlety and style. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=QEHrmQPBdd

work page 2025

[21] [21]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025 b . URL https://arxiv.org/abs/2504.02495

work page arXiv 2025

[22] [22]

Arena-hard-auto leaderboard

LMSys. Arena-hard-auto leaderboard. https://github.com/lm-sys/arena-hard-auto, 2024

work page 2024

[23] [23]

SimPO : Simple preference optimization with a reference-free reward, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO : Simple preference optimization with a reference-free reward, 2024

work page 2024

[24] [24]

Rule based rewards for language model safety, 2024

Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety, 2024. URL https://arxiv.org/abs/2411.01111

work page arXiv 2024

[25] [25]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014

[26] [26]

nvidia/HelpSteer3\#feedback

NVIDIA. nvidia/HelpSteer3\#feedback . https://huggingface.co/datasets/nvidia/HelpSteer3#feedback, 2025 a

work page 2025

[27] [27]

Nemo rl: A scalable and efficient post-training library

Nemo NVIDIA. Nemo rl: A scalable and efficient post-training library. https://github.com/NVIDIA-NeMo/RL, 2025 b . GitHub repository

work page 2025

[28] [28]

Openai model spec, Apr 2025

OpenAI. Openai model spec, Apr 2025. URL https://model-spec.openai.com/2025-04-11.html

work page 2025

[29] [29]

Openrouter

OpenRouter. Openrouter. https://openrouter.ai/models?fmt=table, 2025

work page 2025

[30] [30]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022

[31] [31]

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025. URL https://arxiv.org/abs/2507.02833

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Lmunit: Fine-grained evaluation with natural language unit tests, 2024

Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, and Shikib Mehri. Lmunit: Fine-grained evaluation with natural language unit tests, 2024. URL https://arxiv.org/abs/2412.13091

work page arXiv 2024

[33] [33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023

work page 2023

[35] [35]

NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. NeMo-Aligner : Scalable toolkit for efficient model alignment, 2024

work page 2024

[36] [36]

Judgebench: A benchmark for evaluating LLM -based judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM -based judges. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=G0dksFayVq

work page 2025

[37] [37]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Rm-bench leaderboard

THU-KEG. Rm-bench leaderboard. https://github.com/THU-KEG/RM-Bench-Leaderboard, 2025

work page 2025

[39] [39]

Alan Wake, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Fan Zhou, Feng Hu, Ge Zhang, Guoyin Wang, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qicheng Hu, Shawn Wang, Shijun Zhou, Shiming Yang, Shiyong Li, Tianhang Zhu, Wen Xie, Wenhao Huang, Xi...

work page arXiv 2025

[40] [40]

Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/...

work page 2024

[41] [41]

Helpsteer2-preference: Complementing ratings with preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with preferences. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=MnfHxPP5gs

work page 2025

[42] [42]

H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, and Oleksii Kuchaiev. H elp S teer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general-domain tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings o...

work page doi:10.18653/v1/2025.acl-long.1246 2025

[43] [43]

HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages,

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages, 2025 c . URL https://arxiv.org/abs/2505.11475

work page arXiv 2025

[44] [44]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022. URL https://arxiv.org/abs/2109.01652

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Reward hacking in reinforcement learning

Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, Nov 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

work page 2024

[46] [47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

Rewardanything: Generalizable principle-following reward models, 2025

Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, and Wei Ye. Rewardanything: Generalizable principle-following reward models, 2025. URL https://arxiv.org/abs/2506.03637

work page arXiv 2025

[48] [49]

arXiv:2408.15240

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction, 2024. URL https://arxiv. org/abs/2408.15240, 2024

work page arXiv 2024

[49] [50]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https://arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [51]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023

[51] [52]

Jordan, and Jiantao Jiao

Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf, 2024. URL https://arxiv.org/abs/2401.16335

work page arXiv 2024

[52] [53]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[53] [54]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[54] [55]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[55] [56]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page