arxiv: 2507.06419 · v2 · submitted 2025-07-08 · 💻 cs.CL

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Pankayaraj Pathmanathan , Furong Huang This is my paper

Pith reviewed 2026-05-19 05:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords reward modelingadversarial examplesself-improvementLLM alignmentrobustnesscontrolled decodingpreference learningspurious correlations

0 comments p. Extension

The pith

Reward models can discover their own failure modes by guiding generation of incorrectly scored responses and retrain on them to gain robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REFORM, a self-improving framework in which the reward model itself directs controlled decoding to produce responses that it scores falsely. These examples serve as adversarial training data that patches misaligned scoring without any external knowledge of preference distributions or failure types. Experiments on Anthropic HH and PKU Beavertails show gains in robustness while reward accuracy and downstream policy performance stay intact. The method also reduces spurious correlations that hurt alignment quality.

Core claim

By using the reward model to steer text generation toward responses that maximize or minimize its own scores in controlled ways, REFORM produces adversarial examples that expose and then correct the model's errors when added back into training data.

What carries the argument

Reward-guided controlled decoding, which steers generation using the reward model to surface falsely scored responses that serve as training patches.

If this is right

Robustness increases on the Anthropic HH and PKU Beavertails datasets without loss of reward model accuracy.
Performance holds in both direct evaluation and when the improved reward model is used for downstream policy training.
Alignment quality rises because spurious correlations are removed from the learned preferences.
The approach requires no prior information about which attributes or distributions cause failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Iterative application of the same loop could produce repeated rounds of self-correction with diminishing returns on new data.
The technique might extend to other alignment modules such as safety filters or value heads that also rely on preference signals.
If the generated examples prove distributionally realistic, the method could lower the volume of human preference labels needed for reliable reward models.

Load-bearing premise

The responses created by reward-guided controlled decoding are genuine failure modes rather than artifacts produced only by the decoding procedure.

What would settle it

Retraining with the generated examples fails to improve accuracy on a separate set of human-collected or independently generated out-of-distribution preference pairs.

Figures

Figures reproduced from arXiv: 2507.06419 by Furong Huang, Pankayaraj Pathmanathan.

**Figure 2.** Figure 2: Robust reward modeling via self-improvement: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Generating y ′ + (false negatives) via controlled decoding: Given a prompt x, a base policy π𝒟 proposes likely continuations. A reward model rϕ then guides the decoding toward class-consistent but reward-inconsistent outputs (in this instance to low-reward preferred responses). To generate a valid failure mode, an example must satisfy two conditions: (W1) it must unambiguously belong to a known preference … view at source ↗

**Figure 4.** Figure 4: Generating y ′ − (false positives) via controlled decoding: Given a prompt x, a misaligned policy policy π𝒟−∞ proposes likely continuations. A reward model rϕ then guides the decoding toward class-consistent but rewardinconsistent outputs (in this instance to high-reward non preferred responses). Generating y ′ − (false positives). To generate non-preferred responses with falsely high reward, we adopt a s… view at source ↗

**Figure 5.** Figure 5: Figure 5a shows that REFORM was able to find failure examples with better efficacy than baselines in the category of preferred responses. Figure 5b shows that REFORM finds failure examples with reasonable efficacy as the baselines in the category of not preferred responses. For discussion on drop in coverage in this setting of non-preferred/ rejected responses for REFORM refer to Appendix A. This highlight… view at source ↗

**Figure 6.** Figure 6: Robustness of the finetuned reward to perturbation as measured by drop in win rate ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Quality of the finetuned reward in best of N alignment (PKU) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Quality of the finetuned reward in PPO (PKU) [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Quality of the finetuned reward in DPO (PKU) [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Rejection response reward attribution analysis (PKU) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 13.** Figure 13: Training loss landscape in reward learning [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 11.** Figure 11: Falsely rejected/non preferred responses (attribute based method ( [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Falsely rejected/non preferred responses (attribute based method ( [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 14.** Figure 14: Quality of the finetuned reward in DPO (HH) [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

read the original abstract

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REFORM gives reward models a self-augmentation loop via reward-guided decoding to patch their own failures, but the gains rest on unverified claims that the generated examples are real adversarial cases rather than decoding artifacts.

read the letter

The main point is that this paper offers a practical way for reward models to find and fix their own weaknesses without needing upfront knowledge of what those weaknesses look like. It uses the current model to steer generation toward responses it will score wrongly, then folds those examples back into training. That loop is the core contribution and it is presented cleanly as preference-distribution agnostic, which sets it apart from earlier work that assumes you already know the failure attributes or distributions to target.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes REFORM, a self-improving reward modeling framework that discovers failure modes via reward-guided controlled decoding. The current reward model steers generation of 'falsely scored responses,' which are added to the training set to patch misaligned behavior. The approach is evaluated on the Anthropic HH and PKU Beavertails preference datasets and is claimed to improve robustness to distributional shifts and adversarial perturbations without degrading reward quality, while also enhancing downstream policy alignment by removing spurious correlations.

Significance. If the generated examples are verified to constitute genuine adversarial or out-of-distribution failure modes (rather than artifacts of the decoding procedure), the method would provide a practical, preference-distribution-agnostic route to RM self-improvement. This could meaningfully advance robust reward modeling for alignment, especially given the limited coverage of existing preference datasets. The self-referential loop is novel but hinges on the empirical isolation of the reward-guidance mechanism.

major comments (3)

[§4 and §5] §4 (Method) and §5 (Experiments): The central robustness claim rests on the generated responses representing genuine distributional-shift or adversarial failure modes. The manuscript must include an ablation that holds the number of augmented examples fixed while removing reward guidance (e.g., sampling from the base policy or a non-reward-guided adversarial generator) to isolate whether improvements are attributable to the proposed mechanism rather than decoding-induced biases such as length or fluency shifts.
[§5.1] §5.1 (Results on HH and Beavertails): The abstract asserts 'significant robustness gains' and 'further improves alignment quality,' yet the provided description supplies no quantitative metrics, baseline comparisons, statistical tests, or effect sizes. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.
[§3.2] §3.2 (Controlled Decoding Procedure): The description of how reward-guided decoding produces responses that lie outside the original data distribution in a manner reflecting real human-preference mismatches (as opposed to artifacts) is insufficient. Additional diagnostics, such as human evaluation of the generated pairs or distribution-shift metrics, are needed to support the failure-mode interpretation.

minor comments (2)

[§3] Clarify the precise formulation of the controlled decoding objective (e.g., the balance between reward guidance and fluency constraints) with explicit equations or pseudocode.
[§2] Add a short related-work paragraph contrasting REFORM with prior adversarial or self-training approaches for reward models to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's suggestions for improving the clarity and rigor of our claims regarding REFORM. We respond to each major comment in turn and indicate the changes we will make in the revised version.

read point-by-point responses

Referee: [§4 and §5] §4 (Method) and §5 (Experiments): The central robustness claim rests on the generated responses representing genuine distributional-shift or adversarial failure modes. The manuscript must include an ablation that holds the number of augmented examples fixed while removing reward guidance (e.g., sampling from the base policy or a non-reward-guided adversarial generator) to isolate whether improvements are attributable to the proposed mechanism rather than decoding-induced biases such as length or fluency shifts.

Authors: We fully agree that an ablation isolating the reward guidance is essential to rule out confounding factors. We will add this ablation to the revised manuscript, generating an equal number of augmented examples using non-reward-guided methods (e.g., standard sampling from the base LLM) and comparing the resulting robustness improvements directly to those from REFORM. revision: yes
Referee: [§5.1] §5.1 (Results on HH and Beavertails): The abstract asserts 'significant robustness gains' and 'further improves alignment quality,' yet the provided description supplies no quantitative metrics, baseline comparisons, statistical tests, or effect sizes. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.

Authors: The full paper in Section 5.1 presents quantitative results with tables comparing REFORM to baselines on metrics for robustness and alignment quality on both datasets. To address this, we will add explicit effect sizes, confidence intervals, and statistical tests (e.g., paired t-tests) in the revised version to better highlight the significance of the gains. revision: yes
Referee: [§3.2] §3.2 (Controlled Decoding Procedure): The description of how reward-guided decoding produces responses that lie outside the original data distribution in a manner reflecting real human-preference mismatches (as opposed to artifacts) is insufficient. Additional diagnostics, such as human evaluation of the generated pairs or distribution-shift metrics, are needed to support the failure-mode interpretation.

Authors: We recognize the need for stronger evidence that the generated examples are true failure modes. In the revision, we will include distribution shift metrics (e.g., KL divergence or cosine similarity in embedding space between generated and original responses) and conduct a small-scale human evaluation to verify that the pairs exhibit preference mismatches not present in the original data. revision: yes

Circularity Check

0 steps flagged

No circularity: REFORM is an empirical self-augmentation method evaluated on external benchmarks

full rationale

The paper proposes REFORM as a method that uses the current reward model to guide controlled decoding for generating candidate adversarial examples, augments the training set with those examples, and retrains. This is a standard self-training loop rather than a derivation that reduces to its inputs by construction. The central claims of improved robustness and removal of spurious correlations are supported by direct evaluation on the held-out portions of Anthropic HH and PKU Beavertails plus downstream policy training, none of which are defined in terms of the generated examples themselves. No equations, fitted parameters, or uniqueness theorems are shown to be equivalent to the input data or prior self-citations; the framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper text unavailable, so ledger entries are inferred at the level of stated assumptions in the abstract.

axioms (1)

domain assumption Reward models can be used to guide controlled decoding toward responses that expose their own mis-scoring behavior.
This premise underpins the entire failure-mode discovery step.

pith-pipeline@v0.9.0 · 5749 in / 1125 out tokens · 46583 ms · 2026-05-19T05:13:02.246248+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a tractable, preference-distribution-agnostic method for discovering reward model failure modes via reward-guided controlled decoding... REFORM... augments the training data and patch the reward model's misaligned behavior.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define a failure mode... as a perturbed pair (y′+, y′−) such that the preference class is preserved but the reward ordering is inverted.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 11 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Zero-shot llm-guided counterfactual generation: A case study on nlp model evaluation, 2024

Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, and Huan Liu. Zero-shot llm-guided counterfactual generation: A case study on nlp model evaluation, 2024. URL https://arxiv.org/abs/2405.04793

work page arXiv 2024
[3]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39: 0 324, 1952. URL https://api.semanticscholar.org/CorpusID:125209808

work page 1952
[4]

Transfer q star: Principled decoding for llm alignment, 2024

Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, and Furong Huang. Transfer q star: Principled decoding for llm alignment, 2024. URL https://arxiv.org/abs/2405.20495

work page arXiv 2024
[5]

Odin: Disentangled reward mitigates hacking in rlhf, 2024

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf, 2024. URL https://arxiv.org/abs/2402.07319

work page arXiv 2024
[6]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking, 2024

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking, 2024. URL https://arxiv.org/abs/2312.09244

work page arXiv 2024
[7]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760

work page internal anchor Pith review arXiv 2022
[8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Direct language model alignment from online ai feedback, 2024

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. Direct language model alignment from online ai feedback, 2024. URL https://arxiv.org/abs/2402.04792

work page arXiv 2024
[10]

Unsolved Problems in ML Safety

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022. URL https://arxiv.org/abs/2109.13916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Interpreting language reward models via contrastive explanations, 2025

Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, and Manuela Veloso. Interpreting language reward models via contrastive explanations, 2025. URL https://arxiv.org/abs/2411.16502

work page arXiv 2025
[14]

Args: Alignment as reward-guided search, 2024

Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search, 2024. URL https://arxiv.org/abs/2402.01694

work page arXiv 2024
[15]

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2024. URL https://arxiv.org/abs/2310.06452

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

A Diversity-Promoting Objective Function for Neural Conversation Models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models, 2016. URL https://arxiv.org/abs/1510.03055

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Rethinking and refining the distinct metric, 2022

Siyang Liu, Sahand Sabour, Yinhe Zheng, Pei Ke, Xiaoyan Zhu, and Minlie Huang. Rethinking and refining the distinct metric, 2022. URL https://arxiv.org/abs/2202.13587

work page arXiv 2022
[18]

Rrm: Robust reward model training mitigates reward hacking, 2025

Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, and Mohammad Saleh. Rrm: Robust reward model training mitigates reward hacking, 2025. URL https://arxiv.org/abs/2409.13156

work page arXiv 2025
[19]

Controlled decoding from language models, 2024

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models, 2024. URL https://arxiv.org/abs/2310.17022

work page arXiv 2024
[20]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Offsetbias: Leveraging debiased data for tuning evaluators, 2024

Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. Offsetbias: Leveraging debiased data for tuning evaluators, 2024. URL https://arxiv.org/abs/2407.06551

work page arXiv 2024
[23]

Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. Is poisoning a real threat to llm alignment? maybe more so than you think. arXiv preprint arXiv:2406.12091, 2024

work page arXiv 2024
[24]

Advbdgen: Adversarially fortified prompt-specific fuzzy backdoor generator against llm alignment, 2025

Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, and Furong Huang. Advbdgen: Adversarially fortified prompt-specific fuzzy backdoor generator against llm alignment, 2025. URL https://arxiv.org/abs/2410.11283

work page arXiv 2025
[25]

Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

work page 2023
[26]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

work page 2019
[27]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Warm: On the benefits of weight averaged reward models, 2024

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight averaged reward models, 2024. URL https://arxiv.org/abs/2401.12187

work page arXiv 2024
[29]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback, 2023

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback, 2023. URL https://arxiv.org/abs/2310.05199

work page arXiv 2023
[30]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S. Weld. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models, 2021. URL https://arxiv.org/abs/2101.00288

work page arXiv 2021
[32]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024. URL https://arxiv.org/abs/2312.11456

work page arXiv 2024
[33]

Is dpo superior to ppo for llm alignment? a comprehensive study, 2024

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study, 2024. URL https://arxiv.org/abs/2404.10719

work page arXiv 2024
[34]

Genarm: Reward guided generation with autoregressive reward model for test-time alignment, 2025

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment, 2025. URL https://arxiv.org/abs/2410.08193

work page arXiv 2025
[35]

Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, and Ruihai Dong

Linyi Yang, Eoin M. Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, and Ruihai Dong. Generating plausible counterfactual explanations for deep transformers in financial text classification, 2020. URL https://arxiv.org/abs/2010.12512

work page arXiv 2020
[36]

Evaluating large language models at evaluating instruction following, 2024

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following, 2024. URL https://arxiv.org/abs/2310.07641

work page arXiv 2024
[37]

Preprint, arXiv:2412.19048

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. Jasper and stella: distillation of sota embedding models, 2025. URL https://arxiv.org/abs/2412.19048

work page arXiv 2025