pith. the verified trust layer for science. sign in

arxiv: 2507.06419 · v2 · submitted 2025-07-08 · 💻 cs.CL

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Pith reviewed 2026-05-19 05:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords reward modelingadversarial examplesself-improvementLLM alignmentrobustnesscontrolled decodingpreference learningspurious correlations
0
0 comments X p. Extension

The pith

Reward models can discover their own failure modes by guiding generation of incorrectly scored responses and retrain on them to gain robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REFORM, a self-improving framework in which the reward model itself directs controlled decoding to produce responses that it scores falsely. These examples serve as adversarial training data that patches misaligned scoring without any external knowledge of preference distributions or failure types. Experiments on Anthropic HH and PKU Beavertails show gains in robustness while reward accuracy and downstream policy performance stay intact. The method also reduces spurious correlations that hurt alignment quality.

Core claim

By using the reward model to steer text generation toward responses that maximize or minimize its own scores in controlled ways, REFORM produces adversarial examples that expose and then correct the model's errors when added back into training data.

What carries the argument

Reward-guided controlled decoding, which steers generation using the reward model to surface falsely scored responses that serve as training patches.

If this is right

  • Robustness increases on the Anthropic HH and PKU Beavertails datasets without loss of reward model accuracy.
  • Performance holds in both direct evaluation and when the improved reward model is used for downstream policy training.
  • Alignment quality rises because spurious correlations are removed from the learned preferences.
  • The approach requires no prior information about which attributes or distributions cause failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Iterative application of the same loop could produce repeated rounds of self-correction with diminishing returns on new data.
  • The technique might extend to other alignment modules such as safety filters or value heads that also rely on preference signals.
  • If the generated examples prove distributionally realistic, the method could lower the volume of human preference labels needed for reliable reward models.

Load-bearing premise

The responses created by reward-guided controlled decoding are genuine failure modes rather than artifacts produced only by the decoding procedure.

What would settle it

Retraining with the generated examples fails to improve accuracy on a separate set of human-collected or independently generated out-of-distribution preference pairs.

Figures

Figures reproduced from arXiv: 2507.06419 by Furong Huang, Pankayaraj Pathmanathan.

Figure 1
Figure 1. Figure 1: Failure mode detection as controlled decoding: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Robust reward modeling via self-improvement: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Generating y ′ + (false negatives) via controlled decoding: Given a prompt x, a base policy π𝒟 proposes likely continuations. A reward model rϕ then guides the decoding toward class-consistent but reward-inconsistent outputs (in this instance to low-reward preferred responses). To generate a valid failure mode, an example must satisfy two conditions: (W1) it must unambiguously belong to a known preference … view at source ↗
Figure 4
Figure 4. Figure 4: Generating y ′ − (false positives) via controlled decoding: Given a prompt x, a misaligned policy policy π𝒟−∞ proposes likely continuations. A reward model rϕ then guides the decoding toward class-consistent but reward￾inconsistent outputs (in this instance to high-reward non preferred responses). Generating y ′ − (false positives). To generate non-preferred responses with falsely high reward, we adopt a s… view at source ↗
Figure 5
Figure 5. Figure 5: Figure 5a shows that REFORM was able to find failure examples with better efficacy than baselines in the category of preferred responses. Figure 5b shows that REFORM finds failure examples with reasonable efficacy as the baselines in the category of not preferred responses. For discussion on drop in coverage in this setting of non-preferred/ rejected responses for REFORM refer to Appendix A. This highlight… view at source ↗
Figure 6
Figure 6. Figure 6: Robustness of the finetuned reward to perturbation as measured by drop in win rate ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quality of the finetuned reward in best of N alignment (PKU) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Quality of the finetuned reward in PPO (PKU) [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Quality of the finetuned reward in DPO (PKU) [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Rejection response reward attribution analysis (PKU) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training loss landscape in reward learning [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 11
Figure 11. Figure 11: Falsely rejected/non preferred responses (attribute based method ( [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Falsely rejected/non preferred responses (attribute based method ( [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Quality of the finetuned reward in DPO (HH) [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes REFORM, a self-improving reward modeling framework that discovers failure modes via reward-guided controlled decoding. The current reward model steers generation of 'falsely scored responses,' which are added to the training set to patch misaligned behavior. The approach is evaluated on the Anthropic HH and PKU Beavertails preference datasets and is claimed to improve robustness to distributional shifts and adversarial perturbations without degrading reward quality, while also enhancing downstream policy alignment by removing spurious correlations.

Significance. If the generated examples are verified to constitute genuine adversarial or out-of-distribution failure modes (rather than artifacts of the decoding procedure), the method would provide a practical, preference-distribution-agnostic route to RM self-improvement. This could meaningfully advance robust reward modeling for alignment, especially given the limited coverage of existing preference datasets. The self-referential loop is novel but hinges on the empirical isolation of the reward-guidance mechanism.

major comments (3)
  1. [§4 and §5] §4 (Method) and §5 (Experiments): The central robustness claim rests on the generated responses representing genuine distributional-shift or adversarial failure modes. The manuscript must include an ablation that holds the number of augmented examples fixed while removing reward guidance (e.g., sampling from the base policy or a non-reward-guided adversarial generator) to isolate whether improvements are attributable to the proposed mechanism rather than decoding-induced biases such as length or fluency shifts.
  2. [§5.1] §5.1 (Results on HH and Beavertails): The abstract asserts 'significant robustness gains' and 'further improves alignment quality,' yet the provided description supplies no quantitative metrics, baseline comparisons, statistical tests, or effect sizes. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.
  3. [§3.2] §3.2 (Controlled Decoding Procedure): The description of how reward-guided decoding produces responses that lie outside the original data distribution in a manner reflecting real human-preference mismatches (as opposed to artifacts) is insufficient. Additional diagnostics, such as human evaluation of the generated pairs or distribution-shift metrics, are needed to support the failure-mode interpretation.
minor comments (2)
  1. [§3] Clarify the precise formulation of the controlled decoding objective (e.g., the balance between reward guidance and fluency constraints) with explicit equations or pseudocode.
  2. [§2] Add a short related-work paragraph contrasting REFORM with prior adversarial or self-training approaches for reward models to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's suggestions for improving the clarity and rigor of our claims regarding REFORM. We respond to each major comment in turn and indicate the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Method) and §5 (Experiments): The central robustness claim rests on the generated responses representing genuine distributional-shift or adversarial failure modes. The manuscript must include an ablation that holds the number of augmented examples fixed while removing reward guidance (e.g., sampling from the base policy or a non-reward-guided adversarial generator) to isolate whether improvements are attributable to the proposed mechanism rather than decoding-induced biases such as length or fluency shifts.

    Authors: We fully agree that an ablation isolating the reward guidance is essential to rule out confounding factors. We will add this ablation to the revised manuscript, generating an equal number of augmented examples using non-reward-guided methods (e.g., standard sampling from the base LLM) and comparing the resulting robustness improvements directly to those from REFORM. revision: yes

  2. Referee: [§5.1] §5.1 (Results on HH and Beavertails): The abstract asserts 'significant robustness gains' and 'further improves alignment quality,' yet the provided description supplies no quantitative metrics, baseline comparisons, statistical tests, or effect sizes. Without these, the magnitude and reliability of the claimed improvements cannot be assessed.

    Authors: The full paper in Section 5.1 presents quantitative results with tables comparing REFORM to baselines on metrics for robustness and alignment quality on both datasets. To address this, we will add explicit effect sizes, confidence intervals, and statistical tests (e.g., paired t-tests) in the revised version to better highlight the significance of the gains. revision: yes

  3. Referee: [§3.2] §3.2 (Controlled Decoding Procedure): The description of how reward-guided decoding produces responses that lie outside the original data distribution in a manner reflecting real human-preference mismatches (as opposed to artifacts) is insufficient. Additional diagnostics, such as human evaluation of the generated pairs or distribution-shift metrics, are needed to support the failure-mode interpretation.

    Authors: We recognize the need for stronger evidence that the generated examples are true failure modes. In the revision, we will include distribution shift metrics (e.g., KL divergence or cosine similarity in embedding space between generated and original responses) and conduct a small-scale human evaluation to verify that the pairs exhibit preference mismatches not present in the original data. revision: yes

Circularity Check

0 steps flagged

No circularity: REFORM is an empirical self-augmentation method evaluated on external benchmarks

full rationale

The paper proposes REFORM as a method that uses the current reward model to guide controlled decoding for generating candidate adversarial examples, augments the training set with those examples, and retrains. This is a standard self-training loop rather than a derivation that reduces to its inputs by construction. The central claims of improved robustness and removal of spurious correlations are supported by direct evaluation on the held-out portions of Anthropic HH and PKU Beavertails plus downstream policy training, none of which are defined in terms of the generated examples themselves. No equations, fitted parameters, or uniqueness theorems are shown to be equivalent to the input data or prior self-citations; the framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper text unavailable, so ledger entries are inferred at the level of stated assumptions in the abstract.

axioms (1)
  • domain assumption Reward models can be used to guide controlled decoding toward responses that expose their own mis-scoring behavior.
    This premise underpins the entire failure-mode discovery step.

pith-pipeline@v0.9.0 · 5749 in / 1125 out tokens · 46583 ms · 2026-05-19T05:13:02.246248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 11 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  2. [2]

    Zero-shot llm-guided counterfactual generation: A case study on nlp model evaluation, 2024

    Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, and Huan Liu. Zero-shot llm-guided counterfactual generation: A case study on nlp model evaluation, 2024. URL https://arxiv.org/abs/2405.04793

  3. [3]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39: 0 324, 1952. URL https://api.semanticscholar.org/CorpusID:125209808

  4. [4]

    Transfer q star: Principled decoding for llm alignment, 2024

    Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, and Furong Huang. Transfer q star: Principled decoding for llm alignment, 2024. URL https://arxiv.org/abs/2405.20495

  5. [5]

    Odin: Disentangled reward mitigates hacking in rlhf, 2024

    Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf, 2024. URL https://arxiv.org/abs/2402.07319

  6. [6]

    Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking, 2024

    Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking, 2024. URL https://arxiv.org/abs/2312.09244

  7. [7]

    Scaling Laws for Reward Model Overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760

  8. [8]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  9. [9]

    Direct language model alignment from online ai feedback, 2024

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, and Mathieu Blondel. Direct language model alignment from online ai feedback, 2024. URL https://arxiv.org/abs/2402.04792

  10. [10]

    Unsolved Problems in ML Safety

    Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022. URL https://arxiv.org/abs/2109.13916

  11. [11]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

  12. [12]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  13. [13]

    Interpreting language reward models via contrastive explanations, 2025

    Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, and Manuela Veloso. Interpreting language reward models via contrastive explanations, 2025. URL https://arxiv.org/abs/2411.16502

  14. [14]

    Args: Alignment as reward-guided search, 2024

    Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search, 2024. URL https://arxiv.org/abs/2402.01694

  15. [15]

    Understanding the Effects of RLHF on LLM Generalisation and Diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2024. URL https://arxiv.org/abs/2310.06452

  16. [16]

    A Diversity-Promoting Objective Function for Neural Conversation Models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models, 2016. URL https://arxiv.org/abs/1510.03055

  17. [17]

    Rethinking and refining the distinct metric, 2022

    Siyang Liu, Sahand Sabour, Yinhe Zheng, Pei Ke, Xiaoyan Zhu, and Minlie Huang. Rethinking and refining the distinct metric, 2022. URL https://arxiv.org/abs/2202.13587

  18. [18]

    Rrm: Robust reward model training mitigates reward hacking, 2025

    Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasiia Makarova, Jeremiah Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, and Mohammad Saleh. Rrm: Robust reward model training mitigates reward hacking, 2025. URL https://arxiv.org/abs/2409.13156

  19. [19]

    Controlled decoding from language models, 2024

    Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models, 2024. URL https://arxiv.org/abs/2310.17022

  20. [20]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https://a...

  21. [21]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  22. [22]

    Offsetbias: Leveraging debiased data for tuning evaluators, 2024

    Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. Offsetbias: Leveraging debiased data for tuning evaluators, 2024. URL https://arxiv.org/abs/2407.06551

  23. [23]

    Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

    Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. Is poisoning a real threat to llm alignment? maybe more so than you think. arXiv preprint arXiv:2406.12091, 2024

  24. [24]

    Advbdgen: Adversarially fortified prompt-specific fuzzy backdoor generator against llm alignment, 2025

    Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, and Furong Huang. Advbdgen: Adversarially fortified prompt-specific fuzzy backdoor generator against llm alignment, 2025. URL https://arxiv.org/abs/2410.11283

  25. [25]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

  26. [26]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

  27. [27]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

  28. [28]

    Warm: On the benefits of weight averaged reward models, 2024

    Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight averaged reward models, 2024. URL https://arxiv.org/abs/2401.12187

  29. [29]

    Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback, 2023

    Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback, 2023. URL https://arxiv.org/abs/2310.05199

  30. [30]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805

  31. [31]

    Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S. Weld. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models, 2021. URL https://arxiv.org/abs/2101.00288

  32. [32]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024. URL https://arxiv.org/abs/2312.11456

  33. [33]

    Is dpo superior to ppo for llm alignment? a comprehensive study, 2024

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study, 2024. URL https://arxiv.org/abs/2404.10719

  34. [34]

    Genarm: Reward guided generation with autoregressive reward model for test-time alignment, 2025

    Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment, 2025. URL https://arxiv.org/abs/2410.08193

  35. [35]

    Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, and Ruihai Dong

    Linyi Yang, Eoin M. Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, and Ruihai Dong. Generating plausible counterfactual explanations for deep transformers in financial text classification, 2020. URL https://arxiv.org/abs/2010.12512

  36. [36]

    Evaluating large language models at evaluating instruction following, 2024

    Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following, 2024. URL https://arxiv.org/abs/2310.07641

  37. [37]

    Preprint, arXiv:2412.19048

    Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. Jasper and stella: distillation of sota embedding models, 2025. URL https://arxiv.org/abs/2412.19048