Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Andreas Haupt; Daniel Fein; Marcel Hussing; Max Lamparth; Mykel J. Kochenderfer

arxiv: 2605.27996 · v2 · pith:WMFHIL7Vnew · submitted 2026-05-27 · 💻 cs.AI

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

Max Lamparth , Daniel Fein , Andreas Haupt , Marcel Hussing , Mykel J. Kochenderfer This is my paper

Pith reviewed 2026-06-29 12:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords reward bias substitutionRLHFbias mitigationreward modelsoptimization pressurepreference learningpolicy-induced distribution

0 comments

The pith

Single-axis reward bias mitigations rotate optimization pressure onto correlated proxies instead of removing it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that efforts to reduce reliance on one biased signal in a reward model, such as length or sycophancy, commonly shift the pressure onto other correlated signals rather than eliminating bias. This substitution stays invisible under standard evaluation because audits score on a fixed distribution while the policy trains on a different one induced by its own outputs. The authors formalize outcomes into regimes and prove that successful mitigation, substitution, and overcorrection yield identical results on any audit-based metric, even with oracle access to the true reward. They document that no surveyed published method supplies the data needed to confirm actual mitigation occurred. They close the gap by requiring evaluation on policy-induced samples while tracking multiple biases at once.

Core claim

Successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward; the measurement-versus-optimization gap between audit and policy-induced distributions prevents detection of substitution.

What carries the argument

The measurement-versus-optimization gap between the distribution used for auditing a mitigation and the distribution induced by the trained policy.

If this is right

No published preference-learning mitigation reports the measurements required to certify that bias was removed rather than substituted.
Augmenting evaluation with policy-induced distributions and tracking multiple biases simultaneously distinguishes the three regimes.
A length penalty in GRPO training can compress responses while driving the policy into overconfidence and lowering factual accuracy.
A length-debiasing operator that zeros correlation on the audit set can reintroduce bias under best-of-N selection on multiple reward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks limited to fixed audit sets will systematically miss substitution effects in deployed policies.
Methods that penalize one observable feature should be retested after each training stage on samples drawn from the current policy.
The same gap may affect other single-axis alignment techniques beyond reward modeling, such as direct preference optimization variants.

Load-bearing premise

The distribution used to score a mitigation after training differs from the distribution the policy actually produces during optimization.

What would settle it

Apply a published length-debiasing operator, then compare reward-length correlation on the original audit set versus on best-of-N samples drawn from the resulting policy; the correlation reappearing in the policy samples would confirm substitution.

Figures

Figures reproduced from arXiv: 2605.27996 by Andreas Haupt, Daniel Fein, Marcel Hussing, Max Lamparth, Mykel J. Kochenderfer.

**Figure 1.** Figure 1: RLHF of Llama-3.2-3B-Instruct with Skywork-Reward-V2-Llama-3.1-8B using GRPO (β = 2e−2, LR = 3e−5, 600 steps) on UltraFeedback with four random seeds for each λ ∈ {0, 4, 8}, 95% bootstrapped confidence intervals. To mitigate length, we modify the reward R˜ in the RLHF objective as R˜(x, y) = RRM(x, y) − λ ntok/100. Training with λ = 0 does not break confidence calibration. As length is decreased the optimi… view at source ↗

**Figure 2.** Figure 2: Reward bias substitution mechanism from Figure 1 (schematic). The correlated spurious [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We instantiate all five R0–R3 as tune-able outcomes of a single mitigation operator and R4 [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Shifting γ to produce different |∆J|-heatmaps of Figure 3a to illustrate regime robustness to non-linearity strength, as the regime boundaries do not change themselves. Simulation parameters. With this setup, we have the following parameters to tune and induce regime transitions: • ρ12 sets rotation for bias substitution and also the sign of ∆2 • ρ13 sets the sign of ∆J (if the sign of g1 does not change) … view at source ↗

**Figure 5.** Figure 5: 30 [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 5.** Figure 5: Validating that the simulated regime transitions align with the theoretical predictions from [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Training curves (shaded region shows the min-max range across 4 seeds) [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗

read the original abstract

Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows single-axis bias fixes in reward models often just shift pressure to other proxies, with a proof that audits cannot distinguish this from real success.

read the letter

The key point is that mitigating one bias like length or sycophancy in a reward model can rotate the optimization onto a correlated proxy instead of removing the pressure. They name this reward bias substitution and prove that successful mitigation, substitution, and overcorrection are observationally identical under any scoring on the audit distribution, even with oracle true-reward access.

What is new is the regime taxonomy and the formal indistinguishability result. The survey finding that no prior mitigation paper supplies the checks needed to confirm actual mitigation rather than substitution is also useful. The demonstrations, including the GRPO length penalty case that increases overconfidence and the published debiasing operator that reintroduces bias under best-of-N, make the mechanism concrete.

The work does well at turning the math into prescriptions: augment evaluations with policy-induced distributions and track multiple biases at once. These steps follow directly from the gap between audit and training distributions.

The soft spots are proportionate. The empirical parts are targeted illustrations rather than large-scale sweeps, so they show the failure mode is possible but do not measure how often it occurs. The proof rests on the measurement-optimization gap, which is standard in RLHF but is doing the main work here. No internal contradictions appear in the argument.

This is for people working on reward modeling and RLHF who design or evaluate bias mitigations. Readers focused on benchmarks or preference learning methods will get the most from the taxonomy and the evaluation recommendations.

It deserves peer review. The formal contribution identifies a systematic evaluation blind spot that affects alignment work, and the practical suggestions are worth referee scrutiny.

Referee Report

0 major / 3 minor

Summary. The paper claims that single-axis mitigations of reward-model biases (e.g., length, sycophancy) can redirect optimization pressure onto correlated proxies rather than eliminate it, a failure mode termed reward bias substitution. It introduces a regime taxonomy and proves that successful mitigation, bias substitution, and overcorrection are observationally equivalent under any scoring restricted to the audit distribution (including ranking accuracy, win-rate, and oracle true-reward access). This is demonstrated via GRPO length-penalty training (which compresses responses but induces overconfidence and drops factual accuracy), a published length-debiasing operator that fails under best-of-N on multiple SOTA reward models, and reversal of length-sycophancy coupling under judge disagreement. The paper concludes that standard audit-only evaluations are insufficient and prescribes augmenting them with policy-induced distributions and multi-bias tracking.

Significance. If the indistinguishability result holds, the work is significant because it supplies a formal explanation for why many published single-axis debiasing methods in preference learning cannot be certified as successful under existing evaluation protocols. The proof directly leverages the documented measurement-versus-optimization gap in RLHF, the regime taxonomy organizes observable outcomes, and the concrete demonstrations (GRPO, best-of-N) plus actionable prescriptions for benchmark design constitute a constructive contribution to improving mitigation assessment.

minor comments (3)

[Abstract] The abstract states that the proof covers 'any audit-distribution scoring' but does not name the precise theorem or section containing the formal argument; adding an explicit pointer would improve traceability.
[Empirical section on GRPO] In the GRPO length-penalty demonstration, the claim that factual free-form accuracy falls is central to showing substitution; the manuscript should report the exact accuracy metric, dataset, and statistical significance to allow replication.
[Survey of published methods] The survey claim that 'no method we survey reports the evidence needed' would be strengthened by an explicit table or appendix listing the surveyed papers and the missing evidence categories for each.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the work, the accurate summary of our claims, and the recommendation for minor revision. The significance evaluation is appreciated. No specific major comments appear in the report, so we have no individual points requiring rebuttal or clarification at this time. We remain available to address any additional feedback or to perform minor revisions as directed by the editor.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central derivation is a formal proof that successful mitigation, bias substitution, and overcorrection are observationally equivalent under any scoring restricted to the audit distribution. This equivalence is shown to follow directly once the measurement-versus-optimization gap is granted; the gap is treated as a pre-existing documented property of RLHF rather than something derived from the paper's own fitted quantities or prior self-citations. The regime taxonomy, the indistinguishability result, and the empirical demonstrations (GRPO length penalty, debiasing operator, length-sycophancy reversal) are presented as independent content that does not reduce by construction to inputs defined by the paper itself. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on the domain assumption of a persistent measurement-versus-optimization gap and the existence of correlated reward proxies; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption There exists a measurement-versus-optimization gap between audit and policy-induced distributions.
This gap is required for the proof that substitution remains undetectable under audit scoring.

invented entities (1)

reward bias substitution no independent evidence
purpose: Conceptual label for the redirection of optimization pressure to correlated proxies.
This is a descriptive term for the identified failure mode rather than a new postulated object with independent evidence.

pith-pipeline@v0.9.1-grok · 5784 in / 1284 out tokens · 31309 ms · 2026-06-29T12:17:42.139898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 42 canonical work pages · 12 internal anchors

[1]

Models overview

Anthropic . Models overview. Blog post, 2026. Accessed: 2026-05-06

2026
[2]

Training large language models on narrow tasks can lead to broad misalignment

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Mart \'i n Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment. Nature, 649: 0 584--589, 2026. doi:10.1038/s41586-025-09937-5

work page doi:10.1038/s41586-025-09937-5 2026
[3]

Beyond excess and deficiency: Adaptive length bias mitigation in reward models for RLHF

Yuyan Bu, Liangyu Huo, Yi Jing, and Qing Yang. Beyond excess and deficiency: Adaptive length bias mitigation in reward models for RLHF . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3091--3098, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-naacl.169

work page doi:10.18653/v1/2025.findings-naacl.169 2025
[4]

Disentangling length bias in preference learning via response-conditioned modeling

Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, and Houqiang Li. Disentangling length bias in preference learning via response-conditioned modeling. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=hKxYESOzen

2026
[5]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J \'e r \'e my Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric ...

2023
[6]

Humans or LLM s as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLM s as the judge? a study on judgement bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301--8327, Miami, Florida, USA, November 2024 a . Association for Computational Linguistics. doi:10.18653/v1/2024.emnlp-main.474

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[7]

Odin: disentangled reward mitigates hacking in rlhf

Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: disentangled reward mitigates hacking in rlhf. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024 b

2024
[8]

Learning ordinal probabilistic reward from preferences

Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, and Min Yang. Learning ordinal probabilistic reward from preferences. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=0Vf5trUAVF

2026
[9]

Sycophantic AI decreases prosocial intentions and promotes dependence

Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky. Sycophantic AI decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026 a . doi:10.1126/science.aec8352

work page doi:10.1126/science.aec8352 2026
[10]

ELEPHANT : Measuring and understanding social sycophancy in LLM s

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. ELEPHANT : Measuring and understanding social sycophancy in LLM s. In The Fourteenth International Conference on Learning Representations, 2026 b . URL https://openreview.net/forum?id=igbRHKEiAs

2026
[11]

Learning correlated reward models: Statistical barriers and opportunities

Yeshwanth Cherapanamjeri, Constantinos Costis Daskalakis, Gabriele Farina, and Sobhan Mohammadpour. Learning correlated reward models: Statistical barriers and opportunities. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=TbEyl6krsY

2026
[12]

Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

2024
[13]

Reward models inherit value biases from pretraining

Brian Christian, Jessica A F Thompson, Elle, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, and Tsvetomira Dumbalska. Reward models inherit value biases from pretraining. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=dT399j1Azv

2026
[14]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 4302–4310, Red Hook, NY, USA, 2017. Curran Associates Inc

2017
[15]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

2023
[17]

Ask don't tell: Reducing sycophancy in large language models

Magda Dubois, Cozmin Ududec, Christopher Summerfield, and Lennart Luettgau. Ask don't tell: Reducing sycophancy in large language models, 2026. URL https://arxiv.org/abs/2602.23971

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Length-controlled alpacaeval: A simple debiasing of automatic evaluators

Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=CybBmzWBX0

2024
[19]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D'Amour, Krishnamurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. In First Conference on Language Modeling,...

2024
[20]

Syceval: Evaluating llm sycophancy

Aaron Fanous, Jacob Goldberg, Ank Agarwal, Joanna Lin, Anson Zhou, Sonnet Xu, Vasiliki Bikia, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. In Proceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 893--900, 2025. doi:10.1609/aies.v8i1.36598

work page doi:10.1609/aies.v8i1.36598 2025
[21]

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, and Nick Haber. One bias after another: Mechanistic reward shaping and persistent biases in language reward models, 2026. URL https://arxiv.org/abs/2603.03291

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

C2po: Diagnosing and disentangling bias shortcuts in llms, 2025

Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, and Shuai Zhao. C2po: Diagnosing and disentangling bias shortcuts in llms, 2025. URL https://arxiv.org/abs/2512.23430

work page arXiv 2025
[23]

The perils of optimizing learned reward functions: Low training error does not guarantee low regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forr\' e , David Krueger, and Joar Max Viktor Skalse. The perils of optimizing learned reward functions: Low training error does not guarantee low regret. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of t...

2025
[24]

Reward shaping to mitigate reward hacking in RLHF

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in RLHF . In ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. URL https://openreview.net/forum?id=62A4d5Mokc

2025
[25]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023
[26]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence volume 2, pages665--673(2020), 2020

2020
[27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[29]

Measuring sycophancy of language models in multi-turn dialogues

Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of language models in multi-turn dialogues. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2239--2259, Suzhou, China, November 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.121

work page doi:10.18653/v1/2025.findings-emnlp.121 2025
[30]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

2022
[31]

Explaining length bias in LLM -based preference evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in LLM -based preference evaluations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6763--6794, Suzhou, China, November 2025. Association for Computational Ling...

work page doi:10.18653/v1/2025.findings-emnlp.358 2025
[32]

Post-hoc reward calibration: A case study on length bias

Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo Ponti, and Ivan Titov. Post-hoc reward calibration: A case study on length bias. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Iu8RytBaji

2025
[33]

Measuring and mitigating overreliance to build human-compatible AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, and Umang Bhatt. Measuring and mitigating overreliance to build human-compatible ai, 2026 a . URL https://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Training language models to be warm can reduce accuracy and increase sycophancy

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm can reduce accuracy and increase sycophancy. Nature, 652: 0 1159--1165, 2026 b . doi:10.1038/s41586-026-10410-0

work page doi:10.1038/s41586-026-10410-0 2026
[35]

T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada, July 2017. Association for Computational Linguist...

work page doi:10.18653/v1/p17-1147 2017
[36]

A survey of reinforcement learning from human feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke H \"u llermeier. A survey of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=f7OkIurx4b. Survey Certification

2025
[37]

Preventing fairness gerrymandering: Auditing and learning for subgroup fairness

Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2564--2572. PMLR, 10--15 Jul 2018

2018
[38]

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, and Dylan Hadfield-Menell. Safety drift after fine-tuning: Evidence from high-stakes domains, 2026. URL https://arxiv.org/abs/2604.24902

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Mitigating length bias in rlhf through a causal lens, 2025

Hyeonji Kim, Sujeong Oh, and Sanghack Lee. Mitigating length bias in rlhf through a causal lens, 2025. URL https://arxiv.org/abs/2511.12573

work page arXiv 2025
[40]

Reward identification in inverse reinforcement learning

Kuno Kim, Shivam Garg, Kirankumar Shiragur, and Stefano Ermon. Reward identification in inverse reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5496--5505. PMLR, 18--24 Jul 2021

2021
[41]

Benchmarking cognitive biases in large language models as evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. In Findings of the Association for Computational Linguistics: ACL 2024, pages 517--545, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024
[42]

Detecting prefix bias in llm-based reward models

Ashwin Kumar, Yuzi He, Aram H Markosyan, Bobbie Chern, and Imanol Arrieta-Ibarra. Detecting prefix bias in llm-based reward models. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT '25, page 3196–3206, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3715275.3732204

work page doi:10.1145/3715275.3732204 2025
[43]

Catastrophic goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, and Adri \`a Garriga-Alonso. Catastrophic goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=UXuBzWoZGK

2024
[44]

Correlated proxies: A new definition and improved mitigation for reward hacking

Cassidy Laidlaw, Shivam Singhal, and Anca Dragan. Correlated proxies: A new definition and improved mitigation for reward hacking. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=msEr27EejF

2025
[45]

Reinforcement Learning from Human Feedback

Nathan Lambert. Reinforcement learning from human feedback, 2025. URL https://arxiv.org/abs/2504.12501

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. R eward B ench: Evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1755--1797, Albuquerque, New...

work page doi:10.18653/v1/2025.findings-naacl.96 2025
[47]

Taming overconfidence in LLM s: Reward calibration in RLHF

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in LLM s: Reward calibration in RLHF . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=l0tg0jzsdL

2025
[48]

Length-controlled margin-based preference optimization without reference model, 2025 a

Gengxu Li, Tingyu Xia, Yi Chang, and Yuan Wu. Length-controlled margin-based preference optimization without reference model, 2025 a . URL https://arxiv.org/abs/2502.14643

work page arXiv 2025
[49]

Does style matter? D isentangling style and substance in chatbot arena

Tianle Li, Anastasios Angelopoulos, and Wei-Lin Chiang. Does style matter? D isentangling style and substance in chatbot arena. Blog post, 2024. Accessed: 2026-05-06

2024
[50]

A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others

Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, and Mark Ibrahim. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 2023

2023
[51]

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Eliminating inductive bias in reward models with information-theoretic guidance, 2025 b . URL https://arxiv.org/abs/2512.23461

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025. URL https://arxiv.org/abs/2507.01352. arXiv:2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Robust optimization for mitigating reward hacking with correlated proxies

Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Robust optimization for mitigating reward hacking with correlated proxies. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=O3shkBWM2s

2026
[54]

Eliminating biased length reliance of direct preference optimization via down-sampled KL divergence

Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, and Xing Sun. Eliminating biased length reliance of direct preference optimization via down-sampled KL divergence. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1047--1067. Association for Computational Linguistics, November 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.emnlp-main.60 2024
[55]

Smith, Hannaneh Hajishirzi, and Nathan Lambert

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fb0G86Dewb

2026
[56]

Simpo: simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: simple preference optimization with a reference-free reward. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24. Curran Associates Inc., 2024

2024
[57]

What's in my human feedback? learning interpretable descriptions of preference data

Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=sC6A1bFDUt

2026
[58]

Ng, Daishi Harada, and Stuart J

Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML '99, page 278–287. Morgan Kaufmann Publishers Inc., 1999

1999
[59]

Debiasing reward models by representation learning with guarantees, 2025

Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, and Shiva Kasiviswanathan. Debiasing reward models by representation learning with guarantees, 2025. URL https://arxiv.org/abs/2510.23751

work page arXiv 2025
[60]

Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorze...

work page arXiv 2024
[61]

OpenAssistant/reward-model-deberta-v3-large-v2

OpenAssistant . OpenAssistant/reward-model-deberta-v3-large-v2 . https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2, 2023. Hugging Face model card. Accessed: 2026-01-06

2023
[62]

Linear probe penalties reduce LLM sycophancy

Henry Papadatos and Rachel Freedman. Linear probe penalties reduce LLM sycophancy. In Workshop on Socially Responsible Language Modelling Research, 2024. URL https://openreview.net/forum?id=6N2yES22rG

2024
[63]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998--5017, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.297

work page doi:10.18653/v1/2024.findings-acl.297 2024
[64]

External Validity: From Do-Calculus to Transportability Across Populations, page 451–482

Judea Pearl and Elias Bareinboim. External Validity: From Do-Calculus to Transportability Across Populations, page 451–482. Association for Computing Machinery, New York, NY, USA, 1 edition, 2022. URL https://doi.org/10.1145/3501714.3501741

work page doi:10.1145/3501714.3501741 2022
[65]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7599--7609. PMLR, 13--18 Jul 2020

2020
[66]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

work page doi:10.18653/v1/2023.findings-acl.847 2023
[67]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML '07, page 745–750. Association for Computing Machinery, 2007. doi:10.1145/1273496.1273590

work page doi:10.1145/1273496.1273590 2007
[68]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

2023
[69]

Verbosity bias in preference labeling by large language models

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. URL https://openreview.net/forum?id=magEgFpK1y

2023
[70]

The pitfalls of simplicity bias in neural networks

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[71]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Procaccia

Itai Shapira, Gerdus Benade, and Ariel D. Procaccia. How rlhf amplifies sycophancy, 2026. URL https://arxiv.org/abs/2602.01002

work page arXiv 2026
[73]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In The Twelfth Internati...

2024
[74]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=qq6ctdUwCX

2023
[75]

A long way to go: Investigating length correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF . In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=G8LaO1P0xv

2024
[76]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22. Curran Associates Inc., 2022

2022
[77]

Invariance in policy optimisation and partial identifiability in reward learning

Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023
[78]

Causal reward adjustment: Mitigating reward hacking in external reasoning via backdoor correction, 2025

Ruike Song, Zeen Song, Huijie Guo, and Wenwen Qiang. Causal reward adjustment: Mitigating reward hacking in external reasoning via backdoor correction, 2025. URL https://arxiv.org/abs/2508.04216

work page arXiv 2025
[79]

Robust reward modeling via causal rubrics

Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, and Doina Precup. Robust reward modeling via causal rubrics. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview...

2026
[80]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20. Curran Associates Inc., 2020

2020

Showing first 80 references.

[1] [1]

Models overview

Anthropic . Models overview. Blog post, 2026. Accessed: 2026-05-06

2026

[2] [2]

Training large language models on narrow tasks can lead to broad misalignment

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Mart \'i n Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment. Nature, 649: 0 584--589, 2026. doi:10.1038/s41586-025-09937-5

work page doi:10.1038/s41586-025-09937-5 2026

[3] [3]

Beyond excess and deficiency: Adaptive length bias mitigation in reward models for RLHF

Yuyan Bu, Liangyu Huo, Yi Jing, and Qing Yang. Beyond excess and deficiency: Adaptive length bias mitigation in reward models for RLHF . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3091--3098, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-naacl.169

work page doi:10.18653/v1/2025.findings-naacl.169 2025

[4] [4]

Disentangling length bias in preference learning via response-conditioned modeling

Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, and Houqiang Li. Disentangling length bias in preference learning via response-conditioned modeling. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=hKxYESOzen

2026

[5] [5]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J \'e r \'e my Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric ...

2023

[6] [6]

Humans or LLM s as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLM s as the judge? a study on judgement bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301--8327, Miami, Florida, USA, November 2024 a . Association for Computational Linguistics. doi:10.18653/v1/2024.emnlp-main.474

work page doi:10.18653/v1/2024.emnlp-main.474 2024

[7] [7]

Odin: disentangled reward mitigates hacking in rlhf

Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: disentangled reward mitigates hacking in rlhf. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024 b

2024

[8] [8]

Learning ordinal probabilistic reward from preferences

Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, and Min Yang. Learning ordinal probabilistic reward from preferences. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=0Vf5trUAVF

2026

[9] [9]

Sycophantic AI decreases prosocial intentions and promotes dependence

Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky. Sycophantic AI decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026 a . doi:10.1126/science.aec8352

work page doi:10.1126/science.aec8352 2026

[10] [10]

ELEPHANT : Measuring and understanding social sycophancy in LLM s

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. ELEPHANT : Measuring and understanding social sycophancy in LLM s. In The Fourteenth International Conference on Learning Representations, 2026 b . URL https://openreview.net/forum?id=igbRHKEiAs

2026

[11] [11]

Learning correlated reward models: Statistical barriers and opportunities

Yeshwanth Cherapanamjeri, Constantinos Costis Daskalakis, Gabriele Farina, and Sobhan Mohammadpour. Learning correlated reward models: Statistical barriers and opportunities. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=TbEyl6krsY

2026

[12] [12]

Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

2024

[13] [13]

Reward models inherit value biases from pretraining

Brian Christian, Jessica A F Thompson, Elle, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, and Tsvetomira Dumbalska. Reward models inherit value biases from pretraining. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=dT399j1Azv

2026

[14] [14]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 4302–4310, Red Hook, NY, USA, 2017. Curran Associates Inc

2017

[15] [15]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

2023

[17] [17]

Ask don't tell: Reducing sycophancy in large language models

Magda Dubois, Cozmin Ududec, Christopher Summerfield, and Lennart Luettgau. Ask don't tell: Reducing sycophancy in large language models, 2026. URL https://arxiv.org/abs/2602.23971

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Length-controlled alpacaeval: A simple debiasing of automatic evaluators

Yann Dubois, Percy Liang, and Tatsunori Hashimoto. Length-controlled alpacaeval: A simple debiasing of automatic evaluators. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=CybBmzWBX0

2024

[19] [19]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D'Amour, Krishnamurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. In First Conference on Language Modeling,...

2024

[20] [20]

Syceval: Evaluating llm sycophancy

Aaron Fanous, Jacob Goldberg, Ank Agarwal, Joanna Lin, Anson Zhou, Sonnet Xu, Vasiliki Bikia, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy. In Proceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 893--900, 2025. doi:10.1609/aies.v8i1.36598

work page doi:10.1609/aies.v8i1.36598 2025

[21] [21]

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, and Nick Haber. One bias after another: Mechanistic reward shaping and persistent biases in language reward models, 2026. URL https://arxiv.org/abs/2603.03291

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

C2po: Diagnosing and disentangling bias shortcuts in llms, 2025

Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, and Shuai Zhao. C2po: Diagnosing and disentangling bias shortcuts in llms, 2025. URL https://arxiv.org/abs/2512.23430

work page arXiv 2025

[23] [23]

The perils of optimizing learned reward functions: Low training error does not guarantee low regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forr\' e , David Krueger, and Joar Max Viktor Skalse. The perils of optimizing learned reward functions: Low training error does not guarantee low regret. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Proceedings of t...

2025

[24] [24]

Reward shaping to mitigate reward hacking in RLHF

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in RLHF . In ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. URL https://openreview.net/forum?id=62A4d5Mokc

2025

[25] [25]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023

[26] [26]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence volume 2, pages665--673(2020), 2020

2020

[27] [27]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021

[29] [29]

Measuring sycophancy of language models in multi-turn dialogues

Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of language models in multi-turn dialogues. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2239--2259, Suzhou, China, November 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.121

work page doi:10.18653/v1/2025.findings-emnlp.121 2025

[30] [30]

Lo RA : Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

2022

[31] [31]

Explaining length bias in LLM -based preference evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in LLM -based preference evaluations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6763--6794, Suzhou, China, November 2025. Association for Computational Ling...

work page doi:10.18653/v1/2025.findings-emnlp.358 2025

[32] [32]

Post-hoc reward calibration: A case study on length bias

Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo Ponti, and Ivan Titov. Post-hoc reward calibration: A case study on length bias. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Iu8RytBaji

2025

[33] [33]

Measuring and mitigating overreliance to build human-compatible AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, and Umang Bhatt. Measuring and mitigating overreliance to build human-compatible ai, 2026 a . URL https://arxiv...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Training language models to be warm can reduce accuracy and increase sycophancy

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm can reduce accuracy and increase sycophancy. Nature, 652: 0 1159--1165, 2026 b . doi:10.1038/s41586-026-10410-0

work page doi:10.1038/s41586-026-10410-0 2026

[35] [35]

T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada, July 2017. Association for Computational Linguist...

work page doi:10.18653/v1/p17-1147 2017

[36] [36]

A survey of reinforcement learning from human feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke H \"u llermeier. A survey of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=f7OkIurx4b. Survey Certification

2025

[37] [37]

Preventing fairness gerrymandering: Auditing and learning for subgroup fairness

Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2564--2572. PMLR, 10--15 Jul 2018

2018

[38] [38]

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, and Dylan Hadfield-Menell. Safety drift after fine-tuning: Evidence from high-stakes domains, 2026. URL https://arxiv.org/abs/2604.24902

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Mitigating length bias in rlhf through a causal lens, 2025

Hyeonji Kim, Sujeong Oh, and Sanghack Lee. Mitigating length bias in rlhf through a causal lens, 2025. URL https://arxiv.org/abs/2511.12573

work page arXiv 2025

[40] [40]

Reward identification in inverse reinforcement learning

Kuno Kim, Shivam Garg, Kirankumar Shiragur, and Stefano Ermon. Reward identification in inverse reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5496--5505. PMLR, 18--24 Jul 2021

2021

[41] [41]

Benchmarking cognitive biases in large language models as evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. In Findings of the Association for Computational Linguistics: ACL 2024, pages 517--545, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.29

work page doi:10.18653/v1/2024.findings-acl.29 2024

[42] [42]

Detecting prefix bias in llm-based reward models

Ashwin Kumar, Yuzi He, Aram H Markosyan, Bobbie Chern, and Imanol Arrieta-Ibarra. Detecting prefix bias in llm-based reward models. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT '25, page 3196–3206, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3715275.3732204

work page doi:10.1145/3715275.3732204 2025

[43] [43]

Catastrophic goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, and Adri \`a Garriga-Alonso. Catastrophic goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=UXuBzWoZGK

2024

[44] [44]

Correlated proxies: A new definition and improved mitigation for reward hacking

Cassidy Laidlaw, Shivam Singhal, and Anca Dragan. Correlated proxies: A new definition and improved mitigation for reward hacking. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=msEr27EejF

2025

[45] [45]

Reinforcement Learning from Human Feedback

Nathan Lambert. Reinforcement learning from human feedback, 2025. URL https://arxiv.org/abs/2504.12501

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. R eward B ench: Evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1755--1797, Albuquerque, New...

work page doi:10.18653/v1/2025.findings-naacl.96 2025

[47] [47]

Taming overconfidence in LLM s: Reward calibration in RLHF

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in LLM s: Reward calibration in RLHF . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=l0tg0jzsdL

2025

[48] [48]

Length-controlled margin-based preference optimization without reference model, 2025 a

Gengxu Li, Tingyu Xia, Yi Chang, and Yuan Wu. Length-controlled margin-based preference optimization without reference model, 2025 a . URL https://arxiv.org/abs/2502.14643

work page arXiv 2025

[49] [49]

Does style matter? D isentangling style and substance in chatbot arena

Tianle Li, Anastasios Angelopoulos, and Wei-Lin Chiang. Does style matter? D isentangling style and substance in chatbot arena. Blog post, 2024. Accessed: 2026-05-06

2024

[50] [50]

A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others

Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, and Mark Ibrahim. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 2023

2023

[51] [51]

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Eliminating inductive bias in reward models with information-theoretic guidance, 2025 b . URL https://arxiv.org/abs/2512.23461

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025. URL https://arxiv.org/abs/2507.01352. arXiv:2507.01352

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Robust optimization for mitigating reward hacking with correlated proxies

Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Robust optimization for mitigating reward hacking with correlated proxies. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=O3shkBWM2s

2026

[54] [54]

Eliminating biased length reliance of direct preference optimization via down-sampled KL divergence

Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, and Xing Sun. Eliminating biased length reliance of direct preference optimization via down-sampled KL divergence. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1047--1067. Association for Computational Linguistics, November 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.emnlp-main.60 2024

[55] [55]

Smith, Hannaneh Hajishirzi, and Nathan Lambert

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fb0G86Dewb

2026

[56] [56]

Simpo: simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: simple preference optimization with a reference-free reward. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24. Curran Associates Inc., 2024

2024

[57] [57]

What's in my human feedback? learning interpretable descriptions of preference data

Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=sC6A1bFDUt

2026

[58] [58]

Ng, Daishi Harada, and Stuart J

Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML '99, page 278–287. Morgan Kaufmann Publishers Inc., 1999

1999

[59] [59]

Debiasing reward models by representation learning with guarantees, 2025

Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, and Shiva Kasiviswanathan. Debiasing reward models by representation learning with guarantees, 2025. URL https://arxiv.org/abs/2510.23751

work page arXiv 2025

[60] [60]

Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorze...

work page arXiv 2024

[61] [61]

OpenAssistant/reward-model-deberta-v3-large-v2

OpenAssistant . OpenAssistant/reward-model-deberta-v3-large-v2 . https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2, 2023. Hugging Face model card. Accessed: 2026-01-06

2023

[62] [62]

Linear probe penalties reduce LLM sycophancy

Henry Papadatos and Rachel Freedman. Linear probe penalties reduce LLM sycophancy. In Workshop on Socially Responsible Language Modelling Research, 2024. URL https://openreview.net/forum?id=6N2yES22rG

2024

[63] [63]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998--5017, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.297

work page doi:10.18653/v1/2024.findings-acl.297 2024

[64] [64]

External Validity: From Do-Calculus to Transportability Across Populations, page 451–482

Judea Pearl and Elias Bareinboim. External Validity: From Do-Calculus to Transportability Across Populations, page 451–482. Association for Computing Machinery, New York, NY, USA, 1 edition, 2022. URL https://doi.org/10.1145/3501714.3501741

work page doi:10.1145/3501714.3501741 2022

[65] [65]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 7599--7609. PMLR, 13--18 Jul 2020

2020

[66] [66]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

work page doi:10.18653/v1/2023.findings-acl.847 2023

[67] [67]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML '07, page 745–750. Association for Computing Machinery, 2007. doi:10.1145/1273496.1273590

work page doi:10.1145/1273496.1273590 2007

[68] [68]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

2023

[69] [69]

Verbosity bias in preference labeling by large language models

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. URL https://openreview.net/forum?id=magEgFpK1y

2023

[70] [70]

The pitfalls of simplicity bias in neural networks

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[71] [71]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

Procaccia

Itai Shapira, Gerdus Benade, and Ariel D. Procaccia. How rlhf amplifies sycophancy, 2026. URL https://arxiv.org/abs/2602.01002

work page arXiv 2026

[73] [73]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In The Twelfth Internati...

2024

[74] [74]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback

Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, and Xuanjing Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=qq6ctdUwCX

2023

[75] [75]

A long way to go: Investigating length correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in RLHF . In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=G8LaO1P0xv

2024

[76] [76]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22. Curran Associates Inc., 2022

2022

[77] [77]

Invariance in policy optimisation and partial identifiability in reward learning

Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, and Adam Gleave. Invariance in policy optimisation and partial identifiability in reward learning. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

2023

[78] [78]

Causal reward adjustment: Mitigating reward hacking in external reasoning via backdoor correction, 2025

Ruike Song, Zeen Song, Huijie Guo, and Wenwen Qiang. Causal reward adjustment: Mitigating reward hacking in external reasoning via backdoor correction, 2025. URL https://arxiv.org/abs/2508.04216

work page arXiv 2025

[79] [79]

Robust reward modeling via causal rubrics

Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, and Doina Precup. Robust reward modeling via causal rubrics. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview...

2026

[80] [80]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20. Curran Associates Inc., 2020

2020