pith. sign in

arxiv: 2605.18721 · v3 · pith:MRW6BJ2Inew · submitted 2026-05-18 · 💻 cs.LG · cs.CL

General Preference Reinforcement Learning

Pith reviewed 2026-05-22 09:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords preference optimizationreinforcement learning from human feedbackLLM alignmentreward hackinggeneral preference modelonline RLmulti-dimensional preferences
0
0 comments X

The pith

General Preference Reinforcement Learning uses multi-dimensional signals from skew-symmetric subspaces to resist reward hacking in LLM policy updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to bridge online reinforcement learning with verifiable rewards and preference optimization for open-ended tasks by replacing scalar reward models with a structured General Preference Model. A sympathetic reader would care because scalar scores allow models to exploit one dimension like response length, leading to reward hacking over long training. GPRL maintains k-way structure by computing normalized per-dimension advantages and using a drift monitor to reweight when single-axis exploitation is detected. This enables stable improvements on benchmarks like AlpacaEval without collapse.

Core claim

Starting from Llama-3-8B-Instruct, GPRL reaches a length-controlled win rate of 56.51% on AlpacaEval 2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs. It does this by embedding responses into k skew-symmetric subspaces, computing per-dimension group-relative advantages normalized on their own scales, aggregating them with context-dependent eigenvalues, and employing a closed-loop drift monitor that detects and corrects single-axis exploitation.

What carries the argument

The k-way structure of the General Preference Model carried into the policy update, where per-dimension advantages are normalized separately and aggregated using context-dependent eigenvalues to prevent any axis from dominating.

If this is right

  • GPRL achieves higher length-controlled win rates on AlpacaEval 2.0 compared to baselines.
  • It maintains or improves performance on Arena-Hard, MT-Bench, and WildBench even after extended training.
  • The drift monitor allows correction of exploitation by reweighting dimensions and tightening the trust region.
  • Policy updates remain stable because no single preference dimension can overwhelm the signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this to even longer training or different model sizes could show if the resistance to hacking scales with model capacity.
  • This approach may generalize to other multi-objective RL settings where balancing criteria prevents mode collapse.
  • If the skew-symmetric embedding captures intransitivities effectively, it could improve preference modeling in domains with cyclic human judgments.

Load-bearing premise

The assumption that embedding responses into k skew-symmetric subspaces and aggregating per-dimension advantages with context-dependent eigenvalues produces a preference signal that is meaningfully more resistant to single-axis exploitation than a scalar reward model.

What would settle it

If extended training runs with GPRL show similar reward hacking degradation on benchmarks as scalar reward methods, or if the per-dimension advantages become imbalanced despite the normalization.

Figures

Figures reproduced from arXiv: 2605.18721 by Ahsan Bilal, Andreas Haupt, Arslan Chaudhry, Emily Fox, John M. Cioffi, Muhammad Ahmed Mohsin, Muhammad Umer, Sanmi Koyejo.

Figure 1
Figure 1. Figure 1: Landscape of LLM post-training methods, organized by supervision source and training regime. Online RL with a scalar RM reaches open-ended tasks but suffers reward hacking; GPRL fills the gap with a structured, multi-dimensional reward. In response, the field has split into two largely disconnected tracks. The first avoids explicit re￾ward modeling and optimizes the policy directly on preference data. Offl… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of GPRL. The policy πθ samples G responses per prompt, GPM embeds them, and R≻ produces k pairwise score matrices that yield per-dimension advantages. The aggregate drives the GRPO-style clipped surrogate, while a drift monitor D(t) adapts the dimensional weights and β to suppress reward hacking. responses per prompt, estimates sˆ(yi ≻ µ | x) = 1 K PK k=1 s(yi ≻ yk | x), and regresses log πθ/πθt o… view at source ↗
Figure 3
Figure 3. Figure 3: Dimensional drift distinguishes healthy training from reward hacking. (a) The variance profile α (t) holds its initial shape on a healthy GPRL run. (b) Under hacking, it collapses onto a single dimension l ⋆ . (c) D(t) stays near zero on the healthy run and crosses τ at t ′ on the hacked one, allowing the corrected trajectory to engage the controller at t ′ and pull back as the profile rebalances, while a … view at source ↗
Figure 4
Figure 4. Figure 4: Scaling and per-category breakdown. (a) AlpacaEval 2.0 LC. WR across five training epochs at both reward-model scales, with the controller enabled holding near its peak through epoch 5 and the controller disabled degrading once drift develops. (b, c) Per-category scores on MT-Bench and WildBench, where GPRL leads on the categories that match the supervision and on structural categories while remaining with… view at source ↗
read the original abstract

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces General Preference Reinforcement Learning (GPRL), which extends the General Preference Model by embedding responses into k skew-symmetric subspaces. GPRL computes per-dimension group-relative advantages, normalizes each axis independently, aggregates them using context-dependent eigenvalues, and incorporates a closed-loop drift monitor to detect and mitigate single-axis exploitation. Starting from Llama-3-8B-Instruct, the method is reported to achieve a length-controlled win rate of 56.51% on AlpacaEval 2.0 while outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench through sustained resistance to reward hacking over extended training.

Significance. If the multi-dimensional structure and drift monitor demonstrably prevent collapse to exploitable axes, the work could help bridge online RL with verifiable rewards and preference optimization for open-ended LLM tasks. The reported benchmark gains suggest a practical path beyond scalar reward models, provided the independence of subspaces and effectiveness of the monitor are substantiated.

major comments (3)
  1. [Abstract] Abstract: The central claim that GPRL resists reward hacking across extended runs rests on the k-subspace embedding, per-dimension normalization, and drift monitor, yet the text provides no direct supporting evidence such as eigenvalue trajectories, subspace correlation metrics, or ablations isolating the monitor's contribution; this is load-bearing for the outperformance and resistance assertions.
  2. [Method description] Method description (paragraph introducing GPRL): The aggregation via context-dependent eigenvalues is presented as ensuring no single axis dominates, but without a derivation or empirical check showing that the subspaces remain sufficiently independent (e.g., low cross-dimension correlation under policy updates), the structure may still permit coordinated exploitation as raised in the skeptic analysis.
  3. [Drift monitor] Drift monitor section: The closed-loop correction mechanism (reweighting dimensions and tightening the trust region) is described qualitatively, but the manuscript lacks quantitative results demonstrating its impact on hacking susceptibility or before/after comparisons, leaving the mechanism for sustained multi-dimensional behavior underspecified.
minor comments (2)
  1. [Experimental setup] The choice of k and the procedure for selecting context-dependent eigenvalues are introduced without sensitivity analysis or ablation tables showing robustness to these hyperparameters.
  2. [Results] Ensure all benchmark comparisons include the exact baseline scores and evaluation protocols used for GPRL to allow direct replication of the reported margins.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger direct evidence on the multi-dimensional structure and drift monitor. We address each major comment below and commit to revisions that add the requested empirical support and clarifications without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that GPRL resists reward hacking across extended runs rests on the k-subspace embedding, per-dimension normalization, and drift monitor, yet the text provides no direct supporting evidence such as eigenvalue trajectories, subspace correlation metrics, or ablations isolating the monitor's contribution; this is load-bearing for the outperformance and resistance assertions.

    Authors: We agree that direct visualizations and ablations would make the resistance claim more robust. The sustained outperformance on AlpacaEval 2.0 (56.51% length-controlled win rate) and other benchmarks over extended training, in contrast to degradation observed in SimPO and SPPO, provides supporting evidence through the maintained multi-dimensional behavior. In revision we will add eigenvalue trajectories, subspace correlation metrics, and an ablation isolating the drift monitor. revision: yes

  2. Referee: [Method description] Method description (paragraph introducing GPRL): The aggregation via context-dependent eigenvalues is presented as ensuring no single axis dominates, but without a derivation or empirical check showing that the subspaces remain sufficiently independent (e.g., low cross-dimension correlation under policy updates), the structure may still permit coordinated exploitation as raised in the skeptic analysis.

    Authors: The skew-symmetric embedding from the General Preference Model ensures orthogonality by construction, with each dimension representing distinct preference aspects. We will include a brief derivation of the independence property and report empirical cross-dimension correlation values measured during training to confirm they remain low, addressing potential coordinated exploitation. revision: yes

  3. Referee: [Drift monitor] Drift monitor section: The closed-loop correction mechanism (reweighting dimensions and tightening the trust region) is described qualitatively, but the manuscript lacks quantitative results demonstrating its impact on hacking susceptibility or before/after comparisons, leaving the mechanism for sustained multi-dimensional behavior underspecified.

    Authors: We acknowledge the description is primarily qualitative. In the revised manuscript we will add quantitative metrics on hacking susceptibility before and after monitor activation, along with comparisons showing its role in preserving multi-dimensional behavior across training runs. revision: yes

Circularity Check

0 steps flagged

No circularity: GPRL method and benchmark results are independent of fitted inputs or self-referential definitions

full rationale

The paper introduces GPRL as an extension of the General Preference Model (GPM) that preserves k-subspace structure through per-dimension group-relative advantages, per-axis normalization, context-dependent eigenvalue aggregation, and a closed-loop drift monitor. The central empirical claims are length-controlled win rates (e.g., 56.51% on AlpacaEval 2.0) and outperformance versus SimPO/SPPO on Arena-Hard, MT-Bench, and WildBench. These outcomes are measured on standard external benchmarks whose scoring protocols are defined independently of the paper's parameters or normalizations. No equations or derivations in the abstract reduce the reported performance to quantities defined solely by the method's own fitted values or by a self-citation chain. The multi-dimensional preference signal is presented as a design choice motivated by the limitations of scalar rewards, not as a tautological restatement of the inputs. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the prior existence of the General Preference Model and on the modeling choice that k-dimensional intransitivity-aware comparisons plus per-axis normalization prevent reward hacking better than scalar rewards. No new physical entities are postulated.

free parameters (2)
  • k (number of subspaces)
    Chosen to embed responses; value not stated in abstract but required for the structured comparison.
  • context-dependent eigenvalues
    Used to aggregate per-dimension advantages; introduced without external derivation or fixed values shown.
axioms (1)
  • domain assumption Quality of open-ended responses is better captured by multi-dimensional intransitive comparisons than by any scalar proxy.
    Invoked in the motivation for moving from scalar reward models to GPM.

pith-pipeline@v0.9.0 · 5828 in / 1493 out tokens · 40130 ms · 2026-05-22T09:21:00.870945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 13 internal anchors

  1. [1]

    Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

    Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fahad Shahbaz Khan, and Salman Khan. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321, 2025

  2. [2]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  3. [3]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  5. [5]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  6. [6]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  7. [7]

    Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

  8. [8]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  9. [9]

    Nash learning from human feedback

    Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024. 10

  10. [10]

    Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

    Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

  11. [11]

    Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

    Qi Gou and Cam-Tu Nguyen. Mixed preference optimization: Reinforcement learning with data selection and better reference model.arXiv preprint arXiv:2403.19443, 2024

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

    Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai.Advances in Neural Information Processing Systems, 33:15763–15773, 2020

  14. [14]

    Panacea: Pareto alignment via preference adaptation for llms

    Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024

  15. [15]

    Beyond bradley-terry models: a general preference model for language model alignment

    Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, and Quanquan Gu. Beyond bradley-terry models: a general preference model for language model alignment. InProceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org, 2025

  16. [16]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

  17. [17]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  18. [18]

    Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

    Nuoya Xiong and Aarti Singh. Projection optimization: A general framework for multi-objective and multi-group rlhf.arXiv preprint arXiv:2502.15145, 2025

  19. [19]

    Pareto multi-objective alignment for language models

    Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 257–272. Springer, 2025

  20. [20]

    Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

    Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, and Hua Wu. Extending rlvr to open-ended tasks via verifiable multiple-choice reformulation.arXiv preprint arXiv:2511.02463, 2025

  21. [21]

    Reward shaping to mitigate reward hacking in rlhf

    Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf.arXiv preprint arXiv:2502.18770, 2025

  22. [22]

    Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

    Peter C Fishburn. Nontransitive measurable utility.Journal of Mathematical Psychology, 26(1): 31–67, 1982

  23. [23]

    Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

    Peter C Fishburn. Nontransitive preferences in decision theory.Journal of risk and uncertainty, 4(2):113–134, 1991

  24. [24]

    An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

    Peter C Fishburn. An axiomatic characterization of skew-symmetric bilinear functionals, with applications to utility theory.Economics Letters, 8(4):311–313, 1981

  25. [25]

    Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

    Yutaka Nakamura. Skew-symmetric additive representations of preferences.Journal of Mathe- matical Economics, 30(3):367–387, 1998

  26. [26]

    The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

    Yuda Song, Gokul Swamy, Aarti Singh, J Bagnell, and Wen Sun. The importance of online data: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270, 2024

  27. [27]

    Indirect online preference optimization via reinforcement learning

    En Wang, Xingyu Lin, Chenfu Du Su, Zhonghou Lv Bao, Funing Yang, Yuanbo Xu, and Wenbin Liu. Indirect online preference optimization via reinforcement learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 538–546, 2025. 11

  28. [28]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  29. [29]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

  30. [30]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  31. [31]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  32. [32]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

  33. [33]

    Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024

  34. [34]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  35. [35]

    Judging the judges: A systematic study of position bias in llm-as-a-judge

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 292...

  36. [36]

    Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

    Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

  37. [37]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

  38. [38]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  39. [39]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  40. [40]

    Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

  41. [41]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

  42. [42]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

  43. [43]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023

  44. [44]

    Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

  45. [45]

    A long way to go: Investigating length correlations in rlhf,

    Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

  46. [46]

    Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

    Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

  47. [47]

    Disentangling length from quality in direct preference optimization

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017, 2024

  48. [48]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

  49. [49]

    Thomas Kwa, Drake Thomas, and Adrià Garriga-Alonso. Catastrophic goodhart: regularizing rlhf with kl divergence does not mitigate heavy-tailed reward misspecification.Advances in Neural Information Processing Systems, 37:14608–14633, 2024

  50. [50]

    Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.https://huggingface

    Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, et al. Beyond reward hacking: Causal rewards for large language model alignment.arXiv preprint arXiv:2501.09620, 2025

  51. [51]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  52. [52]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 13 Appendix A More on general preference embeddings This appendix expands on the embedding construction that GPRL inherits from GPM [15], focusing on the structural prop...

  54. [54]

    characterized this empirically in RLHF as reward over-optimization, showing that as the policy spends KL budget against a learned RM, the gold reward traces a hill-shaped curve that initially climbs and then falls, with the peak depending on RM size, KL coefficient, and amount of preference data. The same qualitative shape, namely a peak followed by susta...