pith. machine review for the scientific record. sign in

arxiv: 2605.07105 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CL· cs.CY· cs.IT· math.IT

Recognition: no theorem link

Theoretical Limits of Language Model Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CYcs.ITmath.IT
keywords language model alignmentKL-regularized objectiveJeffreys divergencereward hackingbest-of-N samplinginformation-theoretic limitsproxy rewardscovariance estimator
0
0 comments X

The pith

The maximum reward improvement in KL-regularized language model alignment equals a Jeffreys divergence term that can be estimated directly from base model samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a closed-form upper bound on how much expected reward any alignment procedure can achieve for a given KL-divergence budget from the base model. This bound is governed by the Jeffreys divergence between the base and aligned distributions rather than the square-root KL term used in earlier work, and it can be rewritten as a simple covariance of the reward under the base model. The same analysis quantifies how proxy-reward errors create reward hacking whose size grows as the KL penalty shrinks, yet shows that averaging multiple reward models reduces that gap. Experiments on safety and summarization tasks confirm that best-of-N sampling nearly meets the bound while standard RL methods remain well below it.

Core claim

Under the standard KL-regularized objective, the largest possible increase in expected reward for a fixed KL budget is exactly the Jeffreys divergence between the base-model distribution and the optimally aligned distribution. This quantity is also equal to the covariance between the reward and the log-probability ratio under the base model, which yields an estimator that requires only samples from the unaligned model. When the reward is a noisy proxy, the difference between ideal and realized reward scales with the magnitude of the reward error and is amplified by smaller KL penalties; ensembling several independent proxy rewards shrinks this difference.

What carries the argument

The Jeffreys divergence between base and aligned distributions, which supplies the exact maximum reward gain under a KL budget and equals the covariance of reward under the base model.

If this is right

  • Best-of-N sampling approaches the information-theoretic reward limit for moderate KL budgets.
  • Standard RL methods such as PPO and GRPO fall short of the bound and therefore leave reward gains on the table.
  • Reward ensembling reduces the performance gap caused by proxy-reward errors.
  • Alignment potential on a new task can be predicted from base-model samples alone via the covariance estimator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New alignment algorithms could target the covariance expression directly to close the remaining gap to the bound without increasing inference cost.
  • Tasks with high base-model reward variance will have larger possible alignment gains, offering a way to rank tasks by difficulty before any training.
  • The same bounding technique may apply to other constrained optimization settings in which a divergence penalty is traded against an external score.

Load-bearing premise

The KL-regularized objective is taken as the correct formalization of alignment, and the reward function is assumed to exist independently of the sampling process.

What would settle it

An alignment algorithm that produces a higher expected reward than the computed Jeffreys divergence value at the same KL level on a fixed task and reward model.

Figures

Figures reproduced from arXiv: 2605.07105 by Barry-John Theobald, Federico Danieli, Lucas Monteiro Paes, Natalie Mackraz.

Figure 1
Figure 1. Figure 1: Fundamental Limits of Alignment. Reward gain vs. KL divergence from πbase for best-of-N (BoN), GRPO, PPO, and the theoretical limit computed using Def. 1 and Def. 2. Each PPO/GRPO point corresponds to a checkpoint and each BoN point to a value of N. In both cases, BoN closely track the performance of the theoretical limits, whereas GRPO and PPO remain sub-optimal. The x-axis goes up to the maximum KL measu… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of the reward-gain estimator. Estimated reward gain (∆b n(r, r) — Def. 1) in the y-axis as a function of the number of base model samples used to estimate it (n) in the x-axis. We vary the KL penalty across λ ∈ {0.05, 0.1, 0.5, 1.0, 5.0}. Shaded bands show 95% confidence intervals using bootstrap from Seaborn [51]. Proposition 3 characterizes reward hacking. The reward degradation is not merely… view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of the KL Estimator. Estimated KL divergence between the aligned and the base model (KLc(πr,λ∥πbase) — Def. 2) in the y-axis as a function of the number of samples used to estimate it in the x-axis. The KL penalty varies across λ ∈ {0.05, 0.1, 0.5, 1.0, 5.0}. Confidence (95%) intervals computed using bootstrap [51]. 5 Experiments In this section, we present experiments demonstrating that (i) th… view at source ↗
Figure 4
Figure 4. Figure 4: Convergence of Pareto front. Estimated Reward gain (Def. 1) vs. KL divergence (Def. 2) Pareto frontier using different number of samples per prompt. Shaded bands show 95% confidence intervals using bootstrap from Seaborn [51]. E Experimental Runs Hyper-parameter Sweeps To achieve the experimental results presented in the paper for GRPO and PPO, we ran an extensive hyper￾parameter sweep. The reward gain per… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sweep for the fine-tuning Zephyr 7B SFT Full using GRPO on the Beavertails dataset. The plot shows the reward gain as the trained deviates from the base model. Abnormalities in the plotted lines are caused by training runs either plateauing before a KL of 9.14 or from oscillating between a small range of KL values. F Licenses of Used Assets The existing assets used in this paper are listed i… view at source ↗
Figure 6
Figure 6. Figure 6: Hyperparameter sweep for the fine-tuning Zephyr 7B SFT Full using PPO on the Beavertails dataset. The plot shows the reward gain as the trained deviates from the base model. Abnormalities in the plotted lines are caused by training runs either plateauing before a KL of 9.14 or from oscillating between a small range of KL values [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hyperparameter sweep for the fine-tuning Pythia 1B SFT using GRPO on the TLDR dataset. The plot shows the reward gain as the trained deviates from the base model. Abnormalities in the plotted lines are caused by training runs either plateauing before a KL of 7.96 or from oscillating between a small range of KL values [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hyperparameter sweep for the fine-tuning Pythia 1B SFT using PPO on the TLDR dataset. The plot shows the reward gain as the trained deviates from the base model. Abnormalities in the plotted lines are caused by training runs either plateauing before a KL of 7.96 or from oscillating between a small range of KL values. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives information-theoretic limits on KL-regularized LM alignment, claiming a closed-form expression for the maximum expected reward gain under a fixed KL budget that is governed by a Jeffreys divergence (rather than prior √KL bounds). It reformulates this as a covariance under the base model for practical estimation from samples alone, extends the analysis to proxy-reward settings to bound reward hacking, proves that reward ensembling reduces the hacking gap, and empirically shows that best-of-N approaches the derived frontier on safety and summarization tasks while PPO/GRPO remain suboptimal.

Significance. If the central derivation is exact, the work supplies a precise benchmark for alignment methods, a sample-only estimator of achievable gains, and theoretical justification for ensembling; these are concrete strengths. The empirical Pareto-frontier computation on two tasks is consistent with the theory but limited in scope and detail.

major comments (2)
  1. [§3] §3 (main theorem on optimal reward gain): The claim of a closed-form expression for max_{p: KL(p||p0)≤δ} (E_p[r]−E_{p0}[r]) governed by a Jeffreys term is load-bearing. The optimizing distribution is the exponential tilt p_λ ∝ p0 exp(λ r), yet enforcing exact KL(p_λ||p0)=δ requires solving a monotone scalar equation for λ; if the Jeffreys expression bypasses this solve for arbitrary δ and r, it is either an upper bound, an approximation, or holds only for special cases. The subsequent covariance reformulation inherits the same limitation.
  2. [§4] §4 (proxy-reward and ensembling results): The growth of the ideal-vs-proxy gap with reward error magnitude and decreasing KL penalty is derived from the same optimization; any implicit dependence on λ in the primary result propagates here and must be clarified before the reward-hacking bounds can be treated as exact.
minor comments (2)
  1. [Abstract, §5] The abstract and §5 refer to 'closed-form' without explicitly stating whether the expression is free of numerical root-finding for λ; a short clarifying sentence would remove ambiguity.
  2. [§6] Empirical section: the two tasks are described only at high level; adding the precise reward models, sampling temperatures, and number of base-model samples used for the covariance estimator would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The points raised about the exactness of the closed-form result in §3 and the λ-dependence in §4 are well-taken. We clarify below that our derivations are exact (not bounds or approximations) and provide a practical sample-based estimator; we will revise the manuscript to make the role of the Lagrange multiplier λ fully explicit.

read point-by-point responses
  1. Referee: [§3] §3 (main theorem on optimal reward gain): The claim of a closed-form expression for max_{p: KL(p||p0)≤δ} (E_p[r]−E_{p0}[r]) governed by a Jeffreys term is load-bearing. The optimizing distribution is the exponential tilt p_λ ∝ p0 exp(λ r), yet enforcing exact KL(p_λ||p0)=δ requires solving a monotone scalar equation for λ; if the Jeffreys expression bypasses this solve for arbitrary δ and r, it is either an upper bound, an approximation, or holds only for special cases. The subsequent covariance reformulation inherits the same limitation.

    Authors: Our central result is exact for arbitrary r and δ. Let p_λ be the exponential tilt with λ chosen so that KL(p_λ || p_0) = δ. Then the maximum reward gain satisfies Δ = D_J(p_λ || p_0) / λ exactly, where D_J is the Jeffreys divergence. This is a closed-form expression governed by the Jeffreys term (in contrast to the looser O(√δ) bounds in prior work). Equivalently, Δ = cov_{p_0}(r, exp(λ r)) / E_{p_0}[exp(λ r)]. The covariance form is directly estimable from base-model samples: draw i.i.d. samples from p_0, compute the associated rewards, then numerically solve for the λ that achieves the target KL budget via Monte-Carlo estimates of the moment-generating function and covariance. The procedure does not bypass the scalar solve for λ, but it yields an exact, sample-only characterization of the achievable frontier. We will add a clarifying paragraph and pseudocode in §3 to state this procedure explicitly. revision: partial

  2. Referee: [§4] §4 (proxy-reward and ensembling results): The growth of the ideal-vs-proxy gap with reward error magnitude and decreasing KL penalty is derived from the same optimization; any implicit dependence on λ in the primary result propagates here and must be clarified before the reward-hacking bounds can be treated as exact.

    Authors: We agree that the proxy-reward and ensembling analyses inherit the same optimizing tilt p_λ from §3. Consequently the ideal-vs-proxy gap and the benefit of ensembling are expressed exactly in terms of the λ (or equivalently the KL budget δ) corresponding to each setting. The growth of the gap with reward error magnitude and with decreasing KL penalty (i.e., smaller β or larger λ) follows directly from the same exponential-tilt expressions. We will revise §4 to state the λ-dependence explicitly in the theorem statements and to note that all bounds are to be understood for a fixed KL constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is first-principles optimization

full rationale

The central result follows from standard Lagrange-multiplier optimization of E_p[r] subject to KL(p || p0) ≤ δ, yielding the exponential tilt p_λ ∝ p0 exp(λ r) whose value can be rewritten in terms of Jeffreys divergence or covariance under p0. This is an algebraic identity and equivalent reformulation, not a self-definition or fitted parameter renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are required; the derivation remains self-contained against external information-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard information theory and optimization assumptions with no new free parameters or invented entities introduced.

axioms (2)
  • domain assumption KL-divergence serves as a valid constraint measuring deviation from the base model distribution in the alignment objective.
    Invoked to define the feasible set for the maximum reward gain derivation.
  • standard math Expectations and divergences are well-defined over the policy and reward distributions.
    Background assumption from probability theory used throughout the information-theoretic analysis.

pith-pipeline@v0.9.0 · 5604 in / 1346 out tokens · 74853 ms · 2026-05-11T01:14:28.283350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

  1. [1]

    Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, and Sanmi Koyejo

    Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, and Sanmi Koyejo. Scalable ensembling for mitigating reward overoptimisation, 2024. URLhttps://arxiv.org/abs/2406.01013

  2. [2]

    Concrete Problems in AI Safety

    DarioAmodei, ChrisOlah, JacobSteinhardt, PaulChristiano, JohnSchulman, andDanMané. Concreteproblems in ai safety, 2016. URLhttps://arxiv.org/abs/1606.06565

  3. [3]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. Technical report, Anthropic, February 2026. URLhttps://www-cdn. anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  6. [6]

    Theoretical guarantees on the best-of-n alignment policy

    Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander Nicholas D’Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=u3U8qzFV7w

  7. [7]

    Managing extreme ai risks amid rapid progress.Science, 384(6698):842–845, 2024

    Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress.Science, 384(6698):842–845, 2024

  8. [8]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373

  9. [9]

    Ai alignment at your discretion

    Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio Cesar Vieira Machado, and Flavio du Pin Calmon. Ai alignment at your discretion. InProceedings of the 2025 ACM Con- ference on Fairness, Accountability, and Transparency, FAccT ’25, page 3046–3074, New York, NY, USA,

  10. [10]

    Ownership, Not Just Happy Talk

    Association for Computing Machinery. ISBN 9798400714825. doi: 10.1145/3715275.3732194. URL https://doi.org/10.1145/3715275.3732194

  11. [11]

    Large language models reflect the ideology of their creators.npj Artificial Intelligence, 2(1), January 2026

    Maarten Buyl, Alexander Rogiers, Sander Noels, Guillaume Bied, Iris Dominguez-Catena, Edith Heiter, Iman Johary, Alexandru-Cristian Mara, Raphaël Romero, Jefrey Lijffijt, and Tijl De Bie. Large language models reflect the ideology of their creators.npj Artificial Intelligence, 2(1), January 2026. ISSN 3005-1460. doi: 10.1038/s44387-025-00048-0. URLhttp://...

  12. [12]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  13. [13]

    URLhttps://arxiv.org/abs/2107.03374. 11

  14. [14]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

  15. [15]

    Eleutherai_pythia-1b-deduped__reward__tldr (reward model), 2023

    CleanRL. Eleutherai_pythia-1b-deduped__reward__tldr (reward model), 2023. URLhttps://huggingface.co/cleanrl/ EleutherAI_pythia-1b-deduped__reward__tldr

  16. [16]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  17. [17]

    Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743, 2023

    Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overop- timization.arXiv preprint arXiv:2310.02743, 2023

  18. [18]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore, Dec...

  19. [19]

    Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking

    Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D’Amour, Krishna- murthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. InFirst Conference on Language Modeling...

  20. [20]

    pythia-1b-deduped-tldr-sft, 2024

    TRL (Hugging Face). pythia-1b-deduped-tldr-sft, 2024. URLhttps://huggingface.co/trl-lib/pythia-1b-deduped-tldr-sft

  21. [21]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  22. [22]

    BoNBon alignment for large language models and the sweetness of best-of-n sampling

    Lin Gui, Cristina Garbacea, and Victor Veitch. BoNBon alignment for large language models and the sweetness of best-of-n sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=haSKMlrbX5

  23. [23]

    arXiv preprint arXiv:2504.15236 , year =

    Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, and Deep Ganguli. Values in the wild: Discovering and analyzing values in real-world language model interactions, 2025. URLhttps://arxiv.org/abs/2504.15236

  24. [24]

    The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024

    Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024. URLhttps://arxiv.org/abs/2403. 17031

  25. [25]

    OuP Oxford, 1998

    Harold Jeffreys.The theory of probability. OuP Oxford, 1998

  26. [26]

    arXiv preprint arXiv:2307.04657 , year =

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URLhttps://arxiv.org/abs/2307.04657

  27. [27]

    arXiv preprint arXiv:2310.19852 , year=

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey.arXiv preprint arXiv:2310.19852, 2023

  28. [28]

    arXiv preprint arXiv:2406.15513 , year=

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, and Yaodong Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference, 2025. URLhttps://arxiv.org/abs/2406.15513

  29. [29]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

  30. [30]

    Shopping MMLU: A massive multi- task online shopping benchmark for large language models

    Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Sridatt Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, and Bing Yin. Shopping MMLU: A massive multi- task online shopping benchmark for large langua...

  31. [31]

    Watch your language: Investigating content moderation with large language models

    Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. Watch your language: Investigating content moderation with large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865–878, 2024

  32. [32]

    Information theoretic guarantees for policy alignment in large language models.Transactions on Machine Learning Research, 2025

    Youssef Mroueh and Apoorva Nitsure. Information theoretic guarantees for policy alignment in large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id= Uz9J77Riul

  33. [33]

    Controlled decoding from language models

    Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  34. [34]

    Using gpt-4 for content moderation, 2023

    OpenAI. Using gpt-4 for content moderation, 2023. URLhttps://openai.com/index/using-gpt-4-for-content-moderation. Accessed: 2024-05-01

  35. [35]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  36. [36]

    Monteiro Paes et al

    Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff. Dso: Direct steering optimization for bias mitigation, 2026. URL https://arxiv.org/abs/2512.15926

  37. [37]

    Ai deception: A survey of examples, risks, and potential solutions.Patterns, 5(5), 2024

    Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions.Patterns, 5(5), 2024

  38. [38]

    Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10):1872–1897, September 2020

    XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai, and XuanJing Huang. Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10):1872–1897, September 2020. ISSN 1869-1900. doi: 10.1007/s11431-020-1647-3. URLhttp://dx.doi.org/10.1007/s11431-020-1647-3

  39. [39]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

  40. [40]

    Mismatched guesswork, 2019

    Salman Salamatian, Litian Liu, Ahmad Beirami, and Muriel Médard. Mismatched guesswork, 2019. URL https://arxiv.org/abs/1907.00531

  41. [41]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  42. [42]

    G., Dadashi, R., Hussenot, L., Ferret, J., Vieil- lard, N., Ram ´e, A., Shariari, B., Perrin, S., Friesen, A., Cideron, G., et al

    Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, and Olivier Bachem. Bond: Aligning llms with best-of...

  43. [43]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  44. [45]

    Learning to summarize from human feedback

    NisanStiennon, LongOuyang, JeffWu, DanielM.Ziegler, RyanLowe, ChelseaVoss, AlecRadford, DarioAmodei, and Paul Christiano. Learning to summarize from human feedback. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

  45. [46]

    Machine-assisted proof.Notices of the American Mathematical Society, 72(1):6–13, 2025

    Terence Tao. Machine-assisted proof.Notices of the American Mathematical Society, 72(1):6–13, 2025

  46. [47]

    Yu, and Jianfeng Gao

    Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, 13 Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A survey on post-training ...

  47. [48]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  48. [49]

    Tl;dr dataset for trl.https://huggingface.co/datasets/trl-lib/tldr, 2025

    TRL Team. Tl;dr dataset for trl.https://huggingface.co/datasets/trl-lib/tldr, 2025

  49. [50]

    Rush, and Thomas Wolf

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexan- der M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023. URLhttps://arxiv.org/abs/2310. 16944

  50. [51]

    Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, and Flavio P. Calmon. Soft best-of-n sampling for model alignment, 2025. URLhttps://arxiv.org/abs/2505.03156

  51. [52]

    TRL: Transformers Reinforcement Learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URL https://github.com/huggingface/trl

  52. [53]

    Michael L. Waskom. seaborn: statistical data visualization.Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URLhttps://doi.org/10.21105/joss.03021

  53. [54]

    Asymptotics of language model alignment

    Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, and Ahmad Beirami. Asymptotics of language model alignment. In2024 IEEE International Symposium on Information Theory (ISIT), pages 2027–2032, 2024. doi: 10.1109/ISIT57864.2024.10619456

  54. [55]

    Webshop: towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: towards scalable real-world web interaction with grounded language agents. InProceedings of the 36th International Conference on Neural Infor- mation Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

  55. [56]

    Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024

    Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, and Chuang Gan. Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024. URLhttps://arxiv.org/abs/2401.16635

  56. [57]

    Calibrating sequence likelihood improves conditional language generation

    Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=0qSOodKmJaN. 14 Supplementary Material: Theoretical Limits of Language Model Align...

  57. [58]

    random variablesX1,

    Convergence of Component Estimators (WLLN): The WLLN states that for a sequence of i.i.d. random variablesX1, . . . , Xn with a finite expectationE[X], the sample mean ¯Xn = 1 n P Xi converges in probability to the expectation:¯Xn p − →E[X]. We apply this to our three primary estimators (from Definition 1), which are all sample means of i.i.d. variables (...

  58. [59]

    Convergence of Combined Estimators (Continuous Mapping): The Continuous Mapping Theorem states that for a sequence of random variablesXn p − →c, and a functiong that is continuous atc, we haveg(Xn) p − →g(c). The covariance estimatordCov is a continuous functiong1 of our three sample means: dCov=g 1( ˆC,ˆµr′, ˆZr) = ˆC−(ˆµr′ · ˆZr) Since multiplication an...

  59. [60]

    1, finiteness conditions are satisfied by hypothesis

    b∆n(r, r) p − →∆(r, r)by Prop. 1, finiteness conditions are satisfied by hypothesis

  60. [61]

    helpful” decoding regime (clusterVA(x)) and a second batch using a “safe/refusal

    bZr p − →Zr by the WLLN, sinceE[exp(r/λ)] =Z r <∞by hypothesis; thenlog bZr p − →logZr by the Continuous Mapping Theorem, sinceZr >0. The estimator is a continuous function of these three convergent quantities, so by a further application of the Continuous Mapping Theorem,cKL(πr,λ∥πbase) p − →KL(πr,λ∥πbase). B.2 Proof of Theoretical results in Section 4 P...