Recognition: no theorem link
Theoretical Limits of Language Model Alignment
Pith reviewed 2026-05-11 01:14 UTC · model grok-4.3
The pith
The maximum reward improvement in KL-regularized language model alignment equals a Jeffreys divergence term that can be estimated directly from base model samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the standard KL-regularized objective, the largest possible increase in expected reward for a fixed KL budget is exactly the Jeffreys divergence between the base-model distribution and the optimally aligned distribution. This quantity is also equal to the covariance between the reward and the log-probability ratio under the base model, which yields an estimator that requires only samples from the unaligned model. When the reward is a noisy proxy, the difference between ideal and realized reward scales with the magnitude of the reward error and is amplified by smaller KL penalties; ensembling several independent proxy rewards shrinks this difference.
What carries the argument
The Jeffreys divergence between base and aligned distributions, which supplies the exact maximum reward gain under a KL budget and equals the covariance of reward under the base model.
If this is right
- Best-of-N sampling approaches the information-theoretic reward limit for moderate KL budgets.
- Standard RL methods such as PPO and GRPO fall short of the bound and therefore leave reward gains on the table.
- Reward ensembling reduces the performance gap caused by proxy-reward errors.
- Alignment potential on a new task can be predicted from base-model samples alone via the covariance estimator.
Where Pith is reading between the lines
- New alignment algorithms could target the covariance expression directly to close the remaining gap to the bound without increasing inference cost.
- Tasks with high base-model reward variance will have larger possible alignment gains, offering a way to rank tasks by difficulty before any training.
- The same bounding technique may apply to other constrained optimization settings in which a divergence penalty is traded against an external score.
Load-bearing premise
The KL-regularized objective is taken as the correct formalization of alignment, and the reward function is assumed to exist independently of the sampling process.
What would settle it
An alignment algorithm that produces a higher expected reward than the computed Jeffreys divergence value at the same KL level on a fixed task and reward model.
Figures
read the original abstract
Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives information-theoretic limits on KL-regularized LM alignment, claiming a closed-form expression for the maximum expected reward gain under a fixed KL budget that is governed by a Jeffreys divergence (rather than prior √KL bounds). It reformulates this as a covariance under the base model for practical estimation from samples alone, extends the analysis to proxy-reward settings to bound reward hacking, proves that reward ensembling reduces the hacking gap, and empirically shows that best-of-N approaches the derived frontier on safety and summarization tasks while PPO/GRPO remain suboptimal.
Significance. If the central derivation is exact, the work supplies a precise benchmark for alignment methods, a sample-only estimator of achievable gains, and theoretical justification for ensembling; these are concrete strengths. The empirical Pareto-frontier computation on two tasks is consistent with the theory but limited in scope and detail.
major comments (2)
- [§3] §3 (main theorem on optimal reward gain): The claim of a closed-form expression for max_{p: KL(p||p0)≤δ} (E_p[r]−E_{p0}[r]) governed by a Jeffreys term is load-bearing. The optimizing distribution is the exponential tilt p_λ ∝ p0 exp(λ r), yet enforcing exact KL(p_λ||p0)=δ requires solving a monotone scalar equation for λ; if the Jeffreys expression bypasses this solve for arbitrary δ and r, it is either an upper bound, an approximation, or holds only for special cases. The subsequent covariance reformulation inherits the same limitation.
- [§4] §4 (proxy-reward and ensembling results): The growth of the ideal-vs-proxy gap with reward error magnitude and decreasing KL penalty is derived from the same optimization; any implicit dependence on λ in the primary result propagates here and must be clarified before the reward-hacking bounds can be treated as exact.
minor comments (2)
- [Abstract, §5] The abstract and §5 refer to 'closed-form' without explicitly stating whether the expression is free of numerical root-finding for λ; a short clarifying sentence would remove ambiguity.
- [§6] Empirical section: the two tasks are described only at high level; adding the precise reward models, sampling temperatures, and number of base-model samples used for the covariance estimator would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. The points raised about the exactness of the closed-form result in §3 and the λ-dependence in §4 are well-taken. We clarify below that our derivations are exact (not bounds or approximations) and provide a practical sample-based estimator; we will revise the manuscript to make the role of the Lagrange multiplier λ fully explicit.
read point-by-point responses
-
Referee: [§3] §3 (main theorem on optimal reward gain): The claim of a closed-form expression for max_{p: KL(p||p0)≤δ} (E_p[r]−E_{p0}[r]) governed by a Jeffreys term is load-bearing. The optimizing distribution is the exponential tilt p_λ ∝ p0 exp(λ r), yet enforcing exact KL(p_λ||p0)=δ requires solving a monotone scalar equation for λ; if the Jeffreys expression bypasses this solve for arbitrary δ and r, it is either an upper bound, an approximation, or holds only for special cases. The subsequent covariance reformulation inherits the same limitation.
Authors: Our central result is exact for arbitrary r and δ. Let p_λ be the exponential tilt with λ chosen so that KL(p_λ || p_0) = δ. Then the maximum reward gain satisfies Δ = D_J(p_λ || p_0) / λ exactly, where D_J is the Jeffreys divergence. This is a closed-form expression governed by the Jeffreys term (in contrast to the looser O(√δ) bounds in prior work). Equivalently, Δ = cov_{p_0}(r, exp(λ r)) / E_{p_0}[exp(λ r)]. The covariance form is directly estimable from base-model samples: draw i.i.d. samples from p_0, compute the associated rewards, then numerically solve for the λ that achieves the target KL budget via Monte-Carlo estimates of the moment-generating function and covariance. The procedure does not bypass the scalar solve for λ, but it yields an exact, sample-only characterization of the achievable frontier. We will add a clarifying paragraph and pseudocode in §3 to state this procedure explicitly. revision: partial
-
Referee: [§4] §4 (proxy-reward and ensembling results): The growth of the ideal-vs-proxy gap with reward error magnitude and decreasing KL penalty is derived from the same optimization; any implicit dependence on λ in the primary result propagates here and must be clarified before the reward-hacking bounds can be treated as exact.
Authors: We agree that the proxy-reward and ensembling analyses inherit the same optimizing tilt p_λ from §3. Consequently the ideal-vs-proxy gap and the benefit of ensembling are expressed exactly in terms of the λ (or equivalently the KL budget δ) corresponding to each setting. The growth of the gap with reward error magnitude and with decreasing KL penalty (i.e., smaller β or larger λ) follows directly from the same exponential-tilt expressions. We will revise §4 to state the λ-dependence explicitly in the theorem statements and to note that all bounds are to be understood for a fixed KL constraint. revision: yes
Circularity Check
No significant circularity; derivation is first-principles optimization
full rationale
The central result follows from standard Lagrange-multiplier optimization of E_p[r] subject to KL(p || p0) ≤ δ, yielding the exponential tilt p_λ ∝ p0 exp(λ r) whose value can be rewritten in terms of Jeffreys divergence or covariance under p0. This is an algebraic identity and equivalent reformulation, not a self-definition or fitted parameter renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are required; the derivation remains self-contained against external information-theoretic benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption KL-divergence serves as a valid constraint measuring deviation from the base model distribution in the alignment objective.
- standard math Expectations and divergences are well-defined over the policy and reward distributions.
Reference graph
Works this paper leans on
-
[1]
Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, and Sanmi Koyejo
Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, and Sanmi Koyejo. Scalable ensembling for mitigating reward overoptimisation, 2024. URLhttps://arxiv.org/abs/2406.01013
-
[2]
Concrete Problems in AI Safety
DarioAmodei, ChrisOlah, JacobSteinhardt, PaulChristiano, JohnSchulman, andDanMané. Concreteproblems in ai safety, 2016. URLhttps://arxiv.org/abs/1606.06565
work page internal anchor Pith review arXiv 2016
-
[3]
Anthropic. Claude opus 4.6 system card. Technical report, Anthropic, February 2026. URLhttps://www-cdn. anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf
work page 2026
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander Nicholas D’Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=u3U8qzFV7w
work page 2025
-
[7]
Managing extreme ai risks amid rapid progress.Science, 384(6698):842–845, 2024
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress.Science, 384(6698):842–845, 2024
work page 2024
-
[8]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URLhttps://arxiv.org/abs/2304.01373
work page internal anchor Pith review arXiv 2023
-
[9]
Ai alignment at your discretion
Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio Cesar Vieira Machado, and Flavio du Pin Calmon. Ai alignment at your discretion. InProceedings of the 2025 ACM Con- ference on Fairness, Accountability, and Transparency, FAccT ’25, page 3046–3074, New York, NY, USA,
work page 2025
-
[10]
Ownership, Not Just Happy Talk
Association for Computing Machinery. ISBN 9798400714825. doi: 10.1145/3715275.3732194. URL https://doi.org/10.1145/3715275.3732194
-
[11]
Maarten Buyl, Alexander Rogiers, Sander Noels, Guillaume Bied, Iris Dominguez-Catena, Edith Heiter, Iman Johary, Alexandru-Cristian Mara, Raphaël Romero, Jefrey Lijffijt, and Tijl De Bie. Large language models reflect the ideology of their creators.npj Artificial Intelligence, 2(1), January 2026. ISSN 3005-1460. doi: 10.1038/s44387-025-00048-0. URLhttp://...
-
[12]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
-
[13]
URLhttps://arxiv.org/abs/2107.03374. 11
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 4302–4310, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964
work page 2017
-
[15]
Eleutherai_pythia-1b-deduped__reward__tldr (reward model), 2023
CleanRL. Eleutherai_pythia-1b-deduped__reward__tldr (reward model), 2023. URLhttps://huggingface.co/cleanrl/ EleutherAI_pythia-1b-deduped__reward__tldr
work page 2023
-
[16]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743, 2023
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overop- timization.arXiv preprint arXiv:2310.02743, 2023
-
[18]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore, Dec...
work page 2023
-
[19]
Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alexander Nicholas D’Amour, Krishna- murthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. InFirst Conference on Language Modeling...
work page 2024
-
[20]
pythia-1b-deduped-tldr-sft, 2024
TRL (Hugging Face). pythia-1b-deduped-tldr-sft, 2024. URLhttps://huggingface.co/trl-lib/pythia-1b-deduped-tldr-sft
work page 2024
-
[21]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[22]
BoNBon alignment for large language models and the sweetness of best-of-n sampling
Lin Gui, Cristina Garbacea, and Victor Veitch. BoNBon alignment for large language models and the sweetness of best-of-n sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=haSKMlrbX5
work page 2024
-
[23]
arXiv preprint arXiv:2504.15236 , year =
Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, and Deep Ganguli. Values in the wild: Discovering and analyzing values in real-world language model interactions, 2025. URLhttps://arxiv.org/abs/2504.15236
-
[24]
The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024
Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The n+ implementation details of rlhf with ppo: A case study on tl;dr summarization, 2024. URLhttps://arxiv.org/abs/2403. 17031
work page 2024
- [25]
-
[26]
arXiv preprint arXiv:2307.04657 , year =
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URLhttps://arxiv.org/abs/2307.04657
-
[27]
arXiv preprint arXiv:2310.19852 , year=
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey.arXiv preprint arXiv:2310.19852, 2023
-
[28]
arXiv preprint arXiv:2406.15513 , year=
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, and Yaodong Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference, 2025. URLhttps://arxiv.org/abs/2406.15513
-
[29]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Shopping MMLU: A massive multi- task online shopping benchmark for large language models
Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Sridatt Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, and Bing Yin. Shopping MMLU: A massive multi- task online shopping benchmark for large langua...
work page 2024
-
[31]
Watch your language: Investigating content moderation with large language models
Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. Watch your language: Investigating content moderation with large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865–878, 2024
work page 2024
-
[32]
Youssef Mroueh and Apoorva Nitsure. Information theoretic guarantees for policy alignment in large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id= Uz9J77Riul
work page 2025
-
[33]
Controlled decoding from language models
Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[34]
Using gpt-4 for content moderation, 2023
OpenAI. Using gpt-4 for content moderation, 2023. URLhttps://openai.com/index/using-gpt-4-for-content-moderation. Accessed: 2024-05-01
work page 2023
-
[35]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[36]
Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina, Barry-John Theobald, Luca Zappella, and Nicholas Apostoloff. Dso: Direct steering optimization for bias mitigation, 2026. URL https://arxiv.org/abs/2512.15926
-
[37]
Ai deception: A survey of examples, risks, and potential solutions.Patterns, 5(5), 2024
Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions.Patterns, 5(5), 2024
work page 2024
-
[38]
XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai, and XuanJing Huang. Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10):1872–1897, September 2020. ISSN 1869-1900. doi: 10.1007/s11431-020-1647-3. URLhttp://dx.doi.org/10.1007/s11431-020-1647-3
-
[39]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc
work page 2023
-
[40]
Salman Salamatian, Litian Liu, Ahmad Beirami, and Muriel Médard. Mismatched guesswork, 2019. URL https://arxiv.org/abs/1907.00531
-
[41]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, and Olivier Bachem. Bond: Aligning llms with best-of...
-
[43]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Learning to summarize from human feedback
NisanStiennon, LongOuyang, JeffWu, DanielM.Ziegler, RyanLowe, ChelseaVoss, AlecRadford, DarioAmodei, and Paul Christiano. Learning to summarize from human feedback. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
work page 2020
-
[46]
Machine-assisted proof.Notices of the American Mathematical Society, 72(1):6–13, 2025
Terence Tao. Machine-assisted proof.Notices of the American Mathematical Society, 72(1):6–13, 2025
work page 2025
-
[47]
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, 13 Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A survey on post-training ...
-
[48]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Tl;dr dataset for trl.https://huggingface.co/datasets/trl-lib/tldr, 2025
TRL Team. Tl;dr dataset for trl.https://huggingface.co/datasets/trl-lib/tldr, 2025
work page 2025
-
[50]
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexan- der M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023. URLhttps://arxiv.org/abs/2310. 16944
work page 2023
- [51]
-
[52]
TRL: Transformers Reinforcement Learning, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URL https://github.com/huggingface/trl
work page 2020
-
[53]
Michael L. Waskom. seaborn: statistical data visualization.Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URLhttps://doi.org/10.21105/joss.03021
-
[54]
Asymptotics of language model alignment
Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, and Ahmad Beirami. Asymptotics of language model alignment. In2024 IEEE International Symposium on Information Theory (ISIT), pages 2027–2032, 2024. doi: 10.1109/ISIT57864.2024.10619456
-
[55]
Webshop: towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: towards scalable real-world web interaction with grounded language agents. InProceedings of the 36th International Conference on Neural Infor- mation Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088
work page 2022
-
[56]
Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024
Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, and Chuang Gan. Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024. URLhttps://arxiv.org/abs/2401.16635
-
[57]
Calibrating sequence likelihood improves conditional language generation
Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=0qSOodKmJaN. 14 Supplementary Material: Theoretical Limits of Language Model Align...
work page 2023
-
[58]
Convergence of Component Estimators (WLLN): The WLLN states that for a sequence of i.i.d. random variablesX1, . . . , Xn with a finite expectationE[X], the sample mean ¯Xn = 1 n P Xi converges in probability to the expectation:¯Xn p − →E[X]. We apply this to our three primary estimators (from Definition 1), which are all sample means of i.i.d. variables (...
-
[59]
Convergence of Combined Estimators (Continuous Mapping): The Continuous Mapping Theorem states that for a sequence of random variablesXn p − →c, and a functiong that is continuous atc, we haveg(Xn) p − →g(c). The covariance estimatordCov is a continuous functiong1 of our three sample means: dCov=g 1( ˆC,ˆµr′, ˆZr) = ˆC−(ˆµr′ · ˆZr) Since multiplication an...
-
[60]
1, finiteness conditions are satisfied by hypothesis
b∆n(r, r) p − →∆(r, r)by Prop. 1, finiteness conditions are satisfied by hypothesis
-
[61]
helpful” decoding regime (clusterVA(x)) and a second batch using a “safe/refusal
bZr p − →Zr by the WLLN, sinceE[exp(r/λ)] =Z r <∞by hypothesis; thenlog bZr p − →logZr by the Continuous Mapping Theorem, sinceZr >0. The estimator is a continuous function of these three convergent quantities, so by a further application of the Continuous Mapping Theorem,cKL(πr,λ∥πbase) p − →KL(πr,λ∥πbase). B.2 Proof of Theoretical results in Section 4 P...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.