How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

Akifumi Wachi; Kohei Miyaguchi; Rei Higuchi; Ryotaro Kawata; Shokichi Takakura; Taiji Suzuki

arxiv: 2605.24749 · v1 · pith:574LWMSMnew · submitted 2026-05-23 · 📊 stat.ML · cs.LG

How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

Rei Higuchi , Ryotaro Kawata , Akifumi Wachi , Shokichi Takakura , Kohei Miyaguchi , Taiji Suzuki This is my paper

Pith reviewed 2026-06-30 11:58 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords reward modelingsingle-index modelfeature learningpolicy optimizationtemperature scalingvalue gap boundsneural networksKL-regularized optimization

0 comments

The pith

Above a constant temperature threshold, neural reward models recover the hidden direction in a single-index model and bound the value gap of the exponentiated policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines reward modeling inside KL-regularized policy optimization, where the learned reward is exponentiated to shape the final policy and downstream value depends on errors in high-reward regions. It works in the Gaussian single-index model in which the true reward is an unknown function of the projection onto one hidden vector. A two-stage neural network first learns the hidden direction from exponentially weighted samples and then fits the readout by weighted ridge regression. When the feature-learning temperature exceeds a dimension-free O(1) threshold, a constant fraction of neurons recover the direction; the paper then supplies explicit value-gap bounds that track the deployment temperature and compare ideal label weighting against practical surrogate weighting.

Core claim

In the Gaussian single-index model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d), the two-stage neural reward model recovers the hidden direction θ^* in a constant fraction of neurons for any feature-learning temperature β1 above a dimension-free O(1) threshold, with weak-recovery complexity governed by the generative exponent. After recovery, weighted ridge regression on the readout layer produces tilted-policy value-gap bounds for both an idealized label-weighted fit with weights e^{y/β2} and a practical surrogate-weighted fit with weights e^{r_{a0}(x)/β2}. Keeping β2 explicit identifies an admissible set of deployment temperatures that trades the gain from lowering β2 against the learning cos

What carries the argument

Two-stage neural reward model that learns the hidden direction from reward-weighted samples in the first layer and then fits the readout by weighted ridge regression at a second temperature.

If this is right

For β1 above the threshold a constant fraction of neurons recover the hidden direction.
Weak-recovery complexity is governed by the generative exponent of the link function.
Value-gap bounds hold after recovery for both label-weighted and surrogate-weighted readout fits.
An admissible interval for the deployment temperature β2 balances gain from lower temperature against amplified learning cost.
In the surrogate-weighted case, proxy-dependent factors shrink the admissible interval for β2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of feature-learning and deployment temperatures suggests staged tuning may improve final policy performance without changing the model architecture.
When true labels are unavailable the surrogate-weighted bounds indicate how much the admissible temperature range shrinks relative to the ideal case.
The single-index recovery mechanism could be tested by measuring neuron alignment statistics on trained reward models from standard RL benchmarks.

Load-bearing premise

The true reward is generated exactly by an unknown function of the inner product with one fixed hidden vector and inputs are drawn from a standard Gaussian.

What would settle it

Training the two-stage network on single-index data and finding that fewer than a constant fraction of first-layer neurons align with the true direction when the feature-learning temperature exceeds the O(1) threshold would falsify the recovery claim.

read the original abstract

Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = \sigma^*(\langle \theta^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $\theta^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $\beta_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/\beta_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/\beta_2}$. Keeping the $\beta_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $\beta_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives explicit O(1) thresholds on feature-learning temperature β1 for neuron recovery in a single-index reward model and gives β2-dependent value-gap bounds for both label-weighted and surrogate-weighted fits.

read the letter

The core contribution is a single-index analysis of how exponential weighting at temperature β1 affects the Hermite signal seen by the first layer of a two-stage neural reward model, followed by explicit tilted-policy value-gap bounds after recovery. For β1 above a dimension-free constant, a constant fraction of neurons recover the hidden direction θ*, with sample complexity tied to the generative exponent. The paper then tracks how β2 enters the surrogate-weighted ridge fit and produces an admissible interval for deployment temperatures that trades off policy improvement against amplified estimation error.

This is new relative to prior single-index work on supervised learning or standard reward modeling: the feedback loop through the KL-regularized policy and the surrogate weighting e^{r_a0(x)/β2} are handled directly, and all β dependence is kept visible rather than hidden in big-O notation. The derivations rely on standard Hermite expansions of the weighted population gradient plus ridge concentration, which are appropriate for the exact Gaussian single-index generative model assumed.

The main limitation is the model itself. Everything is derived under r*(x) = σ*(⟨θ*, x⟩) with x ~ N(0,I_d) and a prescribed two-stage architecture; how much carries over when the reward is not exactly single-index or when the policy optimization loop is closed is left open. No circularity appears in the stated claims, and the stress-test confirms the steps are the usual ones for this setting.

The work is aimed at theorists working on reward modeling and feature learning in RLHF-style pipelines. It is worth sending to referees because the temperature trade-offs are made concrete and the analysis is self-contained within its stated assumptions.

Referee Report

0 major / 2 minor

Summary. The paper analyzes a two-stage neural reward model in a Gaussian single-index generative model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d). It shows that for any feature-learning temperature β1 above a dimension-free O(1) threshold, a constant fraction of neurons recover the hidden direction θ*, with weak-recovery complexity governed by the generative exponent via Hermite expansions of the weighted population gradient. After recovery, it derives tilted-policy value-gap bounds for an idealized label-weighted fit (weights e^{y/β2}) and a practical surrogate-weighted fit (weights e^{r_{a0}(x)/β2}), identifying an admissible set of deployment temperatures β2 that balances gain from lowering β2 against amplified learning cost.

Significance. If the results hold, the work supplies a precise single-index analysis of how exponential weighting in reward modeling affects both feature recovery and downstream KL-regularized policy value, with all β-dependence kept explicit. Notable strengths are the dimension-free threshold on β1, the explicit admissible temperature sets, and the distinction between label-weighted and surrogate-weighted regimes. The derivations rely on standard Hermite-signal calculations and ridge-regression concentration, applied to the policy-optimization feedback loop.

minor comments (2)

Abstract: the phrase 'weak-recovery complexity governed by the generative exponent' should be expanded with a brief parenthetical reference to the specific Hermite coefficient or link function exponent that controls the rate.
The manuscript should include a short table or paragraph contrasting the admissible β2 ranges for the label-weighted versus surrogate-weighted cases, to make the shrinkage effect of proxy-dependent factors immediately visible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The provided summary correctly reflects the paper's contributions on feature recovery and tilted-policy value gaps in the single-index setting. No specific major comments appear in the report, so we have no point-by-point items to address.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts a theoretical analysis of feature recovery and value-gap bounds inside an exactly-specified Gaussian single-index generative model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d). The derivations rely on standard tools (Hermite expansions of the weighted population gradient and ridge-regression concentration) whose assumptions are stated explicitly and do not include the target claims. All β-dependence is kept explicit; no step reduces a claimed prediction to a fitted quantity by construction, nor does any load-bearing premise rest on a self-citation chain. The central results are therefore independent of the paper's own fitted outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The analysis rests on the Gaussian single-index generative assumption and the two-stage neural architecture; temperatures β1 and β2 function as tunable parameters rather than fitted constants.

free parameters (2)

β1
Feature-learning temperature that must exceed an O(1) threshold for recovery; chosen by the analyst.
β2
Deployment temperature controlling policy sharpness and weighting strength; appears in admissible-set derivation.

axioms (1)

domain assumption Data generated from Gaussian single-index model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d)
Core modeling choice that enables Hermite-signal analysis and recovery guarantees.

pith-pipeline@v0.9.1-grok · 5796 in / 1378 out tokens · 40137 ms · 2026-06-30T11:58:41.518752+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 11 canonical work pages · 3 internal anchors

[1]

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adserà, and Theodor Misiakiewicz. SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Gergely Neu and Lorenzo Rosasco, editors, The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India, Proceedings of Machine Learning Research, pages 2552–2623. PMLR,

2023
[2]

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan

URL https: //proceedings.mlr.press/v195/abbe23a.html. Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan. Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions.CoRR, abs/2405.15459,

work page arXiv
[4]

Jimmy Ba, Murat A

URLhttps://jmlr.org/papers/v22/20-1288.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High- dimensional asymptotics of feature learning: How one gradient step improves the representation. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing...

2022
[5]

Jimmy Ba, Murat A

URL http://papers.nips.cc/paper_files/paper/2022/hash/ f7e7fabd73b3df96c54a320862afcb78-Abstract-Conference.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Serge...

2022
[6]

Yu Bai and Jason D

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 38a1671ab0747b6ffe4d1c6ef117a3a9-Abstract-Conference.html. Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2023
[7]

net/forum?id=rkllGyBFPH

URL https://openreview. net/forum?id=rkllGyBFPH. Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proce...

2022
[8]

10 Paul F

URL http://papers.nips.cc/paper_files/paper/2022/ hash/3fb6c52aeb11e09053c16eabee74dd7b-Abstract-Conference.html. 10 Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwa...

2022
[9]

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D

URL https://proceedings.neurips.cc/paper/2017/hash/ d5e2c0adad503c91f91df240d0cd4e49-Abstract.html. Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. Smoothing the landscape boosts the signal for SGD: optimal sample complexity for learning single index models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine...

2017
[10]

Rishabh Dudeja and Daniel Hsu

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 02763667a5761ff92bb15d8751bcd223-Abstract-Conference.html. Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, Proceedings of Mach...

2023
[11]

URLhttp://proceedings.mlr.press/v75/dudeja18a.html. Dylan J. Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. In Nika Haghtalab and Ankur Moitra, editors,The Thirty Eighth Annual Conference on Learning Theory, 30-4 July 2025, Lyon, Fra...

2025
[12]

Leo Gao, John Schulman, and Jacob Hilton

URLhttps://proceedings.mlr.press/v291/foster25a.html. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, P...

2023
[13]

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man

URLhttps://proceedings.mlr.press/v202/gao23h.html. Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man. Aligning language models with preferences through f-divergence minimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Confere...

2023
[14]

doi: 10.7551/mitpress/7921.003.0013

ISBN 9780262255103. doi: 10.7551/mitpress/7921.003.0013. URL https://doi.org/10.7551/mitpress/7921.003.0013. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

work page doi:10.7551/mitpress/7921.003.0013 2020
[15]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D

URL https://openreview.net/ forum?id=rygGQyrFvH. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overopti- mization via chi-squared preference optimization. InThe Thirteenth International Conference on Learning Representations, I...

2025
[16]

Kaixuan Ji, Jiafan He, and Quanquan Gu

URL https://openreview.net/forum?id=hXm0Wu2U9K. Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries.Trans. Mach. Learn. Res., 2025,

2025
[17]

11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang

URL https://openreview.net/forum?id= EScatQaRxz. 11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang. Generating diverse and high-quality texts by minimum bayes risk decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Augus...

2024
[18]

URL https: //doi.org/10.18653/v1/2024.findings-acl.503

doi: 10.18653/V1/2024.FINDINGS-ACL.503. URL https: //doi.org/10.18653/v1/2024.findings-acl.503. Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic for- getting. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. ...

work page doi:10.18653/v1/2024.findings-acl.503 2024
[19]

URL http://papers.nips.cc/paper_files/paper/2022/hash/ 67496dfa96afddab795530cc7c69b57a-Abstract-Conference.html. Jason D. Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. ...

2022
[20]

Arvind V

URL http://papers.nips.cc/paper_files/paper/2024/hash/ 6bd5fca2074dcd9ede9de50f71f7ec28-Abstract-Conference.html. Arvind V . Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. In Alice Oh, Tristan Naumann, Amir G...

2024
[21]

Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki

URL http://papers.nips.cc/paper_files/paper/ 2023/hash/b3748cdac932d91f0a51a37db90dec50-Abstract-Conference.html. Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki. Nonlinear transformers can perform inference-time feature learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wa...

2023
[22]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L

URL https://proceedings.mlr.press/v267/nishikawa25a.html. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Chris- tiano, Jan Leike, and Ryan Lowe. Tr...

2022
[24]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URL http://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Zoubin Ghahramani, editor,Machine Learning, Proceedings of the Twenty- Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, ACM International Conference Proceeding Serie...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[25]

doi: 10.1145/1273496. 1273590. URLhttps://doi.org/10.1145/1273496.1273590. Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In Maria Fox and David Poole, editors,Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, pages 1607–1612. AAAI Press,

work page doi:10.1145/1273496 2010
[26]

URLhttps://doi.org/10.1609/aaai.v24i1.7727

doi: 10.1609/AAAI.V24I1.7727. URLhttps://doi.org/10.1609/aaai.v24i1.7727. 12 Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors,Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information ...

work page doi:10.1609/aaai.v24i1.7727 2007
[27]

Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I

URL https://proceedings.neurips.cc/paper/2007/hash/ 013a006f03dbc5392effeb8f18fda755-Abstract.html. Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre Ménard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in RLHF.CoRR, abs/2410.17055,

work page arXiv 2007
[29]

Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller

URL https://proceedings.neurips.cc/paper/2020/hash/ 1f89885d556929e98d3ef9b86448f951-Abstract.html. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation.J. Mach. Learn. Res., 8:985–1005,

2020
[30]

URLhttps://dl.acm.org/doi/10.5555/1314498.1390324

doi: 10.5555/ 1314498.1390324. URLhttps://dl.acm.org/doi/10.5555/1314498.1390324. Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. Follow the wisdom of the crowd: Effective text generation via minimum bayes risk decoding. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 20...

work page doi:10.5555/1314498.1390324 2023
[31]

URL https: //doi.org/10.18653/v1/2023.findings-acl.262

doi: 10.18653/V1/2023.FINDINGS-ACL.262. URL https: //doi.org/10.18653/v1/2023.findings-acl.262. Konstantinos Christopher Tsiolis, Alireza Mousavi Hosseini, and Murat A. Erdogdu. From in- formation to generative exponent: Learning rate induces phase transitions in SGD.CoRR, abs/2510.21020,

work page doi:10.18653/v1/2023.findings-acl.262 2023
[32]

Zaletel, and Joel E

doi: 10.48550/ARXIV .2510.21020. URL https://doi.org/10.48550/ arXiv.2510.21020. Zhichao Wang, Denny Wu, and Zhou Fan. Nonlinear spiked covariance matrices and sig- nal propagation in deep neural networks. In Shipra Agrawal and Aaron Roth, editors,The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, Edmonton, Canada, Proceeding...

work page internal anchor Pith review doi:10.48550/arxiv 2023
[33]

Banghua Zhu, Michael I

URL https://proceedings.mlr.press/v247/wang24b.html. Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, IC...

2023
[35]

Fine-Tuning Language Models from Human Preferences

URLhttp://arxiv.org/abs/1909.08593. A Reward-weighted feature recovery This appendix gives the detailed feature-recovery statements summarized by Theorem 4.3. The analy- sis leverages the online spherical SGD framework of Tsiolis et al

work page internal anchor Pith review Pith/arXiv arXiv 1909
[36]

We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1

to treat the exponentially weighted reward oracle induced by our objective. We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1. For the target activation, write pgen := GE(σ∗). 13 Since σ∗ is a nonconstant polynomial, the polynomial-link characterization of Lee et al. [2024, Proposition 6 and Lemma...

2024
[37]

It remains to prove (4)

Integrating over t and then taking the infimum over c∈R proves the first claim. It remains to prove (4). Use |Ex∼πt[Yt(x)hc(x)]| ≤sup x∈BR |Yt(x)|Ex∼νR[|hc(x)|ρt(x)] ≤2M R∥hc∥L2(νR)∥ρt∥L2(νR), where sup x∈BR |Yt(x)| ≤2M R, becauseπ t is supported onB R. We now apply the simplifiedL2 form of the preceding lemma to the interpolation paths defined in Subsect...

2024
[38]

For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2

Let ˆaW ∈arg min a∈RN ( 1 T2 T2X i=1 WR(xi, ζi)(yi −r a(xi))2 +λ∥a∥ 2 2 ) . For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2 . There exists a universal constant C >0 such that, with probability at least 1−4δ 0 over the ridge-fitting sample, if λ≥C r M4,W,R T2δ0 , then ∥rˆaW −rtarget∥2 L2(νW ) ≲ 1 ZW EW,T2(a♯) + M2,W,R +M 4,W,R +G 4,W,R ...

2024

[1] [1]

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adserà, and Theodor Misiakiewicz. SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Gergely Neu and Lorenzo Rosasco, editors, The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India, Proceedings of Machine Learning Research, pages 2552–2623. PMLR,

2023

[2] [2]

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan

URL https: //proceedings.mlr.press/v195/abbe23a.html. Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan. Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions.CoRR, abs/2405.15459,

work page arXiv

[3] [4]

Jimmy Ba, Murat A

URLhttps://jmlr.org/papers/v22/20-1288.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High- dimensional asymptotics of feature learning: How one gradient step improves the representation. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing...

2022

[4] [5]

Jimmy Ba, Murat A

URL http://papers.nips.cc/paper_files/paper/2022/hash/ f7e7fabd73b3df96c54a320862afcb78-Abstract-Conference.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Serge...

2022

[5] [6]

Yu Bai and Jason D

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 38a1671ab0747b6ffe4d1c6ef117a3a9-Abstract-Conference.html. Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2023

[6] [7]

net/forum?id=rkllGyBFPH

URL https://openreview. net/forum?id=rkllGyBFPH. Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proce...

2022

[7] [8]

10 Paul F

URL http://papers.nips.cc/paper_files/paper/2022/ hash/3fb6c52aeb11e09053c16eabee74dd7b-Abstract-Conference.html. 10 Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwa...

2022

[8] [9]

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D

URL https://proceedings.neurips.cc/paper/2017/hash/ d5e2c0adad503c91f91df240d0cd4e49-Abstract.html. Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. Smoothing the landscape boosts the signal for SGD: optimal sample complexity for learning single index models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine...

2017

[9] [10]

Rishabh Dudeja and Daniel Hsu

URL http://papers.nips.cc/paper_files/paper/2023/hash/ 02763667a5761ff92bb15d8751bcd223-Abstract-Conference.html. Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, Proceedings of Mach...

2023

[10] [11]

URLhttp://proceedings.mlr.press/v75/dudeja18a.html. Dylan J. Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. In Nika Haghtalab and Ankur Moitra, editors,The Thirty Eighth Annual Conference on Learning Theory, 30-4 July 2025, Lyon, Fra...

2025

[11] [12]

Leo Gao, John Schulman, and Jacob Hilton

URLhttps://proceedings.mlr.press/v291/foster25a.html. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, P...

2023

[12] [13]

Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man

URLhttps://proceedings.mlr.press/v202/gao23h.html. Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man. Aligning language models with preferences through f-divergence minimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Confere...

2023

[13] [14]

doi: 10.7551/mitpress/7921.003.0013

ISBN 9780262255103. doi: 10.7551/mitpress/7921.003.0013. URL https://doi.org/10.7551/mitpress/7921.003.0013. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

work page doi:10.7551/mitpress/7921.003.0013 2020

[14] [15]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D

URL https://openreview.net/ forum?id=rygGQyrFvH. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overopti- mization via chi-squared preference optimization. InThe Thirteenth International Conference on Learning Representations, I...

2025

[15] [16]

Kaixuan Ji, Jiafan He, and Quanquan Gu

URL https://openreview.net/forum?id=hXm0Wu2U9K. Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries.Trans. Mach. Learn. Res., 2025,

2025

[16] [17]

11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang

URL https://openreview.net/forum?id= EScatQaRxz. 11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang. Generating diverse and high-quality texts by minimum bayes risk decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Augus...

2024

[17] [18]

URL https: //doi.org/10.18653/v1/2024.findings-acl.503

doi: 10.18653/V1/2024.FINDINGS-ACL.503. URL https: //doi.org/10.18653/v1/2024.findings-acl.503. Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic for- getting. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. ...

work page doi:10.18653/v1/2024.findings-acl.503 2024

[18] [19]

URL http://papers.nips.cc/paper_files/paper/2022/hash/ 67496dfa96afddab795530cc7c69b57a-Abstract-Conference.html. Jason D. Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. ...

2022

[19] [20]

Arvind V

URL http://papers.nips.cc/paper_files/paper/2024/hash/ 6bd5fca2074dcd9ede9de50f71f7ec28-Abstract-Conference.html. Arvind V . Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. In Alice Oh, Tristan Naumann, Amir G...

2024

[20] [21]

Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki

URL http://papers.nips.cc/paper_files/paper/ 2023/hash/b3748cdac932d91f0a51a37db90dec50-Abstract-Conference.html. Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki. Nonlinear transformers can perform inference-time feature learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wa...

2023

[21] [22]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L

URL https://proceedings.mlr.press/v267/nishikawa25a.html. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Chris- tiano, Jan Leike, and Ryan Lowe. Tr...

2022

[22] [24]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URL http://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Zoubin Ghahramani, editor,Machine Learning, Proceedings of the Twenty- Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, ACM International Conference Proceeding Serie...

work page internal anchor Pith review Pith/arXiv arXiv 1910

[23] [25]

doi: 10.1145/1273496. 1273590. URLhttps://doi.org/10.1145/1273496.1273590. Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In Maria Fox and David Poole, editors,Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, pages 1607–1612. AAAI Press,

work page doi:10.1145/1273496 2010

[24] [26]

URLhttps://doi.org/10.1609/aaai.v24i1.7727

doi: 10.1609/AAAI.V24I1.7727. URLhttps://doi.org/10.1609/aaai.v24i1.7727. 12 Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors,Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information ...

work page doi:10.1609/aaai.v24i1.7727 2007

[25] [27]

Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I

URL https://proceedings.neurips.cc/paper/2007/hash/ 013a006f03dbc5392effeb8f18fda755-Abstract.html. Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre Ménard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in RLHF.CoRR, abs/2410.17055,

work page arXiv 2007

[26] [29]

Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller

URL https://proceedings.neurips.cc/paper/2020/hash/ 1f89885d556929e98d3ef9b86448f951-Abstract.html. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation.J. Mach. Learn. Res., 8:985–1005,

2020

[27] [30]

URLhttps://dl.acm.org/doi/10.5555/1314498.1390324

doi: 10.5555/ 1314498.1390324. URLhttps://dl.acm.org/doi/10.5555/1314498.1390324. Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. Follow the wisdom of the crowd: Effective text generation via minimum bayes risk decoding. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 20...

work page doi:10.5555/1314498.1390324 2023

[28] [31]

URL https: //doi.org/10.18653/v1/2023.findings-acl.262

doi: 10.18653/V1/2023.FINDINGS-ACL.262. URL https: //doi.org/10.18653/v1/2023.findings-acl.262. Konstantinos Christopher Tsiolis, Alireza Mousavi Hosseini, and Murat A. Erdogdu. From in- formation to generative exponent: Learning rate induces phase transitions in SGD.CoRR, abs/2510.21020,

work page doi:10.18653/v1/2023.findings-acl.262 2023

[29] [32]

Zaletel, and Joel E

doi: 10.48550/ARXIV .2510.21020. URL https://doi.org/10.48550/ arXiv.2510.21020. Zhichao Wang, Denny Wu, and Zhou Fan. Nonlinear spiked covariance matrices and sig- nal propagation in deep neural networks. In Shipra Agrawal and Aaron Roth, editors,The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, Edmonton, Canada, Proceeding...

work page internal anchor Pith review doi:10.48550/arxiv 2023

[30] [33]

Banghua Zhu, Michael I

URL https://proceedings.mlr.press/v247/wang24b.html. Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, IC...

2023

[31] [35]

Fine-Tuning Language Models from Human Preferences

URLhttp://arxiv.org/abs/1909.08593. A Reward-weighted feature recovery This appendix gives the detailed feature-recovery statements summarized by Theorem 4.3. The analy- sis leverages the online spherical SGD framework of Tsiolis et al

work page internal anchor Pith review Pith/arXiv arXiv 1909

[32] [36]

We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1

to treat the exponentially weighted reward oracle induced by our objective. We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1. For the target activation, write pgen := GE(σ∗). 13 Since σ∗ is a nonconstant polynomial, the polynomial-link characterization of Lee et al. [2024, Proposition 6 and Lemma...

2024

[33] [37]

It remains to prove (4)

Integrating over t and then taking the infimum over c∈R proves the first claim. It remains to prove (4). Use |Ex∼πt[Yt(x)hc(x)]| ≤sup x∈BR |Yt(x)|Ex∼νR[|hc(x)|ρt(x)] ≤2M R∥hc∥L2(νR)∥ρt∥L2(νR), where sup x∈BR |Yt(x)| ≤2M R, becauseπ t is supported onB R. We now apply the simplifiedL2 form of the preceding lemma to the interpolation paths defined in Subsect...

2024

[34] [38]

For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2

Let ˆaW ∈arg min a∈RN ( 1 T2 T2X i=1 WR(xi, ζi)(yi −r a(xi))2 +λ∥a∥ 2 2 ) . For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2 . There exists a universal constant C >0 such that, with probability at least 1−4δ 0 over the ridge-fitting sample, if λ≥C r M4,W,R T2δ0 , then ∥rˆaW −rtarget∥2 L2(νW ) ≲ 1 ZW EW,T2(a♯) + M2,W,R +M 4,W,R +G 4,W,R ...

2024