How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis
Pith reviewed 2026-06-30 11:58 UTC · model grok-4.3
The pith
Above a constant temperature threshold, neural reward models recover the hidden direction in a single-index model and bound the value gap of the exponentiated policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the Gaussian single-index model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d), the two-stage neural reward model recovers the hidden direction θ^* in a constant fraction of neurons for any feature-learning temperature β1 above a dimension-free O(1) threshold, with weak-recovery complexity governed by the generative exponent. After recovery, weighted ridge regression on the readout layer produces tilted-policy value-gap bounds for both an idealized label-weighted fit with weights e^{y/β2} and a practical surrogate-weighted fit with weights e^{r_{a0}(x)/β2}. Keeping β2 explicit identifies an admissible set of deployment temperatures that trades the gain from lowering β2 against the learning cos
What carries the argument
Two-stage neural reward model that learns the hidden direction from reward-weighted samples in the first layer and then fits the readout by weighted ridge regression at a second temperature.
If this is right
- For β1 above the threshold a constant fraction of neurons recover the hidden direction.
- Weak-recovery complexity is governed by the generative exponent of the link function.
- Value-gap bounds hold after recovery for both label-weighted and surrogate-weighted readout fits.
- An admissible interval for the deployment temperature β2 balances gain from lower temperature against amplified learning cost.
- In the surrogate-weighted case, proxy-dependent factors shrink the admissible interval for β2.
Where Pith is reading between the lines
- The separation of feature-learning and deployment temperatures suggests staged tuning may improve final policy performance without changing the model architecture.
- When true labels are unavailable the surrogate-weighted bounds indicate how much the admissible temperature range shrinks relative to the ideal case.
- The single-index recovery mechanism could be tested by measuring neuron alignment statistics on trained reward models from standard RL benchmarks.
Load-bearing premise
The true reward is generated exactly by an unknown function of the inner product with one fixed hidden vector and inputs are drawn from a standard Gaussian.
What would settle it
Training the two-stage network on single-index data and finding that fewer than a constant fraction of first-layer neurons align with the true direction when the feature-learning temperature exceeds the O(1) threshold would falsify the recovery claim.
read the original abstract
Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = \sigma^*(\langle \theta^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $\theta^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $\beta_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/\beta_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/\beta_2}$. Keeping the $\beta_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $\beta_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes a two-stage neural reward model in a Gaussian single-index generative model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d). It shows that for any feature-learning temperature β1 above a dimension-free O(1) threshold, a constant fraction of neurons recover the hidden direction θ*, with weak-recovery complexity governed by the generative exponent via Hermite expansions of the weighted population gradient. After recovery, it derives tilted-policy value-gap bounds for an idealized label-weighted fit (weights e^{y/β2}) and a practical surrogate-weighted fit (weights e^{r_{a0}(x)/β2}), identifying an admissible set of deployment temperatures β2 that balances gain from lowering β2 against amplified learning cost.
Significance. If the results hold, the work supplies a precise single-index analysis of how exponential weighting in reward modeling affects both feature recovery and downstream KL-regularized policy value, with all β-dependence kept explicit. Notable strengths are the dimension-free threshold on β1, the explicit admissible temperature sets, and the distinction between label-weighted and surrogate-weighted regimes. The derivations rely on standard Hermite-signal calculations and ridge-regression concentration, applied to the policy-optimization feedback loop.
minor comments (2)
- Abstract: the phrase 'weak-recovery complexity governed by the generative exponent' should be expanded with a brief parenthetical reference to the specific Hermite coefficient or link function exponent that controls the rate.
- The manuscript should include a short table or paragraph contrasting the admissible β2 ranges for the label-weighted versus surrogate-weighted cases, to make the shrinkage effect of proxy-dependent factors immediately visible.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The provided summary correctly reflects the paper's contributions on feature recovery and tilted-policy value gaps in the single-index setting. No specific major comments appear in the report, so we have no point-by-point items to address.
Circularity Check
No significant circularity
full rationale
The paper conducts a theoretical analysis of feature recovery and value-gap bounds inside an exactly-specified Gaussian single-index generative model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d). The derivations rely on standard tools (Hermite expansions of the weighted population gradient and ridge-regression concentration) whose assumptions are stated explicitly and do not include the target claims. All β-dependence is kept explicit; no step reduces a claimed prediction to a fitted quantity by construction, nor does any load-bearing premise rest on a self-citation chain. The central results are therefore independent of the paper's own fitted outputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- β1
- β2
axioms (1)
- domain assumption Data generated from Gaussian single-index model r^*(x) = σ^*(⟨θ^*, x⟩) with x ∼ N(0, I_d)
Reference graph
Works this paper leans on
-
[1]
SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics
Emmanuel Abbe, Enric Boix Adserà, and Theodor Misiakiewicz. SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Gergely Neu and Lorenzo Rosasco, editors, The Thirty Sixth Annual Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India, Proceedings of Machine Learning Research, pages 2552–2623. PMLR,
2023
-
[2]
Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan
URL https: //proceedings.mlr.press/v195/abbe23a.html. Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan. Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions.CoRR, abs/2405.15459,
-
[4]
Jimmy Ba, Murat A
URLhttps://jmlr.org/papers/v22/20-1288.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High- dimensional asymptotics of feature learning: How one gradient step improves the representation. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing...
2022
-
[5]
Jimmy Ba, Murat A
URL http://papers.nips.cc/paper_files/paper/2022/hash/ f7e7fabd73b3df96c54a320862afcb78-Abstract-Conference.html. Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Serge...
2022
-
[6]
Yu Bai and Jason D
URL http://papers.nips.cc/paper_files/paper/2023/hash/ 38a1671ab0747b6ffe4d1c6ef117a3a9-Abstract-Conference.html. Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2023
-
[7]
net/forum?id=rkllGyBFPH
URL https://openreview. net/forum?id=rkllGyBFPH. Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proce...
2022
-
[8]
10 Paul F
URL http://papers.nips.cc/paper_files/paper/2022/ hash/3fb6c52aeb11e09053c16eabee74dd7b-Abstract-Conference.html. 10 Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwa...
2022
-
[9]
Alex Damian, Eshaan Nichani, Rong Ge, and Jason D
URL https://proceedings.neurips.cc/paper/2017/hash/ d5e2c0adad503c91f91df240d0cd4e49-Abstract.html. Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. Smoothing the landscape boosts the signal for SGD: optimal sample complexity for learning single index models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine...
2017
-
[10]
Rishabh Dudeja and Daniel Hsu
URL http://papers.nips.cc/paper_files/paper/2023/hash/ 02763667a5761ff92bb15d8751bcd223-Abstract-Conference.html. Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, Proceedings of Mach...
2023
-
[11]
URLhttp://proceedings.mlr.press/v75/dudeja18a.html. Dylan J. Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration. In Nika Haghtalab and Ankur Moitra, editors,The Thirty Eighth Annual Conference on Learning Theory, 30-4 July 2025, Lyon, Fra...
2025
-
[12]
Leo Gao, John Schulman, and Jacob Hilton
URLhttps://proceedings.mlr.press/v291/foster25a.html. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, P...
2023
-
[13]
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man
URLhttps://proceedings.mlr.press/v202/gao23h.html. Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymet- man. Aligning language models with preferences through f-divergence minimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Confere...
2023
-
[14]
doi: 10.7551/mitpress/7921.003.0013
ISBN 9780262255103. doi: 10.7551/mitpress/7921.003.0013. URL https://doi.org/10.7551/mitpress/7921.003.0013. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
-
[15]
Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D
URL https://openreview.net/ forum?id=rygGQyrFvH. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overopti- mization via chi-squared preference optimization. InThe Thirteenth International Conference on Learning Representations, I...
2025
-
[16]
Kaixuan Ji, Jiafan He, and Quanquan Gu
URL https://openreview.net/forum?id=hXm0Wu2U9K. Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries.Trans. Mach. Learn. Res., 2025,
2025
-
[17]
11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang
URL https://openreview.net/forum?id= EScatQaRxz. 11 Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang. Generating diverse and high-quality texts by minimum bayes risk decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, Augus...
2024
-
[18]
URL https: //doi.org/10.18653/v1/2024.findings-acl.503
doi: 10.18653/V1/2024.FINDINGS-ACL.503. URL https: //doi.org/10.18653/v1/2024.findings-acl.503. Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic for- getting. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. ...
-
[19]
URL http://papers.nips.cc/paper_files/paper/2022/hash/ 67496dfa96afddab795530cc7c69b57a-Abstract-Conference.html. Jason D. Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. ...
2022
-
[20]
Arvind V
URL http://papers.nips.cc/paper_files/paper/2024/hash/ 6bd5fca2074dcd9ede9de50f71f7ec28-Abstract-Conference.html. Arvind V . Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. In Alice Oh, Tristan Naumann, Amir G...
2024
-
[21]
Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki
URL http://papers.nips.cc/paper_files/paper/ 2023/hash/b3748cdac932d91f0a51a37db90dec50-Abstract-Conference.html. Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki. Nonlinear transformers can perform inference-time feature learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wa...
2023
-
[22]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L
URL https://proceedings.mlr.press/v267/nishikawa25a.html. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Chris- tiano, Jan Leike, and Ryan Lowe. Tr...
2022
-
[24]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
URL http://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Zoubin Ghahramani, editor,Machine Learning, Proceedings of the Twenty- Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, ACM International Conference Proceeding Serie...
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[25]
doi: 10.1145/1273496. 1273590. URLhttps://doi.org/10.1145/1273496.1273590. Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In Maria Fox and David Poole, editors,Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, pages 1607–1612. AAAI Press,
-
[26]
URLhttps://doi.org/10.1609/aaai.v24i1.7727
doi: 10.1609/AAAI.V24I1.7727. URLhttps://doi.org/10.1609/aaai.v24i1.7727. 12 Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors,Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information ...
-
[27]
Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I
URL https://proceedings.neurips.cc/paper/2007/hash/ 013a006f03dbc5392effeb8f18fda755-Abstract.html. Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre Ménard, Eric Moulines, and Michal Valko. Optimal design for reward modeling in RLHF.CoRR, abs/2410.17055,
-
[29]
Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller
URL https://proceedings.neurips.cc/paper/2020/hash/ 1f89885d556929e98d3ef9b86448f951-Abstract.html. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation.J. Mach. Learn. Res., 8:985–1005,
2020
-
[30]
URLhttps://dl.acm.org/doi/10.5555/1314498.1390324
doi: 10.5555/ 1314498.1390324. URLhttps://dl.acm.org/doi/10.5555/1314498.1390324. Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. Follow the wisdom of the crowd: Effective text generation via minimum bayes risk decoding. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 20...
-
[31]
URL https: //doi.org/10.18653/v1/2023.findings-acl.262
doi: 10.18653/V1/2023.FINDINGS-ACL.262. URL https: //doi.org/10.18653/v1/2023.findings-acl.262. Konstantinos Christopher Tsiolis, Alireza Mousavi Hosseini, and Murat A. Erdogdu. From in- formation to generative exponent: Learning rate induces phase transitions in SGD.CoRR, abs/2510.21020,
-
[32]
doi: 10.48550/ARXIV .2510.21020. URL https://doi.org/10.48550/ arXiv.2510.21020. Zhichao Wang, Denny Wu, and Zhou Fan. Nonlinear spiked covariance matrices and sig- nal propagation in deep neural networks. In Shipra Agrawal and Aaron Roth, editors,The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, Edmonton, Canada, Proceeding...
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[33]
Banghua Zhu, Michael I
URL https://proceedings.mlr.press/v247/wang24b.html. Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, IC...
2023
-
[35]
Fine-Tuning Language Models from Human Preferences
URLhttp://arxiv.org/abs/1909.08593. A Reward-weighted feature recovery This appendix gives the detailed feature-recovery statements summarized by Theorem 4.3. The analy- sis leverages the online spherical SGD framework of Tsiolis et al
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[36]
We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1
to treat the exponentially weighted reward oracle induced by our objective. We use the information exponent, generative exponent, and Hermite-coefficient notation from Defini- tion 4.1. For the target activation, write pgen := GE(σ∗). 13 Since σ∗ is a nonconstant polynomial, the polynomial-link characterization of Lee et al. [2024, Proposition 6 and Lemma...
2024
-
[37]
It remains to prove (4)
Integrating over t and then taking the infimum over c∈R proves the first claim. It remains to prove (4). Use |Ex∼πt[Yt(x)hc(x)]| ≤sup x∈BR |Yt(x)|Ex∼νR[|hc(x)|ρt(x)] ≤2M R∥hc∥L2(νR)∥ρt∥L2(νR), where sup x∈BR |Yt(x)| ≤2M R, becauseπ t is supported onB R. We now apply the simplifiedL2 form of the preceding lemma to the interpolation paths defined in Subsect...
2024
-
[38]
For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2
Let ˆaW ∈arg min a∈RN ( 1 T2 T2X i=1 WR(xi, ζi)(yi −r a(xi))2 +λ∥a∥ 2 2 ) . For any comparatora ♯ ∈R N , set EW,T2(a♯) :=∥r a♯ −r target∥2 W,T2 . There exists a universal constant C >0 such that, with probability at least 1−4δ 0 over the ridge-fitting sample, if λ≥C r M4,W,R T2δ0 , then ∥rˆaW −rtarget∥2 L2(νW ) ≲ 1 ZW EW,T2(a♯) + M2,W,R +M 4,W,R +G 4,W,R ...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.