pith. sign in

arxiv: 2605.30154 · v1 · pith:KH3JMQ5Lnew · submitted 2026-05-28 · 💻 cs.LG

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

Pith reviewed 2026-06-29 08:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningmaximum likelihoodsurrogate objectivesfinite rolloutsgradient estimationlanguage model trainingbinary feedbackupdate scale
0
0 comments X

The pith

A family of finite-rollout surrogate objectives continuously connects reinforcement learning to maximum likelihood training while preserving unbiased gradient estimators under fixed rollout budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs RL2ML as a family of surrogate objectives for training language models on binary feedback from sampled outputs. This family interpolates between standard reinforcement learning objectives, maximum-likelihood-like training, and objectives beyond maximum likelihood. It supplies closed-form unbiased gradient estimators that remain aligned with the objective even when only finite rollout groups are available. The work introduces the group-level update scale to track reweighting of rollout groups after their empirical success counts are observed, exposing a subcritical-to-supercritical transition invisible in population-level notation. Choice among members of the family is shown to depend jointly on the evaluation metric, local sensitivity, and estimator variance rather than proximity to maximum likelihood alone.

Core claim

The RL2ML family supplies finite-rollout surrogate objectives equipped with closed-form exactly unbiased gradient estimators; the family interpolates between reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while keeping estimator-objective alignment under a fixed rollout budget, and the newly defined group-level update scale reveals a subcritical-supercritical transition in reweighting that population-level objective notation conceals.

What carries the argument

The RL2ML family of surrogate objectives together with the group-level update scale that records how each rollout group is reweighted once its empirical success count is observed.

If this is right

  • The optimal surrogate is selected jointly by the evaluation metric, local sensitivity, and estimator variance rather than by nearness to maximum likelihood.
  • The remaining free parameter in the family reduces to a one-dimensional optimization problem instead of an unconstrained hyperparameter.
  • Calibrated metric-gain analysis combined with exact variance decomposition determines the best surrogate for a given setting.
  • Population-level objective notation alone hides the subcritical-supercritical update-scale transition.
  • Estimator-objective alignment holds for the entire family under any fixed rollout budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The one-dimensional optimization could be performed adaptively during training by tracking running estimates of sensitivity and variance.
  • The update-scale transition may appear in other reinforcement-learning settings that rely on small batches of rollouts rather than infinite-sample limits.
  • The same finite-rollout construction could be applied to objectives with non-binary rewards by replacing success counts with appropriate summary statistics.
  • If the transition point can be located analytically, training runs could switch surrogate mid-training to stay in the regime that minimizes variance for the target metric.

Load-bearing premise

A closed-form exactly unbiased gradient estimator exists for every member of the surrogate objective family when only finite rollout groups are used.

What would settle it

A direct computation or Monte Carlo check that demonstrates bias in the closed-form gradient estimator for any surrogate in the family under a concrete finite rollout budget would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30154 by Yifu Zheng.

Figure 1
Figure 1. Figure 1: Population-level weights and sample-level update scales for RL2ML. The left panel [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Stylized finite-horizon selection curves. This example illustrates that the best [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Noise curves induced by the exact variance decomposition for [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A representative triad-family frontier for [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper develops RL2ML, a family of finite-rollout surrogate objectives with a closed-form, exactly unbiased gradient estimator. The family continuously connects standard reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while preserving estimator-objective alignment under a fixed rollout budget. We introduce the group-level update scale to characterize how a rollout group is reweighted after its empirical success count is observed, revealing a subcritical-supercritical update-scale transition that is hidden by population-level objective notation alone. Building on this distinction, calibrated metric-gain analysis and exact variance decomposition show that the best choice of surrogate objective is determined neither by proximity to maximum likelihood nor by the population-level weight alone. Instead, it depends jointly on the evaluation metric, local sensitivity, and estimator variance. The remaining degree of freedom in the surrogate objective family can therefore be formulated as a one-dimensional optimization problem rather than treated as an unconstrained hyperparameter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RL2ML, a family of finite-rollout surrogate objectives for correctness-based RL with verifiable rewards. It claims to provide a closed-form, exactly unbiased gradient estimator for the family, which continuously connects RL, ML-like, and beyond-ML objectives while maintaining alignment under fixed rollout budgets. The paper introduces the group-level update scale to characterize reweighting after observing empirical success counts, identifying a subcritical-supercritical transition. It further uses calibrated metric-gain analysis and exact variance decomposition to argue that the optimal surrogate is determined by the evaluation metric, local sensitivity, and estimator variance, allowing the remaining freedom to be optimized in one dimension.

Significance. If the unbiased estimator and the variance decomposition hold, this work offers a significant contribution by clarifying the distinction between population-level objectives and the stochastic geometry induced by finite groups in RLVR training. The update-scale transition and the reduction to a 1D optimization problem could provide practical guidance for choosing objectives in language model training with binary feedback, moving beyond ad-hoc hyperparameter tuning.

major comments (1)
  1. [Abstract (and presumed Methods derivation of the estimator)] The central claim of a closed-form, exactly unbiased gradient estimator for the entire surrogate objective family under finite rollout groups (stated in the abstract) is load-bearing for the continuous-connection and update-scale results. The derivation must be examined to confirm that no interchange of expectation and sampling occurs that holds only in the infinite-group limit, and that reweighting after empirical success counts introduces no uncanceled bias; otherwise the claimed preservation of estimator-objective alignment and the subcritical-supercritical transition cannot be realized with finite budgets.
minor comments (1)
  1. [Notation and surrogate family definition] Define the surrogate family parameter explicitly and show how it interpolates between standard RL and maximum-likelihood objectives with a concrete equation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for focusing on the technical foundation of the unbiased estimator claim, which is indeed central to the paper's contributions. We address the concern point by point below.

read point-by-point responses
  1. Referee: [Abstract (and presumed Methods derivation of the estimator)] The central claim of a closed-form, exactly unbiased gradient estimator for the entire surrogate objective family under finite rollout groups (stated in the abstract) is load-bearing for the continuous-connection and update-scale results. The derivation must be examined to confirm that no interchange of expectation and sampling occurs that holds only in the infinite-group limit, and that reweighting after empirical success counts introduces no uncanceled bias; otherwise the claimed preservation of estimator-objective alignment and the subcritical-supercritical transition cannot be realized with finite budgets.

    Authors: We appreciate the referee highlighting this point. The derivation (Section 3, Equations 7–12) conditions explicitly on the observed success count within each finite-sized rollout group before taking the gradient. The reweighting is a deterministic function of the fully observed empirical count, and the expectation is computed exactly over the policy-induced distribution of that count; no interchange of limits or sampling is performed. The algebra shows that any potential bias terms from the reweighting cancel exactly for the parameterized family, yielding an unbiased estimator at any finite group size. This finite-group exactness is what permits the update-scale transition analysis. We have revised the manuscript to include an expanded step-by-step derivation in a new Appendix B that isolates the finite-group case and explicitly notes the absence of infinite-limit assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained with independent estimator derivation

full rationale

The paper introduces RL2ML as a parameterized family of finite-rollout surrogate objectives and asserts a closed-form exactly unbiased gradient estimator for the entire family. The abstract and provided text derive the group-level update scale and subcritical-supercritical transition directly from the finite-group reweighting after observing empirical success counts, without reducing any claimed prediction or transition to a fitted quantity by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled via prior work, and no renaming of known empirical patterns occurs. The one-dimensional optimization over the remaining degree of freedom is presented as an analysis result rather than a statistical fit. The central claims therefore rest on the (unverified here) algebraic derivation of the unbiased estimator rather than on any circular reduction to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only; the central addition is the surrogate family itself. No explicit free parameters, axioms, or invented entities are stated beyond the existence of the one-dimensional remaining freedom.

free parameters (1)
  • surrogate family parameter
    The remaining degree of freedom in the objective family that is reframed as a one-dimensional optimization problem.

pith-pipeline@v0.9.1-grok · 5727 in / 1225 out tokens · 32064 ms · 2026-06-29T08:32:08.329870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Kevin P. Murphy. Machine Learning: A Probabilistic Perspective . MIT Press, 2012. 21

  2. [2]

    Maximum likelihood reinforcement learning, 2026

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. arXiv preprint arXiv:2602.02710 , 2026

  3. [3]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine Learning, 8(3–4):229–256, 1992

  4. [4]

    Sutton, David McAllester, Satinder Singh, and Yishay Mansour

    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems , volume 12, pages 1057–1063, 1999

  5. [5]

    What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651 , 2025

    Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651 , 2025

  6. [6]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

  7. [7]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

  8. [8]

    Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages 12248–12267. Association for Compu...

  9. [9]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 , 2024

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948 , 2025

  11. [11]

    W. N. Bailey. Generalized Hypergeometric Series, volume 32 of Cambridge Tracts in Mathe- matics and Mathematical Physics . Cambridge University Press, 1935

  12. [12]

    Hypergeometric Summation: An Algorithmic Approach to Summation and Special Function Identities

    Wolfram Koepf. Hypergeometric Summation: An Algorithmic Approach to Summation and Special Function Identities . Vieweg, 1998

  13. [13]

    verl: Volcano engine reinforcement learning for LLMs

    verl contributors. verl: Volcano engine reinforcement learning for LLMs. https://github. com/verl-project/verl, 2025. Accessed 2026-05-28

  14. [14]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 , 2025. 22