pith. machine review for the scientific record. sign in

arxiv: 2605.02469 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KL-regularized RLVRweighted SFTBoltzmann projectionpolicy mirror descentverifiable rewardsone-shot trainingreference samplingfinite-sample analysis
0
0 comments X

The pith

A reference-sampled weighted SFT objective induces exactly the Boltzmann policy of fixed-reference KL-regularized RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a weighted supervised fine-tuning loss, when samples are drawn from a fixed reference policy and weights are set to the prompt-normalized Boltzmann factor exp(r(x,y)/β)/Z(x), produces the identical policy that would be reached by running KL-regularized reinforcement learning with verifiable rewards. This equivalence removes the need to keep rollout generation, verifier scoring, and reference evaluations inside every optimization step. Instead, a one-time reference dataset can be stored and used for training, with explicit finite-sample error terms that separate coverage gaps from estimation and optimization noise. The work further shows that repeating the projection with refreshed samples corresponds to steps of KL policy mirror descent, where inexact inner solves appear only as bounded drift.

Core claim

The paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x). BOLT is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance,

What carries the argument

The reference-sampled weighted-SFT objective whose weights reduce to the prompt-normalized Boltzmann factor exp(r(x,y)/β)/Z(x), thereby inducing the exact KL-regularized target policy.

If this is right

  • Extra SFT epochs cannot compensate for missing coverage in the stored reference support.
  • Refreshed Boltzmann projections with adaptive sampling reduce to KL policy mirror descent steps.
  • Finite inner optimization of each projection appears only as additive drift from the exact mirror step.
  • The temperature-coverage-variance frontier governs the choice of β and sample size for acceptable weight variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Precomputing and storing a single large reference rollout set could allow fully offline training loops for many RLVR tasks.
  • The explicit separation of coverage price from other errors suggests monitoring reference diversity as a practical diagnostic during dataset construction.
  • Iterating the projection with periodic reference refreshes offers a controllable middle ground between one-shot SFT and full online RL.

Load-bearing premise

The reference policy is held fixed and the sampler plus density-ratio weights can be chosen to induce the Boltzmann target policy exactly.

What would settle it

Compute the policy obtained from BOLT on a fixed reference set, then run full KL-regularized RLVR to convergence on the same reward and reference; the two policies should agree in expected reward and KL divergence up to the predicted finite-sample gaps.

read the original abstract

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/\beta)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $\beta\log(1/\pi^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer (the standard Boltzmann target obtained by exponentially tilting the reference policy by verifier reward). In the reference-sampled subclass, the required density-ratio weights reduce uniquely (up to prompt scaling) to the prompt-normalized form exp(r(x,y)/β)/Z(x). BOLT is introduced as the empirical estimator of this projection. The finite one-shot analysis decomposes the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. Refreshed Boltzmann projections are shown to correspond to KL policy mirror descent, with finite inner solves entering as additive drift. Single-run Qwen experiments provide supporting evidence for the target-matched weights, one-shot saturation, refreshed-sampler gains, and optimization-time savings.

Significance. If the central equivalence and uniqueness reduction hold, the work supplies a theoretically grounded static weighted-SFT procedure that exactly matches the fixed-reference KL-regularized RLVR optimum, together with an error decomposition that isolates coverage limitations and a temperature-coverage-variance frontier. The explicit link to policy mirror descent for the adaptive case is a useful conceptual bridge. These elements could simplify RLVR pipelines by enabling precomputed rollouts while preserving the target policy, and the decomposition offers concrete guidance on when extra SFT epochs cannot compensate for missing reference coverage.

major comments (2)
  1. [Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.
  2. [Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.
minor comments (2)
  1. [§4] The finite one-shot error decomposition would be clearer if the individual terms (coverage, partition, ESS variance, etc.) were collected in a single table with their scaling and dependence on β and coverage.
  2. [Notation] Notation for the partition function Z(x) and the reference policy π_ref should be introduced once and used consistently; minor inconsistencies appear in the abstract versus later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.

    Authors: We appreciate the observation. The uniqueness is a direct consequence of the density-ratio matching condition: any reference-sampled weighted-SFT objective that induces exactly the Boltzmann target π* must have weights satisfying w(y|x) ∝ π*(y|x)/π_ref(y|x) (up to prompt-dependent scaling), which forces the prompt-normalized form exp(r(x,y)/β)/Z(x). To make this fully explicit and eliminate any risk of appearing tautological, we will add a short dedicated lemma in §3 that derives the uniqueness from the induced-policy equality, showing that any other weight function in the reference-sampled class fails to recover π* unless it reduces to this form. revision: yes

  2. Referee: [Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.

    Authors: We acknowledge the limitation. The manuscript already qualifies the Qwen runs as single-run and illustrative within the stated scope, with the core contributions being the theoretical equivalence, uniqueness reduction, and finite error decomposition. The experiments serve only to confirm the predicted qualitative behaviors. In revision we will strengthen the presentation by more explicitly framing the results as illustrative and, where feasible, add a small number of additional seeds with basic variance reporting for the key metrics (target matching and one-shot saturation). revision: partial

Circularity Check

1 steps flagged

Core identification of target-matched weights reduces to construction from known Boltzmann form

specific steps
  1. self definitional [Abstract]
    "This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x)."

    The KL-regularized optimizer is defined as the Boltzmann policy π(y|x) ∝ π_ref(y|x) exp(r(x,y)/β) / Z(x). The paper then selects the weighted-SFT weights to force the induced policy (under reference sampling) to equal this exact target, so the claimed equality and the unique reduction to exp(r/β)/Z(x) hold by the explicit construction of the weights rather than emerging from an independent argument.

full rationale

The paper's central result identifies a reference-sampled weighted-SFT objective whose induced policy equals the KL-regularized RLVR optimizer (the Boltzmann target). This equality is achieved by setting the weights to the exact form that reproduces the target when sampling from the reference policy. Since the target policy is the standard closed-form solution of the KL-regularized objective, the weight derivation is equivalent to the input definition rather than an independent derivation. The one-shot error decomposition and mirror-descent connections build upon this constructed objective without introducing additional circular reductions in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The derivation rests on the standard definition of the KL-regularized RLVR objective as the Boltzmann target and on the assumption that a weighted likelihood can be made to induce that target via density ratios; beta appears as a temperature hyperparameter.

free parameters (1)
  • beta
    Temperature scaling the reward in the Boltzmann weight exp(r/β)/Z(x); chosen to control the target policy sharpness.
axioms (1)
  • domain assumption The fixed-reference KL-regularized RLVR optimizer is exactly the Boltzmann target policy obtained by exponentially tilting the reference policy by verifier reward.
    Invoked to define the target that the weighted SFT must match.

pith-pipeline@v0.9.0 · 5618 in / 1394 out tokens · 76807 ms · 2026-05-08T19:30:57.073786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · 13 internal anchors

  1. [1]

    Bartlett and Shahar Mendelson

    Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.JMLR, 3:463–482, 2002

  2. [2]

    Institute of Mathematical Statistics, 2007

    Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofLecture Notes–Monograph Series. Institute of Mathematical Statistics, 2007

  3. [3]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InProc. NeurIPS, 2021

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen et al. Evaluating large language models trained on code. arXiv:2107.03374, arXiv, 2021

  5. [5]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProc. NeurIPS, 2017

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168, arXiv, 2021

  7. [7]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InProc. NeurIPS, 2022

  8. [8]

    QLoRA: Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. NeurIPS, 2023

  9. [9]

    Donsker and S

    Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, i.Commun. Pure Appl. Math., 28(1):1–47, 1975

  10. [10]

    RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026

    Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, and Anningzhe Gao. RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026

  11. [11]

    Model alignment as prospect theoretic optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. InProc. ICML, 2024

  12. [12]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J. Optim., 23(4):2341–2368, 2013

  13. [13]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling. arXiv:2308.08998, arXiv, 2023

  14. [14]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    Daya Guo et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638, 2025. 17

  15. [15]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProc. ICML, 2018

  16. [16]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProc. NeurIPS Datasets and Benchmarks, 2021

  17. [17]

    Probability inequalities for sums of bounded random variables.J

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58(301):13–30, 1963

  18. [18]

    ORPO: Monolithic preference optimization without reference model

    Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. InProc. EMNLP, 2024

  19. [19]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022

  20. [20]

    OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Wenkai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework. InProc. EMNLP System Demonstrations, 2025

  21. [21]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-Reasoner-Zero: An open source approach to scaling up reinforcement learning on the base model. arXiv:2503.24290, arXiv, 2025

  22. [22]

    PROS: Towards compute-efficient RLVR via rollout prefix reuse

    Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InProc. ICLR, 2026

  23. [23]

    Springer, 2016

    Shun ichi Amari.Information Geometry and Its Applications, volume 194 ofApplied Mathe- matical Sciences. Springer, 2016

  24. [24]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team et al. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv:2501.12599, arXiv, 2025

  25. [25]

    Conservative Q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InProc. NeurIPS, 2020

  26. [26]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124, arXiv, 2024

  27. [27]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909, arXiv, 2018

  28. [28]

    McAllester

    David A. McAllester. PAC-Bayesian model averaging. InProc. COLT, pages 164–170, 1999

  29. [29]

    SimPO: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. InProc. NeurIPS, 2024

  30. [30]

    Rossi, Seunghyun Yoon, Trung Bui, Anup B

    Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan A. Rossi, Seunghyun Yoon, Trung Bui, Anup B. Rao, Jayakumar Subramanian, and Branislav Kveton. Offline RL by reward-weighted fine-tuning for conversation optimization. InProc. NeurIPS, 2025

  31. [31]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. AW AC: Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359, arXiv, 2020

  32. [32]

    Reward augmented maximum likelihood for neural structured prediction

    Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured prediction. InProc. NeurIPS, 2016

  33. [33]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

  34. [34]

    Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024

    Andrew Patterson, Samuel Neumann, Martha White, and Adam White. Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024

  35. [35]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv:1910.00177, arXiv, 2019

  36. [36]

    Relative entropy policy search

    Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. InProc. AAAI, 2010

  37. [37]

    Supervised fine tuning on curated data is reinforcement learning (and can be improved), 2025

    Chongli Qin and Jost Tobias Springenberg. Supervised fine tuning on curated data is reinforce- ment learning (and can be improved). arXiv:2507.12856, arXiv, 2025

  38. [38]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InProc. NeurIPS, 2023

  39. [39]

    Rubinstein and Dirk P

    Reuven Y . Rubinstein and Dirk P. Kroese.The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science and Statistics. Springer, 2004

  40. [40]

    Efficient rlhf: Reducing the memory usage of ppo

    Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient RLHF: Reducing the memory usage of PPO. arXiv:2309.00754, arXiv, 2023

  41. [41]

    Jordan, and Pieter Abbeel

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProc. ICML, 2015

  42. [42]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, arXiv, 2017

  43. [43]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, arXiv, 2024

  44. [44]

    HybridFlow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. In Proc. EuroSys, 2025

  45. [45]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proc. NeurIPS, 2020

  46. [46]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. arXiv:2506.14245, arXiv, 2025

  47. [47]

    On the generalization of SFT: A reinforcement learning perspective with reward rectification

    Yongliang Wu, Yizhou Zhou, Ziheng Zhou, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of SFT: A reinforcement learning perspective with reward rectification. InProc. ICLR, 2026

  48. [48]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report. arXiv:2505.09388, arXiv, 2025

  49. [49]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale. arXiv:2503.14476, arXiv, 2025

  50. [50]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825, arXiv, 2023

  51. [51]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv:2504.13837, arXiv, 2025

  52. [52]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv:2504.05118, arXiv, 2025. 19

  53. [53]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InProc. NeurIPS, 2022

  54. [54]

    SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InProc. COLM, 2025

  55. [55]

    Lee, and Yuejie Chi

    Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv:2105.11066, arXiv, 2021

  56. [56]

    Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

    Kai Zhang et al. Agent learning via early experience. arXiv:2510.08558, arXiv, 2025

  57. [57]

    LLM alignment through successive policy re-weighting (SPR)

    Xinnan Zhang, Siliang Zeng, Jiaxiang Li, Kaixiang Lin, and Mingyi Hong. LLM alignment through successive policy re-weighting (SPR). NeurIPS Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024

  58. [58]

    Ziebart, J

    Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. InProc. AAAI, 2010

  59. [59]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, arXiv, 2019. A Weighted-SFT Induced-Target Details Weighted SFT methods can look identical at the optimizer level because they all minimize a weighted log likel...

  60. [60]

    finite sample

    Proposition 1 therefore gives ˜πw(y|x) =π ref(y|x) exp(r(x, y)/β) Z(x) =π ∗(y|x).(31) For fixedx, DKL(π∗∥πθ) =−H(π ∗)−E y∼π ∗(·|x) [logπ θ(y|x)].(32) The entropy term does not depend on θ. Converting the expectation under π∗ to one under πref using (4) gives Ey∼π ∗ [logπ θ(y|x)] =E y∼πref exp(r(x, y)/β) Z(x) logπ θ(y|x) .(33) Negating gives the weighted m...