arxiv: 2605.02469 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Yao Shu , Chenxing Wei , Hongbin Lin , Shuang Qiu , Hui Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KL-regularized RLVRweighted SFTBoltzmann projectionpolicy mirror descentverifiable rewardsone-shot trainingreference samplingfinite-sample analysis

0 comments

The pith

A reference-sampled weighted SFT objective induces exactly the Boltzmann policy of fixed-reference KL-regularized RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a weighted supervised fine-tuning loss, when samples are drawn from a fixed reference policy and weights are set to the prompt-normalized Boltzmann factor exp(r(x,y)/β)/Z(x), produces the identical policy that would be reached by running KL-regularized reinforcement learning with verifiable rewards. This equivalence removes the need to keep rollout generation, verifier scoring, and reference evaluations inside every optimization step. Instead, a one-time reference dataset can be stored and used for training, with explicit finite-sample error terms that separate coverage gaps from estimation and optimization noise. The work further shows that repeating the projection with refreshed samples corresponds to steps of KL policy mirror descent, where inexact inner solves appear only as bounded drift.

Core claim

The paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x). BOLT is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance,

What carries the argument

The reference-sampled weighted-SFT objective whose weights reduce to the prompt-normalized Boltzmann factor exp(r(x,y)/β)/Z(x), thereby inducing the exact KL-regularized target policy.

If this is right

Extra SFT epochs cannot compensate for missing coverage in the stored reference support.
Refreshed Boltzmann projections with adaptive sampling reduce to KL policy mirror descent steps.
Finite inner optimization of each projection appears only as additive drift from the exact mirror step.
The temperature-coverage-variance frontier governs the choice of β and sample size for acceptable weight variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Precomputing and storing a single large reference rollout set could allow fully offline training loops for many RLVR tasks.
The explicit separation of coverage price from other errors suggests monitoring reference diversity as a practical diagnostic during dataset construction.
Iterating the projection with periodic reference refreshes offers a controllable middle ground between one-shot SFT and full online RL.

Load-bearing premise

The reference policy is held fixed and the sampler plus density-ratio weights can be chosen to induce the Boltzmann target policy exactly.

What would settle it

Compute the policy obtained from BOLT on a fixed reference set, then run full KL-regularized RLVR to convergence on the same reward and reference; the two policies should agree in expected reward and KL divergence up to the predicted finite-sample gaps.

read the original abstract

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/\beta)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $\beta\log(1/\pi^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down the exact reference-sampled weights that make weighted SFT reproduce the fixed-reference Boltzmann target in KL-regularized RLVR, plus a clean finite one-shot error split.

read the letter

This paper shows how to set up a static weighted SFT objective on reference samples so that the fitted policy matches the standard Boltzmann tilt of the reference by reward in KL-regularized RLVR. The weights reduce to the prompt-normalized form exp(r(x,y)/β)/Z(x), and they name the resulting procedure BOLT. The finite one-shot decomposition then separates the coverage price from partition estimation, variance, generalization, and optimization terms, which clarifies why extra epochs on the same data cannot fix missing support and maps out the temperature-coverage-variance frontier. When sampling is refreshed it becomes a form of policy mirror descent with additive drift from inner solves. These pieces are internally consistent with the fixed-reference assumption and give a usable offline surrogate for the online RLVR path. The single-run Qwen experiments illustrate the target match, one-shot saturation, and time savings, but they lack error bars, statistical controls, or strong baselines, so the empirical side stays thin. The central claim is by construction once the weight form is selected, yet the explicit reduction and error breakdown are not routine. This work is aimed at researchers doing post-training with verifiable rewards who want to trade online rollouts for precomputed data while understanding the coverage cost. It deserves a serious referee because the framing is sharp and the decomposition is new, even though the empirics will need more runs and comparisons.

Referee Report

2 major / 2 minor

Summary. The paper identifies a reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer (the standard Boltzmann target obtained by exponentially tilting the reference policy by verifier reward). In the reference-sampled subclass, the required density-ratio weights reduce uniquely (up to prompt scaling) to the prompt-normalized form exp(r(x,y)/β)/Z(x). BOLT is introduced as the empirical estimator of this projection. The finite one-shot analysis decomposes the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. Refreshed Boltzmann projections are shown to correspond to KL policy mirror descent, with finite inner solves entering as additive drift. Single-run Qwen experiments provide supporting evidence for the target-matched weights, one-shot saturation, refreshed-sampler gains, and optimization-time savings.

Significance. If the central equivalence and uniqueness reduction hold, the work supplies a theoretically grounded static weighted-SFT procedure that exactly matches the fixed-reference KL-regularized RLVR optimum, together with an error decomposition that isolates coverage limitations and a temperature-coverage-variance frontier. The explicit link to policy mirror descent for the adaptive case is a useful conceptual bridge. These elements could simplify RLVR pipelines by enabling precomputed rollouts while preserving the target policy, and the decomposition offers concrete guidance on when extra SFT epochs cannot compensate for missing reference coverage.

major comments (2)

[Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.
[Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.

minor comments (2)

[§4] The finite one-shot error decomposition would be clearer if the individual terms (coverage, partition, ESS variance, etc.) were collected in a single table with their scaling and dependence on β and coverage.
[Notation] Notation for the partition function Z(x) and the reference policy π_ref should be introduced once and used consistently; minor inconsistencies appear in the abstract versus later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.

Authors: We appreciate the observation. The uniqueness is a direct consequence of the density-ratio matching condition: any reference-sampled weighted-SFT objective that induces exactly the Boltzmann target π* must have weights satisfying w(y|x) ∝ π*(y|x)/π_ref(y|x) (up to prompt-dependent scaling), which forces the prompt-normalized form exp(r(x,y)/β)/Z(x). To make this fully explicit and eliminate any risk of appearing tautological, we will add a short dedicated lemma in §3 that derives the uniqueness from the induced-policy equality, showing that any other weight function in the reference-sampled class fails to recover π* unless it reduces to this form. revision: yes
Referee: [Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.

Authors: We acknowledge the limitation. The manuscript already qualifies the Qwen runs as single-run and illustrative within the stated scope, with the core contributions being the theoretical equivalence, uniqueness reduction, and finite error decomposition. The experiments serve only to confirm the predicted qualitative behaviors. In revision we will strengthen the presentation by more explicitly framing the results as illustrative and, where feasible, add a small number of additional seeds with basic variance reporting for the key metrics (target matching and one-shot saturation). revision: partial

Circularity Check

1 steps flagged

Core identification of target-matched weights reduces to construction from known Boltzmann form

specific steps

self definitional [Abstract]
"This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x)."

The KL-regularized optimizer is defined as the Boltzmann policy π(y|x) ∝ π_ref(y|x) exp(r(x,y)/β) / Z(x). The paper then selects the weighted-SFT weights to force the induced policy (under reference sampling) to equal this exact target, so the claimed equality and the unique reduction to exp(r/β)/Z(x) hold by the explicit construction of the weights rather than emerging from an independent argument.

full rationale

The paper's central result identifies a reference-sampled weighted-SFT objective whose induced policy equals the KL-regularized RLVR optimizer (the Boltzmann target). This equality is achieved by setting the weights to the exact form that reproduces the target when sampling from the reference policy. Since the target policy is the standard closed-form solution of the KL-regularized objective, the weight derivation is equivalent to the input definition rather than an independent derivation. The one-shot error decomposition and mirror-descent connections build upon this constructed objective without introducing additional circular reductions in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The derivation rests on the standard definition of the KL-regularized RLVR objective as the Boltzmann target and on the assumption that a weighted likelihood can be made to induce that target via density ratios; beta appears as a temperature hyperparameter.

free parameters (1)

beta
Temperature scaling the reward in the Boltzmann weight exp(r/β)/Z(x); chosen to control the target policy sharpness.

axioms (1)

domain assumption The fixed-reference KL-regularized RLVR optimizer is exactly the Boltzmann target policy obtained by exponentially tilting the reference policy by verifier reward.
Invoked to define the target that the weighted SFT must match.

pith-pipeline@v0.9.0 · 5618 in / 1394 out tokens · 76807 ms · 2026-05-08T19:30:57.073786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (Jcost = (1/2)(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

π*(y|x) ≜ π_ref(y|x) exp(r(x,y)/β) / Z(x), Z(x) ≜ E_{y∼π_ref}[exp(r(x,y)/β)].
IndisputableMonolith/Foundation/ArithmeticFromLogic (orbit γ^n under multiplicative iteration) embed_eq_pow echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Refreshed Boltzmann projection ≡ KL policy mirror descent: π_{θ_{k+1}} ∝ π_{θ_k} exp(r/β); after K rounds π ∝ π_0 exp(Kr/β).
IndisputableMonolith/Constants (c=1, ℏ, G as φ-powers; zero adjustable parameters) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Temperature β is a free hyperparameter controlling concentration of the Boltzmann target; reward r is a learned/verifier signal, not derived.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · 13 internal anchors

[1]

Bartlett and Shahar Mendelson

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.JMLR, 3:463–482, 2002

2002
[2]

Institute of Mathematical Statistics, 2007

Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofLecture Notes–Monograph Series. Institute of Mathematical Statistics, 2007

2007
[3]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InProc. NeurIPS, 2021

2021
[4]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code. arXiv:2107.03374, arXiv, 2021

work page Pith review arXiv 2021
[5]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProc. NeurIPS, 2017

2017
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168, arXiv, 2021

work page Pith review arXiv 2021
[7]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InProc. NeurIPS, 2022

2022
[8]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. NeurIPS, 2023

2023
[9]

Donsker and S

Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, i.Commun. Pure Appl. Math., 28(1):1–47, 1975

1975
[10]

RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026

Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, and Anningzhe Gao. RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026

2026
[11]

Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. InProc. ICML, 2024

2024
[12]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J. Optim., 23(4):2341–2368, 2013

2013
[13]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling. arXiv:2308.08998, arXiv, 2023

work page Pith review arXiv 2023
[14]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638, 2025. 17

2025
[15]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProc. ICML, 2018

2018
[16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProc. NeurIPS Datasets and Benchmarks, 2021

2021
[17]

Probability inequalities for sums of bounded random variables.J

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58(301):13–30, 1963

1963
[18]

ORPO: Monolithic preference optimization without reference model

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. InProc. EMNLP, 2024

2024
[19]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022

2022
[20]

OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Wenkai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework. InProc. EMNLP System Demonstrations, 2025

2025
[21]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-Reasoner-Zero: An open source approach to scaling up reinforcement learning on the base model. arXiv:2503.24290, arXiv, 2025

work page internal anchor Pith review arXiv 2025
[22]

PROS: Towards compute-efficient RLVR via rollout prefix reuse

Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InProc. ICLR, 2026

2026
[23]

Springer, 2016

Shun ichi Amari.Information Geometry and Its Applications, volume 194 ofApplied Mathe- matical Sciences. Springer, 2016

2016
[24]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team et al. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv:2501.12599, arXiv, 2025

work page internal anchor Pith review arXiv 2025
[25]

Conservative Q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InProc. NeurIPS, 2020

2020
[26]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124, arXiv, 2024

work page internal anchor Pith review arXiv 2024
[27]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909, arXiv, 2018

work page internal anchor Pith review arXiv 2018
[28]

McAllester

David A. McAllester. PAC-Bayesian model averaging. InProc. COLT, pages 164–170, 1999

1999
[29]

SimPO: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. InProc. NeurIPS, 2024

2024
[30]

Rossi, Seunghyun Yoon, Trung Bui, Anup B

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan A. Rossi, Seunghyun Yoon, Trung Bui, Anup B. Rao, Jayakumar Subramanian, and Branislav Kveton. Offline RL by reward-weighted fine-tuning for conversation optimization. InProc. NeurIPS, 2025

2025
[31]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. AW AC: Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359, arXiv, 2020

work page internal anchor Pith review arXiv 2006
[32]

Reward augmented maximum likelihood for neural structured prediction

Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured prediction. InProc. NeurIPS, 2016

2016
[33]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

2022
[34]

Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024

Andrew Patterson, Samuel Neumann, Martha White, and Adam White. Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024

2024
[35]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv:1910.00177, arXiv, 2019

work page internal anchor Pith review arXiv 1910
[36]

Relative entropy policy search

Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. InProc. AAAI, 2010

2010
[37]

Supervised fine tuning on curated data is reinforcement learning (and can be improved), 2025

Chongli Qin and Jost Tobias Springenberg. Supervised fine tuning on curated data is reinforce- ment learning (and can be improved). arXiv:2507.12856, arXiv, 2025

work page arXiv 2025
[38]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InProc. NeurIPS, 2023

2023
[39]

Rubinstein and Dirk P

Reuven Y . Rubinstein and Dirk P. Kroese.The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science and Statistics. Springer, 2004

2004
[40]

Efficient rlhf: Reducing the memory usage of ppo

Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient RLHF: Reducing the memory usage of PPO. arXiv:2309.00754, arXiv, 2023

work page arXiv 2023
[41]

Jordan, and Pieter Abbeel

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProc. ICML, 2015

2015
[42]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, arXiv, 2017

work page internal anchor Pith review arXiv 2017
[43]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, arXiv, 2024

work page internal anchor Pith review arXiv 2024
[44]

HybridFlow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. In Proc. EuroSys, 2025

2025
[45]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proc. NeurIPS, 2020

2020
[46]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. arXiv:2506.14245, arXiv, 2025

work page internal anchor Pith review arXiv 2025
[47]

On the generalization of SFT: A reinforcement learning perspective with reward rectification

Yongliang Wu, Yizhou Zhou, Ziheng Zhou, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of SFT: A reinforcement learning perspective with reward rectification. InProc. ICLR, 2026

2026
[48]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report. arXiv:2505.09388, arXiv, 2025

work page internal anchor Pith review arXiv 2025
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale. arXiv:2503.14476, arXiv, 2025

work page Pith review arXiv 2025
[50]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825, arXiv, 2023

work page internal anchor Pith review arXiv 2023
[51]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv:2504.13837, arXiv, 2025

work page Pith review arXiv 2025
[52]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv:2504.05118, arXiv, 2025. 19

work page internal anchor Pith review arXiv 2025
[53]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InProc. NeurIPS, 2022

2022
[54]

SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InProc. COLM, 2025

2025
[55]

Lee, and Yuejie Chi

Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv:2105.11066, arXiv, 2021

work page arXiv 2021
[56]

Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

Kai Zhang et al. Agent learning via early experience. arXiv:2510.08558, arXiv, 2025

work page arXiv 2025
[57]

LLM alignment through successive policy re-weighting (SPR)

Xinnan Zhang, Siliang Zeng, Jiaxiang Li, Kaixiang Lin, and Mingyi Hong. LLM alignment through successive policy re-weighting (SPR). NeurIPS Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024

2024
[58]

Ziebart, J

Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. InProc. AAAI, 2010

2010
[59]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, arXiv, 2019. A Weighted-SFT Induced-Target Details Weighted SFT methods can look identical at the optimizer level because they all minimize a weighted log likel...

work page internal anchor Pith review arXiv 1909
[60]

finite sample

Proposition 1 therefore gives ˜πw(y|x) =π ref(y|x) exp(r(x, y)/β) Z(x) =π ∗(y|x).(31) For fixedx, DKL(π∗∥πθ) =−H(π ∗)−E y∼π ∗(·|x) [logπ θ(y|x)].(32) The entropy term does not depend on θ. Converting the expectation under π∗ to one under πref using (4) gives Ey∼π ∗ [logπ θ(y|x)] =E y∼πref exp(r(x, y)/β) Z(x) logπ θ(y|x) .(33) Negating gives the weighted m...

work page arXiv