Recognition: 3 theorem links
· Lean TheoremReference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3
The pith
A reference-sampled weighted SFT objective induces exactly the Boltzmann policy of fixed-reference KL-regularized RLVR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x). BOLT is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance,
What carries the argument
The reference-sampled weighted-SFT objective whose weights reduce to the prompt-normalized Boltzmann factor exp(r(x,y)/β)/Z(x), thereby inducing the exact KL-regularized target policy.
If this is right
- Extra SFT epochs cannot compensate for missing coverage in the stored reference support.
- Refreshed Boltzmann projections with adaptive sampling reduce to KL policy mirror descent steps.
- Finite inner optimization of each projection appears only as additive drift from the exact mirror step.
- The temperature-coverage-variance frontier governs the choice of β and sample size for acceptable weight variance.
Where Pith is reading between the lines
- Precomputing and storing a single large reference rollout set could allow fully offline training loops for many RLVR tasks.
- The explicit separation of coverage price from other errors suggests monitoring reference diversity as a practical diagnostic during dataset construction.
- Iterating the projection with periodic reference refreshes offers a controllable middle ground between one-shot SFT and full online RL.
Load-bearing premise
The reference policy is held fixed and the sampler plus density-ratio weights can be chosen to induce the Boltzmann target policy exactly.
What would settle it
Compute the policy obtained from BOLT on a fixed reference set, then run full KL-regularized RLVR to convergence on the same reward and reference; the two policies should agree in expected reward and KL divergence up to the predicted finite-sample gaps.
read the original abstract
Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/\beta)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $\beta\log(1/\pi^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer (the standard Boltzmann target obtained by exponentially tilting the reference policy by verifier reward). In the reference-sampled subclass, the required density-ratio weights reduce uniquely (up to prompt scaling) to the prompt-normalized form exp(r(x,y)/β)/Z(x). BOLT is introduced as the empirical estimator of this projection. The finite one-shot analysis decomposes the exact stored-support price β log(1/π*(S_N|x)) from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. Refreshed Boltzmann projections are shown to correspond to KL policy mirror descent, with finite inner solves entering as additive drift. Single-run Qwen experiments provide supporting evidence for the target-matched weights, one-shot saturation, refreshed-sampler gains, and optimization-time savings.
Significance. If the central equivalence and uniqueness reduction hold, the work supplies a theoretically grounded static weighted-SFT procedure that exactly matches the fixed-reference KL-regularized RLVR optimum, together with an error decomposition that isolates coverage limitations and a temperature-coverage-variance frontier. The explicit link to policy mirror descent for the adaptive case is a useful conceptual bridge. These elements could simplify RLVR pipelines by enabling precomputed rollouts while preserving the target policy, and the decomposition offers concrete guidance on when extra SFT epochs cannot compensate for missing reference coverage.
major comments (2)
- [Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.
- [Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.
minor comments (2)
- [§4] The finite one-shot error decomposition would be clearer if the individual terms (coverage, partition, ESS variance, etc.) were collected in a single table with their scaling and dependence on β and coverage.
- [Notation] Notation for the partition function Z(x) and the reference policy π_ref should be introduced once and used consistently; minor inconsistencies appear in the abstract versus later sections.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address each major point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and main theorem (likely §3): the uniqueness (up to prompt scaling) of the reduction to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x) is asserted via density-ratio matching, but the derivation steps establishing that no other weight forms in the reference-sampled class induce the same policy are not shown; without an explicit uniqueness argument the claim risks appearing tautological once the weight form is chosen to reproduce the target.
Authors: We appreciate the observation. The uniqueness is a direct consequence of the density-ratio matching condition: any reference-sampled weighted-SFT objective that induces exactly the Boltzmann target π* must have weights satisfying w(y|x) ∝ π*(y|x)/π_ref(y|x) (up to prompt-dependent scaling), which forces the prompt-normalized form exp(r(x,y)/β)/Z(x). To make this fully explicit and eliminate any risk of appearing tautological, we will add a short dedicated lemma in §3 that derives the uniqueness from the induced-policy equality, showing that any other weight function in the reference-sampled class fails to recover π* unless it reduces to this form. revision: yes
-
Referee: [Experiments] Experiments section: the single-run Qwen results on projection evidence, one-shot saturation, and optimization savings lack error bars, multiple random seeds, or statistical controls, leaving the empirical support for the practical advantages thin relative to the strength of the theoretical claims.
Authors: We acknowledge the limitation. The manuscript already qualifies the Qwen runs as single-run and illustrative within the stated scope, with the core contributions being the theoretical equivalence, uniqueness reduction, and finite error decomposition. The experiments serve only to confirm the predicted qualitative behaviors. In revision we will strengthen the presentation by more explicitly framing the results as illustrative and, where feasible, add a small number of additional seeds with basic variance reporting for the key metrics (target matching and one-shot saturation). revision: partial
Circularity Check
Core identification of target-matched weights reduces to construction from known Boltzmann form
specific steps
-
self definitional
[Abstract]
"This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight exp(r(x,y)/β)/Z(x)."
The KL-regularized optimizer is defined as the Boltzmann policy π(y|x) ∝ π_ref(y|x) exp(r(x,y)/β) / Z(x). The paper then selects the weighted-SFT weights to force the induced policy (under reference sampling) to equal this exact target, so the claimed equality and the unique reduction to exp(r/β)/Z(x) hold by the explicit construction of the weights rather than emerging from an independent argument.
full rationale
The paper's central result identifies a reference-sampled weighted-SFT objective whose induced policy equals the KL-regularized RLVR optimizer (the Boltzmann target). This equality is achieved by setting the weights to the exact form that reproduces the target when sampling from the reference policy. Since the target policy is the standard closed-form solution of the KL-regularized objective, the weight derivation is equivalent to the input definition rather than an independent derivation. The one-shot error decomposition and mirror-descent connections build upon this constructed objective without introducing additional circular reductions in the provided text.
Axiom & Free-Parameter Ledger
free parameters (1)
- beta
axioms (1)
- domain assumption The fixed-reference KL-regularized RLVR optimizer is exactly the Boltzmann target policy obtained by exponentially tilting the reference policy by verifier reward.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost (Jcost = (1/2)(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
π*(y|x) ≜ π_ref(y|x) exp(r(x,y)/β) / Z(x), Z(x) ≜ E_{y∼π_ref}[exp(r(x,y)/β)].
-
IndisputableMonolith/Foundation/ArithmeticFromLogic (orbit γ^n under multiplicative iteration)embed_eq_pow echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Refreshed Boltzmann projection ≡ KL policy mirror descent: π_{θ_{k+1}} ∝ π_{θ_k} exp(r/β); after K rounds π ∝ π_0 exp(Kr/β).
-
IndisputableMonolith/Constants (c=1, ℏ, G as φ-powers; zero adjustable parameters)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Temperature β is a free hyperparameter controlling concentration of the Boltzmann target; reward r is a learned/verifier signal, not derived.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bartlett and Shahar Mendelson
Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.JMLR, 3:463–482, 2002
2002
-
[2]
Institute of Mathematical Statistics, 2007
Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofLecture Notes–Monograph Series. Institute of Mathematical Statistics, 2007
2007
-
[3]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InProc. NeurIPS, 2021
2021
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen et al. Evaluating large language models trained on code. arXiv:2107.03374, arXiv, 2021
work page Pith review arXiv 2021
-
[5]
Christiano, Jan Leike, Tom B
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InProc. NeurIPS, 2017
2017
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv:2110.14168, arXiv, 2021
work page Pith review arXiv 2021
-
[7]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InProc. NeurIPS, 2022
2022
-
[8]
QLoRA: Efficient finetuning of quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. NeurIPS, 2023
2023
-
[9]
Donsker and S
Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, i.Commun. Pure Appl. Math., 28(1):1–47, 1975
1975
-
[10]
RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026
Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, and Anningzhe Gao. RLHF in an SFT way: From optimal solution to reward-weighted alignment.TMLR, 2026
2026
-
[11]
Model alignment as prospect theoretic optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. InProc. ICML, 2024
2024
-
[12]
Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM J. Optim., 23(4):2341–2368, 2013
2013
-
[13]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language modeling. arXiv:2308.08998, arXiv, 2023
work page Pith review arXiv 2023
-
[14]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
Daya Guo et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638, 2025. 17
2025
-
[15]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProc. ICML, 2018
2018
-
[16]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InProc. NeurIPS Datasets and Benchmarks, 2021
2021
-
[17]
Probability inequalities for sums of bounded random variables.J
Wassily Hoeffding. Probability inequalities for sums of bounded random variables.J. Amer. Statist. Assoc., 58(301):13–30, 1963
1963
-
[18]
ORPO: Monolithic preference optimization without reference model
Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. InProc. EMNLP, 2024
2024
-
[19]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. ICLR, 2022
2022
-
[20]
OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Wenkai Fang, Xianyu, Yu Cao, Haotian Xu, and Yiming Liu. OpenRLHF: A ray-based easy-to-use, scalable and high-performance RLHF framework. InProc. EMNLP System Demonstrations, 2025
2025
-
[21]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-Reasoner-Zero: An open source approach to scaling up reinforcement learning on the base model. arXiv:2503.24290, arXiv, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
PROS: Towards compute-efficient RLVR via rollout prefix reuse
Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InProc. ICLR, 2026
2026
-
[23]
Springer, 2016
Shun ichi Amari.Information Geometry and Its Applications, volume 194 ofApplied Mathe- matical Sciences. Springer, 2016
2016
-
[24]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team et al. Kimi k1.5: Scaling reinforcement learning with LLMs. arXiv:2501.12599, arXiv, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Conservative Q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InProc. NeurIPS, 2020
2020
-
[26]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124, arXiv, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909, arXiv, 2018
work page internal anchor Pith review arXiv 2018
-
[28]
McAllester
David A. McAllester. PAC-Bayesian model averaging. InProc. COLT, pages 164–170, 1999
1999
-
[29]
SimPO: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. InProc. NeurIPS, 2024
2024
-
[30]
Rossi, Seunghyun Yoon, Trung Bui, Anup B
Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan A. Rossi, Seunghyun Yoon, Trung Bui, Anup B. Rao, Jayakumar Subramanian, and Branislav Kveton. Offline RL by reward-weighted fine-tuning for conversation optimization. InProc. NeurIPS, 2025
2025
-
[31]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. AW AC: Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359, arXiv, 2020
work page internal anchor Pith review arXiv 2006
-
[32]
Reward augmented maximum likelihood for neural structured prediction
Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured prediction. InProc. NeurIPS, 2016
2016
-
[33]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...
2022
-
[34]
Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024
Andrew Patterson, Samuel Neumann, Martha White, and Adam White. Empirical design in reinforcement learning.JMLR, 25(318):1–63, 2024
2024
-
[35]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv:1910.00177, arXiv, 2019
work page internal anchor Pith review arXiv 1910
-
[36]
Relative entropy policy search
Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. InProc. AAAI, 2010
2010
-
[37]
Supervised fine tuning on curated data is reinforcement learning (and can be improved), 2025
Chongli Qin and Jost Tobias Springenberg. Supervised fine tuning on curated data is reinforce- ment learning (and can be improved). arXiv:2507.12856, arXiv, 2025
-
[38]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InProc. NeurIPS, 2023
2023
-
[39]
Rubinstein and Dirk P
Reuven Y . Rubinstein and Dirk P. Kroese.The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science and Statistics. Springer, 2004
2004
-
[40]
Efficient rlhf: Reducing the memory usage of ppo
Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient RLHF: Reducing the memory usage of PPO. arXiv:2309.00754, arXiv, 2023
-
[41]
Jordan, and Pieter Abbeel
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InProc. ICML, 2015
2015
-
[42]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, arXiv, 2017
work page internal anchor Pith review arXiv 2017
-
[43]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, arXiv, 2024
work page internal anchor Pith review arXiv 2024
-
[44]
HybridFlow: A flexible and efficient RLHF framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. In Proc. EuroSys, 2025
2025
-
[45]
Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proc. NeurIPS, 2020
2020
-
[46]
Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. arXiv:2506.14245, arXiv, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
On the generalization of SFT: A reinforcement learning perspective with reward rectification
Yongliang Wu, Yizhou Zhou, Ziheng Zhou, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of SFT: A reinforcement learning perspective with reward rectification. InProc. ICLR, 2026
2026
-
[48]
An Yang et al. Qwen3 technical report. arXiv:2505.09388, arXiv, 2025
work page internal anchor Pith review arXiv 2025
-
[49]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale. arXiv:2503.14476, arXiv, 2025
work page Pith review arXiv 2025
-
[50]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825, arXiv, 2023
work page internal anchor Pith review arXiv 2023
-
[51]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? arXiv:2504.13837, arXiv, 2025
work page Pith review arXiv 2025
-
[52]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv:2504.05118, arXiv, 2025. 19
work page internal anchor Pith review arXiv 2025
-
[53]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InProc. NeurIPS, 2022
2022
-
[54]
SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InProc. COLM, 2025
2025
-
[55]
Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv:2105.11066, arXiv, 2021
-
[56]
Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025
Kai Zhang et al. Agent learning via early experience. arXiv:2510.08558, arXiv, 2025
-
[57]
LLM alignment through successive policy re-weighting (SPR)
Xinnan Zhang, Siliang Zeng, Jiaxiang Li, Kaixiang Lin, and Mingyi Hong. LLM alignment through successive policy re-weighting (SPR). NeurIPS Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024
2024
-
[58]
Ziebart, J
Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. InProc. AAAI, 2010
2010
-
[59]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv:1909.08593, arXiv, 2019. A Weighted-SFT Induced-Target Details Weighted SFT methods can look identical at the optimizer level because they all minimize a weighted log likel...
work page internal anchor Pith review arXiv 1909
-
[60]
Proposition 1 therefore gives ˜πw(y|x) =π ref(y|x) exp(r(x, y)/β) Z(x) =π ∗(y|x).(31) For fixedx, DKL(π∗∥πθ) =−H(π ∗)−E y∼π ∗(·|x) [logπ θ(y|x)].(32) The entropy term does not depend on θ. Converting the expectation under π∗ to one under πref using (4) gives Ey∼π ∗ [logπ θ(y|x)] =E y∼πref exp(r(x, y)/β) Z(x) logπ θ(y|x) .(33) Negating gives the weighted m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.