Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Pith reviewed 2026-05-21 23:35 UTC · model grok-4.3
The pith
Prefix-RFT blends demonstration prefixes with reinforcement fine-tuning to exceed standalone SFT and RFT on math reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prefix-RFT samples prefixes from demonstration trajectories and applies reinforcement fine-tuning starting from those points. This approach unifies the two post-training paradigms so the model first follows demonstration behavior and then explores improvements. On mathematical reasoning tasks it delivers higher performance than either method used alone or combined in parallel, while staying robust to differences in demonstration data.
What carries the argument
Prefix sampling from demonstration data, which supplies initial trajectory segments as starting states for the reinforcement learning policy to continue from.
If this is right
- Prefix-RFT reaches higher accuracy on mathematical reasoning benchmarks than SFT or RFT used separately.
- It surpasses mixed-policy methods that run SFT and RFT in parallel.
- Performance stays stable across changes in the amount or quality of available demonstration data.
- The method offers a straightforward way to combine imitation and exploration in a single training run.
Where Pith is reading between the lines
- The prefix-sampling idea could extend to other domains that need both example-following and open-ended improvement, such as code generation.
- Systematic variation of prefix length might reveal an optimal balance point between imitation and exploration.
- Hybrid prefix methods may lower the total demonstration data needed for effective fine-tuning.
Load-bearing premise
Prefix sampling from demonstrations can reliably merge the strengths of SFT and RFT without creating new generalization problems or strong sensitivity to prefix length and sampling choices.
What would settle it
An experiment on the same math reasoning benchmarks where Prefix-RFT shows no accuracy gain over the stronger of SFT or RFT, or where results shift sharply when prefix length is changed.
Figures
read the original abstract
Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Prefix-RFT, a hybrid post-training method for LLMs that blends supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) by sampling prefixes from demonstration data. Using mathematical reasoning as the testbed, it claims Prefix-RFT outperforms standalone SFT and RFT as well as parallel mixed-policy RFT baselines, while remaining robust to variations in demonstration data quality and quantity; the work frames SFT and RFT as complementary and presents Prefix-RFT as a simple harmonization technique.
Significance. If the empirical gains are robust, the result would offer a practical, low-overhead way to combine imitation learning with exploration in LLM alignment, addressing known generalization issues in pure SFT and policy sensitivity in pure RFT. The emphasis on prefix sampling as a lightweight bridge between the two paradigms could influence hybrid fine-tuning designs, particularly for reasoning tasks where demonstration data is available but limited.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The robustness claim is supported only by ablations on demonstration data quality and quantity; no equivalent controls or sensitivity analysis are reported for the prefix length hyperparameter or the choice of sampling strategy (random vs. deterministic), which are load-bearing for the central harmonization argument and could explain the reported gains over baselines.
- [§3] §3 (Method): The unified view of SFT and RFT is presented at a high level, but the precise interaction between the prefix sampling distribution and the subsequent RFT objective is not formalized with an equation or pseudocode that would allow readers to verify whether Prefix-RFT reduces to a known mixed-policy objective or introduces a distinct bias.
- [Results tables] Table 2 or equivalent results table: Outperformance is reported against standalone SFT, RFT, and mixed-policy baselines, yet the manuscript does not indicate whether results are averaged over multiple random seeds or include statistical significance tests; without these, the magnitude of improvement cannot be assessed as reliable rather than run-specific.
minor comments (2)
- [§3] Notation for the prefix sampling procedure could be clarified with a small diagram or explicit probability expression to distinguish it from standard behavior cloning.
- [Ablation figures] The abstract states that Prefix-RFT 'remains robust' to data variations, but the corresponding figures or tables should explicitly label the range of data quantities tested (e.g., 10%, 50%, 100% of demonstrations) for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical and methodological presentation of Prefix-RFT.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The robustness claim is supported only by ablations on demonstration data quality and quantity; no equivalent controls or sensitivity analysis are reported for the prefix length hyperparameter or the choice of sampling strategy (random vs. deterministic), which are load-bearing for the central harmonization argument and could explain the reported gains over baselines.
Authors: We agree that additional sensitivity analyses would further substantiate the robustness of Prefix-RFT. While our existing ablations target demonstration data quality and quantity because they are most relevant to practical deployment, we will expand §4 with new experiments that vary prefix length (e.g., 10–60% of sequence length) and compare random prefix sampling against deterministic alternatives. These results will be reported alongside the existing ablations to directly address the referee’s concern. revision: yes
-
Referee: [§3] §3 (Method): The unified view of SFT and RFT is presented at a high level, but the precise interaction between the prefix sampling distribution and the subsequent RFT objective is not formalized with an equation or pseudocode that would allow readers to verify whether Prefix-RFT reduces to a known mixed-policy objective or introduces a distinct bias.
Authors: We accept that a more precise formalization is needed. In the revised §3 we will add an explicit objective equation that decomposes the Prefix-RFT loss into an expectation over prefixes drawn from the demonstration distribution followed by the standard RFT objective on the generated suffix. We will also include pseudocode for the full training loop and a short discussion clarifying how the prefix-conditioning step introduces a distinct bias relative to standard mixed-policy baselines. revision: yes
-
Referee: [Results tables] Table 2 or equivalent results table: Outperformance is reported against standalone SFT, RFT, and mixed-policy baselines, yet the manuscript does not indicate whether results are averaged over multiple random seeds or include statistical significance tests; without these, the magnitude of improvement cannot be assessed as reliable rather than run-specific.
Authors: The reported numbers in Table 2 reflect single training runs, which is common under the computational budget of large-scale RFT. To improve reliability, we will re-execute the main comparisons across three random seeds, report means and standard deviations, and add paired statistical significance tests (e.g., t-tests) between Prefix-RFT and each baseline. The updated table and accompanying text will appear in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical hybrid method validated against external baselines
full rationale
The paper introduces Prefix-RFT as a practical blending of SFT and RFT via prefix sampling from demonstrations, then evaluates it through direct performance comparisons on mathematical reasoning benchmarks. No load-bearing mathematical derivation, uniqueness theorem, or fitted-parameter prediction is present; the central claims rest on experimental outperformance relative to standalone SFT, RFT, and mixed-policy baselines. Ablations address data quality and quantity but do not create self-referential loops. The work is self-contained against external benchmarks with no reduction of results to quantities defined inside its own equations or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mathematical reasoning problems serve as a representative test bed for comparing SFT, RFT, and hybrid methods.
Forward citations
Cited by 12 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
-
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[1]
Philip J. Ball, Laura M. Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 1577--1594. PMLR , 2023. URL https://proceedings.mlr.press/v202/ball23a.html
work page 2023
-
[2]
How much backtracking is enough? exploring the interplay of sft and rl in enhancing llm reasoning
Hongyi James Cai, Junlin Wang, Xiaoyin Chen, and Bhuwan Dhingra. How much backtracking is enough? exploring the interplay of sft and rl in enhancing llm reasoning. arXiv preprint arXiv:2505.24273, 2025
-
[3]
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. SFT or rl? an early investigation into training r1-like reasoning large vision-language models. CoRR, abs/2504.11468, 2025 a . URL https://doi.org/10.48550/arXiv.2504.11468
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.11468 2025
-
[4]
Jack Chen, Fazhong Liu, Naruto Liu, Yuhan Luo, Erqu Qin, Harry Zheng, Tian Dong, Haojin Zhu, Yan Meng, and Xiao Wang. Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms. arXiv preprint arXiv:2505.13026, 2025 b
-
[5]
Revisiting reinforcement learning for llm reasoning from a cross-domain perspective
Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. arXiv preprint arXiv:2506.14965, 2025
-
[7]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Open r1: A fully open reproduction of deepseek-r1, January 2025
Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1
work page 2025
-
[10]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...
-
[13]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://da...
work page 2021
-
[14]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Opencoder: The open cookbook for top-tier code large language models
Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, JH Liu, Chenchen Zhang, Linzheng Chai, et al. Opencoder: The open cookbook for top-tier code large language models. arXiv preprint arXiv:2411.04905, 2024
-
[16]
Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M. Ponti, and Ivan Titov. Post-hoc reward calibration: A case study on length bias. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=Iu8RytBaji
work page 2025
-
[17]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization, pp.\ 45--73. Springer, 2012. URL https://doi.org/10.1007/978-3-642-27645-3\_2
-
[20]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs/2005.01643
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[21]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman - Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur - Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems 35: Annual Conferenc...
work page 2022
-
[22]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9
work page 2024
-
[23]
Code-r1: Reproducing r1 for code with reliable rewards
Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025
work page 2025
-
[24]
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
arXiv preprint arXiv:2505.16984 , year =
Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. Uft: Unifying supervised and reinforcement fine-tuning. arXiv preprint arXiv:2505.16984, 2025 b
-
[26]
Xuefeng Liu, Hung T. C. Le, Siyu Chen, Rick Stevens, Zhuoran Yang, Matthew R. Walter, and Yuxin Chen. Active advantage-aligned online reinforcement learning with offline data. CoRR, abs/2502.07937, 2025 c . URL https://doi.org/10.48550/arXiv.2502.07937
-
[27]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. CoRR, abs/2503.20783, 2025 d . URL https://doi.org/10.48550/arXiv.2503.20783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783 2025
-
[28]
Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions
Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. CoRR, abs/2303.17396, 2023. URL https://doi.org/10.48550/arXiv.2303.17396
-
[29]
Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions
Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025
-
[30]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4 . CoRR, abs/2304.03277, 2023. URL https://doi.org/10.48550/arXiv.2304.03277
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.03277 2023
-
[31]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q & a benchmark. CoRR, abs/2311.12022, 2023. URL https://doi.org/10.48550/arXiv.2311.12022
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022 2023
-
[32]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Hybrid RL: using both offline and online data can make RL efficient
Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/forum?id=yyBis80iUuU
work page 2023
-
[35]
Policy gradient methods for reinforcement learning with function approximation
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999
work page 1999
-
[36]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems 38: Annu...
work page 2024
-
[38]
Octothinker: Mid-training incentivizes reinforcement learning scaling
Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025 b
-
[39]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[40]
On memorization of large language models in logical reasoning
Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. CoRR, abs/2410.23123, 2024. URL https://doi.org/10.48550/arXiv.2410.23123
-
[41]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv.org/abs/2503.02951
-
[43]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. CoRR, abs/2504.14945, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What's behind ppo's collapse in long-cot? value optimization holds the secret. arXiv preprint arXiv:2503.01491, 2025
-
[46]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? CoRR, abs/2504.13837, 2025 b . URL https://doi.org/10.48550/arXiv.2504.13837
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.13837 2025
-
[48]
Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog
work page 2025
-
[49]
Echo chamber: Rl post-training amplifies behaviors learned in pretraining
Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. arXiv preprint arXiv:2504.07912, 2025
-
[50]
Cheating automatic LLM benchmarks: Null models achieve high win rates
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Cheating automatic LLM benchmarks: Null models achieve high win rates. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=syThiTmWWm
work page 2025
-
[51]
Judgelm: Fine-tuned large language models are scalable judges
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=xsELpEPn4A
work page 2025
-
[52]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[53]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[54]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[55]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.