pith. sign in

arxiv: 2507.01679 · v3 · pith:EDPRMAQQnew · submitted 2025-07-02 · 💻 cs.LG · cs.AI· cs.CL

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Pith reviewed 2026-05-21 23:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords supervised fine-tuningreinforcement fine-tuningprefix samplingmathematical reasoningLLM post-traininghybrid fine-tuning
0
0 comments X

The pith

Prefix-RFT blends demonstration prefixes with reinforcement fine-tuning to exceed standalone SFT and RFT on math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Prefix-RFT to unify supervised fine-tuning and reinforcement fine-tuning for language models. It samples initial prefixes from demonstration data and then lets reinforcement learning continue from those points. This setup aims to capture reliable imitation from demonstrations while gaining performance boosts from exploration. Experiments on mathematical reasoning problems show the hybrid method beats pure SFT, pure RFT, and parallel mixed-policy approaches. It also holds up when demonstration data varies in quality or quantity.

Core claim

Prefix-RFT samples prefixes from demonstration trajectories and applies reinforcement fine-tuning starting from those points. This approach unifies the two post-training paradigms so the model first follows demonstration behavior and then explores improvements. On mathematical reasoning tasks it delivers higher performance than either method used alone or combined in parallel, while staying robust to differences in demonstration data.

What carries the argument

Prefix sampling from demonstration data, which supplies initial trajectory segments as starting states for the reinforcement learning policy to continue from.

If this is right

  • Prefix-RFT reaches higher accuracy on mathematical reasoning benchmarks than SFT or RFT used separately.
  • It surpasses mixed-policy methods that run SFT and RFT in parallel.
  • Performance stays stable across changes in the amount or quality of available demonstration data.
  • The method offers a straightforward way to combine imitation and exploration in a single training run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prefix-sampling idea could extend to other domains that need both example-following and open-ended improvement, such as code generation.
  • Systematic variation of prefix length might reveal an optimal balance point between imitation and exploration.
  • Hybrid prefix methods may lower the total demonstration data needed for effective fine-tuning.

Load-bearing premise

Prefix sampling from demonstrations can reliably merge the strengths of SFT and RFT without creating new generalization problems or strong sensitivity to prefix length and sampling choices.

What would settle it

An experiment on the same math reasoning benchmarks where Prefix-RFT shows no accuracy gain over the stronger of SFT or RFT, or where results shift sharply when prefix length is changed.

Figures

Figures reproduced from arXiv: 2507.01679 by Edoardo M. Ponti, Ivan Titov, Tianhao Cheng, Yinghui Xu, Zeyu Huang, Zihan Qiu, Zili Wang.

Figure 1
Figure 1. Figure 1: The training pipeline of Prefix-RFT . The method does minimal modification to the existing RFT training pipeline. Given a problem and a demonstration, a prefix is sampled to guide the online continuation. The concatenated sequence yn is mixed with other online rollouts to perform RFT-style training. We also utilize an entropy-based clipping strategy to constrain the update on demonstration. A Hybrid Approa… view at source ↗
Figure 2
Figure 2. Figure 2: Averaged performance of math reasoning bench￾marks of Prefix-RFT and other baselines. Results on more models In addi￾tion to Qwen2.5-Math-7B, we also test our method on a smaller-scale model, Qwen2.5-Math-1.5B, and a weaker base model, LLaMA-3.1-8B. The training set￾tings for LLaMA models follow exactly (Yan et al., 2025). As the overall entropy of LLaMA models is relatively higher, we set the ratio for en… view at source ↗
Figure 3
Figure 3. Figure 3: Training trajectories of SFT, RFT, and Prefix-RFT. The x-axis denotes the SFT loss on the demonstra￾tions. The y-axis represents the Avg@16 and Best@16 scores. The final step is marked with the yellow star, annotated using the final SFT loss and the final score. Comparing SFT and RFT As mentioned above, SFT and RFT present distinct training paradigms, with the former focusing on minimizing the likelihood o… view at source ↗
Figure 4
Figure 4. Figure 4: The average reward of rollouts initiated with a prefix and the overall reward. The shaded area represents the advantage assigned to the prefix. As the advantage diminishes, the training gradually transitions from SFT to RFT, aligning with the widely used SFT-then￾RFT pipeline [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The change in SFT loss on demonstrations on problems of varying difficulties, suggesting Prefix-RFT provides more supervision for more challenging problems. We then investigate whether such a transition also exists at the ex￾ample level. We thus analyze the change in SFT loss on demonstra￾tion for problems of varying dif￾ficulty. This analysis focuses on the training interval from the first epoch to the th… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study results on the entropy-based clipping strategy. From left to right: training [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the scheduling strategy on benchmark performance and training dynamics. Left: Performance comparison between the pro￾posed cosine decay scheduler and the baseline Uniform Scheduler. Right: Training reward dynamics when employing the Uniform Scheduler. To investigate the effect of the proposed cosine decay sched￾uler, we conduct an ablation study comparing it against a Uni￾form Scheduler baseline.… view at source ↗
read the original abstract

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Prefix-RFT, a hybrid post-training method for LLMs that blends supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) by sampling prefixes from demonstration data. Using mathematical reasoning as the testbed, it claims Prefix-RFT outperforms standalone SFT and RFT as well as parallel mixed-policy RFT baselines, while remaining robust to variations in demonstration data quality and quantity; the work frames SFT and RFT as complementary and presents Prefix-RFT as a simple harmonization technique.

Significance. If the empirical gains are robust, the result would offer a practical, low-overhead way to combine imitation learning with exploration in LLM alignment, addressing known generalization issues in pure SFT and policy sensitivity in pure RFT. The emphasis on prefix sampling as a lightweight bridge between the two paradigms could influence hybrid fine-tuning designs, particularly for reasoning tasks where demonstration data is available but limited.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The robustness claim is supported only by ablations on demonstration data quality and quantity; no equivalent controls or sensitivity analysis are reported for the prefix length hyperparameter or the choice of sampling strategy (random vs. deterministic), which are load-bearing for the central harmonization argument and could explain the reported gains over baselines.
  2. [§3] §3 (Method): The unified view of SFT and RFT is presented at a high level, but the precise interaction between the prefix sampling distribution and the subsequent RFT objective is not formalized with an equation or pseudocode that would allow readers to verify whether Prefix-RFT reduces to a known mixed-policy objective or introduces a distinct bias.
  3. [Results tables] Table 2 or equivalent results table: Outperformance is reported against standalone SFT, RFT, and mixed-policy baselines, yet the manuscript does not indicate whether results are averaged over multiple random seeds or include statistical significance tests; without these, the magnitude of improvement cannot be assessed as reliable rather than run-specific.
minor comments (2)
  1. [§3] Notation for the prefix sampling procedure could be clarified with a small diagram or explicit probability expression to distinguish it from standard behavior cloning.
  2. [Ablation figures] The abstract states that Prefix-RFT 'remains robust' to data variations, but the corresponding figures or tables should explicitly label the range of data quantities tested (e.g., 10%, 50%, 100% of demonstrations) for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical and methodological presentation of Prefix-RFT.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The robustness claim is supported only by ablations on demonstration data quality and quantity; no equivalent controls or sensitivity analysis are reported for the prefix length hyperparameter or the choice of sampling strategy (random vs. deterministic), which are load-bearing for the central harmonization argument and could explain the reported gains over baselines.

    Authors: We agree that additional sensitivity analyses would further substantiate the robustness of Prefix-RFT. While our existing ablations target demonstration data quality and quantity because they are most relevant to practical deployment, we will expand §4 with new experiments that vary prefix length (e.g., 10–60% of sequence length) and compare random prefix sampling against deterministic alternatives. These results will be reported alongside the existing ablations to directly address the referee’s concern. revision: yes

  2. Referee: [§3] §3 (Method): The unified view of SFT and RFT is presented at a high level, but the precise interaction between the prefix sampling distribution and the subsequent RFT objective is not formalized with an equation or pseudocode that would allow readers to verify whether Prefix-RFT reduces to a known mixed-policy objective or introduces a distinct bias.

    Authors: We accept that a more precise formalization is needed. In the revised §3 we will add an explicit objective equation that decomposes the Prefix-RFT loss into an expectation over prefixes drawn from the demonstration distribution followed by the standard RFT objective on the generated suffix. We will also include pseudocode for the full training loop and a short discussion clarifying how the prefix-conditioning step introduces a distinct bias relative to standard mixed-policy baselines. revision: yes

  3. Referee: [Results tables] Table 2 or equivalent results table: Outperformance is reported against standalone SFT, RFT, and mixed-policy baselines, yet the manuscript does not indicate whether results are averaged over multiple random seeds or include statistical significance tests; without these, the magnitude of improvement cannot be assessed as reliable rather than run-specific.

    Authors: The reported numbers in Table 2 reflect single training runs, which is common under the computational budget of large-scale RFT. To improve reliability, we will re-execute the main comparisons across three random seeds, report means and standard deviations, and add paired statistical significance tests (e.g., t-tests) between Prefix-RFT and each baseline. The updated table and accompanying text will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hybrid method validated against external baselines

full rationale

The paper introduces Prefix-RFT as a practical blending of SFT and RFT via prefix sampling from demonstrations, then evaluates it through direct performance comparisons on mathematical reasoning benchmarks. No load-bearing mathematical derivation, uniqueness theorem, or fitted-parameter prediction is present; the central claims rest on experimental outperformance relative to standalone SFT, RFT, and mixed-policy baselines. Ablations address data quality and quantity but do not create self-referential loops. The work is self-contained against external benchmarks with no reduction of results to quantities defined inside its own equations or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the domain assumption that mathematical reasoning problems form a suitable test bed for evaluating fine-tuning methods and that demonstration data of varying quality can be used directly for prefix sampling. No explicit free parameters or invented entities are described.

axioms (1)
  • domain assumption Mathematical reasoning problems serve as a representative test bed for comparing SFT, RFT, and hybrid methods.
    Invoked when the authors state they use math problems to empirically demonstrate the method's effectiveness.

pith-pipeline@v0.9.0 · 5733 in / 1323 out tokens · 37687 ms · 2026-05-21T23:35:46.504011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  3. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 7.0

    AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...

  4. Near-Future Policy Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...

  5. SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

  6. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  7. Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

    cs.LG 2026-03 unverdicted novelty 6.0

    HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

  8. Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    cs.LG 2026-05 unverdicted novelty 5.0

    FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

  9. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...

  10. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.

  11. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  12. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 9 Pith papers · 21 internal anchors

  1. [1]

    Ball, Laura M

    Philip J. Ball, Laura M. Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research, pp.\ 1577--1594. PMLR , 2023. URL https://proceedings.mlr.press/v202/ball23a.html

  2. [2]

    How much backtracking is enough? exploring the interplay of sft and rl in enhancing llm reasoning

    Hongyi James Cai, Junlin Wang, Xiaoyin Chen, and Bhuwan Dhingra. How much backtracking is enough? exploring the interplay of sft and rl in enhancing llm reasoning. arXiv preprint arXiv:2505.24273, 2025

  3. [3]

    SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. SFT or rl? an early investigation into training r1-like reasoning large vision-language models. CoRR, abs/2504.11468, 2025 a . URL https://doi.org/10.48550/arXiv.2504.11468

  4. [4]

    Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms

    Jack Chen, Fazhong Liu, Naruto Liu, Yuhan Luo, Erqu Qin, Harry Zheng, Tian Dong, Haojin Zhu, Yan Meng, and Xiao Wang. Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms. arXiv preprint arXiv:2505.13026, 2025 b

  5. [5]

    Revisiting reinforcement learning for llm reasoning from a cross-domain perspective

    Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, et al. Revisiting reinforcement learning for llm reasoning from a cross-domain perspective. arXiv preprint arXiv:2506.14965, 2025

  6. [7]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025 b

  7. [8]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

  8. [9]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  9. [10]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

  10. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  11. [12]

    O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...

  12. [13]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://da...

  13. [14]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025

  14. [15]

    Opencoder: The open cookbook for top-tier code large language models

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J Yang, JH Liu, Chenchen Zhang, Linzheng Chai, et al. Opencoder: The open cookbook for top-tier code large language models. arXiv preprint arXiv:2411.04905, 2024

  15. [16]

    Ponti, and Ivan Titov

    Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M. Ponti, and Ivan Titov. Post-hoc reward calibration: A case study on length bias. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=Iu8RytBaji

  16. [17]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  17. [18]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  18. [19]

    Riedmiller

    Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization, pp.\ 45--73. Springer, 2012. URL https://doi.org/10.1007/978-3-642-27645-3\_2

  19. [20]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR, abs/2005.01643

  20. [21]

    Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman - Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur - Ari, and Vedant Misra

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman - Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur - Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems 35: Annual Conferenc...

  21. [22]

    Jiang, Ziju Shen, et al

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9

  22. [23]

    Code-r1: Reproducing r1 for code with reliable rewards

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

  23. [24]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025 a

  24. [25]

    arXiv preprint arXiv:2505.16984 , year =

    Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. Uft: Unifying supervised and reinforcement fine-tuning. arXiv preprint arXiv:2505.16984, 2025 b

  25. [26]

    Xuefeng Liu, Hung T. C. Le, Siyu Chen, Rick Stevens, Zhuoran Yang, Matthew R. Walter, and Yuxin Chen. Active advantage-aligned online reinforcement learning with offline data. CoRR, abs/2502.07937, 2025 c . URL https://doi.org/10.48550/arXiv.2502.07937

  26. [27]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. CoRR, abs/2503.20783, 2025 d . URL https://doi.org/10.48550/arXiv.2503.20783

  27. [28]

    Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions

    Yicheng Luo, Jackie Kay, Edward Grefenstette, and Marc Peter Deisenroth. Finetuning from offline reinforcement learning: Challenges, trade-offs and practical solutions. CoRR, abs/2303.17396, 2023. URL https://doi.org/10.48550/arXiv.2303.17396

  28. [29]

    Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025

  29. [30]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4 . CoRR, abs/2304.03277, 2023. URL https://doi.org/10.48550/arXiv.2304.03277

  30. [31]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q & a benchmark. CoRR, abs/2311.12022, 2023. URL https://doi.org/10.48550/arXiv.2311.12022

  31. [32]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  32. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  33. [34]

    Hybrid RL: using both offline and online data can make RL efficient

    Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid RL: using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/forum?id=yyBis80iUuU

  34. [35]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  35. [36]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025 a

  36. [37]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems 38: Annu...

  37. [38]

    Octothinker: Mid-training incentivizes reinforcement learning scaling

    Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incentivizes reinforcement learning scaling. arXiv preprint arXiv:2506.20512, 2025 b

  38. [39]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

  39. [40]

    On memorization of large language models in logical reasoning

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. CoRR, abs/2410.23123, 2024. URL https://doi.org/10.48550/arXiv.2410.23123

  40. [41]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025

  41. [42]

    KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. 2025. URL https://arxiv.org/abs/2503.02951

  42. [43]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. CoRR, abs/2504.14945, 2025

  43. [44]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

  44. [45]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What's behind ppo's collapse in long-cot? value optimization holds the secret. arXiv preprint arXiv:2503.01491, 2025

  45. [46]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025 a

  46. [47]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? CoRR, abs/2504.13837, 2025 b . URL https://doi.org/10.48550/arXiv.2504.13837

  47. [48]

    7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient

    Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog

  48. [49]

    Echo chamber: Rl post-training amplifies behaviors learned in pretraining

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. arXiv preprint arXiv:2504.07912, 2025

  49. [50]

    Cheating automatic LLM benchmarks: Null models achieve high win rates

    Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Cheating automatic LLM benchmarks: Null models achieve high win rates. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=syThiTmWWm

  50. [51]

    Judgelm: Fine-tuned large language models are scalable judges

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. Judgelm: Fine-tuned large language models are scalable judges. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=xsELpEPn4A

  51. [52]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  52. [53]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  53. [54]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  54. [55]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...