A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Anthony Man-Cho So; Lei Zhao; Mengqi Li; Ruoyu Sun; Xiao Li

arxiv: 2510.18814 · v3 · pith:XP5QRJWOnew · submitted 2025-10-21 · 💻 cs.LG · cs.AI

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Mengqi Li , Lei Zhao , Anthony Man-Cho So , Ruoyu Sun , Xiao Li This is my paper

Pith reviewed 2026-05-21 20:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-trainingLLM reasoningreward-freemath benchmarkspost-trainingself-generated dataonline data refresh

0 comments

The pith

Language models can improve reasoning by training on responses they generate themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks if language models can raise their reasoning performance without external rewards or human labels, using only their own sampled outputs for training. It presents Self-evolving Post-Training (SePT), which runs repeated cycles of sampling questions, generating responses at a fixed temperature, and then fine-tuning the model on those responses. An online refresh ensures every new training batch is produced by the most recently updated version of the model. Experiments across six math reasoning benchmarks show gains over a strong baseline of the untuned base model evaluated at its best decoding temperature. Ablations indicate that the online refresh and temperature schedule are important drivers of the observed improvements.

Core claim

Self-evolving Post-Training enables a model to improve its reasoning by alternating between self-generation of responses at a chosen sampling temperature and training on the resulting data, with each batch refreshed online from the latest model version. This process produces accuracy gains on math reasoning tasks relative to the untuned base model at its optimal decoding temperature, demonstrating that self-generated supervision alone can support capability improvement in a practical regime.

What carries the argument

The iterative self-training loop with online data refresh, in which each training batch is generated by the most recent model version after each update.

If this is right

Accuracy rises on six math reasoning benchmarks compared with the untuned base model at its best temperature.
The online data refresh step is required for the performance gains to appear.
Sampling temperature choices affect how well the self-training loop works.
Reasoning gains are possible when supervision comes entirely from the model's own outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the assumption of reliable self-generated signals holds outside mathematics, the same loop could support self-improvement on other reasoning or generation tasks.
Periodic self-training during deployment could allow models to keep adapting without new external data.
Repeated cycles might produce compounding gains until the model reaches a performance plateau determined by its architecture.

Load-bearing premise

Self-generated responses produced under the chosen sampling temperature contain sufficiently reliable reasoning signals to drive genuine capability improvement rather than merely reinforcing existing patterns or errors.

What would settle it

Applying SePT to one of the tested models and finding that accuracy on the math benchmarks does not rise above the untuned base model at its best decoding temperature would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.18814 by Anthony Man-Cho So, Lei Zhao, Mengqi Li, Ruoyu Sun, Xiao Li.

**Figure 2.** Figure 2: Given the same question q, the base model generates the reasoning steps [A, B] with B being the wrong response (highlighted in light red, picked from one of all 8 wrong tries), while the OSFT model generates the path [A, ˆ Bˆ] with Bˆ containing the correct response (highlighted in light blue). It can be seen that OSFT facilitates the base model’s existing preference obtained from pretraining, which large… view at source ↗

**Figure 3.** Figure 3: PPL of models trained using OSFT and RLVR (GRPO, DAPO, and Dr. GRPO), where [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of OSFT and RLVR (GRPO) on Qwen2.5-Math 1.5B (dashed lines) and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of OSFT and RLVR (GRPO) on the Qwen2.5-7B base model across six math [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Performance impact of training data source. Peak scores (within 300 steps) are compared for models trained on DeepScaleR (blue baseline) versus Openthoughts math-only (orange). Percentages show the performance change from using OpenthoughtsMath. To evaluate the impact of the training data scope, we substitute the default dataset DeepScaleR with the Openthoughts (Guha et al., 2025) math-only (Openthoughts… view at source ↗

**Figure 7.** Figure 7: Ablation study on the decoupled temperature dynamics in OSFT. The figure illustrates [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on the number of self-generated samples ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation study of evaluation temperature [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Chat template, including special tokens, for the Qwen-2.5 and Llama-3.1 series. The [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Performance comparison of OSFT against RLVR (GRPO, DAPO, and Dr. GRPO) on [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Empirical validation for the choice of a higher sampling temperature ( [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: The iterative workflow of OSFT. The model alternates between generating its own train [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Full question and the incorrect response generated by the base model, corresponding to [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Full question and the correct response generated by the OSFT model, corresponding to [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

read the original abstract

Can language models improve their reasoning performance without external rewards, using only their own sampled responses for training? We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. It repeatedly samples questions, uses the model itself to generate responses under a specified sampling temperature, and then trains the model on the self-generated data. In this self-training loop, we use an online data refresh mechanism, where each new batch is generated by the most recently updated model. Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models. Additional ablations demonstrate the importance of online data refresh and temperature dynamics. Overall, our results identify a practical regime where reasoning can be improved using self-generated supervision alone. Our code is available at https://github.com/ElementQi/SePT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SePT shows benchmark gains from training on self-generated responses with online refresh, but the lack of any verification step makes it unclear if reasoning actually improves or if errors just get reinforced.

read the letter

The main thing here is that SePT runs a loop of sampling responses from the current model and training directly on them, with online data refresh so each batch comes from the latest version plus some temperature scheduling. It reports lifts over a no-training baseline (the base model at its best swept temperature) across six math reasoning benchmarks on several models tested. The ablations point to the refresh and temperature choices as useful ingredients. The code release is a plus for anyone who wants to try it out quickly. What stands out as new is the concrete combination of those two mechanisms inside a reward-free self-training setup rather than a one-shot self-distillation pass. The paper keeps the method simple and focuses on practical post-training for reasoning. The soft spot is the data quality assumption. Nothing in the procedure filters or checks the sampled responses for correctness or coherent reasoning steps. If a decent fraction of them contain mistakes, the updates could entrench those patterns instead of correcting them, so the benchmark gains might reflect longer outputs, distributional shifts, or calibration changes more than real capability growth. The abstract mentions consistent improvements and helpful ablations, but without error bars, exact numbers, or stronger baseline details it is hard to judge how robust the effect is. This paper is for researchers working on post-training methods that try to reduce dependence on external rewards or labeled data. Someone testing self-improvement ideas for math reasoning would find the setup and the refresh ablation worth looking at. It deserves peer review because the core experiment is straightforward, the results are positive enough to be worth checking, and referees can push for the missing controls on data quality and statistical reliability.

Referee Report

2 major / 1 minor

Summary. The paper proposes Self-evolving Post-Training (SePT), a reward-free method that alternates between sampling responses from the model itself at a chosen temperature and training on those self-generated responses, using an online data refresh where each batch comes from the latest model version. It claims consistent improvements over a strong no-training baseline (untuned base model at its best swept decoding temperature) across six math reasoning benchmarks on several tested models, with ablations showing the value of online refresh and temperature scheduling.

Significance. If the result holds, the work demonstrates a practical regime for LLM reasoning improvement using only self-generated supervision without external rewards, verification, or additional data, which could inform scalable post-training approaches. The public code release at https://github.com/ElementQi/SePT is a strength that supports reproducibility.

major comments (2)

[Method / procedure description] The method description (implicit in the abstract and procedure outline) samples responses and trains directly on them with only online refresh and temperature scheduling; no filtering, correctness verification, or analysis of error rates in the self-generated data is described. This assumption is load-bearing for the central claim, as flawed reasoning in a non-trivial fraction of responses could entrench errors rather than yield capability gains, and the reported ablations address only refresh and temperature but not data quality.
[Experiments] The experiments section reports improvements but provides no exact metrics, error bars, full baseline comparisons, or per-benchmark breakdowns in the abstract-level summary; without these, it is not possible to assess whether gains reflect genuine reasoning improvement or artifacts such as output length or calibration shifts.

minor comments (1)

[Abstract] The abstract states improvements occur 'on several tested models' but does not name the models or quantify the gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on clarifying the method's assumptions and strengthening the experimental reporting. Below we provide point-by-point responses to the major comments and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Method / procedure description] The method description (implicit in the abstract and procedure outline) samples responses and trains directly on them with only online refresh and temperature scheduling; no filtering, correctness verification, or analysis of error rates in the self-generated data is described. This assumption is load-bearing for the central claim, as flawed reasoning in a non-trivial fraction of responses could entrench errors rather than yield capability gains, and the reported ablations address only refresh and temperature but not data quality.

Authors: We agree that the absence of explicit filtering or verification is central to the SePT approach, which deliberately operates without external rewards or correctness checks. The online data refresh is intended to allow gradual improvement as the model generates higher-quality responses over iterations. While the current ablations focus on refresh and temperature, we acknowledge that a direct examination of data quality would strengthen the claims. In the revised manuscript we will add a dedicated subsection that reports estimated error rates on a verifiable subset of questions and tracks how the fraction of correct self-generated responses evolves across training steps. This addition will directly address the concern about potential error entrenchment. revision: yes
Referee: [Experiments] The experiments section reports improvements but provides no exact metrics, error bars, full baseline comparisons, or per-benchmark breakdowns in the abstract-level summary; without these, it is not possible to assess whether gains reflect genuine reasoning improvement or artifacts such as output length or calibration shifts.

Authors: The full experiments section already contains per-benchmark accuracy tables, comparisons against the untuned baseline at its optimal temperature, and results across multiple models. Standard deviations from repeated runs are reported. To make these details more accessible, we will update the abstract to include the main quantitative gains and add a short paragraph in the experiments section that explicitly rules out output-length inflation and calibration shifts as explanations for the observed improvements. These changes will allow readers to evaluate the results more readily without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical procedure (SePT) that alternates sampling responses from the current model and training on those responses with online refresh and temperature scheduling. Reported gains are measured performance differences on six external math reasoning benchmarks against an explicitly defined no-training baseline (untuned base model at its best swept decoding temperature). No equation or claim reduces by construction to a fitted input, self-definition, or self-citation chain; the central result remains an observed quantity on independent test sets rather than a tautological restatement of the training loop itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that self-generated data can serve as useful supervision; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption Self-generated responses under controlled temperature contain net-positive reasoning signals
Invoked throughout the description of the self-training loop; if false the observed gains would not occur.

pith-pipeline@v0.9.0 · 5709 in / 1150 out tokens · 39619 ms · 2026-05-21T20:01:02.380888+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 24 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effec- tiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

10 Preprint. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps: //github.com/huggingface/open-r1. Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reason- ing models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual mul- timodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025a. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demon- strations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

work page arXiv
[12]

Understanding R1-Zero-Like Training: A Critical Perspective

URLhttps://gi thub.com/project-numina/aimo-progress-prize. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

s1: Simple test-time scaling

URLhttps://pretty-radio-b75.notion.si te/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scali ng-RL-19681902c1468005bed8ca303013a4e2. Notion Blog. 11 Preprint. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori Hashimoto. s1: Simple test-time s...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Accessed: 2025-01-24

URLhttps://github.com/Jiayi-Pan/TinyZero. Accessed: 2025-01-24. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

work page 2025
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025a. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Ch...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

CoRR , volume =

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-R1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460,

work page arXiv
[22]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LIMO: Less is More for Reasoning

12 Preprint. Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does re- inforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

CONTENTS 1 Introduction 1 2 Preliminaries 3 2.1 Language Models

13 Preprint. CONTENTS 1 Introduction 1 2 Preliminaries 3 2.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Supervised Finetuning (SFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Reinforcement Learning with Verifiable Rewards . . . . . . . . . . . . . . . . . . 3 3 The OSFT Self-Tuning Para...

work page 2025
[29]

We also provide ablation study forτ eval in Section 4.3.3

By usingτ eval = 1as our default, we aim to provide a direct assessment of the model’s original capabilities as learned during training, without post-hoc optimization of decoding parameters. We also provide ablation study forτ eval in Section 4.3.3. Pass@k Metric.The direct estimation of pass@k by generating onlyksamples per problem can re- sult in high v...

work page 2025
[30]

GRPO) is similar under our experimental conditions

A key observation from these results is that the performance of GRPO and its variants (DAPO, Dr. GRPO) is similar under our experimental conditions. The learning trajectories for these three RL-based methods are nearly indistinguishable across the benchmarks. Our simple, reward-free OSFT paradigm demonstrates a comparable performance. Its learning curves ...

work page 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effec- tiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Ka- mar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

10 Preprint. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps: //github.com/huggingface/open-r1. Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reason- ing models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual mul- timodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025a. Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demon- strations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

work page arXiv

[12] [12]

Understanding R1-Zero-Like Training: A Critical Perspective

URLhttps://gi thub.com/project-numina/aimo-progress-prize. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

s1: Simple test-time scaling

URLhttps://pretty-radio-b75.notion.si te/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scali ng-RL-19681902c1468005bed8ca303013a4e2. Notion Blog. 11 Preprint. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori Hashimoto. s1: Simple test-time s...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Accessed: 2025-01-24

URLhttps://github.com/Jiayi-Pan/TinyZero. Accessed: 2025-01-24. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

work page 2025

[15] [15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, Yulia Tsvetkov, Hannaneh Hajishirzi, Pang Wei Koh, and Luke Zettlemoyer. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025a. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Ch...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

CoRR , volume =

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-R1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460,

work page arXiv

[22] [22]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LIMO: Less is More for Reasoning

12 Preprint. Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does re- inforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

CONTENTS 1 Introduction 1 2 Preliminaries 3 2.1 Language Models

13 Preprint. CONTENTS 1 Introduction 1 2 Preliminaries 3 2.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Supervised Finetuning (SFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Reinforcement Learning with Verifiable Rewards . . . . . . . . . . . . . . . . . . 3 3 The OSFT Self-Tuning Para...

work page 2025

[29] [29]

We also provide ablation study forτ eval in Section 4.3.3

By usingτ eval = 1as our default, we aim to provide a direct assessment of the model’s original capabilities as learned during training, without post-hoc optimization of decoding parameters. We also provide ablation study forτ eval in Section 4.3.3. Pass@k Metric.The direct estimation of pass@k by generating onlyksamples per problem can re- sult in high v...

work page 2025

[30] [30]

GRPO) is similar under our experimental conditions

A key observation from these results is that the performance of GRPO and its variants (DAPO, Dr. GRPO) is similar under our experimental conditions. The learning trajectories for these three RL-based methods are nearly indistinguishable across the benchmarks. Our simple, reward-free OSFT paradigm demonstrates a comparable performance. Its learning curves ...

work page 2025