pith. machine review for the scientific record. sign in

arxiv: 2504.20571 · v3 · submitted 2025-04-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learninglarge language modelsmath reasoningone-shot learningverifiable rewardpolicy gradientgeneralizationMATH benchmark
0
0 comments X

The pith

One training example via reinforcement learning lifts an LLM's math reasoning score from 36% to 74% on MATH500.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning with verifiable reward using a single training example can substantially strengthen mathematical reasoning in large language models. Applied to the base model Qwen2.5-Math-1.5B, this 1-shot RLVR approach raises accuracy on the MATH500 benchmark from 36.0% to 73.6% and improves average performance across six common math reasoning benchmarks from 17.6% to 35.7%. These results match the performance obtained by training on a 1.2k-example subset that contains the same single example, and similar gains appear with two examples or across other base models and algorithms. The improvements stem mainly from the policy gradient loss rather than incidental training effects, and the process produces cross-category generalization plus continued test gains after training accuracy has plateaued.

Core claim

Reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models. Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset, and RLVR with only two examples even slightly exceeds these results. Similar substantial improvements are observed across QwQ

What carries the argument

1-shot RLVR: reinforcement learning with verifiable reward applied to a single training example, using policy gradient updates to reinforce correct reasoning trajectories while promoting exploration through entropy loss.

If this is right

  • Performance with one example matches results from a 1.2k-example training set on both MATH500 and the six-benchmark average.
  • Two examples produce slightly higher scores than one example across the same benchmarks.
  • The method yields consistent gains when applied to other base models such as Qwen2.5-Math-7B and Llama3.2-3B-Instruct and with both GRPO and PPO algorithms.
  • Training produces cross-category generalization and an increase in self-reflection behavior in the model's outputs.
  • Test performance continues to improve even after training accuracy saturates, a pattern termed post-saturation generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If one well-chosen example suffices, the data volume required for effective RL fine-tuning on reasoning tasks could be reduced by orders of magnitude.
  • The post-saturation generalization effect suggests that RL may continue refining internal solution strategies beyond what accuracy metrics capture during training.
  • Similar one-shot RLVR might be testable on other domains with verifiable outcomes, such as code generation or symbolic manipulation.
  • The distinction from grokking implies that future work can focus on policy-gradient dynamics rather than memorization-like phenomena when studying minimal-data RL.

Load-bearing premise

The single chosen example is not specially selected to inflate results, and the observed gains arise specifically from the reinforcement learning policy gradient rather than from prompt format, training setup, or other incidental factors.

What would settle it

Training with a randomly selected single math example instead of the identified one and finding no comparable lift on MATH500 or the other benchmarks would show that the gains depend on special selection rather than the general 1-shot RLVR mechanism.

read the original abstract

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. All resources are open source at https://github.com/ypwang61/One-Shot-RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that reinforcement learning with verifiable reward (RLVR) using only one training example (1-shot RLVR) can substantially improve mathematical reasoning in LLMs. Applying it to Qwen2.5-Math-1.5B raises MATH500 accuracy from 36.0% to 73.6% (8.6% non-format gain) and average performance across six benchmarks from 17.6% to 35.7% (7.0% non-format gain), matching results from the 1.2k-example DeepScaleR subset that contains the example. Comparable gains hold across models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), algorithms (GRPO, PPO), and multiple math examples. The work also reports cross-category generalization, increased self-reflection, post-saturation generalization, the primacy of policy-gradient loss over grokking, and the necessity of entropy regularization for exploration.

Significance. If the central result holds, the finding is significant because it shows that RLVR for reasoning can be effective with minimal supervision, achieving parity with much larger datasets. The multi-model, multi-algorithm validation, open-source code, and explicit separation of policy-gradient effects from incidental training artifacts strengthen the contribution and invite re-examination of data-efficiency assumptions in recent RLVR literature.

major comments (2)
  1. [Experimental setup and results sections describing example selection and 1-shot RLVR runs] The manuscript states that it 'identifies' a single example yielding the headline gains (36.0% → 73.6% on MATH500) and that 'similar substantial improvements' occur for 'different math examples,' yet supplies no protocol for how candidate examples were drawn, how many were evaluated, or the selection criteria. If the reported example was retained after testing multiple candidates and choosing the highest performer, the result demonstrates existence of at least one effective seed rather than that an arbitrary single example suffices. This selection step is load-bearing for both the headline numbers and the claimed equivalence to the 1.2k DeepScaleR subset.
  2. [Ablation and analysis sections on policy gradient vs. grokking] The claim that gains arise primarily from the policy-gradient loss (distinguishing the method from grokking) rests on the specific training dynamics observed with the chosen example. Without a documented, reproducible selection procedure, it remains possible that the observed dynamics are particular to the retained example rather than general to 1-shot RLVR.
minor comments (2)
  1. [Training details] The exact coefficient schedule and range tested for the entropy loss term should be stated explicitly, as the paper emphasizes its critical role in promoting exploration.
  2. [Results on multiple examples] Clarify whether the reported 'different math examples' were drawn from the same distribution as the primary example or from a broader pool, and report the number of examples tried.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of reproducibility and the scope of our claims regarding example selection and the generality of the policy-gradient findings. We address each point below and will make revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental setup and results sections describing example selection and 1-shot RLVR runs] The manuscript states that it 'identifies' a single example yielding the headline gains (36.0% → 73.6% on MATH500) and that 'similar substantial improvements' occur for 'different math examples,' yet supplies no protocol for how candidate examples were drawn, how many were evaluated, or the selection criteria. If the reported example was retained after testing multiple candidates and choosing the highest performer, the result demonstrates existence of at least one effective seed rather than that an arbitrary single example suffices. This selection step is load-bearing for both the headline numbers and the claimed equivalence to the 1.2k DeepScaleR subset.

    Authors: We agree that a clear description of the example selection process is needed to support reproducibility and to precisely delineate the scope of the claims. The manuscript already reports substantial gains for multiple distinct math examples, indicating that the phenomenon is not limited to a single instance. In the experiments, candidate examples were drawn from the MATH training set, and several were evaluated to identify one yielding the headline results while confirming similar behavior for others. This supports the interpretation that effective single examples exist and can match the performance of the 1.2k-example subset, rather than asserting that an arbitrary example would produce identical gains. We will revise the experimental setup section to document the sampling approach for candidates and the evaluation criteria used, making the selection process explicit and reproducible. This revision will also reinforce the existing multi-example results to clarify the contribution. revision: yes

  2. Referee: [Ablation and analysis sections on policy gradient vs. grokking] The claim that gains arise primarily from the policy-gradient loss (distinguishing the method from grokking) rests on the specific training dynamics observed with the chosen example. Without a documented, reproducible selection procedure, it remains possible that the observed dynamics are particular to the retained example rather than general to 1-shot RLVR.

    Authors: We acknowledge that the detailed training dynamics and ablations were presented primarily for the main reported example. However, the paper already notes consistent improvements and related phenomena across different math examples. To directly address the concern about example-specific effects, we will expand the analysis section (and add an appendix if needed) with training curves and policy-gradient ablations for at least two additional examples. This will demonstrate that the dominance of the policy-gradient loss over grokking-like behavior holds more generally for 1-shot RLVR. The revision will also include a brief statement clarifying that while the primary plots focus on the representative example, the core conclusion is supported by results across examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical claims

full rationale

The paper reports empirical performance gains from applying 1-shot RLVR (e.g., MATH500 rising from 36.0% to 73.6% on Qwen2.5-Math-1.5B) measured on held-out benchmarks. No equations, derivations, or fitted parameters are presented that reduce the reported results to inputs by construction. The identification of the single example is stated as an empirical finding without any self-definitional loop, self-citation load-bearing premise, or renaming of known results. Open-source code and cross-model/algorithm verification further confirm the claims remain independent of internal fitting or circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the empirical choice of a single effective example and an entropy coefficient that promotes exploration.

free parameters (1)
  • entropy loss coefficient
    Chosen to encourage exploration during 1-shot RLVR training; its specific value is tuned for the reported gains.
axioms (1)
  • domain assumption Math answers admit automatic verifiable reward based on final correctness and format
    Invoked throughout the RLVR setup to define the reward signal.

pith-pipeline@v0.9.0 · 5746 in / 1359 out tokens · 24396 ms · 2026-05-15T19:47:48.868083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  2. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  3. Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.

  4. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  5. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

    cs.CL 2026-05 unverdicted novelty 6.0

    Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

  6. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

  7. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  8. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  9. Gradient Extrapolation-Based Policy Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...

  10. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  11. Cost-Aware Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

  12. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.

  13. Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

  14. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  15. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

  16. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  17. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

  18. Hierarchical Reasoning Model

    cs.AI 2025-06 unverdicted novelty 5.0

    HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...

  19. Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

    cs.LG 2026-05 unverdicted novelty 3.0

    Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 18 Pith papers · 28 internal anchors

  1. [1]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Accessed: 2025-04-10. 10

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  4. [4]

    On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115, 2024

    Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115, 2024

  5. [5]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

  6. [6]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

  7. [7]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  8. [8]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  9. [9]

    Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679, 2024

  10. [10]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

  11. [11]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  12. [12]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  13. [13]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  14. [14]

    Deepcoder: A fully open-source 14b coder at o3-mini level

    Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/ DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 ,

  15. [15]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

  16. [16]

    Srpo: A cross-domain implementation of large-scale reinforcement learning on llm, 2025

    Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, and Bing Yu. Srpo: A cross-domain implementation of large-scale reinforcement learning on llm, 2025. 11

  17. [17]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

  18. [18]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

  19. [19]

    Limr: Less is more for rl scaling, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025

  20. [20]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. Submitted on April 18, 2025

  21. [21]

    Rethinking reflection in pre-training.arXiv preprint arXiv:2504.04022, 2025

    Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, et al. Rethinking reflection in pre-training.arXiv preprint arXiv:2504.04022, 2025

  22. [22]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  23. [23]

    What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477, 2025

    Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477, 2025

  24. [24]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  25. [25]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  26. [26]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  27. [27]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  28. [28]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  29. [29]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  30. [30]

    Aime problems and solutions

    Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-04-20

  31. [31]

    Amc problems and solutions

    Art of Problem Solving. Amc problems and solutions. https://artofproblemsolving. com/wiki/index.php?title=AMC_Problems_and_Solutions. Accessed: 2025-04-20. 12

  32. [32]

    Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  33. [33]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

  34. [34]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  35. [35]

    Evaltree: Profiling language model weaknesses via hierarchical capability trees.arXiv preprint arXiv:2503.08893, 2025

    Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, and Pang Wei Koh. Evaltree: Profiling language model weaknesses via hierarchical capability trees.arXiv preprint arXiv:2503.08893, 2025

  36. [36]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  37. [37]

    Deep grokking: Would deep neural networks generalize better?arXiv preprint arXiv:2405.19454, 2024

    Simin Fan, Razvan Pascanu, and Martin Jaggi. Deep grokking: Would deep neural networks generalize better?arXiv preprint arXiv:2405.19454, 2024

  38. [38]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023

  39. [39]

    Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

    Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

  40. [40]

    The complex- ity dynamics of grokking.arXiv preprint arXiv:2412.09810, 2024

    Branton DeMoss, Silvia Sapora, Jakob Foerster, Nick Hawes, and Ingmar Posner. The complex- ity dynamics of grokking.arXiv preprint arXiv:2412.09810, 2024

  41. [41]

    Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697, 2025

    Lucas Prieto, Melih Barsbey, Pedro AM Mediano, and Tolga Birdal. Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697, 2025

  42. [42]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

  43. [43]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

  44. [44]

    Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models.arXiv preprint arXiv:2503.17287, 2025

    Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models.arXiv preprint arXiv:2503.17287, 2025

  45. [45]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

  46. [46]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

  47. [47]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

  48. [48]

    Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025

    Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025. 13

  49. [49]

    Alpagasus: Training a better alpaca with fewer data

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InInternational Conference on Learning Representations, 2024

  50. [50]

    Smith, Hannaneh Hajishirzi, and Pradeep Dasigi

    Hamish Ivison, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi. Data-efficient finetuning using cross-task nearest neighbors. InFindings of the Association for Computational Linguistics, 2023

  51. [51]

    LESS: selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

  52. [52]

    Active preference learning for large language models

    William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. InInternational Conference on Machine Learning, 2024

  53. [53]

    Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146, 2024

    Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146, 2024

  54. [54]

    Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

    Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

  55. [55]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

  56. [56]

    Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185, 2025

    Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185, 2025

  57. [57]

    Schulman

    J. Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html,

  58. [58]

    Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

  59. [59]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

  60. [60]

    Skywork open reasoner series

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xi- aoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/ Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. No...

  61. [61]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  62. [62]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  63. [63]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022

  64. [64]

    Deep Learning is Robust to Massive Label Noise

    David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise.arXiv preprint arXiv:1705.10694, 2017

  65. [65]

    Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  66. [66]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv: 1609.04836, 2016. 14

  67. [67]

    Smith, Benoit Dherin, David G

    Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent.Iclr, 2021

  68. [68]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

  69. [69]

    L′ PG-GRPO(·, θ) +βL ′ KL(·, θ, θref) +αL ′ Entropy(·, θ) # ,(3) where β and α are hyper-parameters (in general β >0 , α <0 ), and “·

    Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Ruoyu Sun, and Zhi-Quan Luo. Entropic distribution matching for supervised fine-tuning of llms: Less overfitting and better diversity. InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024. 15 Contents 1 Introduction 1 2 Preliminary 3 3 Experiments 4 ...