arxiv: 2504.20571 · v3 · submitted 2025-04-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang , Qing Yang , Zhiyuan Zeng , Liliang Ren , Liyuan Liu , Baolin Peng , Hao Cheng , Xuehai He

show 6 more authors

Kuan Wang Jianfeng Gao Weizhu Chen Shuohang Wang Simon Shaolei Du Yelong Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelsmath reasoningone-shot learningverifiable rewardpolicy gradientgeneralizationMATH benchmark

0 comments

The pith

One training example via reinforcement learning lifts an LLM's math reasoning score from 36% to 74% on MATH500.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning with verifiable reward using a single training example can substantially strengthen mathematical reasoning in large language models. Applied to the base model Qwen2.5-Math-1.5B, this 1-shot RLVR approach raises accuracy on the MATH500 benchmark from 36.0% to 73.6% and improves average performance across six common math reasoning benchmarks from 17.6% to 35.7%. These results match the performance obtained by training on a 1.2k-example subset that contains the same single example, and similar gains appear with two examples or across other base models and algorithms. The improvements stem mainly from the policy gradient loss rather than incidental training effects, and the process produces cross-category generalization plus continued test gains after training accuracy has plateaued.

Core claim

Reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models. Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset, and RLVR with only two examples even slightly exceeds these results. Similar substantial improvements are observed across QwQ

What carries the argument

1-shot RLVR: reinforcement learning with verifiable reward applied to a single training example, using policy gradient updates to reinforce correct reasoning trajectories while promoting exploration through entropy loss.

If this is right

Performance with one example matches results from a 1.2k-example training set on both MATH500 and the six-benchmark average.
Two examples produce slightly higher scores than one example across the same benchmarks.
The method yields consistent gains when applied to other base models such as Qwen2.5-Math-7B and Llama3.2-3B-Instruct and with both GRPO and PPO algorithms.
Training produces cross-category generalization and an increase in self-reflection behavior in the model's outputs.
Test performance continues to improve even after training accuracy saturates, a pattern termed post-saturation generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If one well-chosen example suffices, the data volume required for effective RL fine-tuning on reasoning tasks could be reduced by orders of magnitude.
The post-saturation generalization effect suggests that RL may continue refining internal solution strategies beyond what accuracy metrics capture during training.
Similar one-shot RLVR might be testable on other domains with verifiable outcomes, such as code generation or symbolic manipulation.
The distinction from grokking implies that future work can focus on policy-gradient dynamics rather than memorization-like phenomena when studying minimal-data RL.

Load-bearing premise

The single chosen example is not specially selected to inflate results, and the observed gains arise specifically from the reinforcement learning policy gradient rather than from prompt format, training setup, or other incidental factors.

What would settle it

Training with a randomly selected single math example instead of the identified one and finding no comparable lift on MATH500 or the other benchmarks would show that the gains depend on special selection rather than the general 1-shot RLVR mechanism.

read the original abstract

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. All resources are open source at https://github.com/ypwang61/One-Shot-RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One training example under RLVR can match a 1.2k-example baseline on math reasoning, but the selection process for that example is not fully described.

read the letter

The main result is straightforward: applying RL with verifiable reward to Qwen2.5-Math-1.5B on a single math example raises MATH500 accuracy from 36% to 73.6%, matching the performance of a 1.2k-example subset that contains it. The same pattern appears on the average of six benchmarks and holds for a couple of other models and both GRPO and PPO. They also report that two examples can edge it out slightly further. That quantitative match is the new piece relative to earlier RLVR work. They add some supporting observations, such as continued test gains after training accuracy saturates and a role for entropy loss in keeping exploration alive, plus a check that the policy gradient term is doing the work rather than incidental format effects or grokking. The code release helps here. The experiments look consistent across the setups they tried. The soft spot is the single-example selection. The paper says it identifies one effective example and sees similar gains with different math examples, but it does not spell out how the candidates were sampled or how many were tested before settling on the reported one. If the example was chosen after seeing results, the headline claim reads more as existence of at least one strong seed than as proof that any single example will do. That distinction affects how much weight to give the efficiency story. The rest of the paper does not have obvious load-bearing gaps. This is useful for anyone working on low-data RL post-training for reasoning models. The empirical pattern is sharp enough and the artifacts are public enough that it should go to peer review rather than a desk reject. The selection detail is straightforward to clarify in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that reinforcement learning with verifiable reward (RLVR) using only one training example (1-shot RLVR) can substantially improve mathematical reasoning in LLMs. Applying it to Qwen2.5-Math-1.5B raises MATH500 accuracy from 36.0% to 73.6% (8.6% non-format gain) and average performance across six benchmarks from 17.6% to 35.7% (7.0% non-format gain), matching results from the 1.2k-example DeepScaleR subset that contains the example. Comparable gains hold across models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), algorithms (GRPO, PPO), and multiple math examples. The work also reports cross-category generalization, increased self-reflection, post-saturation generalization, the primacy of policy-gradient loss over grokking, and the necessity of entropy regularization for exploration.

Significance. If the central result holds, the finding is significant because it shows that RLVR for reasoning can be effective with minimal supervision, achieving parity with much larger datasets. The multi-model, multi-algorithm validation, open-source code, and explicit separation of policy-gradient effects from incidental training artifacts strengthen the contribution and invite re-examination of data-efficiency assumptions in recent RLVR literature.

major comments (2)

[Experimental setup and results sections describing example selection and 1-shot RLVR runs] The manuscript states that it 'identifies' a single example yielding the headline gains (36.0% → 73.6% on MATH500) and that 'similar substantial improvements' occur for 'different math examples,' yet supplies no protocol for how candidate examples were drawn, how many were evaluated, or the selection criteria. If the reported example was retained after testing multiple candidates and choosing the highest performer, the result demonstrates existence of at least one effective seed rather than that an arbitrary single example suffices. This selection step is load-bearing for both the headline numbers and the claimed equivalence to the 1.2k DeepScaleR subset.
[Ablation and analysis sections on policy gradient vs. grokking] The claim that gains arise primarily from the policy-gradient loss (distinguishing the method from grokking) rests on the specific training dynamics observed with the chosen example. Without a documented, reproducible selection procedure, it remains possible that the observed dynamics are particular to the retained example rather than general to 1-shot RLVR.

minor comments (2)

[Training details] The exact coefficient schedule and range tested for the entropy loss term should be stated explicitly, as the paper emphasizes its critical role in promoting exploration.
[Results on multiple examples] Clarify whether the reported 'different math examples' were drawn from the same distribution as the primary example or from a broader pool, and report the number of examples tried.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of reproducibility and the scope of our claims regarding example selection and the generality of the policy-gradient findings. We address each point below and will make revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental setup and results sections describing example selection and 1-shot RLVR runs] The manuscript states that it 'identifies' a single example yielding the headline gains (36.0% → 73.6% on MATH500) and that 'similar substantial improvements' occur for 'different math examples,' yet supplies no protocol for how candidate examples were drawn, how many were evaluated, or the selection criteria. If the reported example was retained after testing multiple candidates and choosing the highest performer, the result demonstrates existence of at least one effective seed rather than that an arbitrary single example suffices. This selection step is load-bearing for both the headline numbers and the claimed equivalence to the 1.2k DeepScaleR subset.

Authors: We agree that a clear description of the example selection process is needed to support reproducibility and to precisely delineate the scope of the claims. The manuscript already reports substantial gains for multiple distinct math examples, indicating that the phenomenon is not limited to a single instance. In the experiments, candidate examples were drawn from the MATH training set, and several were evaluated to identify one yielding the headline results while confirming similar behavior for others. This supports the interpretation that effective single examples exist and can match the performance of the 1.2k-example subset, rather than asserting that an arbitrary example would produce identical gains. We will revise the experimental setup section to document the sampling approach for candidates and the evaluation criteria used, making the selection process explicit and reproducible. This revision will also reinforce the existing multi-example results to clarify the contribution. revision: yes
Referee: [Ablation and analysis sections on policy gradient vs. grokking] The claim that gains arise primarily from the policy-gradient loss (distinguishing the method from grokking) rests on the specific training dynamics observed with the chosen example. Without a documented, reproducible selection procedure, it remains possible that the observed dynamics are particular to the retained example rather than general to 1-shot RLVR.

Authors: We acknowledge that the detailed training dynamics and ablations were presented primarily for the main reported example. However, the paper already notes consistent improvements and related phenomena across different math examples. To directly address the concern about example-specific effects, we will expand the analysis section (and add an appendix if needed) with training curves and policy-gradient ablations for at least two additional examples. This will demonstrate that the dominance of the policy-gradient loss over grokking-like behavior holds more generally for 1-shot RLVR. The revision will also include a brief statement clarifying that while the primary plots focus on the representative example, the core conclusion is supported by results across examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical claims

full rationale

The paper reports empirical performance gains from applying 1-shot RLVR (e.g., MATH500 rising from 36.0% to 73.6% on Qwen2.5-Math-1.5B) measured on held-out benchmarks. No equations, derivations, or fitted parameters are presented that reduce the reported results to inputs by construction. The identification of the single example is stated as an empirical finding without any self-definitional loop, self-citation load-bearing premise, or renaming of known results. Open-source code and cross-model/algorithm verification further confirm the claims remain independent of internal fitting or circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus the empirical choice of a single effective example and an entropy coefficient that promotes exploration.

free parameters (1)

entropy loss coefficient
Chosen to encourage exploration during 1-shot RLVR training; its specific value is tuned for the reported gains.

axioms (1)

domain assumption Math answers admit automatic verifiable reward based on final correctness and format
Invoked throughout the RLVR setup to define the reward signal.

pith-pipeline@v0.9.0 · 5746 in / 1359 out tokens · 24396 ms · 2026-05-15T19:47:48.868083+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
cs.CL 2026-05 unverdicted novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
cs.CV 2026-04 unverdicted novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 18 Pith papers · 28 internal anchors

[1]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Accessed: 2025-04-10. 10

work page 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115, 2024

Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115, 2024

work page arXiv 2024
[5]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page arXiv 2025
[7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679, 2024

work page arXiv 2024
[10]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

work page arXiv 2025
[11]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Deepcoder: A fully open-source 14b coder at o3-mini level

Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/ DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 ,

work page
[15]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Srpo: A cross-domain implementation of large-scale reinforcement learning on llm, 2025

Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, Shimiao Jiang, Shiqi Kuang, Shouyu Yin, Chaohang Wen, Haotian Zhang, Bin Chen, and Bing Yu. Srpo: A cross-domain implementation of large-scale reinforcement learning on llm, 2025. 11

work page 2025
[17]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

work page 2024
[18]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

work page
[19]

Limr: Less is more for rl scaling, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling, 2025

work page 2025
[20]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. Submitted on April 18, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Rethinking reflection in pre-training.arXiv preprint arXiv:2504.04022, 2025

Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, et al. Rethinking reflection in pre-training.arXiv preprint arXiv:2504.04022, 2025

work page arXiv 2025
[22]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477, 2025

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective.arXiv preprint arXiv:2503.15477, 2025

work page arXiv 2025
[24]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[29]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Aime problems and solutions

Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions. Accessed: 2025-04-20

work page 2025
[31]

Amc problems and solutions

Art of Problem Solving. Amc problems and solutions. https://artofproblemsolving. com/wiki/index.php?title=AMC_Problems_and_Solutions. Accessed: 2025-04-20. 12

work page 2025
[32]

Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

work page 2022
[33]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Evaltree: Profiling language model weaknesses via hierarchical capability trees.arXiv preprint arXiv:2503.08893, 2025

Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, and Pang Wei Koh. Evaltree: Profiling language model weaknesses via hierarchical capability trees.arXiv preprint arXiv:2503.08893, 2025

work page arXiv 2025
[36]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Deep grokking: Would deep neural networks generalize better?arXiv preprint arXiv:2405.19454, 2024

Simin Fan, Razvan Pascanu, and Martin Jaggi. Deep grokking: Would deep neural networks generalize better?arXiv preprint arXiv:2405.19454, 2024

work page arXiv 2024
[38]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022

work page 2022
[40]

The complex- ity dynamics of grokking.arXiv preprint arXiv:2412.09810, 2024

Branton DeMoss, Silvia Sapora, Jakob Foerster, Nick Hawes, and Ingmar Posner. The complex- ity dynamics of grokking.arXiv preprint arXiv:2412.09810, 2024

work page arXiv 2024
[41]

Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697, 2025

Lucas Prieto, Melih Barsbey, Pedro AM Mediano, and Tolga Birdal. Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697, 2025

work page arXiv 2025
[42]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

work page arXiv 2025
[44]

Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models.arXiv preprint arXiv:2503.17287, 2025

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models.arXiv preprint arXiv:2503.17287, 2025

work page arXiv 2025
[45]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025
[47]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025

Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807, 2025. 13

work page arXiv 2025
[49]

Alpagasus: Training a better alpaca with fewer data

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. InInternational Conference on Learning Representations, 2024

work page 2024
[50]

Smith, Hannaneh Hajishirzi, and Pradeep Dasigi

Hamish Ivison, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi. Data-efficient finetuning using cross-task nearest neighbors. InFindings of the Association for Computational Linguistics, 2023

work page 2023
[51]

LESS: selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning, 2024

work page 2024
[52]

Active preference learning for large language models

William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. InInternational Conference on Machine Learning, 2024

work page 2024
[53]

Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146, 2024

Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146, 2024

work page arXiv 2024
[54]

Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimization for sample efficient rlhf.arXiv preprint arXiv:2402.10500, 2024

work page arXiv 2024
[55]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 2022

work page 2022
[56]

Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185, 2025

Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185, 2025

work page arXiv 2025
[57]

Schulman

J. Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html,

work page
[58]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[60]

Skywork open reasoner series

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xi- aoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/ Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. No...

work page 2025
[61]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025
[63]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Deep Learning is Robust to Massive Label Noise

David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise.arXiv preprint arXiv:1705.10694, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[66]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv: 1609.04836, 2016. 14

work page internal anchor Pith review Pith/arXiv arXiv 2016
[67]

Smith, Benoit Dherin, David G

Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent.Iclr, 2021

work page 2021
[68]

Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling.arXiv preprint, 2024

work page 2024
[69]

L′ PG-GRPO(·, θ) +βL ′ KL(·, θ, θref) +αL ′ Entropy(·, θ) # ,(3) where β and α are hyper-parameters (in general β >0 , α <0 ), and “·

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Ruoyu Sun, and Zhi-Quan Luo. Entropic distribution matching for supervised fine-tuning of llms: Less overfitting and better diversity. InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024. 15 Contents 1 Introduction 1 2 Preliminary 3 3 Experiments 4 ...

work page 2024