Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

Guocong Li; Jintai Chen; Pengxiang Cai; Qingyuan Zeng; Tianchen Fang; Xiaohan Li

arxiv: 2606.22317 · v1 · pith:WGZKCZ5Bnew · submitted 2026-06-21 · 💻 cs.LG · cs.AI

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

Pengxiang Cai , Tianchen Fang , Xiaohan Li , Qingyuan Zeng , Guocong Li , Jintai Chen This is my paper

Pith reviewed 2026-06-26 11:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Curriculum Reinforcement LearningLLM ReasoningRLVRReasoning Capacity BoundaryPass@k EvaluationTeacher Guidance

0 comments

The pith

Boundary-aware curriculum RL expands LLM reasoning capacity beyond the base model by introducing new patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RL with verifiable rewards mainly reallocates probability among trajectories the base model already knows, raising single-try accuracy but leaving the outer limit of solvable problems unchanged. The paper introduces a curriculum method that first maps that outer limit by drawing many samples per problem, then supplies teacher guidance on examples sitting at or past the limit, and finally applies RL to embed the new solution patterns. When tested on Qwen, Llama, and DeepSeek families, the procedure lifts both one-attempt success and the fraction of problems solved inside 256 draws. The authors treat the second metric as an empirical stand-in for the model's overall reasoning boundary. If the gains hold, the technique supplies a repeatable way to keep extending what an LLM can reason about instead of saturating at its starting distribution.

Core claim

Boundary-aware Curriculum RL first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across multiple base models this yields simultaneous gains in pass@1 and pass@256, with the latter serving as a proxy for expanded capacity. The approach thereby moves beyond the reallocation-only behavior observed in vanilla RLVR.

What carries the argument

Boundary-aware Curriculum RL, which detects the reasoning boundary via pass@k sampling and uses teacher guidance plus RL to embed new patterns on difficult examples.

If this is right

Both pass@1 and pass@256 rise together, indicating gains in both efficiency and reachable capacity.
The same curriculum sequence works across Qwen, Llama, and DeepSeek base models.
Average pass@256 improves 9.8 points over the base models and 10.3 points over vanilla RLVR.
The method supplies a concrete training loop that can be repeated to push the empirical boundary further.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated application of the loop could produce a sequence of models whose reachable problem sets keep growing without external architectural changes.
The teacher-guidance step might eventually be replaced by an automated verifier that proposes corrections only on boundary cases.
The same boundary-detection idea could be tested on non-verifiable tasks if a reliable proxy for solution correctness can be defined.

Load-bearing premise

Higher pass@256 after the procedure means the model has acquired reasoning patterns absent from the base model rather than merely sampling the same patterns more efficiently.

What would settle it

An analysis showing that every correct trajectory found by the trained model is already generable by the base model under sufficiently large sampling, or a run in which pass@256 gains vanish when sampling temperature or decoding method changes.

Figures

Figures reproduced from arXiv: 2606.22317 by Guocong Li, Jintai Chen, Pengxiang Cai, Qingyuan Zeng, Tianchen Fang, Xiaohan Li.

**Figure 2.** Figure 2: Full pass@k curves across all model–benchmark pairs. Gray, purple dashed, and blue denote the base model, the Vanilla RLVR model, and the reported boundary-aware Curriculum RL model, respectively. Improvements near k = 256 indicate that more problems become solvable under a large sampling budget, rather than only being sampled more easily at small k. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Summary of the main evaluation results. A, average pass@1 scores over AIME 2024, AIME 2025, and MATH500. B, average pass@256 scores over the same benchmarks. C, pass@256 gains over the base model for the Vanilla RLVR model and the boundary-aware Curriculum RL model. D, number of evaluation problems that remain unsolved within 256 sampled attempts for the base model and the boundary-aware Curriculum RL mode… view at source ↗

**Figure 4.** Figure 4: Expansion of training-example regions that provide useful reward-driven signal across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representative pass@k curves for three challenging model–benchmark pairs. These panels give a compact view of the large-k behavior shown more comprehensively in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Per-benchmark pass@256 gain maps. Each cell reports the percentage-point change [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Training-example difficulty shares across curriculum rounds. We record the difficulty [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Absolute training-example difficulty counts across curriculum rounds. The top row separates [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Training-example difficulty transitions from the base model to the third curriculum round. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) is widely viewed as a promising path toward continuously improving large language models. Recent works, however, suggest that mainstream RLVR often reallocates sampling probabilities among trajectories already present in the base model: it can improve sampling efficiency, reflected by higher pass@1 scores, but yields limited gains, and can even decrease pass@k scores when k is large, and therefore may fail to expand the base model's reasoning capacity boundary. In this paper, we present a boundary-aware Curriculum RL approach to move beyond the base model's reasoning capacity boundary. Our approach first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across Qwen, Llama, and DeepSeek base models, boundary-aware Curriculum RL improves both pass@1 scores and pass@256 scores, with pass@1 reflecting one-attempt performance and pass@256 serving as an empirical proxy for the reasoning capacity boundary. In our experiments, average pass@256 improves by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These results suggest that boundary-aware Curriculum RL can provide a scalable route for LLMs to continuously improve beyond the base model's empirical reasoning capacity boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports pass@256 gains from boundary-aware curriculum RL but does not show those gains come from new reasoning patterns rather than better sampling of trajectories already in the base model.

read the letter

The central result is that boundary-aware curriculum RL lifts average pass@256 by 9.8 points over the base models and 10.3 points over vanilla RLVR across Qwen, Llama, and DeepSeek. The method first samples with pass@k to find the current edge, applies teacher guidance near or past that edge, then runs RL to lock in the patterns. This is a distinct sequence compared with standard RLVR papers that the abstract cites.

The approach is straightforward and the multi-model coverage is useful. Using pass@256 as the proxy for capacity boundary is a reasonable choice given the problem they set up.

The main weakness is that the results do not distinguish between two possibilities: the model now produces reasoning steps it could not produce before, or it simply assigns higher probability to low-probability but already-present successful trajectories. The abstract gives no trajectory-level checks, no overlap analysis with base-model rollouts, and no ablation that removes the teacher guidance step. Without those, the claim that the method expands the empirical boundary rests on an untested assumption.

The paper is worth sending to referees because the question it asks matters for RLVR work and the proposed recipe is concrete enough to test. A serious review would focus on whether the full experiments close the gap between pass@256 improvement and evidence of genuinely new patterns. I would bring it to a reading group to walk through the exact experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes a boundary-aware Curriculum RL method for LLMs: it first uses pass@k sampling to identify the current reasoning capacity boundary, applies targeted teacher guidance to examples near or beyond that boundary, and then uses RLVR to consolidate the introduced reasoning patterns. Experiments across Qwen, Llama, and DeepSeek base models report that the approach improves both pass@1 and pass@256 (the latter treated as an empirical proxy for the capacity boundary), with average pass@256 gains of 9.8 pp over base models and 10.3 pp over vanilla RLVR. The central claim is that this provides a scalable route to expand reasoning capacity beyond the base model's empirical boundary, unlike standard RLVR which the authors argue mainly reallocates probability mass among already-present trajectories.

Significance. If the pass@256 gains demonstrably reflect reasoning patterns absent from the base model (rather than higher-probability sampling of low-probability but already-supported trajectories), the result would be significant: it would supply an empirical route for continuous post-training expansion of LLM reasoning boundaries, directly addressing the reallocation limitation the authors attribute to vanilla RLVR. The multi-model evaluation and explicit contrast with vanilla RLVR strengthen the potential impact if the key assumption holds.

major comments (2)

[Experimental results / abstract] Experimental results (as summarized in the abstract and implied in the full manuscript): the claim that pass@256 improvements demonstrate expansion beyond the base model's reasoning capacity boundary rests on the assumption that newly successful trajectories lie outside the support of the base model's distribution. No direct verification is reported (e.g., no measurement of base-model probability on the additional successful rollouts, no comparison of trajectory novelty, or ablation confirming absence of the patterns pre-training). This is load-bearing for the central claim, as the skeptic's concern (more efficient sampling of existing patterns) remains unruled out by the presented evidence.
[Method / curriculum construction] Method description (curriculum construction): the boundary-aware procedure relies on pass@k to locate the capacity boundary and teacher guidance on boundary examples, yet the manuscript supplies no ablation isolating the contribution of each component to the observed pass@256 lift. Without these controls it is difficult to attribute the gains specifically to introduction of new patterns versus other factors such as curriculum ordering or teacher signal strength.

minor comments (2)

[Abstract] The abstract states average gains but does not report per-model breakdowns or variance; adding these would improve clarity of the multi-model claim.
[Abstract / experiments] Notation for pass@k is used without explicit definition of the sampling temperature or decoding strategy; a brief clarification would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which correctly identify areas where additional evidence would strengthen our central claim. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental results / abstract] Experimental results (as summarized in the abstract and implied in the full manuscript): the claim that pass@256 improvements demonstrate expansion beyond the base model's reasoning capacity boundary rests on the assumption that newly successful trajectories lie outside the support of the base model's distribution. No direct verification is reported (e.g., no measurement of base-model probability on the additional successful rollouts, no comparison of trajectory novelty, or ablation confirming absence of the patterns pre-training). This is load-bearing for the central claim, as the skeptic's concern (more efficient sampling of existing patterns) remains unruled out by the presented evidence.

Authors: We acknowledge that the evidence is indirect and that direct verification (e.g., base-model log-probabilities on newly successful trajectories) would more conclusively address the reallocation concern. Our current argument relies on pass@256 as an established empirical proxy for distribution support together with the contrast to vanilla RLVR, which shows no comparable gains. To strengthen the manuscript, we will add an analysis computing base-model probabilities on the additional successful rollouts produced by our method. revision: yes
Referee: [Method / curriculum construction] Method description (curriculum construction): the boundary-aware procedure relies on pass@k to locate the capacity boundary and teacher guidance on boundary examples, yet the manuscript supplies no ablation isolating the contribution of each component to the observed pass@256 lift. Without these controls it is difficult to attribute the gains specifically to introduction of new patterns versus other factors such as curriculum ordering or teacher signal strength.

Authors: We agree that isolating the contributions of boundary identification and targeted teacher guidance is important. The revised manuscript will include two new ablations: (1) boundary-aware selection without teacher guidance and (2) teacher guidance applied without boundary awareness. These will quantify each component's role in the observed pass@256 improvements. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical results only

full rationale

The paper reports experimental improvements in pass@1 and pass@256 after applying boundary-aware Curriculum RL to multiple base models (Qwen, Llama, DeepSeek). No equations, derivations, fitted parameters, or mathematical predictions appear in the provided text. The central claim rests on measured deltas between base models, vanilla RLVR, and the proposed method, with pass@256 treated as an empirical proxy rather than a quantity defined in terms of itself. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and no renaming of known results occurs. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pass@256 functions as a reliable proxy for the reasoning capacity boundary and that teacher guidance can inject patterns outside the base model's support.

axioms (1)

domain assumption pass@256 serves as a valid empirical proxy for the reasoning capacity boundary
Used both to locate the boundary and to quantify expansion.

pith-pipeline@v0.9.1-grok · 5789 in / 1142 out tokens · 36751 ms · 2026-06-26T11:08:59.850419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 14 linked inside Pith

[1]

Rl for reasoning by adaptively revealing rationales

Mohammad Hossein Amani, Aryo Lotfi, Nicolas Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, and Robert West. Rl for reasoning by adaptively revealing rationales. InThe F ourteenth International Conference on Learning Representations, 2025

2025
[2]

Online difficulty filtering for reasoning oriented reinforcement learning

Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, Jeongyeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 700–719, 2026

2026
[3]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374v2, 2021

Pith/arXiv arXiv 2021
[4]

Unveiling the key factors for distilling chain-of- thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, et al. Unveiling the key factors for distilling chain-of- thought reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 15094–15119, 2025

2025
[5]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168v2, 2021

Pith/arXiv arXiv 2021
[6]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. https://arxiv.org/abs/2507.06261v6, 2025

Pith/arXiv arXiv 2025
[7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[8]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[9]

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

Elliot Glazer, Ege Erdil, Tamay Besiroglu, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai. https://arxiv.org/abs/2411.04872v7, 2024

Pith/arXiv arXiv 2024
[10]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, 2025

2025
[11]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021
[12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

2023
[13]

R-zero: Self-evolving reasoning llm from zero data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. https://arxiv.org/abs/2508.05004v4, 2025. 10

Pith/arXiv arXiv 2025
[14]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. https://arxiv.org/abs/2403.01244v2, 2024

arXiv 2024
[15]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. https://arxiv.org/abs/2509.08255v1, 2025

arXiv 2025
[16]

Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[17]

Tacler: Tailored curriculum reinforcement learning for efficient reasoning

Huiyuan Lai and Malvina Nissim. Tacler: Tailored curriculum reinforcement learning for efficient reasoning. https://arxiv.org/abs/2601.21711v1, 2026

arXiv 2026
[18]

Language models can easily learn to reason from demonstrations

Dacheng Li, Shiyi Cao, Tyler Griggs, et al. Language models can easily learn to reason from demonstrations. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15979–15997, 2025

2025
[19]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[20]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2023

2023
[21]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

2025
[22]

S1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. S1: Simple test-time scaling. https://arxiv.org/abs/2501.19393v3, 2025

Pith/arXiv arXiv 2025
[23]

Openai o1 system card, 2024

openai. Openai o1 system card, 2024

2024
[24]

Curriculum reinforcement learning from easy to hard tasks improves llm reasoning

Shubham Parashar, Shurui Gui, Xiner Li, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. InThe F ourteenth International Conference on Learning Representations, 2025

2025
[25]

Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning

ByteDance Seed, Jiaze Chen, Tiantian Fan, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. https://arxiv.org/abs/2504.13914v3, 2025

arXiv 2025
[26]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[27]

Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning

Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning. https://arxiv.org/abs/2504.04524v2, 2025

arXiv 2025
[28]

Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models

Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. https://arxiv.org/abs/2503.21380v3, 2025

Pith/arXiv arXiv 2025
[29]

Kimi k1.5: Scaling reinforcement learning with llms

Kimi Team, Angang Du, Bofei Gao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025
[30]

Continual gradient low-rank projec- tion fine-tuning for llms

Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing. Continual gradient low-rank projec- tion fine-tuning for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14815–14829, 2025. 11

2025
[31]

Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions

Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, and Yujiu Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long P...

2024
[32]

Reasoning scaffolding: Distilling the flow of thought from llms

Xiangyu Wen, Junhua Huang, Zeju Li, Min Li, Jianyuan Zhong, Zhijian Xu, Mingxuan Yuan, Yongxiang Huang, and Qiang Xu. Reasoning scaffolding: Distilling the flow of thought from llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025
[33]

Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms

Xumeng Wen, Zihan Liu, Shun Zheng, et al. Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025
[34]

Enhancing long-chain reasoning distillation through error- aware self-reflection

Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, and Maosong Sun. Enhancing long-chain reasoning distillation through error- aware self-reflection. https://arxiv.org/abs/2505.22131v2, 2025

arXiv 2025
[35]

Training large language models for reasoning through reverse curriculum reinforcement learning

Zhiheng Xi, Wenxiang Chen, Boyang Hong, et al. Training large language models for reasoning through reverse curriculum reinforcement learning. InF orty-First International Conference on Machine Learning, 2024

2024
[36]

Learning to reason under off-policy guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. https://arxiv.org/abs/2504.14945v5, 2025

Pith/arXiv arXiv 2025
[37]

Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement

An Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement. https://arxiv.org/abs/2409.12122v1, 2024

Pith/arXiv arXiv 2024
[38]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report. https://arxiv.org/abs/2505.09388v1, 2025

Pith/arXiv arXiv 2025
[39]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[40]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[41]

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025

2025
[42]

Cures: From gradient analysis to efficient curriculum learning for reasoning llms

Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. Cures: From gradient analysis to efficient curriculum learning for reasoning llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025
[43]

On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

2025
[44]

Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025

2025
[45]

Automatic curricu- lum expert iteration for reliable llm reasoning

Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curricu- lum expert iteration for reliable llm reasoning. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[46]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. https://arxiv.org/abs/2412.06559v4, 2024

arXiv 2024
[47]

Group sequence policy optimization, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, et al. Group sequence policy optimization, 2025. 12

2025
[48]

Spurious forgetting in continual learning of language models

Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma. Spurious forgetting in continual learning of language models. https://arxiv.org/abs/2501.13453v1, 2025

arXiv 2025
[49]

Ttrl: Test-time reinforcement learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, et al. Ttrl: Test-time reinforcement learning. https://arxiv.org/abs/2504.16084v3, 2025. 13 A Supplementary Experimental Figures This appendix provides additional visual evidence for the main empirical claims: boundary-aware Curriculum RL improves large-k behavior more consistently than Vanilla RLVR, and the curriculum e...

Pith/arXiv arXiv 2025

[1] [1]

Rl for reasoning by adaptively revealing rationales

Mohammad Hossein Amani, Aryo Lotfi, Nicolas Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, and Robert West. Rl for reasoning by adaptively revealing rationales. InThe F ourteenth International Conference on Learning Representations, 2025

2025

[2] [2]

Online difficulty filtering for reasoning oriented reinforcement learning

Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, Jeongyeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 700–719, 2026

2026

[3] [3]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374v2, 2021

Pith/arXiv arXiv 2021

[4] [4]

Unveiling the key factors for distilling chain-of- thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, et al. Unveiling the key factors for distilling chain-of- thought reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 15094–15119, 2025

2025

[5] [5]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168v2, 2021

Pith/arXiv arXiv 2021

[6] [6]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. https://arxiv.org/abs/2507.06261v6, 2025

Pith/arXiv arXiv 2025

[7] [7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[8] [8]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[9] [9]

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

Elliot Glazer, Ege Erdil, Tamay Besiroglu, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai. https://arxiv.org/abs/2411.04872v7, 2024

Pith/arXiv arXiv 2024

[10] [10]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, 2025

2025

[11] [11]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021

[12] [12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

2023

[13] [13]

R-zero: Self-evolving reasoning llm from zero data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. https://arxiv.org/abs/2508.05004v4, 2025. 10

Pith/arXiv arXiv 2025

[14] [14]

Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. https://arxiv.org/abs/2403.01244v2, 2024

arXiv 2024

[15] [15]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. https://arxiv.org/abs/2509.08255v1, 2025

arXiv 2025

[16] [16]

Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[17] [17]

Tacler: Tailored curriculum reinforcement learning for efficient reasoning

Huiyuan Lai and Malvina Nissim. Tacler: Tailored curriculum reinforcement learning for efficient reasoning. https://arxiv.org/abs/2601.21711v1, 2026

arXiv 2026

[18] [18]

Language models can easily learn to reason from demonstrations

Dacheng Li, Shiyi Cao, Tyler Griggs, et al. Language models can easily learn to reason from demonstrations. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15979–15997, 2025

2025

[19] [19]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[20] [20]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2023

2023

[21] [21]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

2025

[22] [22]

S1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. S1: Simple test-time scaling. https://arxiv.org/abs/2501.19393v3, 2025

Pith/arXiv arXiv 2025

[23] [23]

Openai o1 system card, 2024

openai. Openai o1 system card, 2024

2024

[24] [24]

Curriculum reinforcement learning from easy to hard tasks improves llm reasoning

Shubham Parashar, Shurui Gui, Xiner Li, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. InThe F ourteenth International Conference on Learning Representations, 2025

2025

[25] [25]

Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning

ByteDance Seed, Jiaze Chen, Tiantian Fan, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. https://arxiv.org/abs/2504.13914v3, 2025

arXiv 2025

[26] [26]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[27] [27]

Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning

Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning. https://arxiv.org/abs/2504.04524v2, 2025

arXiv 2025

[28] [28]

Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models

Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. https://arxiv.org/abs/2503.21380v3, 2025

Pith/arXiv arXiv 2025

[29] [29]

Kimi k1.5: Scaling reinforcement learning with llms

Kimi Team, Angang Du, Bofei Gao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025

[30] [30]

Continual gradient low-rank projec- tion fine-tuning for llms

Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing. Continual gradient low-rank projec- tion fine-tuning for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14815–14829, 2025. 11

2025

[31] [31]

Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions

Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, and Yujiu Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long P...

2024

[32] [32]

Reasoning scaffolding: Distilling the flow of thought from llms

Xiangyu Wen, Junhua Huang, Zeju Li, Min Li, Jianyuan Zhong, Zhijian Xu, Mingxuan Yuan, Yongxiang Huang, and Qiang Xu. Reasoning scaffolding: Distilling the flow of thought from llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025

[33] [33]

Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms

Xumeng Wen, Zihan Liu, Shun Zheng, et al. Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025

[34] [34]

Enhancing long-chain reasoning distillation through error- aware self-reflection

Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, and Maosong Sun. Enhancing long-chain reasoning distillation through error- aware self-reflection. https://arxiv.org/abs/2505.22131v2, 2025

arXiv 2025

[35] [35]

Training large language models for reasoning through reverse curriculum reinforcement learning

Zhiheng Xi, Wenxiang Chen, Boyang Hong, et al. Training large language models for reasoning through reverse curriculum reinforcement learning. InF orty-First International Conference on Machine Learning, 2024

2024

[36] [36]

Learning to reason under off-policy guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. https://arxiv.org/abs/2504.14945v5, 2025

Pith/arXiv arXiv 2025

[37] [37]

Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement

An Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement. https://arxiv.org/abs/2409.12122v1, 2024

Pith/arXiv arXiv 2024

[38] [38]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report. https://arxiv.org/abs/2505.09388v1, 2025

Pith/arXiv arXiv 2025

[39] [39]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[40] [40]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[41] [41]

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025

2025

[42] [42]

Cures: From gradient analysis to efficient curriculum learning for reasoning llms

Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. Cures: From gradient analysis to efficient curriculum learning for reasoning llms. InThe F ourteenth International Conference on Learning Representations, 2025

2025

[43] [43]

On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

2025

[44] [44]

Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025

2025

[45] [45]

Automatic curricu- lum expert iteration for reliable llm reasoning

Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curricu- lum expert iteration for reliable llm reasoning. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[46] [46]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. https://arxiv.org/abs/2412.06559v4, 2024

arXiv 2024

[47] [47]

Group sequence policy optimization, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, et al. Group sequence policy optimization, 2025. 12

2025

[48] [48]

Spurious forgetting in continual learning of language models

Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma. Spurious forgetting in continual learning of language models. https://arxiv.org/abs/2501.13453v1, 2025

arXiv 2025

[49] [49]

Ttrl: Test-time reinforcement learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng, et al. Ttrl: Test-time reinforcement learning. https://arxiv.org/abs/2504.16084v3, 2025. 13 A Supplementary Experimental Figures This appendix provides additional visual evidence for the main empirical claims: boundary-aware Curriculum RL improves large-k behavior more consistently than Vanilla RLVR, and the curriculum e...

Pith/arXiv arXiv 2025