pith. sign in

arxiv: 2606.22317 · v1 · pith:WGZKCZ5Bnew · submitted 2026-06-21 · 💻 cs.LG · cs.AI

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

Pith reviewed 2026-06-26 11:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Curriculum Reinforcement LearningLLM ReasoningRLVRReasoning Capacity BoundaryPass@k EvaluationTeacher Guidance
0
0 comments X

The pith

Boundary-aware curriculum RL expands LLM reasoning capacity beyond the base model by introducing new patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RL with verifiable rewards mainly reallocates probability among trajectories the base model already knows, raising single-try accuracy but leaving the outer limit of solvable problems unchanged. The paper introduces a curriculum method that first maps that outer limit by drawing many samples per problem, then supplies teacher guidance on examples sitting at or past the limit, and finally applies RL to embed the new solution patterns. When tested on Qwen, Llama, and DeepSeek families, the procedure lifts both one-attempt success and the fraction of problems solved inside 256 draws. The authors treat the second metric as an empirical stand-in for the model's overall reasoning boundary. If the gains hold, the technique supplies a repeatable way to keep extending what an LLM can reason about instead of saturating at its starting distribution.

Core claim

Boundary-aware Curriculum RL first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across multiple base models this yields simultaneous gains in pass@1 and pass@256, with the latter serving as a proxy for expanded capacity. The approach thereby moves beyond the reallocation-only behavior observed in vanilla RLVR.

What carries the argument

Boundary-aware Curriculum RL, which detects the reasoning boundary via pass@k sampling and uses teacher guidance plus RL to embed new patterns on difficult examples.

If this is right

  • Both pass@1 and pass@256 rise together, indicating gains in both efficiency and reachable capacity.
  • The same curriculum sequence works across Qwen, Llama, and DeepSeek base models.
  • Average pass@256 improves 9.8 points over the base models and 10.3 points over vanilla RLVR.
  • The method supplies a concrete training loop that can be repeated to push the empirical boundary further.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated application of the loop could produce a sequence of models whose reachable problem sets keep growing without external architectural changes.
  • The teacher-guidance step might eventually be replaced by an automated verifier that proposes corrections only on boundary cases.
  • The same boundary-detection idea could be tested on non-verifiable tasks if a reliable proxy for solution correctness can be defined.

Load-bearing premise

Higher pass@256 after the procedure means the model has acquired reasoning patterns absent from the base model rather than merely sampling the same patterns more efficiently.

What would settle it

An analysis showing that every correct trajectory found by the trained model is already generable by the base model under sufficiently large sampling, or a run in which pass@256 gains vanish when sampling temperature or decoding method changes.

Figures

Figures reproduced from arXiv: 2606.22317 by Guocong Li, Jintai Chen, Pengxiang Cai, Qingyuan Zeng, Tianchen Fang, Xiaohan Li.

Figure 1
Figure 1. Figure 1: Overview of Boundary-Aware Curriculum RL. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Full pass@k curves across all model–benchmark pairs. Gray, purple dashed, and blue denote the base model, the Vanilla RLVR model, and the reported boundary-aware Curriculum RL model, respectively. Improvements near k = 256 indicate that more problems become solvable under a large sampling budget, rather than only being sampled more easily at small k. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summary of the main evaluation results. A, average pass@1 scores over AIME 2024, AIME 2025, and MATH500. B, average pass@256 scores over the same benchmarks. C, pass@256 gains over the base model for the Vanilla RLVR model and the boundary-aware Curriculum RL model. D, number of evaluation problems that remain unsolved within 256 sampled attempts for the base model and the boundary-aware Curriculum RL mode… view at source ↗
Figure 4
Figure 4. Figure 4: Expansion of training-example regions that provide useful reward-driven signal across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative pass@k curves for three challenging model–benchmark pairs. These panels give a compact view of the large-k behavior shown more comprehensively in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-benchmark pass@256 gain maps. Each cell reports the percentage-point change [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training-example difficulty shares across curriculum rounds. We record the difficulty [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Absolute training-example difficulty counts across curriculum rounds. The top row separates [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training-example difficulty transitions from the base model to the third curriculum round. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) is widely viewed as a promising path toward continuously improving large language models. Recent works, however, suggest that mainstream RLVR often reallocates sampling probabilities among trajectories already present in the base model: it can improve sampling efficiency, reflected by higher pass@1 scores, but yields limited gains, and can even decrease pass@k scores when k is large, and therefore may fail to expand the base model's reasoning capacity boundary. In this paper, we present a boundary-aware Curriculum RL approach to move beyond the base model's reasoning capacity boundary. Our approach first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across Qwen, Llama, and DeepSeek base models, boundary-aware Curriculum RL improves both pass@1 scores and pass@256 scores, with pass@1 reflecting one-attempt performance and pass@256 serving as an empirical proxy for the reasoning capacity boundary. In our experiments, average pass@256 improves by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These results suggest that boundary-aware Curriculum RL can provide a scalable route for LLMs to continuously improve beyond the base model's empirical reasoning capacity boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a boundary-aware Curriculum RL method for LLMs: it first uses pass@k sampling to identify the current reasoning capacity boundary, applies targeted teacher guidance to examples near or beyond that boundary, and then uses RLVR to consolidate the introduced reasoning patterns. Experiments across Qwen, Llama, and DeepSeek base models report that the approach improves both pass@1 and pass@256 (the latter treated as an empirical proxy for the capacity boundary), with average pass@256 gains of 9.8 pp over base models and 10.3 pp over vanilla RLVR. The central claim is that this provides a scalable route to expand reasoning capacity beyond the base model's empirical boundary, unlike standard RLVR which the authors argue mainly reallocates probability mass among already-present trajectories.

Significance. If the pass@256 gains demonstrably reflect reasoning patterns absent from the base model (rather than higher-probability sampling of low-probability but already-supported trajectories), the result would be significant: it would supply an empirical route for continuous post-training expansion of LLM reasoning boundaries, directly addressing the reallocation limitation the authors attribute to vanilla RLVR. The multi-model evaluation and explicit contrast with vanilla RLVR strengthen the potential impact if the key assumption holds.

major comments (2)
  1. [Experimental results / abstract] Experimental results (as summarized in the abstract and implied in the full manuscript): the claim that pass@256 improvements demonstrate expansion beyond the base model's reasoning capacity boundary rests on the assumption that newly successful trajectories lie outside the support of the base model's distribution. No direct verification is reported (e.g., no measurement of base-model probability on the additional successful rollouts, no comparison of trajectory novelty, or ablation confirming absence of the patterns pre-training). This is load-bearing for the central claim, as the skeptic's concern (more efficient sampling of existing patterns) remains unruled out by the presented evidence.
  2. [Method / curriculum construction] Method description (curriculum construction): the boundary-aware procedure relies on pass@k to locate the capacity boundary and teacher guidance on boundary examples, yet the manuscript supplies no ablation isolating the contribution of each component to the observed pass@256 lift. Without these controls it is difficult to attribute the gains specifically to introduction of new patterns versus other factors such as curriculum ordering or teacher signal strength.
minor comments (2)
  1. [Abstract] The abstract states average gains but does not report per-model breakdowns or variance; adding these would improve clarity of the multi-model claim.
  2. [Abstract / experiments] Notation for pass@k is used without explicit definition of the sampling temperature or decoding strategy; a brief clarification would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which correctly identify areas where additional evidence would strengthen our central claim. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experimental results / abstract] Experimental results (as summarized in the abstract and implied in the full manuscript): the claim that pass@256 improvements demonstrate expansion beyond the base model's reasoning capacity boundary rests on the assumption that newly successful trajectories lie outside the support of the base model's distribution. No direct verification is reported (e.g., no measurement of base-model probability on the additional successful rollouts, no comparison of trajectory novelty, or ablation confirming absence of the patterns pre-training). This is load-bearing for the central claim, as the skeptic's concern (more efficient sampling of existing patterns) remains unruled out by the presented evidence.

    Authors: We acknowledge that the evidence is indirect and that direct verification (e.g., base-model log-probabilities on newly successful trajectories) would more conclusively address the reallocation concern. Our current argument relies on pass@256 as an established empirical proxy for distribution support together with the contrast to vanilla RLVR, which shows no comparable gains. To strengthen the manuscript, we will add an analysis computing base-model probabilities on the additional successful rollouts produced by our method. revision: yes

  2. Referee: [Method / curriculum construction] Method description (curriculum construction): the boundary-aware procedure relies on pass@k to locate the capacity boundary and teacher guidance on boundary examples, yet the manuscript supplies no ablation isolating the contribution of each component to the observed pass@256 lift. Without these controls it is difficult to attribute the gains specifically to introduction of new patterns versus other factors such as curriculum ordering or teacher signal strength.

    Authors: We agree that isolating the contributions of boundary identification and targeted teacher guidance is important. The revised manuscript will include two new ablations: (1) boundary-aware selection without teacher guidance and (2) teacher guidance applied without boundary awareness. These will quantify each component's role in the observed pass@256 improvements. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical results only

full rationale

The paper reports experimental improvements in pass@1 and pass@256 after applying boundary-aware Curriculum RL to multiple base models (Qwen, Llama, DeepSeek). No equations, derivations, fitted parameters, or mathematical predictions appear in the provided text. The central claim rests on measured deltas between base models, vanilla RLVR, and the proposed method, with pass@256 treated as an empirical proxy rather than a quantity defined in terms of itself. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and no renaming of known results occurs. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pass@256 functions as a reliable proxy for the reasoning capacity boundary and that teacher guidance can inject patterns outside the base model's support.

axioms (1)
  • domain assumption pass@256 serves as a valid empirical proxy for the reasoning capacity boundary
    Used both to locate the boundary and to quantify expansion.

pith-pipeline@v0.9.1-grok · 5789 in / 1142 out tokens · 36751 ms · 2026-06-26T11:08:59.850419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 14 linked inside Pith

  1. [1]

    Rl for reasoning by adaptively revealing rationales

    Mohammad Hossein Amani, Aryo Lotfi, Nicolas Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, and Robert West. Rl for reasoning by adaptively revealing rationales. InThe F ourteenth International Conference on Learning Representations, 2025

  2. [2]

    Online difficulty filtering for reasoning oriented reinforcement learning

    Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, Jeongyeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 700–719, 2026

  3. [3]

    Evaluating large language models trained on code

    Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374v2, 2021

  4. [4]

    Unveiling the key factors for distilling chain-of- thought reasoning

    Xinghao Chen, Zhijing Sun, Guo Wenjin, et al. Unveiling the key factors for distilling chain-of- thought reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 15094–15119, 2025

  5. [5]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems. https://arxiv.org/abs/2110.14168v2, 2021

  6. [6]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. https://arxiv.org/abs/2507.06261v6, 2025

  7. [7]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

    DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645(8081):633–638, 2025

  8. [8]

    Omni-math: A universal olympiad level mathematic benchmark for large language models

    Bofei Gao, Feifan Song, Zhe Yang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth International Conference on Learning Representations, 2024

  9. [9]

    Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai. https://arxiv.org/abs/2411.04872v7, 2024

  10. [10]

    Rewarding the unlikely: Lifting grpo beyond distribution sharpening

    Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25548–25560, 2025

  11. [11]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  12. [12]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

  13. [13]

    R-zero: Self-evolving reasoning llm from zero data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. https://arxiv.org/abs/2508.05004v4, 2025. 10

  14. [14]

    Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. https://arxiv.org/abs/2403.01244v2, 2024

  15. [15]

    Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

    Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. https://arxiv.org/abs/2509.08255v1, 2025

  16. [16]

    Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

    Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

  17. [17]

    Tacler: Tailored curriculum reinforcement learning for efficient reasoning

    Huiyuan Lai and Malvina Nissim. Tacler: Tailored curriculum reinforcement learning for efficient reasoning. https://arxiv.org/abs/2601.21711v1, 2026

  18. [18]

    Language models can easily learn to reason from demonstrations

    Dacheng Li, Shiyi Cao, Tyler Griggs, et al. Language models can easily learn to reason from demonstrations. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15979–15997, 2025

  19. [19]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

  20. [20]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2023

  21. [21]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

  22. [22]

    S1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. S1: Simple test-time scaling. https://arxiv.org/abs/2501.19393v3, 2025

  23. [23]

    Openai o1 system card, 2024

    openai. Openai o1 system card, 2024

  24. [24]

    Curriculum reinforcement learning from easy to hard tasks improves llm reasoning

    Shubham Parashar, Shurui Gui, Xiner Li, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning. InThe F ourteenth International Conference on Learning Representations, 2025

  25. [25]

    Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning

    ByteDance Seed, Jiaze Chen, Tiantian Fan, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. https://arxiv.org/abs/2504.13914v3, 2025

  26. [26]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning

    Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning. https://arxiv.org/abs/2504.04524v2, 2025

  28. [28]

    Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models

    Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models. https://arxiv.org/abs/2503.21380v3, 2025

  29. [29]

    Kimi k1.5: Scaling reinforcement learning with llms

    Kimi Team, Angang Du, Bofei Gao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  30. [30]

    Continual gradient low-rank projec- tion fine-tuning for llms

    Chenxu Wang, Yilin Lyu, Zicheng Sun, and Liping Jing. Continual gradient low-rank projec- tion fine-tuning for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14815–14829, 2025. 11

  31. [31]

    Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions

    Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, and Yujiu Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long P...

  32. [32]

    Reasoning scaffolding: Distilling the flow of thought from llms

    Xiangyu Wen, Junhua Huang, Zeju Li, Min Li, Jianyuan Zhong, Zhijian Xu, Mingxuan Yuan, Yongxiang Huang, and Qiang Xu. Reasoning scaffolding: Distilling the flow of thought from llms. InThe F ourteenth International Conference on Learning Representations, 2025

  33. [33]

    Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms

    Xumeng Wen, Zihan Liu, Shun Zheng, et al. Reinforcement learning with verifiable rewards im- plicitly incentivizes correct reasoning in base llms. InThe F ourteenth International Conference on Learning Representations, 2025

  34. [34]

    Enhancing long-chain reasoning distillation through error- aware self-reflection

    Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, and Maosong Sun. Enhancing long-chain reasoning distillation through error- aware self-reflection. https://arxiv.org/abs/2505.22131v2, 2025

  35. [35]

    Training large language models for reasoning through reverse curriculum reinforcement learning

    Zhiheng Xi, Wenxiang Chen, Boyang Hong, et al. Training large language models for reasoning through reverse curriculum reinforcement learning. InF orty-First International Conference on Machine Learning, 2024

  36. [36]

    Learning to reason under off-policy guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. https://arxiv.org/abs/2504.14945v5, 2025

  37. [37]

    Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement

    An Yang, Beichen Zhang, Binyuan Hui, et al. Qwen2.5-math technical report: Toward mathe- matical expert model via self-improvement. https://arxiv.org/abs/2409.12122v1, 2024

  38. [38]

    Qwen3 technical report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report. https://arxiv.org/abs/2505.09388v1, 2025

  39. [39]

    Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  40. [40]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

  41. [41]

    Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025

  42. [42]

    Cures: From gradient analysis to efficient curriculum learning for reasoning llms

    Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. Cures: From gradient analysis to efficient curriculum learning for reasoning llms. InThe F ourteenth International Conference on Learning Representations, 2025

  43. [43]

    On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models, 2025

  44. [44]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025

  45. [45]

    Automatic curricu- lum expert iteration for reliable llm reasoning

    Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curricu- lum expert iteration for reliable llm reasoning. InThe Thirteenth International Conference on Learning Representations, 2024

  46. [46]

    Processbench: Identifying process errors in mathematical reasoning

    Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. https://arxiv.org/abs/2412.06559v4, 2024

  47. [47]

    Group sequence policy optimization, 2025

    Chujie Zheng, Shixuan Liu, Mingze Li, et al. Group sequence policy optimization, 2025. 12

  48. [48]

    Spurious forgetting in continual learning of language models

    Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma. Spurious forgetting in continual learning of language models. https://arxiv.org/abs/2501.13453v1, 2025

  49. [49]

    Ttrl: Test-time reinforcement learning

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, et al. Ttrl: Test-time reinforcement learning. https://arxiv.org/abs/2504.16084v3, 2025. 13 A Supplementary Experimental Figures This appendix provides additional visual evidence for the main empirical claims: boundary-aware Curriculum RL improves large-k behavior more consistently than Vanilla RLVR, and the curriculum e...