pith. machine review for the scientific record. sign in

arxiv: 2505.22312 · v2 · submitted 2025-05-28 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Skywork Open Reasoner 1 Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningchain of thoughtreasoning modelsentropy collapselarge language modelsAIME benchmarkLiveCodeBench
0
0 comments X

The pith

Skywork-OR1 applies reinforcement learning to long chain-of-thought models and raises average accuracy on AIME24, AIME25, and LiveCodeBench by 13 to 15 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Skywork-OR1 as a scalable RL method for long chain-of-thought reasoning models built on the DeepSeek-R1-Distill series. It reports accuracy gains from 57.8 percent to 72.8 percent for the 32B model and from 43.6 percent to 57.5 percent for the 7B model across the three benchmarks. Ablation studies test the main pipeline components while separate analysis shows that preventing early entropy collapse supports better final performance. The authors fully release model weights, training code, and datasets.

Core claim

An effective RL implementation for long CoT models that includes entropy-mitigation techniques produces substantial reasoning gains, with the 32B version surpassing DeepSeek-R1 and Qwen3-32B on AIME24 and AIME25 while remaining comparable on LiveCodeBench.

What carries the argument

The RL training pipeline with entropy-collapse mitigation applied to DeepSeek-R1-Distill models.

Load-bearing premise

The measured accuracy gains come mainly from the described RL components and entropy techniques rather than from hidden data choices or evaluation differences.

What would settle it

A controlled rerun of the training that removes only the entropy-mitigation steps and measures whether the accuracy lifts largely disappear.

read the original abstract

The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Skywork-OR1, a scalable RL implementation for long Chain-of-Thought reasoning built on the DeepSeek-R1-Distill model series. It reports concrete accuracy gains on AIME24, AIME25, and LiveCodeBench (+15.0% average for the 32B model from 57.8% to 72.8%, and +13.9% for the 7B model from 43.6% to 57.5%), attributes these to the RL pipeline and entropy-mitigation techniques, presents ablation studies validating core components, analyzes entropy collapse dynamics, and fully open-sources model weights, training code, and datasets.

Significance. If the reported gains are shown to be robustly caused by the described RL components and entropy interventions rather than confounds, the work would advance practical understanding of RL for LLM reasoning, particularly by demonstrating the importance of mitigating premature entropy collapse. The open-sourcing of code, datasets, and models is a clear strength that supports reproducibility and community follow-up research.

major comments (3)
  1. [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claim of +15.0% and +13.9% average accuracy lifts requires evidence that these deltas are attributable to the RL pipeline and entropy-mitigation methods. The manuscript does not report error bars, standard deviations across runs, or statistical significance tests on the benchmark results, which is necessary to establish that the improvements exceed evaluation noise or minor setup variations.
  2. [Ablation Studies] Ablation Studies section: The paper states that comprehensive ablations were performed to validate the core components. However, to support the causal attribution of gains to the entropy-dynamics interventions, the ablations must control for total training steps, data volume/quality, and hyperparameter budgets; without explicit description of such matched controls, the load-bearing claim that the described techniques drive the observed improvements cannot be fully evaluated.
  3. [Entropy Collapse Analysis] Entropy Collapse Analysis section: The investigation identifies key factors affecting entropy dynamics and claims that mitigating premature collapse is critical for test performance. This section should include quantitative correlations (e.g., plots or tables linking specific entropy values or mitigation thresholds to downstream accuracy changes) to make the causal link between entropy management and the reported benchmark gains explicit and falsifiable.
minor comments (2)
  1. [Abstract] The abstract and introduction could more clearly distinguish the base DeepSeek-R1-Distill models from the final Skywork-OR1 checkpoints to avoid any ambiguity in the comparison setup.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the number of evaluation samples or prompts used for each benchmark to improve clarity and reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment by clarifying experimental controls, adding quantitative analysis, and acknowledging limitations where multiple runs were not feasible due to computational costs. Revisions have been made to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claim of +15.0% and +13.9% average accuracy lifts requires evidence that these deltas are attributable to the RL pipeline and entropy-mitigation methods. The manuscript does not report error bars, standard deviations across runs, or statistical significance tests on the benchmark results, which is necessary to establish that the improvements exceed evaluation noise or minor setup variations.

    Authors: We agree that reporting error bars or multiple runs would strengthen claims of robustness. Due to the high computational cost of RL training for 7B and 32B models, we conducted single runs per configuration. In the revised manuscript, we have added a discussion in the Experimental Results section explicitly noting this limitation and highlighting that gains are consistent across model sizes and benchmarks. We also note that evaluations use fixed test sets and deterministic decoding, reducing certain sources of variance. revision: partial

  2. Referee: [Ablation Studies] Ablation Studies section: The paper states that comprehensive ablations were performed to validate the core components. However, to support the causal attribution of gains to the entropy-dynamics interventions, the ablations must control for total training steps, data volume/quality, and hyperparameter budgets; without explicit description of such matched controls, the load-bearing claim that the described techniques drive the observed improvements cannot be fully evaluated.

    Authors: We appreciate this clarification request. Our entropy-related ablations were performed with matched total training steps and identical data volumes. We have revised the Ablation Studies section to explicitly describe these controls, including constant step counts, same dataset composition, and comparable hyperparameter budgets across variants. This makes the experimental design and causal attribution clearer. revision: yes

  3. Referee: [Entropy Collapse Analysis] Entropy Collapse Analysis section: The investigation identifies key factors affecting entropy dynamics and claims that mitigating premature collapse is critical for test performance. This section should include quantitative correlations (e.g., plots or tables linking specific entropy values or mitigation thresholds to downstream accuracy changes) to make the causal link between entropy management and the reported benchmark gains explicit and falsifiable.

    Authors: We agree that quantitative links would make the analysis more rigorous. We have updated the Entropy Collapse Analysis section with additional plots and tables correlating entropy values at key training stages with final benchmark accuracies across different mitigation thresholds. These provide explicit quantitative evidence supporting the importance of avoiding premature entropy collapse. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL performance report with ablations

full rationale

The paper reports post-training accuracy gains on AIME24/AIME25/LiveCodeBench after RL fine-tuning of DeepSeek-R1-Distill models, supported by ablation studies on entropy dynamics and training components. No mathematical derivation chain, first-principles predictions, or equations exist that could reduce by construction to fitted parameters, self-citations, or ansatzes. Claims rest on experimental outcomes and open-sourced artifacts rather than any self-referential logic, satisfying the self-contained criterion with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical outcome of an RL training run rather than on any derivation from first principles. No new physical or mathematical axioms are introduced; the work assumes standard RL-for-LLM practices and benchmark validity.

free parameters (1)
  • RL training hyperparameters
    Learning rates, reward scaling, entropy regularization coefficients and other knobs that were tuned to produce the reported accuracy numbers.
axioms (1)
  • domain assumption Reinforcement learning on final-answer correctness improves long-CoT reasoning capability
    Invoked throughout the training pipeline description.

pith-pipeline@v0.9.0 · 5615 in / 1309 out tokens · 27608 ms · 2026-05-17T04:22:53.912748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 unverdicted novelty 7.0

    DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...

  2. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 conditional novelty 7.0

    DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

  3. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  4. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  5. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  6. Faithful Mobile GUI Agents with Guided Advantage Estimator

    cs.AI 2026-05 unverdicted novelty 7.0

    Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.

  7. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  8. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  9. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  10. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 6.0

    ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.

  11. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  12. Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

    cs.CL 2026-04 unverdicted novelty 6.0

    PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...

  13. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  14. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    cs.LG 2026-04 unverdicted novelty 6.0

    On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

  15. Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

    cs.AI 2026-03 unverdicted novelty 6.0

    Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.

  16. Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

    cs.LG 2025-12 unverdicted novelty 6.0

    Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

  17. EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

    cs.LG 2026-05 unverdicted novelty 5.0

    EP-GRPO adds entropy-gated modulation, implicit process signals from policy divergence, and cumulative entropy mapping to GRPO, yielding higher accuracy and efficiency on math reasoning benchmarks.

  18. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

    cs.LG 2025-12 unverdicted novelty 5.0

    Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

  19. OneRec-V2 Technical Report

    cs.IR 2025-08 unverdicted novelty 5.0

    OneRec-V2 scales generative recommendation to 8B parameters via decoder-only design and real-world preference alignment, improving user engagement metrics in production A/B tests.

  20. Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation

    cs.LG 2026-05 unverdicted novelty 4.0

    Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 17 Pith papers · 13 internal anchors

  1. [1]

    Acereason-nemotron: Advancing math and code reasoning through reinforcement learnin, 2025

    Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Mohammad Shoeybi Peng Xu, and Wei Ping Bryan Catanzaro. Acereason-nemotron: Advancing math and code reasoning through reinforcement learnin, 2025. 37

  2. [2]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456, 2025

  3. [3]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  4. [4]

    Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024

  5. [5]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Skywork open reasoner series.https://capricious-hydrogen-41c.notion.site/S kywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series.https://capricious-hydrogen-41c.notion.site/S kywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog

  8. [8]

    Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

  9. [9]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  10. [10]

    Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024

  11. [11]

    Math-verify: A robust mathematical expression evaluation system.https://github.c om/huggingface/Math-Verify, 2025

    Hynek Kydlíček. Math-verify: A robust mathematical expression evaluation system.https://github.c om/huggingface/Math-Verify, 2025. Version 0.6.1

  12. [12]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu-Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decembe...

  13. [13]

    Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](ht tps://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.p df), 2024

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](ht tps://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

  14. [15]

    Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

  15. [16]

    Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: 38 A fully open-source 14b coder at o3-mini level.https://pretty-radio-b75.notion.site/DeepCoder -A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3b...

  16. [17]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-P review-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,...

  17. [18]

    Areal: Ant reasoning rl.https://github.com/inclusionAI/AReaL, 2025

    Ant Research RL Lab. Areal: Ant reasoning rl.https://github.com/inclusionAI/AReaL, 2025

  18. [19]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  19. [20]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  20. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  21. [22]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT press, 2nd edition, 2018

  22. [23]

    Sutton, David McAllester, Satinder Singh, and Yishay Mansour

    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, pages 1057–1063, 1999

  23. [24]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  24. [25]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  25. [26]

    Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning

    RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025

  26. [27]

    Superdistillation achieves near-r1 performance with just 5

    TinyR1 Team. Superdistillation achieves near-r1 performance with just 5

  27. [28]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

  28. [29]

    Light-r1: Surpassing r1-distill from scratch with $1000 through curriculum sft & dpo, 2025

    Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Zhenyu Duan Qi An, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Surpassing r1-distill from scratch with $1000 through curriculum sft & dpo, 2025

  29. [30]

    Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

    Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025

  30. [31]

    Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025

  31. [32]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi 39 Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yan...

  32. [33]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  33. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  34. [35]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025

  35. [36]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 40