Recognition: unknown
Skywork Open Reasoner 1 Technical Report
Pith reviewed 2026-05-17 04:22 UTC · model grok-4.3
The pith
Skywork-OR1 applies reinforcement learning to long chain-of-thought models and raises average accuracy on AIME24, AIME25, and LiveCodeBench by 13 to 15 points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An effective RL implementation for long CoT models that includes entropy-mitigation techniques produces substantial reasoning gains, with the 32B version surpassing DeepSeek-R1 and Qwen3-32B on AIME24 and AIME25 while remaining comparable on LiveCodeBench.
What carries the argument
The RL training pipeline with entropy-collapse mitigation applied to DeepSeek-R1-Distill models.
Load-bearing premise
The measured accuracy gains come mainly from the described RL components and entropy techniques rather than from hidden data choices or evaluation differences.
What would settle it
A controlled rerun of the training that removes only the entropy-mitigation steps and measures whether the accuracy lifts largely disappear.
read the original abstract
The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Skywork-OR1, a scalable RL implementation for long Chain-of-Thought reasoning built on the DeepSeek-R1-Distill model series. It reports concrete accuracy gains on AIME24, AIME25, and LiveCodeBench (+15.0% average for the 32B model from 57.8% to 72.8%, and +13.9% for the 7B model from 43.6% to 57.5%), attributes these to the RL pipeline and entropy-mitigation techniques, presents ablation studies validating core components, analyzes entropy collapse dynamics, and fully open-sources model weights, training code, and datasets.
Significance. If the reported gains are shown to be robustly caused by the described RL components and entropy interventions rather than confounds, the work would advance practical understanding of RL for LLM reasoning, particularly by demonstrating the importance of mitigating premature entropy collapse. The open-sourcing of code, datasets, and models is a clear strength that supports reproducibility and community follow-up research.
major comments (3)
- [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claim of +15.0% and +13.9% average accuracy lifts requires evidence that these deltas are attributable to the RL pipeline and entropy-mitigation methods. The manuscript does not report error bars, standard deviations across runs, or statistical significance tests on the benchmark results, which is necessary to establish that the improvements exceed evaluation noise or minor setup variations.
- [Ablation Studies] Ablation Studies section: The paper states that comprehensive ablations were performed to validate the core components. However, to support the causal attribution of gains to the entropy-dynamics interventions, the ablations must control for total training steps, data volume/quality, and hyperparameter budgets; without explicit description of such matched controls, the load-bearing claim that the described techniques drive the observed improvements cannot be fully evaluated.
- [Entropy Collapse Analysis] Entropy Collapse Analysis section: The investigation identifies key factors affecting entropy dynamics and claims that mitigating premature collapse is critical for test performance. This section should include quantitative correlations (e.g., plots or tables linking specific entropy values or mitigation thresholds to downstream accuracy changes) to make the causal link between entropy management and the reported benchmark gains explicit and falsifiable.
minor comments (2)
- [Abstract] The abstract and introduction could more clearly distinguish the base DeepSeek-R1-Distill models from the final Skywork-OR1 checkpoints to avoid any ambiguity in the comparison setup.
- [Figures and Tables] Figure captions and table headers should explicitly state the number of evaluation samples or prompts used for each benchmark to improve clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment by clarifying experimental controls, adding quantitative analysis, and acknowledging limitations where multiple runs were not feasible due to computational costs. Revisions have been made to strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claim of +15.0% and +13.9% average accuracy lifts requires evidence that these deltas are attributable to the RL pipeline and entropy-mitigation methods. The manuscript does not report error bars, standard deviations across runs, or statistical significance tests on the benchmark results, which is necessary to establish that the improvements exceed evaluation noise or minor setup variations.
Authors: We agree that reporting error bars or multiple runs would strengthen claims of robustness. Due to the high computational cost of RL training for 7B and 32B models, we conducted single runs per configuration. In the revised manuscript, we have added a discussion in the Experimental Results section explicitly noting this limitation and highlighting that gains are consistent across model sizes and benchmarks. We also note that evaluations use fixed test sets and deterministic decoding, reducing certain sources of variance. revision: partial
-
Referee: [Ablation Studies] Ablation Studies section: The paper states that comprehensive ablations were performed to validate the core components. However, to support the causal attribution of gains to the entropy-dynamics interventions, the ablations must control for total training steps, data volume/quality, and hyperparameter budgets; without explicit description of such matched controls, the load-bearing claim that the described techniques drive the observed improvements cannot be fully evaluated.
Authors: We appreciate this clarification request. Our entropy-related ablations were performed with matched total training steps and identical data volumes. We have revised the Ablation Studies section to explicitly describe these controls, including constant step counts, same dataset composition, and comparable hyperparameter budgets across variants. This makes the experimental design and causal attribution clearer. revision: yes
-
Referee: [Entropy Collapse Analysis] Entropy Collapse Analysis section: The investigation identifies key factors affecting entropy dynamics and claims that mitigating premature collapse is critical for test performance. This section should include quantitative correlations (e.g., plots or tables linking specific entropy values or mitigation thresholds to downstream accuracy changes) to make the causal link between entropy management and the reported benchmark gains explicit and falsifiable.
Authors: We agree that quantitative links would make the analysis more rigorous. We have updated the Entropy Collapse Analysis section with additional plots and tables correlating entropy values at key training stages with final benchmark accuracies across different mitigation thresholds. These provide explicit quantitative evidence supporting the importance of avoiding premature entropy collapse. revision: yes
Circularity Check
No circularity: purely empirical RL performance report with ablations
full rationale
The paper reports post-training accuracy gains on AIME24/AIME25/LiveCodeBench after RL fine-tuning of DeepSeek-R1-Distill models, supported by ablation studies on entropy dynamics and training components. No mathematical derivation chain, first-principles predictions, or equations exist that could reduce by construction to fitted parameters, self-citations, or ansatzes. Claims rest on experimental outcomes and open-sourced artifacts rather than any self-referential logic, satisfying the self-contained criterion with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption Reinforcement learning on final-answer correctness improves long-CoT reasoning capability
Forward citations
Cited by 20 Pith papers
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Teaching Language Models to Think in Code
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Faithful Mobile GUI Agents with Guided Advantage Estimator
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
Teaching Language Models to Think in Code
ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
-
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
-
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
-
EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance
EP-GRPO adds entropy-gated modulation, implicit process signals from policy divergence, and cumulative entropy mapping to GRPO, yielding higher accuracy and efficiency on math reasoning benchmarks.
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
OneRec-V2 Technical Report
OneRec-V2 scales generative recommendation to 8B parameters via decoder-only design and real-world preference alignment, improving user engagement metrics in production A/B tests.
-
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
Reference graph
Works this paper leans on
-
[1]
Acereason-nemotron: Advancing math and code reasoning through reinforcement learnin, 2025
Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Mohammad Shoeybi Peng Xu, and Wei Ping Bryan Catanzaro. Acereason-nemotron: Advancing math and code reasoning through reinforcement learnin, 2025. 37
work page 2025
-
[2]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[4]
Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024
work page 2024
-
[5]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series.https://capricious-hydrogen-41c.notion.site/S kywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog
work page 2025
-
[8]
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025
work page 2025
-
[9]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint, 2024
work page 2024
-
[11]
Hynek Kydlíček. Math-verify: A robust mathematical expression evaluation system.https://github.c om/huggingface/Math-Verify, 2025. Version 0.6.1
work page 2025
-
[12]
Coderl: Mastering code generation through pretrained models and deep reinforcement learning
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu-Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decembe...
work page 2022
-
[13]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.[https://huggingface.co/AI-MO/NuminaMath-1.5](ht tps://github.com/project-numina/aimo-progress-prize/blob/main/report/num...
work page 2024
-
[15]
Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
-
[16]
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: 38 A fully open-source 14b coder at o3-mini level.https://pretty-radio-b75.notion.site/DeepCoder -A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3b...
work page 2025
-
[17]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-P review-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,...
work page 2025
-
[18]
Areal: Ant reasoning rl.https://github.com/inclusionAI/AReaL, 2025
Ant Research RL Lab. Areal: Ant reasoning rl.https://github.com/inclusionAI/AReaL, 2025
work page 2025
-
[19]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT press, 2nd edition, 2018
work page 2018
-
[23]
Sutton, David McAllester, Satinder Singh, and Yishay Mansour
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, pages 1057–1063, 1999
work page 1999
-
[24]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
work page 2025
-
[26]
RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025
work page 2025
-
[27]
Superdistillation achieves near-r1 performance with just 5
TinyR1 Team. Superdistillation achieves near-r1 performance with just 5
-
[28]
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025
-
[29]
Light-r1: Surpassing r1-distill from scratch with $1000 through curriculum sft & dpo, 2025
Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Zhenyu Duan Qi An, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Surpassing r1-distill from scratch with $1000 through curriculum sft & dpo, 2025
work page 2025
-
[30]
Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, and Xiaolong Xu. Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.arXiv preprint arXiv:2504.14655, 2025
-
[31]
Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, 2025
work page 2025
-
[32]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi 39 Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025. 40
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.