pith. sign in

arxiv: 2505.24864 · v1 · pith:2NL7GLWHnew · submitted 2025-05-30 · 💻 cs.CL · cs.AI

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Pith reviewed 2026-05-18 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords reinforcement learningreasoninglarge language modelsprolonged trainingreasoning boundariesKL controlpass@k
0
0 comments X p. Extension
pith:2NL7GLWH Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{2NL7GLWH}

Prints a linked pith:2NL7GLWH badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Prolonged reinforcement learning can uncover novel reasoning strategies inaccessible to base models even with extensive sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to show that reinforcement learning, when continued for longer periods with appropriate controls, can lead large language models to develop new reasoning approaches that the original model simply does not have, regardless of how many times one samples from it. A sympathetic reader would care because this would mean RL is not limited to reinforcing what is already possible but can genuinely expand the set of problems a model can solve. The work introduces a training approach called ProRL that includes mechanisms to maintain stability over long training runs and applies it to a variety of tasks. Results indicate that the trained models achieve better results on pass@k metrics, including complete success on problems where the base model scores zero across all attempts, and that these gains grow with training time and starting task ability.

Core claim

Prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. This is achieved through a methodology incorporating KL divergence control, reference policy resetting, and a diverse suite of tasks, leading to consistent outperformance on pass@k evaluations and a correlation between reasoning boundary improvements and both base model competence and training duration.

What carries the argument

The ProRL training methodology that incorporates KL divergence control, reference policy resetting, and diverse tasks to enable stable prolonged reinforcement learning. It works by allowing the model to explore and populate new regions of the solution space over extended training periods without collapsing into repetitive outputs.

If this is right

  • RL-trained models achieve success on tasks where base models fail completely no matter the sampling budget.
  • Reasoning performance continues to improve with increased training duration rather than plateauing quickly.
  • Improvements are larger on tasks for which the base model already shows some competence.
  • RL can be viewed as a method to explore and fill new areas in the space of possible solutions over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Longer training horizons could unlock reasoning capabilities in even more complex domains not tested here.
  • Examining the actual generated reasoning chains could provide direct evidence of strategy novelty beyond performance metrics.
  • Base models may contain latent potential that only becomes accessible after sustained optimization rather than short fine-tuning.

Load-bearing premise

That outperformance on pass@k, especially complete success where the base model scores zero across all samples, indicates genuinely new reasoning strategies instead of better exploitation of capabilities already present in the base distribution.

What would settle it

Finding that all successful solutions generated by the ProRL model use the same intermediate reasoning steps as the rare successful samples from the base model on the same problems would suggest the strategies are not novel.

read the original abstract

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProRL, a prolonged RL training procedure incorporating KL divergence control, reference policy resetting, and a diverse task suite. It claims that extended RL training uncovers novel reasoning strategies inaccessible to base models even under extensive sampling, supported by consistent outperformance on pass@k metrics (including tasks where base models achieve zero success regardless of attempts) and strong correlations between reasoning boundary improvements and both base-model competence and training duration. Model weights are released for further research.

Significance. If the central claim is substantiated, the work would be significant for clarifying when and how RL expands the solution space of LLMs beyond amplifying latent high-reward outputs. The empirical correlations with training duration and the public release of weights are strengths that enable reproducibility and follow-on analysis.

major comments (2)
  1. [Empirical Analysis and Results] The claim that ProRL discovers strategies 'inaccessible to base models, even under extensive sampling' is load-bearing for the paper's contribution. Pass@k results showing base-model success = 0 while ProRL succeeds are reported, but the manuscript provides no qualitative comparison of solution traces, no structural analysis of reasoning steps, and no targeted prompting or few-shot elicitation experiments using ProRL-derived examples on the base model. Without such evidence, the results remain compatible with RL raising the probability of already-present but rare patterns or improving format adherence and search efficiency.
  2. [§4] §4 (Methodology) and the experimental setup: the paper mentions KL divergence control and reference policy resetting as key to prolonged training, yet provides no ablation on the strength of KL control or the frequency of policy resets. These choices directly affect the exploration-exploitation balance and are therefore central to interpreting whether prolonged training genuinely populates new regions of solution space or simply optimizes existing ones more effectively.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the pass@k plots should explicitly state the number of samples used for each k and whether temperature or other decoding parameters were held constant across base and ProRL models.
  2. [Correlation Analysis] The correlation analysis between base-model competence and reasoning-boundary improvement would benefit from reporting the exact statistical test (e.g., Pearson r with p-value) and the number of tasks included.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our claims and experimental design. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Empirical Analysis and Results] The claim that ProRL discovers strategies 'inaccessible to base models, even under extensive sampling' is load-bearing for the paper's contribution. Pass@k results showing base-model success = 0 while ProRL succeeds are reported, but the manuscript provides no qualitative comparison of solution traces, no structural analysis of reasoning steps, and no targeted prompting or few-shot elicitation experiments using ProRL-derived examples on the base model. Without such evidence, the results remain compatible with RL raising the probability of already-present but rare patterns or improving format adherence and search efficiency.

    Authors: We agree that the inaccessibility claim is central and that pass@k alone, while quantitative evidence of expanded boundaries, leaves room for alternative interpretations such as probability amplification of rare patterns. In the revised manuscript we will add qualitative comparisons of solution traces on tasks where the base model records zero success at high k, along with a structural breakdown of reasoning steps to illustrate differences in approach. We will also include targeted prompting experiments that inject ProRL-derived solution examples into the base model to test whether success rates improve. revision: yes

  2. Referee: [§4] §4 (Methodology) and the experimental setup: the paper mentions KL divergence control and reference policy resetting as key to prolonged training, yet provides no ablation on the strength of KL control or the frequency of policy resets. These choices directly affect the exploration-exploitation balance and are therefore central to interpreting whether prolonged training genuinely populates new regions of solution space or simply optimizes existing ones more effectively.

    Authors: We acknowledge that the absence of these ablations limits interpretability of the prolonged-training results. In the revision we will add experiments that vary the KL coefficient and the reset interval, reporting their effects on both pass@k performance and training stability. These results will clarify the role of the chosen hyperparameters in enabling exploration of new solution regions. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of RL training

full rationale

The paper conducts an empirical study of prolonged RL (ProRL) by training models with KL control, reference resetting, and diverse tasks, then measuring performance gains on external pass@k benchmarks where base models sometimes score zero. Claims about uncovering inaccessible strategies rest on these observed differences and released weights rather than any derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. The central results are presented as direct comparisons against independent task suites and sampling efforts, making the work self-contained against external metrics without reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions for verifiable rewards plus the introduced ProRL techniques; no major new free parameters or invented physical entities are introduced beyond methodological choices.

free parameters (1)
  • KL divergence control strength
    Hyperparameter used to regularize policy updates during prolonged training.
axioms (1)
  • domain assumption Verifiable rewards exist for the chosen reasoning tasks and can guide policy improvement.
    Invoked throughout the RL setup described in the abstract.
invented entities (1)
  • ProRL training methodology no independent evidence
    purpose: Framework incorporating KL control, reference policy resetting, and diverse tasks for extended RL.
    Introduced as the novel training approach; no independent falsifiable prediction outside the empirical results is provided.

pith-pipeline@v0.9.0 · 5789 in / 1283 out tokens · 42016 ms · 2026-05-18T20:47:45.013668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  2. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  3. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  4. Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

    cs.LG 2026-01 unverdicted novelty 7.0

    Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.

  5. The Art of Scaling Reinforcement Learning Compute for LLMs

    cs.LG 2025-10 unverdicted novelty 7.0

    A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...

  6. Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 7.0

    TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

  7. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  8. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  9. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  10. Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.

  11. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  12. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  13. SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

  14. Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

    cs.LG 2025-12 unverdicted novelty 6.0

    Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

  15. DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

    cs.AI 2025-09 unverdicted novelty 6.0

    DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than ext...

  16. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    cs.RO 2025-09 conditional novelty 6.0

    SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...

  17. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  18. Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

    cs.LG 2025-10 unverdicted novelty 5.0

    Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve p...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty- radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling- RL-19681902c1468005bed8ca303013a4e...

  4. [4]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  5. [5]

    Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. Vapo: Efficient and reliable rei...

  6. [6]

    Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

  7. [7]

    Deepcoder: A fully open-source 14b coder at o3-mini level

    Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepCoder-A- Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51, ...

  8. [8]

    Code-r1: Reproducing r1 for code with reliable rewards

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

  9. [9]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. 10

  10. [10]

    Reward hacking in reinforcement learning

    Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, Nov 2024

  11. [11]

    and He, He and Feng, Shi , month = dec, year =

    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf. arXiv preprint arXiv:2409.12822, 2024

  12. [12]

    Ai as humanity’s salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text, 2025

    Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar Mireshghallah, Jiacheng Liu, Seungju Han, Allyson Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri, and Yejin Choi. Ai as humanity’s salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text, 2025

  13. [13]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

  14. [14]

    Zico Kolter, and Aditi Raghunathan

    Xingyu Dang, Christina Baek, J. Zico Kolter, and Aditi Raghunathan. Assessing Diversity Collapse in Reasoning, February 2025

  15. [15]

    Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  17. [17]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  18. [18]

    Skywork open reasoner series

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xi- aoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/ Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 , 2025. No...

  19. [19]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025

  20. [20]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  21. [21]

    7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient

    Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog

  22. [22]

    American invitational mathematics examination - aime

    MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024, February 2024

  23. [23]

    American invitational mathematics examination - aime

    MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2025, February 2025

  24. [24]

    American mathematics competition - amc

    MAA. American mathematics competition - amc. In American Mathematics Competition - AMC

  25. [25]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

  26. [26]

    Solving quantitative reasoning problems with language models, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. 11

  27. [27]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

  28. [28]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  29. [29]

    Measuring coding challenge competence with apps, 2021

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021

  30. [30]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  31. [31]

    Taco: Topics in algorithmic code generation dataset, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset, 2023

  32. [32]

    Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  33. [33]

    Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

  34. [34]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

  35. [35]

    Online difficulty filtering for reasoning oriented reinforcement learning, 2025

    Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning, 2025

  36. [36]

    Instruction-following evaluation for large language models, 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023

  37. [37]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  38. [38]

    The curious case of neural text degeneration, 2020

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration, 2020

  39. [39]

    Stop overthinking: A survey on efficient reasoning for large language models, 2025

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models, 2025

  40. [40]

    Ai as humanity’s salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text

    Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar Mireshghallah, Jiacheng Liu, Seungju Han, Allyson Ettinger, Liwei Jiang, Khyathi Raghavi Chandu, Nouha Dziri, and Yejin Choi. Ai as humanity’s salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text. ArXiv, abs/2410.04265, 2024

  41. [41]

    Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind 12 Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichand...

  42. [42]

    Learning to reason with llms, September 2024

    OpenAI. Learning to reason with llms, September 2024. https://openai.com/index/ learning-to-reason-with-llms

  43. [43]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  44. [44]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  45. [45]

    Mirror descent policy optimization, 2021

    Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization, 2021

  46. [46]

    Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024

  47. [47]

    Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024

  48. [48]

    Reinforcement learning enhanced llms: A survey, 2025

    Shuhe Wang, Shengyu Zhang, Jie Zhang, Runyi Hu, Xiaoya Li, Tianwei Zhang, Jiwei Li, Fei Wu, Guoyin Wang, and Eduard Hovy. Reinforcement learning enhanced llms: A survey, 2025

  49. [49]

    Playing atari with deep reinforcement learning, 2013

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013

  50. [50]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

  51. [51]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017

  52. [52]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025

  53. [53]

    Llms are greedy agents: Effects of rl fine-tuning on decision-making abilities, 2025

    Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, and Razvan Pascanu. Llms are greedy agents: Effects of rl fine-tuning on decision-making abilities, 2025

  54. [54]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  55. [55]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024

  56. [56]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

  57. [57]

    Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025

    Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025

  58. [58]

    0": 1, "1

    Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorze...

  59. [59]

    However, the final answer shoule be a list of action plans for multiple steps

    Single Action per Agent: Assign only one action to each agent at a time. However, the final answer shoule be a list of action plans for multiple steps

  60. [60]

    Agent[x, y]

    Unique Agent Keys: Use unique keys for each agent in the JSON format action plan. The key should be the agent’s coordinates in the format "Agent[x, y]"

  61. [61]

    Prioritize Matching Boxes to Targets: Always prioritize actions that will match a box to its target over moving a box to an adjacent square

  62. [62]

    Sequential Action Planning: The whole returned answer should be a list of action plans for multiple steps, do not just return one step plan

  63. [63]

    Clear Formatting: Ensure the action plan is clearly formatted in JSON, with each agent’s action specified as a key-value pair. 16

  64. [64]

    Conflict Resolution: Ensure that no two agents are assigned actions that would interfere with each other

  65. [65]

    Agent[0.5, 0.5]

    Optimize Efficiency: Aim to minimize the number of moves required to match all boxes with their targets. Here is the format for your action plan: Please provide your final answer as a list of action dictionaries. For example: ‘‘‘json [{"Agent[0.5, 0.5]": "move(box_blue, square[0.5, 1.5])", "Agent[1.5, 0.5]": "move(box_red, target_red)"}, {"Agent[0.5, 1.5]...