pith. sign in

arxiv: 2606.09932 · v1 · pith:QYW44JPAnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

Pith reviewed 2026-06-27 18:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords supervised fine-tuningreinforcement learningmodel plasticityLLM post-trainingSFT-to-RL handoffover-confident distributionsneuron resetmodel fusion
0
0 comments X

The pith

Excessive SFT creates over-confident token distributions and sharp parameter landscapes that hinder RL optimization, but Rejuvenation restores plasticity through base-anchored fusion and neuron reset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why reinforcement learning stages often yield little gain when applied to models that received too much supervised fine-tuning. Analysis of parameter changes, output distributions, and optimization behavior shows that over-SFT produces rigid models with peaked probabilities and steep loss surfaces. The proposed Rejuvenation technique fuses the model toward its base checkpoint to reduce drift and selectively resets neurons to loosen rigidity while keeping useful SFT priors. Tests on math reasoning and agentic tasks find that the method raises RL performance and improves out-of-distribution generalization. The work targets the common SFT-then-RL pipeline and aims to make the transition more reliable.

Core claim

Models from excessive SFT lose plasticity, shown by over-confident token distributions and sharp parameter landscapes that resist effective reshaping during RL. Rejuvenation counters this with base-anchored model fusion to limit SFT-induced drift together with targeted neuron reset to reduce rigidity, thereby restoring adaptability without discarding SFT-acquired priors.

What carries the argument

Rejuvenation, which applies base-anchored model fusion to curb SFT drift and targeted neuron reset to ease model rigidity.

If this is right

  • Rejuvenation raises final RL performance on math reasoning tasks.
  • Rejuvenation raises final RL performance on agentic tasks.
  • Rejuvenation improves generalization to out-of-distribution tasks.
  • The method preserves SFT priors while reducing optimization difficulty in the RL stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines may need explicit controls on SFT duration or strength to leave sufficient plasticity for later RL.
  • The same fusion-and-reset pattern could be tested on other sequential training regimes such as continued pretraining followed by fine-tuning.
  • Measuring landscape sharpness or distribution entropy before RL could serve as a practical diagnostic for deciding when to apply rejuvenation.

Load-bearing premise

The limited RL improvement after excessive SFT is caused mainly by loss of plasticity rather than reward model quality, optimization choices, or data mismatch.

What would settle it

Apply Rejuvenation to an over-SFT model and observe no reduction in parameter landscape sharpness or no increase in RL reward improvement.

Figures

Figures reproduced from arXiv: 2606.09932 by Jiashun Liu, Ling Pan, Runze Liu, Xu Wan, Yuqian Fu.

Figure 1
Figure 1. Figure 1: Overview of Rejuvenation. limited room for RL to further improve the policy and generalize (Chu et al., 2025). Thus, despite its widespread success, the handoff from SFT to RL remains a fragile and computationally expensive. A critical yet often overlooked challenge is to determine which SFT checkpoint should be used as the RL initial policy. Standard SFT metrics (e.g., training loss, validation accuracy) … view at source ↗
Figure 2
Figure 2. Figure 2: Parameter changes and statis￾tics of different SFT checkpoints. Excessive SFT induces large parameter shifts and weight magnitude. We first examine how SFT changes the model parameters. For each checkpoint, we visualize the element-wise parameter difference with respect to the base model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parameter changes of layers.0.self_attn.v_proj.weight induced by SFT and RL. changes compared with the SFT-induced shift. This suggests that once SFT has caused relatively high parameter distortion, RL is unlikely to reverse it through standard policy optimization. 3.1.2 Output Space Excessive SFT saturates the policy before RL. We compare the output diversity of different SFT checkpoints in [PITH_FULL_IM… view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation results of RL-trained models from different SFT checkpoints. (a) Average score [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average gradient norm and weight update magnitude of RL with different SFT checkpoints. Over-trained models exhibit larger RL gradient norms. We next study whether these SFT-induced changes affect RL opti￾mization [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Token logit comparison between normal SFT models and over-trained models: (a) [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Logit attribution ranking: (a) module level; (b) neuron level. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Wordcloud of selected tokens in logit attribution. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of token entropy: (a) Before neuron reset; (b) After neuron reset. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per layer and per module visualization of neuron reset ratio in EvoLM-4B. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Parameter changes of layers.0.self_attn.k_proj.weight induced by SFT and RL. SFT - Base Delta 0.5 Epochs (Under-Trained) 2 Epochs (Moderately Trained) 8 Epochs (Over-Trained) 32 Epochs (Severely Over-Trained) RL - SFT Delta RL - Base Delta 0.00100 0.00075 0.00050 0.00025 0.00000 0.00025 0.00050 0.00075 0.00100 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Parameter changes of layers.27.self_attn.v_proj.weight induced by SFT and RL. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Parameter changes of layers.27.self_attn.k_proj.weight induced by SFT and RL. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that excessive SFT produces over-confident token distributions and sharp parameter landscapes that reduce model plasticity and limit gains during subsequent RL. The authors support this via analysis of parameter changes, output spaces, and RL optimization dynamics, then introduce Rejuvenation (base-anchored model fusion plus targeted neuron reset) to restore plasticity while retaining SFT priors. Experiments on math reasoning and agentic tasks report consistent RL improvements and better OOD generalization for over-trained SFT checkpoints.

Significance. If the plasticity-loss attribution is isolated from confounds and the empirical gains hold under controlled conditions, the work would be significant for LLM post-training pipelines by providing a practical intervention for SFT-to-RL transitions. The multi-perspective analysis and task diversity are positive features; reproducible code or parameter-free derivations are not mentioned.

major comments (2)
  1. [Abstract] Abstract: the central claim attributes limited RL improvement after excessive SFT specifically to loss of plasticity (over-confident distributions and sharp landscapes) rather than confounds such as reward-model quality, KL coefficient, or data mismatch. No explicit controls are described that hold the reward model, optimization hyperparameters, and data fixed while varying only SFT duration or plasticity metrics; without such isolation the observed Rejuvenation gains could arise from incidental regularization rather than restored plasticity.
  2. [Abstract] Abstract (experimental results paragraph): the claim of 'consistent' RL performance improvements and enhanced OOD generalization is stated without reference to error bars, ablation details, dataset sizes, or statistical significance tests. These omissions are load-bearing because they prevent verification that the gains are robustly attributable to the proposed plasticity restoration rather than other factors.
minor comments (1)
  1. The method description ('base-anchored model fusion' and 'targeted neuron reset') is high-level; adding precise definitions, hyperparameters, or pseudocode would improve clarity without altering the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on isolating the plasticity effect and strengthening experimental reporting. We address each major comment below and will revise the manuscript to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim attributes limited RL improvement after excessive SFT specifically to loss of plasticity (over-confident distributions and sharp landscapes) rather than confounds such as reward-model quality, KL coefficient, or data mismatch. No explicit controls are described that hold the reward model, optimization hyperparameters, and data fixed while varying only SFT duration or plasticity metrics; without such isolation the observed Rejuvenation gains could arise from incidental regularization rather than restored plasticity.

    Authors: We agree that explicit isolation of SFT duration is necessary to support the plasticity attribution. In the full experimental setup, the reward model, KL coefficient, optimization hyperparameters, and training data were held fixed while varying only the number of SFT steps across checkpoints; the same base model and RL procedure were used for all comparisons. However, this controlled design was not stated explicitly in the abstract. We will revise the abstract and add a dedicated paragraph in Section 3 (or an appendix) describing the fixed components and the SFT-duration sweep to make the isolation clear. revision: yes

  2. Referee: [Abstract] Abstract (experimental results paragraph): the claim of 'consistent' RL performance improvements and enhanced OOD generalization is stated without reference to error bars, ablation details, dataset sizes, or statistical significance tests. These omissions are load-bearing because they prevent verification that the gains are robustly attributable to the proposed plasticity restoration rather than other factors.

    Authors: We acknowledge that the abstract lacks these details. The full manuscript reports results over multiple random seeds with standard deviations, includes ablation studies on the fusion and reset components, specifies dataset sizes, and uses statistical tests for key comparisons. We will expand the abstract's results paragraph to reference error bars, note the ablation scope, and mention significance testing. We will also ensure these elements are highlighted in the main results section for easier verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attribution and intervention with no derivations or self-referential predictions

full rationale

The paper conducts empirical analysis of SFT-to-RL handoff, attributes limited RL gains after excessive SFT to loss of plasticity (over-confident distributions and sharp landscapes), and proposes the Rejuvenation method (base-anchored fusion plus neuron reset) with experimental validation on math and agentic tasks. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claim is an observational attribution supported by analysis of parameter changes, output spaces, and optimization dynamics rather than any quantity that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The method is presented as an empirical intervention, not a closed-form result equivalent to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or background assumptions are stated in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5773 in / 1082 out tokens · 13169 ms · 2026-06-27T18:48:09.863353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

146 extracted references · 13 canonical work pages

  1. [1]

    Nature , volume =

    Loss of plasticity in deep continual learning , author =. Nature , volume =. 2024 , publisher =

  2. [2]

    arXiv preprint arXiv:2602.11137 , year=

    Weight Decay Improves Language Model Plasticity , author=. arXiv preprint arXiv:2602.11137 , year=

  3. [3]

    arXiv preprint arXiv:2604.14164 , year=

    How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data , author=. arXiv preprint arXiv:2604.14164 , year=

  4. [4]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  5. [5]

    The Eleventh International Conference on Learning Representations , year=

    Editing models with task arithmetic , author=. The Eleventh International Conference on Learning Representations , year=

  6. [6]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  7. [7]

    International Conference on Learning Representations , year=

    Pruning Convolutional Neural Networks for Resource Efficient Inference , author=. International Conference on Learning Representations , year=

  8. [8]

    Are Sixteen Heads Really Better than One? , url =

    Michel, Paul and Levy, Omer and Neubig, Graham , booktitle =. Are Sixteen Heads Really Better than One? , url =

  9. [9]

    Knowledge Neurons in Pretrained Transformers

    Dai, Damai and Dong, Li and Hao, Yaru and Sui, Zhifang and Chang, Baobao and Wei, Furu. Knowledge Neurons in Pretrained Transformers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.581

  10. [10]

    An Adversarial Example for Direct Logit Attribution: Memory Management in GELU -4 L

    Janiak, Jett and Rager, Can and Dao, James and Lau, Yeu-Tong. An Adversarial Example for Direct Logit Attribution: Memory Management in GELU -4 L. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.15

  11. [11]

    2025 , editor =

    Chu, Tianzhe and Zhai, Yuexiang and Yang, Jihan and Tong, Shengbang and Xie, Saining and Schuurmans, Dale and Le, Quoc V and Levine, Sergey and Ma, Yi , booktitle =. 2025 , editor =

  12. [12]

    reinforcement learning fine-tuning for llms , author =

    Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms , author =. arXiv preprint arXiv:2508.16546 , year =

  13. [13]

    arXiv preprint arXiv:2509.12235 , year =

    Rl fine-tuning heals ood forgetting in sft , author =. arXiv preprint arXiv:2509.12235 , year =

  14. [14]

    Quagmires in

    Feiyang Kang and Michael Kuchnik and Karthik Padthe and Marin Vlastelica and Ruoxi Jia and Carole-Jean Wu and Newsha Ardalani , booktitle =. Quagmires in. 2026 , url =

  15. [15]

    arXiv preprint arXiv:2602.01058 , year =

    Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning , author =. arXiv preprint arXiv:2602.01058 , year =

  16. [16]

    arXiv preprint arXiv:2602.14012 , year =

    From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection , author =. arXiv preprint arXiv:2602.14012 , year =

  17. [17]

    The Thirteenth International Conference on Learning Representations , year =

    Preserving Diversity in Supervised Fine-Tuning of Large Language Models , author =. The Thirteenth International Conference on Learning Representations , year =

  18. [18]

    arXiv preprint arXiv:2507.12856 , year =

    Supervised fine tuning on curated data is reinforcement learning (and can be improved) , author =. arXiv preprint arXiv:2507.12856 , year =

  19. [19]

    On the Generalization of

    Yongliang Wu and Yizhou Zhou and Zhou Ziheng and Yingzhe Peng and Xinyu Ye and Xinting Hu and Wenbo Zhu and Lu Qi and Ming-Hsuan Yang and Xu Yang , booktitle =. On the Generalization of. 2026 , url =

  20. [20]

    The Fourteenth International Conference on Learning Representations , year =

    Anchored Supervised Fine-Tuning , author =. The Fourteenth International Conference on Learning Representations , year =

  21. [21]

    Getting Your

    Xinran Li and Guangda Huzhang and Siqi Shen and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Jun Zhang , booktitle =. Getting Your. 2026 , url =

  22. [22]

    arXiv preprint arXiv:2602.02244 , year =

    Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models , author =. arXiv preprint arXiv:2602.02244 , year =

  23. [23]

    arXiv preprint arXiv:2601.09195 , year =

    ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection , author =. arXiv preprint arXiv:2601.09195 , year =

  24. [24]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

    Learning to Reason under Off-Policy Guidance , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

  25. [25]

    arXiv preprint arXiv:2505.13026 , year =

    Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms , author =. arXiv preprint arXiv:2505.13026 , year =

  26. [26]

    Ozdaglar , booktitle =

    Mingyang Liu and Gabriele Farina and Asuman E. Ozdaglar , booktitle =. 2025 , url =

  27. [27]

    arXiv preprint arXiv:2506.01096 , year =

    SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning , author =. arXiv preprint arXiv:2506.01096 , year =

  28. [28]

    The Fourteenth International Conference on Learning Representations , year =

    Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions , author =. The Fourteenth International Conference on Learning Representations , year =

  29. [29]

    2026 , url =

    Yuqian Fu and Tinghong Chen and Jiajun Chai and Xihuai Wang and Songjun Tu and Guojun Yin and Wei Lin and Qichao Zhang and Yuanheng Zhu and Dongbin Zhao , booktitle =. 2026 , url =

  30. [30]

    arXiv preprint arXiv:2507.01679 , year =

    Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling , author =. arXiv preprint arXiv:2507.01679 , year =

  31. [31]

    arXiv preprint arXiv:2508.00222 , year =

    RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization , author =. arXiv preprint arXiv:2508.00222 , year =

  32. [32]

    On-Policy

    Wenhao Zhang and Yuexiang Xie and Yuchang Sun and Yanxi Chen and Guoyin Wang and Yaliang Li and Bolin Ding and Jingren Zhou , booktitle =. On-Policy. 2026 , url =

  33. [33]

    The Fourteenth International Conference on Learning Representations , year =

    Proximal Supervised Fine-Tuning , author =. The Fourteenth International Conference on Learning Representations , year =

  34. [34]

    arXiv preprint arXiv:2509.04419 , year =

    Towards a unified view of large language model post-training , author =. arXiv preprint arXiv:2509.04419 , year =

  35. [35]

    arXiv preprint arXiv:2604.14258 , year=

    GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification , author=. arXiv preprint arXiv:2604.14258 , year=

  36. [36]

    arXiv preprint arXiv:2604.23747 , year =

    SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning , author =. arXiv preprint arXiv:2604.23747 , year =

  37. [37]

    Xing and Sham M

    Zhenting Qi and Fan Nie and Alexandre Alahi and James Zou and Himabindu Lakkaraju and Yilun Du and Eric P. Xing and Sham M. Kakade and Hanlin Zhang , booktitle =. Evo. 2025 , url =

  38. [38]

    arXiv preprint arXiv:2512.07783 , year =

    On the interplay of pre-training, mid-training, and rl on reasoning language models , author =. arXiv preprint arXiv:2512.07783 , year =

  39. [39]

    arXiv preprint arXiv:1707.06347 , year =

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  40. [40]

    AAAI Conference on Artificial Intelligence (AAAI) , author =

    Mastering Complex Control in MOBA Games with Deep Reinforcement Learning , volume =. AAAI Conference on Artificial Intelligence (AAAI) , author =. 2020 , month =. doi:10.1609/aaai.v34i04.6144 , number =

  41. [41]

    Reinforcement learning: An introduction , author =

  42. [42]

    The Bitter Lesson , author =

  43. [43]

    Advances in neural information processing systems (NeurIPS) , volume =

    Direct preference optimization: Your language model is secretly a reward model , author =. Advances in neural information processing systems (NeurIPS) , volume =

  44. [44]

    Conference on Language Modeling (COLM) , year =

    From r to Q star: Your Language Model is Secretly a Q-Function , author =. Conference on Language Modeling (COLM) , year =

  45. [45]

    arXiv preprint arXiv:2211.14275 , year =

    Solving math word problems with process-and outcome-based feedback , author =. arXiv preprint arXiv:2211.14275 , year =

  46. [46]

    The Twelfth International Conference on Learning Representations , year =

    Let's Verify Step by Step , author =. The Twelfth International Conference on Learning Representations , year =

  47. [47]

    Math-Shepherd: Verify and Reinforce LLM s Step-by-step without Human Annotations

    Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , editor =. Math-Shepherd: Verify and Reinforce. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.510 , pages =

  48. [48]

    The Lessons of Developing Process Reward Models in Mathematical Reasoning

    Zhang, Zhenru and Zheng, Chujie and Wu, Yangzhen and Zhang, Beichen and Lin, Runji and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang. The Lessons of Developing Process Reward Models in Mathematical Reasoning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.547

  49. [49]

    Proceedings of the AAAI Conference on Artificial Intelligence , author =

    GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2026 , month =. doi:10.1609/aaai.v40i41.40797 , number =

  50. [50]

    arXiv preprint arXiv:2502.01456 , year =

    Process Reinforcement through Implicit Rewards , author =. arXiv preprint arXiv:2502.01456 , year =

  51. [51]

    The Fourteenth International Conference on Learning Representations , year =

    Agentic Reinforced Policy Optimization , author =. The Fourteenth International Conference on Learning Representations , year =

  52. [52]

    International Conference on Machine Learning (ICML) , pages =

    AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training , author =. International Conference on Machine Learning (ICML) , pages =. 2024 , editor =

  53. [53]

    2023 , note =

    Introducing Claude , author =. 2023 , note =

  54. [54]

    arXiv preprint arXiv:2303.08774 , year =

    GPT-4 Technical Report , author =. arXiv preprint arXiv:2303.08774 , year =

  55. [55]

    arXiv preprint arXiv:2410.21276 , year =

    GPT-4o System Card , author =. arXiv preprint arXiv:2410.21276 , year =

  56. [56]

    2024 , note =

    Learning to reason with LLMs , author =. 2024 , note =

  57. [57]

    2024 , note =

    OpenAI o3-mini , author =. 2024 , note =

  58. [58]

    arXiv preprint arXiv:2501.12599 , year =

    Kimi k1.5: Scaling Reinforcement Learning with LLMs , author =. arXiv preprint arXiv:2501.12599 , year =

  59. [59]

    arXiv preprint arXiv:2402.03300 , year =

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =

  60. [60]

    Nature , volume =

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author =. Nature , volume =

  61. [61]

    arXiv preprint arXiv:2412.15115 , year =

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

  62. [62]

    arXiv preprint arXiv:2409.12122 , year =

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author =. arXiv preprint arXiv:2409.12122 , year =

  63. [63]

    arXiv preprint arXiv:2409.12186 , year =

    Qwen2.5-Coder Technical Report , author =. arXiv preprint arXiv:2409.12186 , year =

  64. [64]

    arXiv preprint arXiv:2505.09388 , year =

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  65. [65]

    QwQ-32B: Embracing the Power of Reinforcement Learning , url =

  66. [66]

    arXiv preprint arXiv:2407.21783 , year =

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  67. [67]

    Introducing Gemma 3: The most capable model you can run on a single GPU or TPU , url =

  68. [68]

    Measuring Mathematical Problem Solving With the MATH Dataset , url =

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the MATH Dataset , url =

  69. [69]

    arXiv preprint arXiv:2110.14168 , year =

    Training verifiers to solve math word problems , author =. arXiv preprint arXiv:2110.14168 , year =

  70. [70]

    2023 , note =

    American Mathematics Contest 12 (AMC 12) , author =. 2023 , note =

  71. [71]

    2024 , note =

    American Invitational Mathematics Examination (AIME) , author =. 2024 , note =

  72. [72]

    2025 , note =

    American Invitational Mathematics Examination (AIME) , author =. 2025 , note =

  73. [73]

    Solving Quantitative Reasoning Problems with Language Models , url =

    Lewkowycz, Aitor and Andreassen, Anders and Dohan, David and Dyer, Ethan and Michalewski, Henryk and Ramasesh, Vinay and Slone, Ambrose and Anil, Cem and Schlag, Imanol and Gutman-Solo, Theo and Wu, Yuhuai and Neyshabur, Behnam and Gur-Ari, Guy and Misra, Vedant , booktitle =. Solving Quantitative Reasoning Problems with Language Models , url =

  74. [74]

    O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , editor =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, yea...

  75. [75]

    Bowman , booktitle =

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle =. 2024 , url =

  76. [76]

    arXiv preprint arXiv:1803.05457 , year =

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author =. arXiv preprint arXiv:1803.05457 , year =

  77. [77]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , url =

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understandi...

  78. [78]

    The Thirteenth International Conference on Learning Representations , year =

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author =. The Thirteenth International Conference on Learning Representations , year =

  79. [79]

    GitHub repository , howpublished =

    Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

  80. [80]

    The Fourteenth International Conference on Learning Representations , year =

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning , author =. The Fourteenth International Conference on Learning Representations , year =

Showing first 80 references.