pith. sign in

arxiv: 2507.00432 · v2 · pith:4B5WMF72new · submitted 2025-07-01 · 💻 cs.AI · cs.CL

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Pith reviewed 2026-05-19 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM reasoningtransferabilitysupervised fine-tuningreinforcement learninggeneral capabilitiesrepresentation driftmath reasoningpost-training
0
0 comments X p. Extension
pith:4B5WMF72 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{4B5WMF72}

Prints a linked pith:4B5WMF72 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Reinforcement learning on math data preserves general LLM capabilities while supervised fine-tuning erodes them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether strong math reasoning in large language models reflects genuine broader problem-solving gains or narrow specialization. Broad evaluations of existing models show limited transfer to science, coding, planning and instruction tasks. Controlled experiments on one base model using only math data then isolate the tuning method as the key variable: reinforcement learning maintains performance across domains while supervised fine-tuning produces forgetting. Internal analyses trace the difference to large shifts in latent representations and output distributions under supervised fine-tuning, shifts that reinforcement learning largely avoids.

Core claim

When the same base model is trained on identical math-only data, reinforcement learning versions retain strong performance on scientific QA, agent planning, coding and instruction-following, whereas supervised fine-tuning versions lose these general capabilities. Latent-space and token-distribution measurements show that supervised fine-tuning produces substantial drift away from the model's original general-domain structure while reinforcement learning largely preserves that structure.

What carries the argument

Direct comparison of reinforcement learning versus supervised fine-tuning on fixed math reasoning data, with latent representation similarity and output token distribution shift as the metrics that quantify preservation or loss of general capabilities.

If this is right

  • Reasoning models trained with reinforcement learning on math data should retain higher performance on non-math tasks than those trained with supervised fine-tuning.
  • Post-training pipelines that rely heavily on supervised fine-tuning of distilled reasoning data are likely to produce models with reduced general capabilities.
  • Standard math benchmark gains alone are unreliable indicators of overall model improvement because transfer to other domains is weak under supervised fine-tuning.
  • Latent representation stability during training is a practical diagnostic for whether a reasoning model will keep its general abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid training schedules that begin with supervised fine-tuning for rapid math gains and then switch to reinforcement learning could reduce drift while controlling compute cost.
  • The same stability difference between the two tuning methods may appear when the narrow domain is coding or science rather than math.
  • Developers evaluating new reasoning models should include representation-drift measurements alongside benchmark scores to predict generalization.

Load-bearing premise

The chosen mix of math, scientific QA, agent planning, coding and instruction tasks is broad enough to stand in for general capabilities and that observed differences trace to tuning method rather than data composition or scale details.

What would settle it

A new training run on the same base model and math data in which supervised fine-tuning is modified to explicitly limit representation drift and is then shown to retain general-task performance at levels comparable to reinforcement learning.

read the original abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that math reasoning gains in LLMs largely fail to transfer to other domains such as scientific QA, agent planning, coding, and instruction-following. Evaluations across more than 20 open-weight models show limited generalization, while controlled experiments on Qwen3-14B using math-only data demonstrate that RL-tuned models preserve general capabilities and latent-domain structure better than SFT-tuned models, which induce substantial representation and output drift.

Significance. If the central empirical findings hold after addressing controls, the work is significant for post-training research: it provides evidence favoring RL over SFT for reasoning improvements without sacrificing breadth, supported by both large-scale model evaluations and mechanistic analyses of latent spaces and token distributions. The scale of the model survey and the focus on transferability rather than benchmark chasing are strengths.

major comments (2)
  1. [§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.
  2. [§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.
minor comments (2)
  1. [Figures 3-5] Figure captions and legends for the latent-space and distribution-shift plots should explicitly state the number of samples and the exact distance metrics used to improve reproducibility.
  2. [Abstract and §1] The abstract and introduction use 'forget general capabilities' without a precise operational definition; a short clarification tying it to the specific metrics reported would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our controlled experiments and strengthen the robustness of our transferability claims. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.

    Authors: We agree that explicit confirmation is required to rule out confounds. In the revised manuscript we will add a dedicated paragraph (and supporting appendix) that details the construction of the reward model, verifier, and sampling strategy. This section will confirm that all auxiliary components were derived exclusively from the math-only corpus, with no general-domain data and with optimization steps matched to the SFT baseline except for the RL objective itself. These additions will directly support attribution to the tuning method. revision: yes

  2. Referee: [§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.

    Authors: We appreciate the suggestion for additional controls. While the task suite draws from standard, widely adopted benchmarks chosen to span distinct capability domains, we will strengthen the manuscript by (1) reporting per-task statistical tests (bootstrap confidence intervals and paired significance tests) on the RL vs. SFT differences and (2) adding a short ablation that repeats the main comparisons on a reduced task subset. We will also include a brief overlap analysis (n-gram and embedding similarity) between the math training data and each evaluation domain. These revisions will make the attribution to tuning method more rigorous without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation chain

full rationale

The paper reports direct empirical measurements of model performance on external benchmarks (math, scientific QA, coding, instruction-following) plus controlled RL-versus-SFT experiments on Qwen3-14B with math-only data, followed by post-hoc analyses of latent representations and token distributions. No equations, fitted parameters, or predictions are presented as derivations; all claims rest on observed differences against held-out tasks rather than reducing to self-referential inputs or self-citation chains. The central contrast between RL preservation and SFT drift is therefore self-contained observational content, not tautological by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about benchmark validity and that observed differences stem from tuning method rather than unmeasured factors.

axioms (1)
  • domain assumption Standard benchmarks like MATH, AIME, scientific QA, coding, and planning tasks measure distinct and transferable capabilities.
    Invoked when interpreting lack of transfer as evidence against broad generalization.

pith-pipeline@v0.9.0 · 5762 in / 1101 out tokens · 28589 ms · 2026-05-19T00:55:33.197978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  2. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

    cs.AI 2026-04 unverdicted novelty 7.0

    SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

  3. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  4. Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 7.0

    TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

  5. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  6. Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

    cs.CL 2026-04 conditional novelty 6.0

    RL generalizes better than SFT by preserving and slowly evolving a compact set of task-agnostic features from the base model rather than introducing many specialized ones.

  7. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  8. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  9. ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

    cs.LG 2026-01 conditional novelty 6.0

    ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.

  10. Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

    cs.LG 2026-01 unverdicted novelty 6.0

    ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detec...

  11. On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

    cs.LG 2026-01 unverdicted novelty 6.0

    SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.

  12. Understanding Task Transfer in Vision-Language Models

    cs.CV 2025-11 unverdicted novelty 6.0

    Finetuning VLMs on perception tasks produces positive and negative transfers that can be mapped with a new normalized metric called Perfection Gap Factor across 13 tasks and three models.

  13. M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.

  14. Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

    cs.LG 2025-12 unverdicted novelty 5.0

    Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

  15. Proximal Supervised Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 5.0

    PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.

  16. Sample-efficient LLM Optimization with Reset Replay

    cs.LG 2025-08 unverdicted novelty 5.0

    LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...

  17. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  18. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 18 Pith papers · 55 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

    Bespoke Labs , howpublished =. Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

  5. [5]

    Temporal Sampling for Forgotten Reasoning in LLMs , url =

    Yuetai Li and Zhangchen Xu and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Xiang Yue and Radha Poovendran , journal =. Temporal Sampling for Forgotten Reasoning in LLMs , url =

  6. [6]

    TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

    Zhangchen Xu and Yuetai Li and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

  7. [7]

    VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

    Yichen Feng and Zhangchen Xu and Fengqing Jiang and Yuetai Li and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

  8. [8]

    CoRR , volume=

    Yuetai Li and Xiang Yue and Zhangchen Xu and Fengqing Jiang and Luyao Niu and Bill Yuchen Lin and Bhaskar Ramasubramanian and Radha Poovendran , title=. CoRR , volume=. 2025 , month=

  9. [9]

    The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

    Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , journal =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

  10. [10]

    LIMO: Less is More for Reasoning , url =

    Ye, Yixin and Huang, Zhen and Xiao, Yang and Chern, Ethan and Xia, Shijie and Liu, Pengfei , journal =. LIMO: Less is More for Reasoning , url =

  11. [11]

    OpenThoughts: Data Recipes for Reasoning Models , url =

    Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others , journal =. OpenThoughts: Data Recipes for Reasoning Models , url =

  12. [12]

    Liu, and Matt Gardner

    Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , booktitle =. Crowdsourcing Multiple Choice Science Questions , url =. doi:10.18653/v1/W17-4413 , pages =

  13. [13]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =

    Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =. doi:10.18653/v1/D18-1260 , pages =

  14. [14]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  15. [15]

    Reddy, Siva and Chen, Danqi and Manning, Christopher D. , doi =. Transactions of the Association for Computational Linguistics , pages =

  16. [16]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

    Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

  17. [17]

    General-reasoner: Advancing llm reasoning across all domains , url =

    Ma, Xueguang and Liu, Qian and Jiang, Dongfu and Zhang, Ge and Ma, Zejun and Chen, Wenhu , journal =. General-reasoner: Advancing llm reasoning across all domains , url =

  18. [18]

    Proceedings of the Twentieth European Conference on Computer Systems , pages =

    Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696075 , abstract =

  19. [19]

    L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

    Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

  20. [20]

    Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za

    Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za. Advances in Neural Information Processing Systems , title =

  21. [21]

    doi:10.18653/v1/D19-1332 , pages =

    Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan , booktitle =. doi:10.18653/v1/D19-1332 , pages =

  22. [22]

    2025 , url =

    Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zh...

  23. [23]

    The Language Model Evaluation Harness , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  24. [24]

    Acpbench: Reasoning about action, change, and planning , url =

    Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin , booktitle =. Acpbench: Reasoning about action, change, and planning , url =

  25. [25]

    Wong and Rui Wang , booktitle=

    Yiming Wang and Pei Zhang and Baosong Yang and Derek F. Wong and Rui Wang , booktitle=. Latent Space Chain-of-Embedding Enables Output-free. 2025 , url=

  26. [26]

    ArXiv preprint , title =

    Zhou, Lexin and Pacchiardi, Lorenzo and Mart. ArXiv preprint , title =

  27. [27]

    Good Idea or Not, Representation of LLM Could Tell , url =

    Xu, Yi and Xue, Bo and Sheng, Shuqian and Deng, Cheng and Ding, Jiaxin and Shen, Zanwei and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , journal =. Good Idea or Not, Representation of LLM Could Tell , url =

  28. [28]

    R ep E val: Effective Text Evaluation with LLM Representation

    Sheng, Shuqian and Xu, Yi and Zhang, Tianhang and Shen, Zanwei and Fu, Luoyi and Ding, Jiaxin and Zhou, Lei and Gan, Xiaoying and Wang, Xinbing and Zhou, Chenghu. R ep E val: Effective Text Evaluation with LLM Representation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.398

  29. [29]
  30. [30]

    2025 , eprint=

    Efficient Test-Time Scaling via Self-Calibration , author=. 2025 , eprint=

  31. [31]

    2025 , eprint=

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , eprint=

  32. [32]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

  33. [33]

    2025 , url =

    DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=. 2025 , url =

  34. [34]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  35. [35]

    2024 , url=

    Xiang Yue and Tianyu Zheng and Ge Zhang and Wenhu Chen , booktitle=. 2024 , url=

  36. [36]

    2024 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=

  37. [37]

    2024 , eprint=

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

  38. [38]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  39. [39]

    2024 , url=

    Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

  40. [40]
  41. [41]

    Advances in Neural Information Processing Systems , editor=

    Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  42. [42]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  43. [43]

    GPT-4o System Card

    OpenAI , year=. 2410.21276 , archivePrefix=

  44. [44]

    2025 , eprint=

    s1: Simple test-time scaling , author=. 2025 , eprint=

  45. [45]

    2025 , eprint=

    BIG-Bench Extra Hard , author=. 2025 , eprint=

  46. [46]

    2022 , eprint=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

  47. [47]

    2024 , eprint=

    OpenAI o1 System Card , author=. 2024 , eprint=

  48. [48]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  49. [49]

    2504.13941 , archivePrefix=

    Syeda Nahida Akter and Shrimai Prabhumoye and Matvei Novikov and Seungju Han and Ying Lin and Evelina Bakhturina and Eric Nyberg and Yejin Choi and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2504.13941 , archivePrefix=

  50. [50]

    2025 , eprint=

    Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. 2025 , eprint=

  51. [51]

    T heorem QA : A Theorem-driven Question Answering Dataset

    Chen, Wenhu and Yin, Ming and Ku, Max and Lu, Pan and Wan, Yixin and Ma, Xueguang and Xu, Jianyu and Wang, Xinyi and Xia, Tony. T heorem QA : A Theorem-driven Question Answering Dataset. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.489

  52. [52]

    2025 , eprint=

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=

  53. [53]

    2025 , journal=

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , author=. 2025 , journal=

  54. [54]

    DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

    Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , note =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

  55. [55]

    Fine-tuning language models from human preferences , url =

    Ziegler, Daniel M and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey , journal =. Fine-tuning language models from human preferences , url =

  56. [56]

    Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

    Chen, Jie and Han, Xintian and Ma, Yu and Zhou, Xun and Xiang, Liang , journal =. Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

  57. [57]

    2024 , cdate=

    Renxi Wang and Haonan Li and Minghao Wu and Yuxia Wang and Xudong Han and Chiyu Zhang and Timothy Baldwin , title=. 2024 , cdate=

  58. [58]

    On the impact of fine-tuning on chain-of-thought reasoning , url =

    Lobo, Elita and Agarwal, Chirag and Lakkaraju, Himabindu , journal =. On the impact of fine-tuning on chain-of-thought reasoning , url =

  59. [59]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  60. [60]

    OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

    Zhang, Yuxiang and Yang, Yuqi and Shu, Jiangming and Wang, Yuhang and Xiao, Jinlin and Sang, Jitao , journal =. OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

  61. [61]

    2025 , eprint=

    Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs , author=. 2025 , eprint=

  62. [62]

    2023 , eprint=

    Towards A Unified Agent with Foundation Models , author=. 2023 , eprint=

  63. [63]

    A Bradford Book , year=

    Reinforcement learning: An introduction , author=. A Bradford Book , year=

  64. [64]

    Puterman, Martin L , biburl =

  65. [65]

    Dell’Aversana, Paolo , year =

  66. [66]

    Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =

    Andreas Junghanns and Jonathan Schaeffer , keywords =. Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =. 2001 , issn =. doi:https://doi.org/10.1016/S0004-3702(01)00109-6 , url =

  67. [67]

    2024 , eprint=

    Training Language Models to Self-Correct via Reinforcement Learning , author=. 2024 , eprint=

  68. [68]

    2024 , eprint=

    Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF , author=. 2024 , eprint=

  69. [69]

    Artificial intelligence , volume=

    Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

  70. [70]

    2002 , publisher=

    Finite-time Analysis of the Multiarmed Bandit Problem , author=. 2002 , publisher=

  71. [71]

    2016 , eprint=

    OpenAI Gym , author=. 2016 , eprint=

  72. [72]

    2022 , eprint=

    STaR: Bootstrapping Reasoning With Reasoning , author=. 2022 , eprint=

  73. [73]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  74. [74]

    2025 , eprint=

    Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

  75. [75]

    2017 , eprint=

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

  76. [76]

    2025 , eprint=

    Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents , author=. 2025 , eprint=

  77. [77]

    2024 , eprint=

    Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency , author=. 2024 , eprint=

  78. [78]

    2025 , eprint=

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

  79. [79]

    2025 , eprint=

    WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=

  80. [80]

    2025 , eprint=

    An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents , author=. 2025 , eprint=

Showing first 80 references.