Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Pith reviewed 2026-05-19 00:55 UTC · model grok-4.3
pith:4B5WMF72 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{4B5WMF72}
Prints a linked pith:4B5WMF72 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Reinforcement learning on math data preserves general LLM capabilities while supervised fine-tuning erodes them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the same base model is trained on identical math-only data, reinforcement learning versions retain strong performance on scientific QA, agent planning, coding and instruction-following, whereas supervised fine-tuning versions lose these general capabilities. Latent-space and token-distribution measurements show that supervised fine-tuning produces substantial drift away from the model's original general-domain structure while reinforcement learning largely preserves that structure.
What carries the argument
Direct comparison of reinforcement learning versus supervised fine-tuning on fixed math reasoning data, with latent representation similarity and output token distribution shift as the metrics that quantify preservation or loss of general capabilities.
If this is right
- Reasoning models trained with reinforcement learning on math data should retain higher performance on non-math tasks than those trained with supervised fine-tuning.
- Post-training pipelines that rely heavily on supervised fine-tuning of distilled reasoning data are likely to produce models with reduced general capabilities.
- Standard math benchmark gains alone are unreliable indicators of overall model improvement because transfer to other domains is weak under supervised fine-tuning.
- Latent representation stability during training is a practical diagnostic for whether a reasoning model will keep its general abilities.
Where Pith is reading between the lines
- Hybrid training schedules that begin with supervised fine-tuning for rapid math gains and then switch to reinforcement learning could reduce drift while controlling compute cost.
- The same stability difference between the two tuning methods may appear when the narrow domain is coding or science rather than math.
- Developers evaluating new reasoning models should include representation-drift measurements alongside benchmark scores to predict generalization.
Load-bearing premise
The chosen mix of math, scientific QA, agent planning, coding and instruction tasks is broad enough to stand in for general capabilities and that observed differences trace to tuning method rather than data composition or scale details.
What would settle it
A new training run on the same base model and math data in which supervised fine-tuning is modified to explicitly limit representation drift and is then shown to retain general-task performance at levels comparable to reinforcement learning.
read the original abstract
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that math reasoning gains in LLMs largely fail to transfer to other domains such as scientific QA, agent planning, coding, and instruction-following. Evaluations across more than 20 open-weight models show limited generalization, while controlled experiments on Qwen3-14B using math-only data demonstrate that RL-tuned models preserve general capabilities and latent-domain structure better than SFT-tuned models, which induce substantial representation and output drift.
Significance. If the central empirical findings hold after addressing controls, the work is significant for post-training research: it provides evidence favoring RL over SFT for reasoning improvements without sacrificing breadth, supported by both large-scale model evaluations and mechanistic analyses of latent spaces and token distributions. The scale of the model survey and the focus on transferability rather than benchmark chasing are strengths.
major comments (2)
- [§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.
- [§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.
minor comments (2)
- [Figures 3-5] Figure captions and legends for the latent-space and distribution-shift plots should explicitly state the number of samples and the exact distance metrics used to improve reproducibility.
- [Abstract and §1] The abstract and introduction use 'forget general capabilities' without a precise operational definition; a short clarification tying it to the specific metrics reported would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our controlled experiments and strengthen the robustness of our transferability claims. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.
Authors: We agree that explicit confirmation is required to rule out confounds. In the revised manuscript we will add a dedicated paragraph (and supporting appendix) that details the construction of the reward model, verifier, and sampling strategy. This section will confirm that all auxiliary components were derived exclusively from the math-only corpus, with no general-domain data and with optimization steps matched to the SFT baseline except for the RL objective itself. These additions will directly support attribution to the tuning method. revision: yes
-
Referee: [§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.
Authors: We appreciate the suggestion for additional controls. While the task suite draws from standard, widely adopted benchmarks chosen to span distinct capability domains, we will strengthen the manuscript by (1) reporting per-task statistical tests (bootstrap confidence intervals and paired significance tests) on the RL vs. SFT differences and (2) adding a short ablation that repeats the main comparisons on a reduced task subset. We will also include a brief overlap analysis (n-gram and embedding similarity) between the math training data and each evaluation domain. These revisions will make the attribution to tuning method more rigorous without altering the core findings. revision: yes
Circularity Check
No circularity in empirical evaluation chain
full rationale
The paper reports direct empirical measurements of model performance on external benchmarks (math, scientific QA, coding, instruction-following) plus controlled RL-versus-SFT experiments on Qwen3-14B with math-only data, followed by post-hoc analyses of latent representations and token distributions. No equations, fitted parameters, or predictions are presented as derivations; all claims rest on observed differences against held-out tasks rather than reducing to self-referential inputs or self-citation chains. The central contrast between RL preservation and SFT drift is therefore self-contained observational content, not tautological by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard benchmarks like MATH, AIME, scientific QA, coding, and planning tasks measure distinct and transferable capabilities.
Forward citations
Cited by 18 Pith papers
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
RL generalizes better than SFT by preserving and slowly evolving a compact set of task-agnostic features from the base model rather than introducing many specialized ones.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
-
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detec...
-
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.
-
Understanding Task Transfer in Vision-Language Models
Finetuning VLMs on perception tasks produces positive and negative transfers that can be mapped with a new normalized metric called Perfection Gap Factor across 13 tasks and three models.
-
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
Proximal Supervised Fine-Tuning
PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
-
Sample-efficient LLM Optimization with Reset Replay
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =
Bespoke Labs , howpublished =. Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =
-
[5]
Temporal Sampling for Forgotten Reasoning in LLMs , url =
Yuetai Li and Zhangchen Xu and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Xiang Yue and Radha Poovendran , journal =. Temporal Sampling for Forgotten Reasoning in LLMs , url =
-
[6]
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =
Zhangchen Xu and Yuetai Li and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =
-
[7]
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =
Yichen Feng and Zhangchen Xu and Fengqing Jiang and Yuetai Li and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =
-
[8]
Yuetai Li and Xiang Yue and Zhangchen Xu and Fengqing Jiang and Luyao Niu and Bill Yuchen Lin and Bhaskar Ramasubramanian and Radha Poovendran , title=. CoRR , volume=. 2025 , month=
work page 2025
-
[9]
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =
Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , journal =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =
-
[10]
LIMO: Less is More for Reasoning , url =
Ye, Yixin and Huang, Zhen and Xiao, Yang and Chern, Ethan and Xia, Shijie and Liu, Pengfei , journal =. LIMO: Less is More for Reasoning , url =
-
[11]
OpenThoughts: Data Recipes for Reasoning Models , url =
Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others , journal =. OpenThoughts: Data Recipes for Reasoning Models , url =
-
[12]
Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , booktitle =. Crowdsourcing Multiple Choice Science Questions , url =. doi:10.18653/v1/W17-4413 , pages =
-
[13]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =
Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =. doi:10.18653/v1/D18-1260 , pages =
-
[14]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
work page 2023
-
[15]
Reddy, Siva and Chen, Danqi and Manning, Christopher D. , doi =. Transactions of the Association for Computational Linguistics , pages =
-
[16]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =
Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =
-
[17]
General-reasoner: Advancing llm reasoning across all domains , url =
Ma, Xueguang and Liu, Qian and Jiang, Dongfu and Zhang, Ge and Ma, Zejun and Chen, Wenhu , journal =. General-reasoner: Advancing llm reasoning across all domains , url =
-
[18]
Proceedings of the Twentieth European Conference on Computer Systems , pages =
Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696075 , abstract =
-
[19]
L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models
Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38
-
[20]
Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za
Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za. Advances in Neural Information Processing Systems , title =
-
[21]
doi:10.18653/v1/D19-1332 , pages =
Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan , booktitle =. doi:10.18653/v1/D19-1332 , pages =
-
[22]
Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zh...
work page 2025
-
[23]
The Language Model Evaluation Harness , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[24]
Acpbench: Reasoning about action, change, and planning , url =
Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin , booktitle =. Acpbench: Reasoning about action, change, and planning , url =
-
[25]
Wong and Rui Wang , booktitle=
Yiming Wang and Pei Zhang and Baosong Yang and Derek F. Wong and Rui Wang , booktitle=. Latent Space Chain-of-Embedding Enables Output-free. 2025 , url=
work page 2025
-
[26]
Zhou, Lexin and Pacchiardi, Lorenzo and Mart. ArXiv preprint , title =
-
[27]
Good Idea or Not, Representation of LLM Could Tell , url =
Xu, Yi and Xue, Bo and Sheng, Shuqian and Deng, Cheng and Ding, Jiaxin and Shen, Zanwei and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , journal =. Good Idea or Not, Representation of LLM Could Tell , url =
-
[28]
R ep E val: Effective Text Evaluation with LLM Representation
Sheng, Shuqian and Xu, Yi and Zhang, Tianhang and Shen, Zanwei and Fu, Luoyi and Ding, Jiaxin and Zhou, Lei and Gan, Xiaoying and Wang, Xinbing and Zhou, Chenghu. R ep E val: Effective Text Evaluation with LLM Representation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.398
-
[29]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI Team , year=. 2501.12948 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Efficient Test-Time Scaling via Self-Calibration , author=. 2025 , eprint=
work page 2025
-
[31]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , eprint=
work page 2025
-
[32]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=. 2025 , url =
work page 2025
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[35]
Xiang Yue and Tianyu Zheng and Ge Zhang and Wenhu Chen , booktitle=. 2024 , url=
work page 2024
-
[36]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=
work page 2024
-
[37]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=
work page 2024
- [38]
-
[39]
Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=
work page 2024
-
[40]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P , year=. 2502.14739 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Advances in Neural Information Processing Systems , editor=
Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[42]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[43]
OpenAI , year=. 2410.21276 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
- [44]
- [45]
-
[46]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=
work page 2022
- [47]
-
[48]
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
work page 2022
-
[49]
Syeda Nahida Akter and Shrimai Prabhumoye and Matvei Novikov and Seungju Han and Ying Lin and Evelina Bakhturina and Eric Nyberg and Yejin Choi and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2504.13941 , archivePrefix=
-
[50]
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. 2025 , eprint=
work page 2025
-
[51]
T heorem QA : A Theorem-driven Question Answering Dataset
Chen, Wenhu and Yin, Ming and Ku, Max and Lu, Pan and Wan, Yixin and Ma, Xueguang and Xu, Jianyu and Wang, Xinyi and Xia, Tony. T heorem QA : A Theorem-driven Question Answering Dataset. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.489
-
[52]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=
work page 2025
-
[53]
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , author=. 2025 , journal=
work page 2025
-
[54]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =
Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , note =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =
-
[55]
Fine-tuning language models from human preferences , url =
Ziegler, Daniel M and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey , journal =. Fine-tuning language models from human preferences , url =
-
[56]
Chen, Jie and Han, Xintian and Ma, Yu and Zhou, Xun and Xiang, Liang , journal =. Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =
-
[57]
Renxi Wang and Haonan Li and Minghao Wu and Yuxia Wang and Xudong Han and Chiyu Zhang and Timothy Baldwin , title=. 2024 , cdate=
work page 2024
-
[58]
On the impact of fine-tuning on chain-of-thought reasoning , url =
Lobo, Elita and Agarwal, Chirag and Lakkaraju, Himabindu , journal =. On the impact of fine-tuning on chain-of-thought reasoning , url =
-
[59]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[60]
Zhang, Yuxiang and Yang, Yuqi and Shu, Jiangming and Wang, Yuhang and Xiao, Jinlin and Sang, Jitao , journal =. OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =
-
[61]
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs , author=. 2025 , eprint=
work page 2025
-
[62]
Towards A Unified Agent with Foundation Models , author=. 2023 , eprint=
work page 2023
-
[63]
Reinforcement learning: An introduction , author=. A Bradford Book , year=
-
[64]
Puterman, Martin L , biburl =
-
[65]
Dell’Aversana, Paolo , year =
-
[66]
Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =
Andreas Junghanns and Jonathan Schaeffer , keywords =. Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =. 2001 , issn =. doi:https://doi.org/10.1016/S0004-3702(01)00109-6 , url =
-
[67]
Training Language Models to Self-Correct via Reinforcement Learning , author=. 2024 , eprint=
work page 2024
-
[68]
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF , author=. 2024 , eprint=
work page 2024
-
[69]
Artificial intelligence , volume=
Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=
work page 1998
-
[70]
Finite-time Analysis of the Multiarmed Bandit Problem , author=. 2002 , publisher=
work page 2002
- [71]
-
[72]
STaR: Bootstrapping Reasoning With Reasoning , author=. 2022 , eprint=
work page 2022
-
[73]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[74]
Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=
work page 2025
- [75]
-
[76]
Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents , author=. 2025 , eprint=
work page 2025
-
[77]
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency , author=. 2024 , eprint=
work page 2024
-
[78]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[79]
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[80]
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents , author=. 2025 , eprint=
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.