Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

arxiv: 2507.00432 · v2 · pith:4B5WMF72new · submitted 2025-07-01 · 💻 cs.AI · cs.CL

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan , Yuetai Li , Tuney Zheng , Xiaoyu Xu , Seungone Kim , Minxin Du , Radha Poovendran , Graham Neubig

show 1 more author

Xiang Yue

This is my paper

Pith reviewed 2026-05-19 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM reasoningtransferabilitysupervised fine-tuningreinforcement learninggeneral capabilitiesrepresentation driftmath reasoningpost-training

0 comments p. Extension

pith:4B5WMF72 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4B5WMF72}

Prints a linked pith:4B5WMF72 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Reinforcement learning on math data preserves general LLM capabilities while supervised fine-tuning erodes them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether strong math reasoning in large language models reflects genuine broader problem-solving gains or narrow specialization. Broad evaluations of existing models show limited transfer to science, coding, planning and instruction tasks. Controlled experiments on one base model using only math data then isolate the tuning method as the key variable: reinforcement learning maintains performance across domains while supervised fine-tuning produces forgetting. Internal analyses trace the difference to large shifts in latent representations and output distributions under supervised fine-tuning, shifts that reinforcement learning largely avoids.

Core claim

When the same base model is trained on identical math-only data, reinforcement learning versions retain strong performance on scientific QA, agent planning, coding and instruction-following, whereas supervised fine-tuning versions lose these general capabilities. Latent-space and token-distribution measurements show that supervised fine-tuning produces substantial drift away from the model's original general-domain structure while reinforcement learning largely preserves that structure.

What carries the argument

Direct comparison of reinforcement learning versus supervised fine-tuning on fixed math reasoning data, with latent representation similarity and output token distribution shift as the metrics that quantify preservation or loss of general capabilities.

If this is right

Reasoning models trained with reinforcement learning on math data should retain higher performance on non-math tasks than those trained with supervised fine-tuning.
Post-training pipelines that rely heavily on supervised fine-tuning of distilled reasoning data are likely to produce models with reduced general capabilities.
Standard math benchmark gains alone are unreliable indicators of overall model improvement because transfer to other domains is weak under supervised fine-tuning.
Latent representation stability during training is a practical diagnostic for whether a reasoning model will keep its general abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid training schedules that begin with supervised fine-tuning for rapid math gains and then switch to reinforcement learning could reduce drift while controlling compute cost.
The same stability difference between the two tuning methods may appear when the narrow domain is coding or science rather than math.
Developers evaluating new reasoning models should include representation-drift measurements alongside benchmark scores to predict generalization.

Load-bearing premise

The chosen mix of math, scientific QA, agent planning, coding and instruction tasks is broad enough to stand in for general capabilities and that observed differences trace to tuning method rather than data composition or scale details.

What would settle it

A new training run on the same base model and math data in which supervised fine-tuning is modified to explicitly limit representation drift and is then shown to retain general-task performance at levels comparable to reinforcement learning.

read the original abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL on math-only data transfers to other domains better than SFT does, with representation analysis showing less drift in the RL case.

read the letter

The main thing to know is that this paper finds RL-tuned models keep general capabilities when trained only on math data, while SFT versions often lose them, and they link the difference to smaller shifts in latent representations and token distributions under RL. They first check more than 20 open models and see that math gains rarely carry over to scientific QA, coding, planning, or instruction tasks. Then they run controlled experiments on Qwen3-14B using the same math data but switching the tuning method. RL versions generalize across the suite; SFT versions show bigger representation drift and output changes. The controlled setup on one base model with fixed data plus the shift measurements is the clearest new part. It improves on earlier transfer studies by trying to isolate the method and by adding internal analysis to explain the pattern. The broad initial evaluation across models also gives the claim some grounding. The soft spot is whether the RL and SFT runs are matched on everything except the objective. RL setups usually include a reward model or verifier, and if that component saw extra data, different steps, or auxiliary objectives, the generalization gap could partly reflect those factors rather than RL itself. The abstract says math-only for both, but the training logs and exact auxiliary components need checking to confirm no confound. Minor gaps in reporting statistical controls or precise task metrics could also be filled. This paper is aimed at people who design post-training for reasoning models and want to avoid breaking other skills. Readers focused on generalization from narrow domains or on RL versus SFT tradeoffs will get direct value from the comparison and analysis. It has enough empirical substance and a plausible mechanism to deserve a serious referee, even with some tightening on the controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that math reasoning gains in LLMs largely fail to transfer to other domains such as scientific QA, agent planning, coding, and instruction-following. Evaluations across more than 20 open-weight models show limited generalization, while controlled experiments on Qwen3-14B using math-only data demonstrate that RL-tuned models preserve general capabilities and latent-domain structure better than SFT-tuned models, which induce substantial representation and output drift.

Significance. If the central empirical findings hold after addressing controls, the work is significant for post-training research: it provides evidence favoring RL over SFT for reasoning improvements without sacrificing breadth, supported by both large-scale model evaluations and mechanistic analyses of latent spaces and token distributions. The scale of the model survey and the focus on transferability rather than benchmark chasing are strengths.

major comments (2)

[§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.
[§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.

minor comments (2)

[Figures 3-5] Figure captions and legends for the latent-space and distribution-shift plots should explicitly state the number of samples and the exact distance metrics used to improve reproducibility.
[Abstract and §1] The abstract and introduction use 'forget general capabilities' without a precise operational definition; a short clarification tying it to the specific metrics reported would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our controlled experiments and strengthen the robustness of our transferability claims. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Controlled Experiments): The claim that RL and SFT conditions are matched on math-only data requires explicit confirmation that auxiliary components (e.g., reward model, verifier, or sampling strategy) in the RL setup were trained exclusively on the same math-only corpus without additional general-domain exposure or differing optimization steps; without these details the attribution of generalization and drift differences to the tuning method versus training dynamics remains open to the confound noted in the stress-test.

Authors: We agree that explicit confirmation is required to rule out confounds. In the revised manuscript we will add a dedicated paragraph (and supporting appendix) that details the construction of the reward model, verifier, and sampling strategy. This section will confirm that all auxiliary components were derived exclusively from the math-only corpus, with no general-domain data and with optimization steps matched to the SFT baseline except for the RL objective itself. These additions will directly support attribution to the tuning method. revision: yes
Referee: [§3] §3 (Task Suite and Metrics): The broad suite is presented as capturing general capabilities, yet the manuscript provides limited justification or controls for why performance differences can be attributed to tuning method rather than uneven task difficulty, data overlap with math corpora, or model-scale specifics; adding per-task statistical tests or ablation on task selection would make the cross-domain transfer claim more robust.

Authors: We appreciate the suggestion for additional controls. While the task suite draws from standard, widely adopted benchmarks chosen to span distinct capability domains, we will strengthen the manuscript by (1) reporting per-task statistical tests (bootstrap confidence intervals and paired significance tests) on the RL vs. SFT differences and (2) adding a short ablation that repeats the main comparisons on a reduced task subset. We will also include a brief overlap analysis (n-gram and embedding similarity) between the math training data and each evaluation domain. These revisions will make the attribution to tuning method more rigorous without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation chain

full rationale

The paper reports direct empirical measurements of model performance on external benchmarks (math, scientific QA, coding, instruction-following) plus controlled RL-versus-SFT experiments on Qwen3-14B with math-only data, followed by post-hoc analyses of latent representations and token distributions. No equations, fitted parameters, or predictions are presented as derivations; all claims rest on observed differences against held-out tasks rather than reducing to self-referential inputs or self-citation chains. The central contrast between RL preservation and SFT drift is therefore self-contained observational content, not tautological by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about benchmark validity and that observed differences stem from tuning method rather than unmeasured factors.

axioms (1)

domain assumption Standard benchmarks like MATH, AIME, scientific QA, coding, and planning tasks measure distinct and transferable capabilities.
Invoked when interpreting lack of transfer as evidence against broad generalization.

pith-pipeline@v0.9.0 · 5762 in / 1101 out tokens · 28589 ms · 2026-05-19T00:55:33.197978+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
cs.LG 2025-08 unverdicted novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
cs.CL 2026-04 conditional novelty 6.0

RL generalizes better than SFT by preserving and slowly evolving a compact set of task-agnostic features from the base model rather than introducing many specialized ones.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
cs.LG 2026-01 unverdicted novelty 6.0

ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detec...
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
cs.LG 2026-01 unverdicted novelty 6.0

SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.
Understanding Task Transfer in Vision-Language Models
cs.CV 2025-11 unverdicted novelty 6.0

Finetuning VLMs on perception tasks produces positive and negative transfers that can be mapped with a new normalized metric called Perfection Gap Factor across 13 tasks and three models.
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
cs.LG 2025-12 unverdicted novelty 5.0

Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
Proximal Supervised Fine-Tuning
cs.LG 2025-08 unverdicted novelty 5.0

PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
Sample-efficient LLM Optimization with Reset Replay
cs.LG 2025-08 unverdicted novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 18 Pith papers · 55 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

Bespoke Labs , howpublished =. Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

work page
[5]

Temporal Sampling for Forgotten Reasoning in LLMs , url =

Yuetai Li and Zhangchen Xu and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Xiang Yue and Radha Poovendran , journal =. Temporal Sampling for Forgotten Reasoning in LLMs , url =

work page
[6]

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

Zhangchen Xu and Yuetai Li and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

work page
[7]

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

Yichen Feng and Zhangchen Xu and Fengqing Jiang and Yuetai Li and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

work page
[8]

CoRR , volume=

Yuetai Li and Xiang Yue and Zhangchen Xu and Fengqing Jiang and Luyao Niu and Bill Yuchen Lin and Bhaskar Ramasubramanian and Radha Poovendran , title=. CoRR , volume=. 2025 , month=

work page 2025
[9]

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , journal =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

work page
[10]

LIMO: Less is More for Reasoning , url =

Ye, Yixin and Huang, Zhen and Xiao, Yang and Chern, Ethan and Xia, Shijie and Liu, Pengfei , journal =. LIMO: Less is More for Reasoning , url =

work page
[11]

OpenThoughts: Data Recipes for Reasoning Models , url =

Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others , journal =. OpenThoughts: Data Recipes for Reasoning Models , url =

work page
[12]

Liu, and Matt Gardner

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , booktitle =. Crowdsourcing Multiple Choice Science Questions , url =. doi:10.18653/v1/W17-4413 , pages =

work page doi:10.18653/v1/w17-4413
[13]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =. doi:10.18653/v1/D18-1260 , pages =

work page doi:10.18653/v1/d18-1260
[14]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[15]

Reddy, Siva and Chen, Danqi and Manning, Christopher D. , doi =. Transactions of the Association for Computational Linguistics , pages =

work page
[16]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

work page
[17]

General-reasoner: Advancing llm reasoning across all domains , url =

Ma, Xueguang and Liu, Qian and Jiang, Dongfu and Zhang, Ge and Ma, Zejun and Chen, Wenhu , journal =. General-reasoner: Advancing llm reasoning across all domains , url =

work page
[18]

Proceedings of the Twentieth European Conference on Computer Systems , pages =

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696075 , abstract =

work page doi:10.1145/3689031.3696075 2025
[19]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

work page doi:10.18653/v1/2024.acl-demos.38 2024
[20]

Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za

Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za. Advances in Neural Information Processing Systems , title =

work page
[21]

doi:10.18653/v1/D19-1332 , pages =

Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan , booktitle =. doi:10.18653/v1/D19-1332 , pages =

work page doi:10.18653/v1/d19-1332
[22]

2025 , url =

Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zh...

work page 2025
[23]

The Language Model Evaluation Harness , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page
[24]

Acpbench: Reasoning about action, change, and planning , url =

Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin , booktitle =. Acpbench: Reasoning about action, change, and planning , url =

work page
[25]

Wong and Rui Wang , booktitle=

Yiming Wang and Pei Zhang and Baosong Yang and Derek F. Wong and Rui Wang , booktitle=. Latent Space Chain-of-Embedding Enables Output-free. 2025 , url=

work page 2025
[26]

ArXiv preprint , title =

Zhou, Lexin and Pacchiardi, Lorenzo and Mart. ArXiv preprint , title =

work page
[27]

Good Idea or Not, Representation of LLM Could Tell , url =

Xu, Yi and Xue, Bo and Sheng, Shuqian and Deng, Cheng and Ding, Jiaxin and Shen, Zanwei and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , journal =. Good Idea or Not, Representation of LLM Could Tell , url =

work page
[28]

R ep E val: Effective Text Evaluation with LLM Representation

Sheng, Shuqian and Xu, Yi and Zhang, Tianhang and Shen, Zanwei and Fu, Luoyi and Ding, Jiaxin and Zhou, Lei and Gan, Xiaoying and Wang, Xinbing and Zhou, Chenghu. R ep E val: Effective Text Evaluation with LLM Representation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.398

work page doi:10.18653/v1/2024.emnlp-main.398 2024
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI Team , year=. 2501.12948 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

2025 , eprint=

Efficient Test-Time Scaling via Self-Calibration , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , eprint=

work page 2025
[32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

2025 , url =

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=. 2025 , url =

work page 2025
[34]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[35]

2024 , url=

Xiang Yue and Tianyu Zheng and Ge Zhang and Wenhu Chen , booktitle=. 2024 , url=

work page 2024
[36]

2024 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=

work page 2024
[37]

2024 , eprint=

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

work page 2024
[38]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[39]

2024 , url=

Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

work page 2024
[40]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P , year=. 2502.14739 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Advances in Neural Information Processing Systems , editor=

Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[42]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[43]

GPT-4o System Card

OpenAI , year=. 2410.21276 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

work page 2025
[45]

2025 , eprint=

BIG-Bench Extra Hard , author=. 2025 , eprint=

work page 2025
[46]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022
[47]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

work page 2024
[48]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[49]

2504.13941 , archivePrefix=

Syeda Nahida Akter and Shrimai Prabhumoye and Matvei Novikov and Seungju Han and Ying Lin and Evelina Bakhturina and Eric Nyberg and Yejin Choi and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2504.13941 , archivePrefix=

work page arXiv
[50]

2025 , eprint=

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. 2025 , eprint=

work page 2025
[51]

T heorem QA : A Theorem-driven Question Answering Dataset

Chen, Wenhu and Yin, Ming and Ku, Max and Lu, Pan and Wan, Yixin and Ma, Xueguang and Xu, Jianyu and Wang, Xinyi and Xia, Tony. T heorem QA : A Theorem-driven Question Answering Dataset. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.489

work page doi:10.18653/v1/2023.emnlp-main.489 2023
[52]

2025 , eprint=

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=

work page 2025
[53]

2025 , journal=

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , author=. 2025 , journal=

work page 2025
[54]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , note =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

work page
[55]

Fine-tuning language models from human preferences , url =

Ziegler, Daniel M and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey , journal =. Fine-tuning language models from human preferences , url =

work page
[56]

Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

Chen, Jie and Han, Xintian and Ma, Yu and Zhou, Xun and Xiang, Liang , journal =. Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

work page
[57]

2024 , cdate=

Renxi Wang and Haonan Li and Minghao Wu and Yuxia Wang and Xudong Han and Chiyu Zhang and Timothy Baldwin , title=. 2024 , cdate=

work page 2024
[58]

On the impact of fine-tuning on chain-of-thought reasoning , url =

Lobo, Elita and Agarwal, Chirag and Lakkaraju, Himabindu , journal =. On the impact of fine-tuning on chain-of-thought reasoning , url =

work page
[59]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[60]

OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

Zhang, Yuxiang and Yang, Yuqi and Shu, Jiangming and Wang, Yuhang and Xiao, Jinlin and Sang, Jitao , journal =. OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

work page
[61]

2025 , eprint=

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs , author=. 2025 , eprint=

work page 2025
[62]

2023 , eprint=

Towards A Unified Agent with Foundation Models , author=. 2023 , eprint=

work page 2023
[63]

A Bradford Book , year=

Reinforcement learning: An introduction , author=. A Bradford Book , year=

work page
[64]

Puterman, Martin L , biburl =

work page
[65]

Dell’Aversana, Paolo , year =

work page
[66]

Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =

Andreas Junghanns and Jonathan Schaeffer , keywords =. Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =. 2001 , issn =. doi:https://doi.org/10.1016/S0004-3702(01)00109-6 , url =

work page doi:10.1016/s0004-3702(01)00109-6 2001
[67]

2024 , eprint=

Training Language Models to Self-Correct via Reinforcement Learning , author=. 2024 , eprint=

work page 2024
[68]

2024 , eprint=

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF , author=. 2024 , eprint=

work page 2024
[69]

Artificial intelligence , volume=

Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

work page 1998
[70]

2002 , publisher=

Finite-time Analysis of the Multiarmed Bandit Problem , author=. 2002 , publisher=

work page 2002
[71]

2016 , eprint=

OpenAI Gym , author=. 2016 , eprint=

work page 2016
[72]

2022 , eprint=

STaR: Bootstrapping Reasoning With Reasoning , author=. 2022 , eprint=

work page 2022
[73]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[74]

2025 , eprint=

Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

work page 2025
[75]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[76]

2025 , eprint=

Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents , author=. 2025 , eprint=

work page 2025
[77]

2024 , eprint=

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency , author=. 2024 , eprint=

work page 2024
[78]

2025 , eprint=

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[79]

2025 , eprint=

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[80]

2025 , eprint=

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents , author=. 2025 , eprint=

work page 2025

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

Bespoke Labs , howpublished =. Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation , year =

work page

[5] [5]

Temporal Sampling for Forgotten Reasoning in LLMs , url =

Yuetai Li and Zhangchen Xu and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Xiang Yue and Radha Poovendran , journal =. Temporal Sampling for Forgotten Reasoning in LLMs , url =

work page

[6] [6]

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

Zhangchen Xu and Yuetai Li and Fengqing Jiang and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning , url =

work page

[7] [7]

VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

Yichen Feng and Zhangchen Xu and Fengqing Jiang and Yuetai Li and Bhaskar Ramasubramanian and Luyao Niu and Bill Yuchen Lin and Radha Poovendran , journal =. VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL , url =

work page

[8] [8]

CoRR , volume=

Yuetai Li and Xiang Yue and Zhangchen Xu and Fengqing Jiang and Luyao Niu and Bill Yuchen Lin and Bhaskar Ramasubramanian and Radha Poovendran , title=. CoRR , volume=. 2025 , month=

work page 2025

[9] [9]

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , journal =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

work page

[10] [10]

LIMO: Less is More for Reasoning , url =

Ye, Yixin and Huang, Zhen and Xiao, Yang and Chern, Ethan and Xia, Shijie and Liu, Pengfei , journal =. LIMO: Less is More for Reasoning , url =

work page

[11] [11]

OpenThoughts: Data Recipes for Reasoning Models , url =

Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others , journal =. OpenThoughts: Data Recipes for Reasoning Models , url =

work page

[12] [12]

Liu, and Matt Gardner

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt , booktitle =. Crowdsourcing Multiple Choice Science Questions , url =. doi:10.18653/v1/W17-4413 , pages =

work page doi:10.18653/v1/w17-4413

[13] [13]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle =. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =. doi:10.18653/v1/D18-1260 , pages =

work page doi:10.18653/v1/d18-1260

[14] [14]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023

[15] [15]

Reddy, Siva and Chen, Danqi and Manning, Christopher D. , doi =. Transactions of the Association for Computational Linguistics , pages =

work page

[16] [16]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , doi =

work page

[17] [17]

General-reasoner: Advancing llm reasoning across all domains , url =

Ma, Xueguang and Liu, Qian and Jiang, Dongfu and Zhang, Ge and Ma, Zejun and Chen, Wenhu , journal =. General-reasoner: Advancing llm reasoning across all domains , url =

work page

[18] [18]

Proceedings of the Twentieth European Conference on Computer Systems , pages =

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , title =. Proceedings of the Twentieth European Conference on Computer Systems , pages =. 2025 , isbn =. doi:10.1145/3689031.3696075 , abstract =

work page doi:10.1145/3689031.3696075 2025

[19] [19]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

work page doi:10.18653/v1/2024.acl-demos.38 2024

[20] [20]

Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za

Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za. Advances in Neural Information Processing Systems , title =

work page

[21] [21]

doi:10.18653/v1/D19-1332 , pages =

Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan , booktitle =. doi:10.18653/v1/D19-1332 , pages =

work page doi:10.18653/v1/d19-1332

[22] [22]

2025 , url =

Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zh...

work page 2025

[23] [23]

The Language Model Evaluation Harness , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page

[24] [24]

Acpbench: Reasoning about action, change, and planning , url =

Kokel, Harsha and Katz, Michael and Srinivas, Kavitha and Sohrabi, Shirin , booktitle =. Acpbench: Reasoning about action, change, and planning , url =

work page

[25] [25]

Wong and Rui Wang , booktitle=

Yiming Wang and Pei Zhang and Baosong Yang and Derek F. Wong and Rui Wang , booktitle=. Latent Space Chain-of-Embedding Enables Output-free. 2025 , url=

work page 2025

[26] [26]

ArXiv preprint , title =

Zhou, Lexin and Pacchiardi, Lorenzo and Mart. ArXiv preprint , title =

work page

[27] [27]

Good Idea or Not, Representation of LLM Could Tell , url =

Xu, Yi and Xue, Bo and Sheng, Shuqian and Deng, Cheng and Ding, Jiaxin and Shen, Zanwei and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , journal =. Good Idea or Not, Representation of LLM Could Tell , url =

work page

[28] [28]

R ep E val: Effective Text Evaluation with LLM Representation

Sheng, Shuqian and Xu, Yi and Zhang, Tianhang and Shen, Zanwei and Fu, Luoyi and Ding, Jiaxin and Zhou, Lei and Gan, Xiaoying and Wang, Xinbing and Zhou, Chenghu. R ep E val: Effective Text Evaluation with LLM Representation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.398

work page doi:10.18653/v1/2024.emnlp-main.398 2024

[29] [29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI Team , year=. 2501.12948 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

2025 , eprint=

Efficient Test-Time Scaling via Self-Calibration , author=. 2025 , eprint=

work page 2025

[31] [31]

2025 , eprint=

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , eprint=

work page 2025

[32] [32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

2025 , url =

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=. 2025 , url =

work page 2025

[34] [34]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024

[35] [35]

2024 , url=

Xiang Yue and Tianyu Zheng and Ge Zhang and Wenhu Chen , booktitle=. 2024 , url=

work page 2024

[36] [36]

2024 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2024 , eprint=

work page 2024

[37] [37]

2024 , eprint=

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

work page 2024

[38] [38]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[39] [39]

2024 , url=

Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

work page 2024

[40] [40]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P , year=. 2502.14739 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Advances in Neural Information Processing Systems , editor=

Solving Quantitative Reasoning Problems with Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022

[42] [42]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[43] [43]

GPT-4o System Card

OpenAI , year=. 2410.21276 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

work page 2025

[45] [45]

2025 , eprint=

BIG-Bench Extra Hard , author=. 2025 , eprint=

work page 2025

[46] [46]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022

[47] [47]

2024 , eprint=

OpenAI o1 System Card , author=. 2024 , eprint=

work page 2024

[48] [48]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022

[49] [49]

2504.13941 , archivePrefix=

Syeda Nahida Akter and Shrimai Prabhumoye and Matvei Novikov and Seungju Han and Ying Lin and Evelina Bakhturina and Eric Nyberg and Yejin Choi and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro , year=. 2504.13941 , archivePrefix=

work page arXiv

[50] [50]

2025 , eprint=

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. 2025 , eprint=

work page 2025

[51] [51]

T heorem QA : A Theorem-driven Question Answering Dataset

Chen, Wenhu and Yin, Ming and Ku, Max and Lu, Pan and Wan, Yixin and Ma, Xueguang and Xu, Jianyu and Wang, Xinyi and Xia, Tony. T heorem QA : A Theorem-driven Question Answering Dataset. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.489

work page doi:10.18653/v1/2023.emnlp-main.489 2023

[52] [52]

2025 , eprint=

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , eprint=

work page 2025

[53] [53]

2025 , journal=

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , author=. 2025 , journal=

work page 2025

[54] [54]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica , note =. DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , year =

work page

[55] [55]

Fine-tuning language models from human preferences , url =

Ziegler, Daniel M and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey , journal =. Fine-tuning language models from human preferences , url =

work page

[56] [56]

Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

Chen, Jie and Han, Xintian and Ma, Yu and Zhou, Xun and Xiang, Liang , journal =. Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models , url =

work page

[57] [57]

2024 , cdate=

Renxi Wang and Haonan Li and Minghao Wu and Yuxia Wang and Xudong Han and Chiyu Zhang and Timothy Baldwin , title=. 2024 , cdate=

work page 2024

[58] [58]

On the impact of fine-tuning on chain-of-thought reasoning , url =

Lobo, Elita and Agarwal, Chirag and Lakkaraju, Himabindu , journal =. On the impact of fine-tuning on chain-of-thought reasoning , url =

work page

[59] [59]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[60] [60]

OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

Zhang, Yuxiang and Yang, Yuqi and Shu, Jiangming and Wang, Yuhang and Xiao, Jinlin and Sang, Jitao , journal =. OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning , url =

work page

[61] [61]

2025 , eprint=

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs , author=. 2025 , eprint=

work page 2025

[62] [62]

2023 , eprint=

Towards A Unified Agent with Foundation Models , author=. 2023 , eprint=

work page 2023

[63] [63]

A Bradford Book , year=

Reinforcement learning: An introduction , author=. A Bradford Book , year=

work page

[64] [64]

Puterman, Martin L , biburl =

work page

[65] [65]

Dell’Aversana, Paolo , year =

work page

[66] [66]

Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =

Andreas Junghanns and Jonathan Schaeffer , keywords =. Sokoban: Enhancing general single-agent search methods using domain knowledge , journal =. 2001 , issn =. doi:https://doi.org/10.1016/S0004-3702(01)00109-6 , url =

work page doi:10.1016/s0004-3702(01)00109-6 2001

[67] [67]

2024 , eprint=

Training Language Models to Self-Correct via Reinforcement Learning , author=. 2024 , eprint=

work page 2024

[68] [68]

2024 , eprint=

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF , author=. 2024 , eprint=

work page 2024

[69] [69]

Artificial intelligence , volume=

Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

work page 1998

[70] [70]

2002 , publisher=

Finite-time Analysis of the Multiarmed Bandit Problem , author=. 2002 , publisher=

work page 2002

[71] [71]

2016 , eprint=

OpenAI Gym , author=. 2016 , eprint=

work page 2016

[72] [72]

2022 , eprint=

STaR: Bootstrapping Reasoning With Reasoning , author=. 2022 , eprint=

work page 2022

[73] [73]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021

[74] [74]

2025 , eprint=

Understanding R1-Zero-Like Training: A Critical Perspective , author=. 2025 , eprint=

work page 2025

[75] [75]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017

[76] [76]

2025 , eprint=

Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents , author=. 2025 , eprint=

work page 2025

[77] [77]

2024 , eprint=

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency , author=. 2024 , eprint=

work page 2024

[78] [78]

2025 , eprint=

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[79] [79]

2025 , eprint=

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[80] [80]

2025 , eprint=

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents , author=. 2025 , eprint=

work page 2025