Recognition: no theorem link
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Pith reviewed 2026-05-16 22:39 UTC · model grok-4.3
The pith
Sequential SFT followed by RL reaches a higher performance ceiling than synchronized training by first locking in a stable foundation then unlocking additional plasticity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Plasticity-Ceiling Framework decomposes the final performance ceiling into foundational SFT performance and subsequent RL plasticity. The sequential SFT-then-RL pipeline is superior to synchronized approaches because it avoids stability deficits and premature convergence. Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the ceiling; data scale determines primary post-training potential while trajectory difficulty acts as a multiplier; and the minimum validation loss of SFT serves as a reliable indicator for selecting expert trajectories that maximize the ultimate ceiling.
What carries the argument
The Plasticity-Ceiling Framework, which decomposes the final performance ceiling into the foundational SFT performance level and the additional improvement available through RL plasticity.
If this is right
- Transitioning from SFT to RL at the stable or mild overfitting regime secures both a robust foundation and substantial remaining plasticity for the highest overall ceiling.
- Larger data scale sets the primary post-training potential while harder trajectories multiply the achievable performance.
- Selecting expert trajectories by their minimum validation loss during SFT reliably maximizes the final ceiling after RL.
- The sequential pipeline overcomes the stability problems and premature convergence that appear when SFT and RL run simultaneously.
Where Pith is reading between the lines
- The same transition timing and scaling rules could apply to other reasoning tasks if the foundation-plus-plasticity split holds outside mathematics.
- Real-time monitoring of validation loss during SFT could let practitioners switch to RL without running separate scaling experiments for every new dataset.
- Testing whether synthetic trajectories follow the same data-scale and difficulty-multiplier pattern would show how far the guidelines extend beyond human expert data.
Load-bearing premise
The split between what supervised fine-tuning alone can achieve and what reinforcement learning can still add on top remains consistent across different models and mathematical reasoning tasks.
What would settle it
A direct test would train both the sequential SFT-then-RL pipeline and a synchronized joint approach on the same model and benchmark set and check whether the sequential version still produces a strictly higher final performance ceiling.
Figures
read the original abstract
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the ``Less is More'' hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Plasticity-Ceiling Framework to decompose LLM post-training performance on mathematical reasoning into an SFT foundation component and subsequent RL plasticity (incremental gain from RL). Through benchmarking, it claims that the sequential SFT-then-RL pipeline outperforms synchronized SFT+RL approaches by avoiding stability issues and premature convergence. It further derives scaling guidelines: transition to RL at the Stable or Mild Overfitting regime of SFT, data scale as the primary driver of post-training potential (refuting 'Less is More'), trajectory difficulty as a multiplier, and minimum SFT validation loss as a reliable selector for expert trajectories.
Significance. If the empirical decomposition and superiority claims hold under controlled conditions, the work supplies concrete, actionable guidelines for ordering and scaling SFT and RL stages when leveraging expert trajectories. This could standardize post-training pipelines for math reasoning and shift emphasis toward data volume over trajectory curation, with the minimum-validation-loss indicator offering a practical checkpoint for trajectory selection.
major comments (3)
- [Experimental comparisons (likely §4)] The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.
- [Plasticity-Ceiling Framework definition and results] The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.
- [Scaling analysis and guidelines] The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.
minor comments (2)
- [Abstract] The abstract provides no quantitative results, model sizes, benchmark names, or error bars, which hinders immediate assessment of effect sizes and statistical reliability.
- [Framework and notation] Clarify the precise operational definition of 'RL plasticity' (e.g., is it the absolute gain, relative gain, or normalized improvement) and how it is computed from the reported curves.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our experimental controls and outlining revisions to strengthen the empirical support for the Plasticity-Ceiling Framework and scaling guidelines.
read point-by-point responses
-
Referee: The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.
Authors: We appreciate the referee's emphasis on ensuring fair experimental comparisons. In our original setup, the synchronized SFT+RL baselines were trained with exactly the same total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules as the sequential SFT-then-RL pipeline. This matching was implemented to isolate the effect of training order. To eliminate any potential ambiguity, we will add a dedicated subsection in §4 with a hyperparameter comparison table and explicit confirmation of matched training budgets across all methods. revision: yes
-
Referee: The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.
Authors: The Plasticity-Ceiling Framework is an empirical decomposition based on observed performance ceilings rather than a claim of strict theoretical additivity. Our benchmarking across configurations shows consistent correlations between the decomposed components and final outcomes. We agree that further isolation is valuable. In the revision, we will incorporate new ablations that hold total data volume and compute budget fixed while varying only SFT/RL ordering and synchronization to more rigorously test the framework's predictive utility. revision: yes
-
Referee: The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.
Authors: The scaling guidelines and minimum-validation-loss indicator are derived from our primary benchmark suite. To strengthen claims of generalization, we will expand the revised manuscript with additional experiments across different model scales and supplementary mathematical reasoning tasks. These results will be presented to demonstrate the robustness of the regime recommendations and trajectory selection criterion beyond the current evaluation set. revision: yes
Circularity Check
Empirical benchmarking establishes claims without definitional circularity
full rationale
The paper proposes the Plasticity-Ceiling Framework as an empirical decomposition of final performance into SFT foundation plus RL plasticity, then validates the sequential SFT-then-RL pipeline and scaling guidelines through extensive benchmarking on mathematical reasoning tasks. No load-bearing equations, self-definitions, or derivations are present that reduce by construction to fitted inputs or prior self-citations; the central claims rest on direct experimental comparisons of stability, convergence, and performance ceilings across regimes. The work is therefore self-contained as standard empirical analysis rather than a closed theoretical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Final performance ceiling decomposes additively into SFT performance and RL plasticity
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chr...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023
Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023. URL https://arxiv.org/abs/2311.18232
-
[4]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Scaling laws for predicting downstream performance in llms, 2025
Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji. Scaling laws for predicting downstream performance in llms, 2025. URL https://arxiv.org/abs/2410.08527
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
VERL utils: FLOPs counter (line 149)
Volcano Engine. VERL utils: FLOPs counter (line 149). https://github.com/volcengine/verl/blob/59049a66/verl/utils/flops_counter.py\#L149, 2023. version 59049a6; Accessed: 2024-12-01
work page 2023
-
[9]
SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning
Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv preprint arXiv:2506.19767, 2025. doi:10.48550/arXiv.2506.19767. URL https://arxiv.org/abs/2506.19767
-
[10]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Team GLM, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024
work page 2024
-
[12]
Skywork Open Reasoner 1 Technical Report
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025. URL https://arxiv.org/abs/2507.00432
-
[14]
P.J. Huber and E.M. Ronchetti. Robust Statistics. Wiley Series in Probability and Statistics. Wiley, 2011. ISBN 9781118210338. URL https://books.google.com.hk/books?id=j1OhquR_j88C
work page 2011
-
[15]
Hugging Face . Math-verify. https://github.com/huggingface/Math-Verify, 2024
work page 2024
-
[16]
How to detect and handle outliers, volume 16
Boris Iglewicz and David C Hoaglin. How to detect and handle outliers, volume 16. Asqc Quality Press Milwaukee, WI, 1993
work page 1993
-
[17]
Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025
Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, and Newsha Ardalani. Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025. URL https://arxiv.org/abs/2510.01624
-
[18]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
The Art of Scaling Reinforcement Learning Compute for LLMs
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms, 2025. URL https://arxiv.org/abs/2510.13786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49 0 (4): 0 764--766, 2013. ISSN 0022-1031. doi:https://doi.org/10.1016/j.jesp.2013.03.013. URL https://www.sciencedire...
-
[22]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main/...
work page 2024
-
[23]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Towards a unified view of large language model post-training, 2025
Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training, 2025. URL https://arxiv.org/abs/2509.04419
-
[25]
Llama 3.2: Revolutionizing edge ai and vision with open, customizable models
Meta AI . Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/, September 2024. Meta AI blog; accessed 2025-04-13; 15 minute read
work page 2024
-
[26]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473
-
[28]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Least median of squares regression
Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79 0 (388): 0 871--880, 1984
work page 1984
-
[31]
Rousseeuw and Katrien Driessen
Peter J. Rousseeuw and Katrien Driessen. Computing lts regression for large data sets. Data Min. Knowl. Discov., 12 0 (1): 0 29–45, January 2006. ISSN 1384-5810. doi:10.1007/s10618-005-0024-4. URL https://doi.org/10.1007/s10618-005-0024-4
-
[32]
Robust regression and outlier detection
Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection. John Wiley & Sons, 1987
work page 1987
-
[33]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance, 2024. URL https://arxiv.org/abs/2405.10938
-
[34]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
RL's Razor: Why Online Reinforcement Learning Forgets Less
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259
work page internal anchor Pith review arXiv 2025
-
[36]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025
Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song. Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025. URL https://arxiv.org/abs/2504.11741
-
[38]
Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning, 2025. URL https://arxiv.org/abs/2503.01067
-
[39]
Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025. URL https://arxiv.org/abs/2504.17565
-
[40]
Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving
Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. https://arxiv.org/abs/2407.13690, 2024. URL https://arxiv.org/abs/2407.13690. arXiv:2407.13690, cs.CL
-
[41]
How to train your LLM web agent: A statistical diagnosis
Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Pe \ n aloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Mu \ n oz-M \' a rmol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Pich \' e , Alexandre Lacoste, and Massimo Caccia. How to train your LLM web agent: A s...
-
[42]
Implicit reward as the bridge: A unified view of sft and dpo connections, 2025
Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, and Xipeng Qiu. Implicit reward as the bridge: A unified view of sft and dpo connections, 2025. URL https://arxiv.org/abs/2507.00018
-
[43]
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URL https://arxiv.org/abs/2504.14945
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
LIMO: Less is More for Reasoning
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387
work page internal anchor Pith review arXiv 2025
-
[47]
Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning, 2025. URL https://arxiv.org/abs/2507.08267
-
[48]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
-
[50]
Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, and Lan-Zhe Guo. D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning, 2025 a . URL https://arxiv.org/abs/2503.11441
-
[51]
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting, 2025 b . URL https://arxiv.org/abs/2508.11408
-
[52]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025 c
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
1.4 million open-source distilled reasoning dataset to empower large language model training, 2025
Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training, 2025. URL https://arxiv.org/abs/2503.19633
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.