pith. machine review for the scientific record. sign in

arxiv: 2512.11470 · v2 · submitted 2025-12-12 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM post-trainingmathematical reasoningsupervised fine-tuningreinforcement learningexpert trajectoriesscaling guidelinesplasticity ceiling
0
0 comments X

The pith

Sequential SFT followed by RL reaches a higher performance ceiling than synchronized training by first locking in a stable foundation then unlocking additional plasticity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Plasticity-Ceiling Framework to separate final model performance into the base level reached by supervised fine-tuning on expert trajectories and the extra gains still available from reinforcement learning afterward. It finds that running SFT first until the stable or mild overfitting stage, then switching to RL, outperforms joint training because the sequential order prevents instability and early convergence limits. The work supplies concrete scaling rules: data volume sets the main ceiling height while trajectory difficulty multiplies the outcome, and the lowest validation loss during SFT reliably flags the best trajectories to use.

Core claim

The Plasticity-Ceiling Framework decomposes the final performance ceiling into foundational SFT performance and subsequent RL plasticity. The sequential SFT-then-RL pipeline is superior to synchronized approaches because it avoids stability deficits and premature convergence. Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the ceiling; data scale determines primary post-training potential while trajectory difficulty acts as a multiplier; and the minimum validation loss of SFT serves as a reliable indicator for selecting expert trajectories that maximize the ultimate ceiling.

What carries the argument

The Plasticity-Ceiling Framework, which decomposes the final performance ceiling into the foundational SFT performance level and the additional improvement available through RL plasticity.

If this is right

  • Transitioning from SFT to RL at the stable or mild overfitting regime secures both a robust foundation and substantial remaining plasticity for the highest overall ceiling.
  • Larger data scale sets the primary post-training potential while harder trajectories multiply the achievable performance.
  • Selecting expert trajectories by their minimum validation loss during SFT reliably maximizes the final ceiling after RL.
  • The sequential pipeline overcomes the stability problems and premature convergence that appear when SFT and RL run simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transition timing and scaling rules could apply to other reasoning tasks if the foundation-plus-plasticity split holds outside mathematics.
  • Real-time monitoring of validation loss during SFT could let practitioners switch to RL without running separate scaling experiments for every new dataset.
  • Testing whether synthetic trajectories follow the same data-scale and difficulty-multiplier pattern would show how far the guidelines extend beyond human expert data.

Load-bearing premise

The split between what supervised fine-tuning alone can achieve and what reinforcement learning can still add on top remains consistent across different models and mathematical reasoning tasks.

What would settle it

A direct test would train both the sequential SFT-then-RL pipeline and a synchronized joint approach on the same model and benchmark set and check whether the sequential version still produces a strictly higher final performance ceiling.

Figures

Figures reproduced from arXiv: 2512.11470 by Bowen Ding, Dantong Zhu, Fei Mi, Futing Wang, Heyuan Deng, Jiayang Lyv, Jiyao Yuan, Lifeng Shang, Qi Zhu, Shuangshuang Tian, Tao Lin, Yuhan Chen.

Figure 1
Figure 1. Figure 1: The conceptual overview of LLM post-training. Sequential SFT-then-RL (blue→orange) achieves the highest performance ceiling Apost, outperforming Pure RL (orange) and Synchronized SFT-RL (striped blue–orange) paths. Insets highlight that larger, harder data in￾creases plasticity, and RL should start during the Stable SFT. Conversely, some LLM practitioners (Yang et al., 2025; GLM et al., 2025; DeepSeek-AI, … view at source ↗
Figure 2
Figure 2. Figure 2: Compute–performance scaling of post-training paradigms under different initialization [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SFT Compute Scaling Dynamics of the SFT-then-RL Pipeline across Diverse Data Prop [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The analysis of the max post-training performance [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of SFT-then-RL fitting across different SFT data configurations. (a) Correlation analysis between Apost and Minimum Validation Loss. (b)-(f) The SFT-then-RL scaling dynamics under various data configurations. The SFT trajectory is depicted by a black dashed line. RL scaling curves initiated from different SFT steps are distinguished by a color gradient, where lighter shades indicate a higher … view at source ↗
read the original abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the ``Less is More'' hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Plasticity-Ceiling Framework to decompose LLM post-training performance on mathematical reasoning into an SFT foundation component and subsequent RL plasticity (incremental gain from RL). Through benchmarking, it claims that the sequential SFT-then-RL pipeline outperforms synchronized SFT+RL approaches by avoiding stability issues and premature convergence. It further derives scaling guidelines: transition to RL at the Stable or Mild Overfitting regime of SFT, data scale as the primary driver of post-training potential (refuting 'Less is More'), trajectory difficulty as a multiplier, and minimum SFT validation loss as a reliable selector for expert trajectories.

Significance. If the empirical decomposition and superiority claims hold under controlled conditions, the work supplies concrete, actionable guidelines for ordering and scaling SFT and RL stages when leveraging expert trajectories. This could standardize post-training pipelines for math reasoning and shift emphasis toward data volume over trajectory curation, with the minimum-validation-loss indicator offering a practical checkpoint for trajectory selection.

major comments (3)
  1. [Experimental comparisons (likely §4)] The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.
  2. [Plasticity-Ceiling Framework definition and results] The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.
  3. [Scaling analysis and guidelines] The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.
minor comments (2)
  1. [Abstract] The abstract provides no quantitative results, model sizes, benchmark names, or error bars, which hinders immediate assessment of effect sizes and statistical reliability.
  2. [Framework and notation] Clarify the precise operational definition of 'RL plasticity' (e.g., is it the absolute gain, relative gain, or normalized improvement) and how it is computed from the reported curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our experimental controls and outlining revisions to strengthen the empirical support for the Plasticity-Ceiling Framework and scaling guidelines.

read point-by-point responses
  1. Referee: The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.

    Authors: We appreciate the referee's emphasis on ensuring fair experimental comparisons. In our original setup, the synchronized SFT+RL baselines were trained with exactly the same total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules as the sequential SFT-then-RL pipeline. This matching was implemented to isolate the effect of training order. To eliminate any potential ambiguity, we will add a dedicated subsection in §4 with a hyperparameter comparison table and explicit confirmation of matched training budgets across all methods. revision: yes

  2. Referee: The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.

    Authors: The Plasticity-Ceiling Framework is an empirical decomposition based on observed performance ceilings rather than a claim of strict theoretical additivity. Our benchmarking across configurations shows consistent correlations between the decomposed components and final outcomes. We agree that further isolation is valuable. In the revision, we will incorporate new ablations that hold total data volume and compute budget fixed while varying only SFT/RL ordering and synchronization to more rigorously test the framework's predictive utility. revision: yes

  3. Referee: The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.

    Authors: The scaling guidelines and minimum-validation-loss indicator are derived from our primary benchmark suite. To strengthen claims of generalization, we will expand the revised manuscript with additional experiments across different model scales and supplementary mathematical reasoning tasks. These results will be presented to demonstrate the robustness of the regime recommendations and trajectory selection criterion beyond the current evaluation set. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking establishes claims without definitional circularity

full rationale

The paper proposes the Plasticity-Ceiling Framework as an empirical decomposition of final performance into SFT foundation plus RL plasticity, then validates the sequential SFT-then-RL pipeline and scaling guidelines through extensive benchmarking on mathematical reasoning tasks. No load-bearing equations, self-definitions, or derivations are present that reduce by construction to fitted inputs or prior self-citations; the central claims rest on direct experimental comparisons of stability, convergence, and performance ceilings across regimes. The work is therefore self-contained as standard empirical analysis rather than a closed theoretical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the assumption that SFT and RL effects can be separated and measured independently through benchmarking.

axioms (1)
  • domain assumption Final performance ceiling decomposes additively into SFT performance and RL plasticity
    This is the core of the proposed framework as stated in the abstract.

pith-pipeline@v0.9.0 · 7475 in / 1051 out tokens · 67310 ms · 2026-05-16T22:39:29.481400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 23 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chr...

  3. [3]

    Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

    Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023. URL https://arxiv.org/abs/2311.18232

  4. [4]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245

  5. [5]

    Scaling laws for predicting downstream performance in llms, 2025

    Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji. Scaling laws for predicting downstream performance in llms, 2025. URL https://arxiv.org/abs/2410.08527

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  8. [8]

    VERL utils: FLOPs counter (line 149)

    Volcano Engine. VERL utils: FLOPs counter (line 149). https://github.com/volcengine/verl/blob/59049a66/verl/utils/flops_counter.py\#L149, 2023. version 59049a6; Accessed: 2024-12-01

  9. [9]

    SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning

    Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv preprint arXiv:2506.19767, 2025. doi:10.48550/arXiv.2506.19767. URL https://arxiv.org/abs/2506.19767

  10. [10]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Team GLM, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bo...

  11. [11]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

  12. [12]

    Skywork Open Reasoner 1 Technical Report

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025

  13. [13]

    Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025. URL https://arxiv.org/abs/2507.00432

  14. [14]

    Huber and E.M

    P.J. Huber and E.M. Ronchetti. Robust Statistics. Wiley Series in Probability and Statistics. Wiley, 2011. ISBN 9781118210338. URL https://books.google.com.hk/books?id=j1OhquR_j88C

  15. [15]

    Math-verify

    Hugging Face . Math-verify. https://github.com/huggingface/Math-Verify, 2024

  16. [16]

    How to detect and handle outliers, volume 16

    Boris Iglewicz and David C Hoaglin. How to detect and handle outliers, volume 16. Asqc Quality Press Milwaukee, WI, 1993

  17. [17]

    Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025

    Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, and Newsha Ardalani. Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025. URL https://arxiv.org/abs/2510.01624

  18. [18]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  19. [19]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms, 2025. URL https://arxiv.org/abs/2510.13786

  20. [20]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

  21. [21]

    Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

    Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49 0 (4): 0 764--766, 2013. ISSN 0022-1031. doi:https://doi.org/10.1016/j.jesp.2013.03.013. URL https://www.sciencedire...

  22. [22]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main/...

  23. [23]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

  24. [24]

    Towards a unified view of large language model post-training, 2025

    Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training, 2025. URL https://arxiv.org/abs/2509.04419

  25. [25]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

    Meta AI . Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/, September 2024. Meta AI blog; accessed 2025-04-13; 15 minute read

  26. [26]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

  27. [27]

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473

  28. [28]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  29. [29]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  30. [30]

    Least median of squares regression

    Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79 0 (388): 0 871--880, 1984

  31. [31]

    Rousseeuw and Katrien Driessen

    Peter J. Rousseeuw and Katrien Driessen. Computing lts regression for large data sets. Data Min. Knowl. Discov., 12 0 (1): 0 29–45, January 2006. ISSN 1384-5810. doi:10.1007/s10618-005-0024-4. URL https://doi.org/10.1007/s10618-005-0024-4

  32. [32]

    Robust regression and outlier detection

    Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection. John Wiley & Sons, 1987

  33. [33]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance, 2024. URL https://arxiv.org/abs/2405.10938

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  35. [35]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

  36. [36]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  37. [37]

    Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025

    Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song. Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025. URL https://arxiv.org/abs/2504.11741

  38. [38]

    Andrew Bagnell

    Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning, 2025. URL https://arxiv.org/abs/2503.01067

  39. [39]

    Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

    Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025. URL https://arxiv.org/abs/2504.17565

  40. [40]

    Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

    Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. https://arxiv.org/abs/2407.13690, 2024. URL https://arxiv.org/abs/2407.13690. arXiv:2407.13690, cs.CL

  41. [41]

    How to train your LLM web agent: A statistical diagnosis

    Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Pe \ n aloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Mu \ n oz-M \' a rmol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Pich \' e , Alexandre Lacoste, and Massimo Caccia. How to train your LLM web agent: A s...

  42. [42]

    Implicit reward as the bridge: A unified view of sft and dpo connections, 2025

    Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, and Xipeng Qiu. Implicit reward as the bridge: A unified view of sft and dpo connections, 2025. URL https://arxiv.org/abs/2507.00018

  43. [43]

    Learning to Reason under Off-Policy Guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URL https://arxiv.org/abs/2504.14945

  44. [44]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  46. [46]

    LIMO: Less is More for Reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387

  47. [47]

    A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning, 2025

    Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning, 2025. URL https://arxiv.org/abs/2507.08267

  48. [48]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023

  49. [49]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  50. [50]

    D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning, 2025 a

    Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, and Lan-Zhe Guo. D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning, 2025 a . URL https://arxiv.org/abs/2503.11441

  51. [51]

    On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting, 2025 b

    Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting, 2025 b . URL https://arxiv.org/abs/2508.11408

  52. [52]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025 c

  53. [53]

    1.4 million open-source distilled reasoning dataset to empower large language model training, 2025

    Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training, 2025. URL https://arxiv.org/abs/2503.19633