arxiv: 2512.11470 · v2 · submitted 2025-12-12 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding , Yuhan Chen , Jiayang Lyv , Jiyao Yuan , Qi Zhu , Shuangshuang Tian , Dantong Zhu , Futing Wang

show 4 more authors

Heyuan Deng Fei Mi Lifeng Shang Tao Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM post-trainingmathematical reasoningsupervised fine-tuningreinforcement learningexpert trajectoriesscaling guidelinesplasticity ceiling

0 comments

The pith

Sequential SFT followed by RL reaches a higher performance ceiling than synchronized training by first locking in a stable foundation then unlocking additional plasticity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Plasticity-Ceiling Framework to separate final model performance into the base level reached by supervised fine-tuning on expert trajectories and the extra gains still available from reinforcement learning afterward. It finds that running SFT first until the stable or mild overfitting stage, then switching to RL, outperforms joint training because the sequential order prevents instability and early convergence limits. The work supplies concrete scaling rules: data volume sets the main ceiling height while trajectory difficulty multiplies the outcome, and the lowest validation loss during SFT reliably flags the best trajectories to use.

Core claim

The Plasticity-Ceiling Framework decomposes the final performance ceiling into foundational SFT performance and subsequent RL plasticity. The sequential SFT-then-RL pipeline is superior to synchronized approaches because it avoids stability deficits and premature convergence. Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the ceiling; data scale determines primary post-training potential while trajectory difficulty acts as a multiplier; and the minimum validation loss of SFT serves as a reliable indicator for selecting expert trajectories that maximize the ultimate ceiling.

What carries the argument

The Plasticity-Ceiling Framework, which decomposes the final performance ceiling into the foundational SFT performance level and the additional improvement available through RL plasticity.

If this is right

Transitioning from SFT to RL at the stable or mild overfitting regime secures both a robust foundation and substantial remaining plasticity for the highest overall ceiling.
Larger data scale sets the primary post-training potential while harder trajectories multiply the achievable performance.
Selecting expert trajectories by their minimum validation loss during SFT reliably maximizes the final ceiling after RL.
The sequential pipeline overcomes the stability problems and premature convergence that appear when SFT and RL run simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transition timing and scaling rules could apply to other reasoning tasks if the foundation-plus-plasticity split holds outside mathematics.
Real-time monitoring of validation loss during SFT could let practitioners switch to RL without running separate scaling experiments for every new dataset.
Testing whether synthetic trajectories follow the same data-scale and difficulty-multiplier pattern would show how far the guidelines extend beyond human expert data.

Load-bearing premise

The split between what supervised fine-tuning alone can achieve and what reinforcement learning can still add on top remains consistent across different models and mathematical reasoning tasks.

What would settle it

A direct test would train both the sequential SFT-then-RL pipeline and a synchronized joint approach on the same model and benchmark set and check whether the sequential version still produces a strictly higher final performance ceiling.

Figures

Figures reproduced from arXiv: 2512.11470 by Bowen Ding, Dantong Zhu, Fei Mi, Futing Wang, Heyuan Deng, Jiayang Lyv, Jiyao Yuan, Lifeng Shang, Qi Zhu, Shuangshuang Tian, Tao Lin, Yuhan Chen.

**Figure 1.** Figure 1: The conceptual overview of LLM post-training. Sequential SFT-then-RL (blue→orange) achieves the highest performance ceiling Apost, outperforming Pure RL (orange) and Synchronized SFT-RL (striped blue–orange) paths. Insets highlight that larger, harder data increases plasticity, and RL should start during the Stable SFT. Conversely, some LLM practitioners (Yang et al., 2025; GLM et al., 2025; DeepSeek-AI, … view at source ↗

**Figure 2.** Figure 2: Compute–performance scaling of post-training paradigms under different initialization [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: SFT Compute Scaling Dynamics of the SFT-then-RL Pipeline across Diverse Data Prop [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The analysis of the max post-training performance [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of SFT-then-RL fitting across different SFT data configurations. (a) Correlation analysis between Apost and Minimum Validation Loss. (b)-(f) The SFT-then-RL scaling dynamics under various data configurations. The SFT trajectory is depicted by a black dashed line. RL scaling curves initiated from different SFT steps are distinguished by a color gradient, where lighter shades indicate a higher … view at source ↗

read the original abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the ``Less is More'' hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sequential SFT-then-RL outperforms synchronized training for math reasoning LLMs according to their Plasticity-Ceiling Framework, with practical scaling rules on data volume and switch timing.

read the letter

The main point is that sequential SFT followed by RL gives better final performance than trying to run them together, and the authors support this by splitting results into an SFT foundation plus the extra lift RL can still provide. They benchmark across SFT regimes and conclude that moving to RL once SFT hits stable or mild overfitting works best, that raw data scale drives most of the gain while trajectory difficulty is secondary, and that the lowest SFT validation loss is a decent signal for picking useful expert trajectories. This pushes back on the less-is-more view from earlier work and gives concrete steps for using expert data in post-training pipelines. The benchmarking effort and the derived guidelines are the useful parts; practitioners tuning math reasoning models will find the switch-point and data-scale observations directly applicable. The main soft spot is whether the synchronized baselines were run with equivalent optimization effort. If those runs used different schedules, replay buffers, or fewer effective steps, the stability and convergence gaps could be artifacts rather than inherent to joint training. The additivity assumption between SFT base and RL plasticity also needs checking against model size and reward-model details, since the abstract gives no error bars or ablation numbers. The full results will show how large the reported edges actually are. This paper targets engineers doing LLM post-training on reasoning tasks who need rules of thumb rather than new theory. A reader already working on SFT-RL pipelines will extract usable advice. It deserves peer review because the question is practical, the experiments target it, and the framework organizes existing techniques in a new way even if some comparisons need tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Plasticity-Ceiling Framework to decompose LLM post-training performance on mathematical reasoning into an SFT foundation component and subsequent RL plasticity (incremental gain from RL). Through benchmarking, it claims that the sequential SFT-then-RL pipeline outperforms synchronized SFT+RL approaches by avoiding stability issues and premature convergence. It further derives scaling guidelines: transition to RL at the Stable or Mild Overfitting regime of SFT, data scale as the primary driver of post-training potential (refuting 'Less is More'), trajectory difficulty as a multiplier, and minimum SFT validation loss as a reliable selector for expert trajectories.

Significance. If the empirical decomposition and superiority claims hold under controlled conditions, the work supplies concrete, actionable guidelines for ordering and scaling SFT and RL stages when leveraging expert trajectories. This could standardize post-training pipelines for math reasoning and shift emphasis toward data volume over trajectory curation, with the minimum-validation-loss indicator offering a practical checkpoint for trajectory selection.

major comments (3)

[Experimental comparisons (likely §4)] The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.
[Plasticity-Ceiling Framework definition and results] The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.
[Scaling analysis and guidelines] The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.

minor comments (2)

[Abstract] The abstract provides no quantitative results, model sizes, benchmark names, or error bars, which hinders immediate assessment of effect sizes and statistical reliability.
[Framework and notation] Clarify the precise operational definition of 'RL plasticity' (e.g., is it the absolute gain, relative gain, or normalized improvement) and how it is computed from the reported curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our experimental controls and outlining revisions to strengthen the empirical support for the Plasticity-Ceiling Framework and scaling guidelines.

read point-by-point responses

Referee: The central claim that sequential SFT-then-RL is strictly superior (overcoming stability and convergence deficits) requires explicit confirmation that synchronized baselines received identical total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules; otherwise the reported advantages may be artifacts of unequal effective training budgets rather than ordering per se.

Authors: We appreciate the referee's emphasis on ensuring fair experimental comparisons. In our original setup, the synchronized SFT+RL baselines were trained with exactly the same total gradient steps, replay-buffer size, data mixing ratios, and optimization schedules as the sequential SFT-then-RL pipeline. This matching was implemented to isolate the effect of training order. To eliminate any potential ambiguity, we will add a dedicated subsection in §4 with a hyperparameter comparison table and explicit confirmation of matched training budgets across all methods. revision: yes
Referee: The Plasticity-Ceiling decomposition treats RL plasticity as an additive increment after SFT, but this additivity is not isolated from model-specific reward dynamics or trajectory overlap; without ablations that hold total data and compute fixed while varying only the SFT/RL ordering and synchronization, the framework's predictive power for the final ceiling remains unverified.

Authors: The Plasticity-Ceiling Framework is an empirical decomposition based on observed performance ceilings rather than a claim of strict theoretical additivity. Our benchmarking across configurations shows consistent correlations between the decomposed components and final outcomes. We agree that further isolation is valuable. In the revision, we will incorporate new ablations that hold total data volume and compute budget fixed while varying only SFT/RL ordering and synchronization to more rigorously test the framework's predictive utility. revision: yes
Referee: The scaling guidelines (transition at Stable/Mild Overfitting regime, data scale as primary driver, trajectory difficulty as multiplier) rest on the same untested additivity assumption across the reported models and math tasks; the manuscript should include cross-model and cross-task validation to show that the minimum-validation-loss indicator and regime recommendations generalize beyond the specific benchmark suite.

Authors: The scaling guidelines and minimum-validation-loss indicator are derived from our primary benchmark suite. To strengthen claims of generalization, we will expand the revised manuscript with additional experiments across different model scales and supplementary mathematical reasoning tasks. These results will be presented to demonstrate the robustness of the regime recommendations and trajectory selection criterion beyond the current evaluation set. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking establishes claims without definitional circularity

full rationale

The paper proposes the Plasticity-Ceiling Framework as an empirical decomposition of final performance into SFT foundation plus RL plasticity, then validates the sequential SFT-then-RL pipeline and scaling guidelines through extensive benchmarking on mathematical reasoning tasks. No load-bearing equations, self-definitions, or derivations are present that reduce by construction to fitted inputs or prior self-citations; the central claims rest on direct experimental comparisons of stability, convergence, and performance ceilings across regimes. The work is therefore self-contained as standard empirical analysis rather than a closed theoretical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the assumption that SFT and RL effects can be separated and measured independently through benchmarking.

axioms (1)

domain assumption Final performance ceiling decomposes additively into SFT performance and RL plasticity
This is the core of the proposed framework as stated in the abstract.

pith-pipeline@v0.9.0 · 7475 in / 1051 out tokens · 67310 ms · 2026-05-16T22:39:29.481400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 23 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chr...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023

Marwa Abdulhai, Isadora White, Charlie Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, and Sergey Levine. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023. URL https://arxiv.org/abs/2311.18232

work page arXiv 2023
[4]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. URL https://arxiv.org/abs/2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Scaling laws for predicting downstream performance in llms, 2025

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji. Scaling laws for predicting downstream performance in llms, 2025. URL https://arxiv.org/abs/2410.08527

work page arXiv 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

VERL utils: FLOPs counter (line 149)

Volcano Engine. VERL utils: FLOPs counter (line 149). https://github.com/volcengine/verl/blob/59049a66/verl/utils/flops_counter.py\#L149, 2023. version 59049a6; Accessed: 2024-12-01

work page 2023
[9]

SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT : A single-stage method with supervised and reinforcement fine-tuning for reasoning. arXiv preprint arXiv:2506.19767, 2025. doi:10.48550/arXiv.2506.19767. URL https://arxiv.org/abs/2506.19767

work page doi:10.48550/arxiv.2506.19767 2025
[10]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Team GLM, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024

work page 2024
[12]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review arXiv 2025
[13]

Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning, 2025. URL https://arxiv.org/abs/2507.00432

work page arXiv 2025
[14]

Huber and E.M

P.J. Huber and E.M. Ronchetti. Robust Statistics. Wiley Series in Probability and Statistics. Wiley, 2011. ISBN 9781118210338. URL https://books.google.com.hk/books?id=j1OhquR_j88C

work page 2011
[15]

Math-verify

Hugging Face . Math-verify. https://github.com/huggingface/Math-Verify, 2024

work page 2024
[16]

How to detect and handle outliers, volume 16

Boris Iglewicz and David C Hoaglin. How to detect and handle outliers, volume 16. Asqc Quality Press Milwaukee, WI, 1993

work page 1993
[17]

Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025

Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, and Newsha Ardalani. Quagmires in sft-rl post-training: When high sft scores mislead and what to use instead, 2025. URL https://arxiv.org/abs/2510.01624

work page arXiv 2025
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms, 2025. URL https://arxiv.org/abs/2510.13786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49 0 (4): 0 764--766, 2013. ISSN 0022-1031. doi:https://doi.org/10.1016/j.jesp.2013.03.013. URL https://www.sciencedire...

work page doi:10.1016/j.jesp.2013.03.013 2013
[22]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main/...

work page 2024
[23]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Towards a unified view of large language model post-training, 2025

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training, 2025. URL https://arxiv.org/abs/2509.04419

work page arXiv 2025
[25]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

Meta AI . Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/, September 2024. Meta AI blog; accessed 2025-04-13; 15 minute read

work page 2024
[26]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473

work page arXiv 2021
[28]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Least median of squares regression

Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79 0 (388): 0 871--880, 1984

work page 1984
[31]

Rousseeuw and Katrien Driessen

Peter J. Rousseeuw and Katrien Driessen. Computing lts regression for large data sets. Data Min. Knowl. Discov., 12 0 (1): 0 29–45, January 2006. ISSN 1384-5810. doi:10.1007/s10618-005-0024-4. URL https://doi.org/10.1007/s10618-005-0024-4

work page doi:10.1007/s10618-005-0024-4 2006
[32]

Robust regression and outlier detection

Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection. John Wiley & Sons, 1987

work page 1987
[33]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance, 2024. URL https://arxiv.org/abs/2405.10938

work page arXiv 2024
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

work page internal anchor Pith review arXiv 2025
[36]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025

Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song. Climbing the ladder of reasoning: What llms can-and still can't-solve after sft?, 2025. URL https://arxiv.org/abs/2504.11741

work page arXiv 2025
[38]

Andrew Bagnell

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning, 2025. URL https://arxiv.org/abs/2503.01067

work page arXiv 2025
[39]

Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, and Xiangang Li. Deepdistill: Enhancing llm reasoning capabilities via large-scale difficulty-graded data training, 2025. URL https://arxiv.org/abs/2504.17565

work page arXiv 2025
[40]

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. https://arxiv.org/abs/2407.13690, 2024. URL https://arxiv.org/abs/2407.13690. arXiv:2407.13690, cs.CL

work page arXiv 2024
[41]

How to train your LLM web agent: A statistical diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Pe \ n aloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Mu \ n oz-M \' a rmol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Pich \' e , Alexandre Lacoste, and Massimo Caccia. How to train your LLM web agent: A s...

work page doi:10.48550/arxiv.2507.04103 2025
[42]

Implicit reward as the bridge: A unified view of sft and dpo connections, 2025

Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, and Xipeng Qiu. Implicit reward as the bridge: A unified view of sft and dpo connections, 2025. URL https://arxiv.org/abs/2507.00018

work page arXiv 2025
[43]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URL https://arxiv.org/abs/2504.14945

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

LIMO: Less is More for Reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387

work page internal anchor Pith review arXiv 2025
[47]

A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning, 2025

Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning, 2025. URL https://arxiv.org/abs/2507.08267

work page arXiv 2025
[48]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[50]

D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning, 2025 a

Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, and Lan-Zhe Guo. D3: Diversity, difficulty, and dependability-aware data selection for sample-efficient llm instruction tuning, 2025 a . URL https://arxiv.org/abs/2503.11441

work page arXiv 2025
[51]

On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting, 2025 b

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting, 2025 b . URL https://arxiv.org/abs/2508.11408

work page arXiv 2025
[52]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025 c

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

1.4 million open-source distilled reasoning dataset to empower large language model training, 2025

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, and Xiangang Li. 1.4 million open-source distilled reasoning dataset to empower large language model training, 2025. URL https://arxiv.org/abs/2503.19633

work page arXiv 2025