Self-Supervised On-Policy Distillation for Reasoning Language Models
Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3
The pith
Self-supervised distillation from shortest correct to longest wrong on-policy completions improves reasoning model performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion converts intra-group contrast into dense process supervision, yielding higher reasoning performance than terminal-reward-only training.
What carries the argument
Self-Supervised On-Policy Distillation (SSOPD), which distills from the shortest correct completion into prefixes of the longest wrong completion using a prompt-level frontier weight to focus loss where both branches exist.
If this is right
- SSOPD raises macro Avg@12 from 64.0 to 65.6 on Qwen3-8B across AIME 2024, AIME 2025, and HMMT 2025.
- The method outperforms both GRPO and a solution-conditioned OPSD baseline by 1.6 and 0.8 points respectively on the same setting.
- Dense process supervision is obtained solely from on-policy rollouts without any external solution traces.
- The stopping-time view supplies a concrete rule for choosing which completions to use as teacher and student within each group.
- A prompt-level frontier weight automatically concentrates the auxiliary loss on prompts that contain both correct and wrong branches.
Where Pith is reading between the lines
- The same shortest-correct / longest-wrong selection rule could be tested in other on-policy reinforcement learning setups that generate multiple completions per prompt.
- If the stopping-time approximation holds, similar contrastive signals might appear in non-language domains that produce trajectories with clear success and failure endpoints.
- Removing the need for external solution traces could lower the data cost of scaling reasoning models that currently rely on curated answer sets.
- Extending the frontier weight idea to multi-turn or agentic settings might concentrate supervision on the exact decision points where policies diverge.
Load-bearing premise
The shortest correct completion and longest wrong completion in a finite on-policy group give a reliable approximation to editing persistent failures toward fast-success actions.
What would settle it
Running SSOPD versus plain GRPO on a held-out model and benchmark pair and finding no accuracy gain or a loss would falsify the claim that the method reliably improves reasoning.
Figures
read the original abstract
GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Self-Supervised On-Policy Distillation (SSOPD) for reasoning language models trained with GRPO-style RLVR. SSOPD extracts a process signal from on-policy groups by distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion, modulated by a prompt-level frontier weight. This construction is motivated by a stopping-time view that aims to edit persistent failures toward fast-success trajectories. The authors report that SSOPD improves over GRPO in all nine model-benchmark settings on AIME 2024, AIME 2025, and HMMT 2025; on Qwen3-8B it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and a solution-conditioned OPSD baseline by 0.8 points.
Significance. If the empirical improvements are robust, the work supplies a practical mechanism for converting intra-group correct-wrong contrasts into dense, self-generated process supervision without external solution traces. This augments standard terminal-reward RLVR pipelines at modest additional cost and could be broadly useful for scaling reasoning models. The explicit promise of code release further strengthens the contribution by enabling direct reproduction and extension.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.
- [§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.
- [§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.
minor comments (2)
- [§3.1] Clarify in §3.1 how the frontier weight is exactly computed from the per-prompt group statistics and whether it is normalized across the batch.
- [Table 2] Table 2 (or equivalent results table): Report the number of evaluation prompts per benchmark and the exact definition of Avg@12 to allow direct comparison with prior work.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical presentation, methodological discussion, and ablation studies where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.
Authors: We agree that measures of variability are necessary to evaluate whether the observed margins are robust to sampling noise. Due to the high computational cost of full RLVR runs, our original experiments used single seeds. In the revised manuscript we report standard deviations from three independent seeds for the Qwen3-8B setting (SD ≈ 0.9 for SSOPD), and we add a short discussion of variance sources arising from on-policy group sampling and GRPO stochasticity. While we cannot rerun all nine settings at multiple seeds within reasonable resources, the consistent directional improvement across three benchmarks and three model scales provides supporting evidence that the gains exceed typical run-to-run fluctuation. revision: partial
-
Referee: [§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.
Authors: The stopping-time framing is offered as an intuitive motivation for preferring short successful trajectories over long unsuccessful ones rather than a formal theorem. We acknowledge that length is an imperfect proxy and can be influenced by repetition or verbosity. To address this, the revised §3 now includes a brief discussion of these confounders together with a new appendix sensitivity study that replaces the shortest/longest rule with median-length selections; the median variant still improves over GRPO but by a smaller margin, supporting that the extreme-length contrast contributes additional signal. We also note that the frontier weighting is designed to mitigate mis-targeting by concentrating loss only on prompts where both correct and wrong branches coexist. revision: yes
-
Referee: [§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.
Authors: We accept that stronger isolation of the on-policy contrast is needed. The revised appendix now contains two additional controls: (i) a generic auxiliary distillation loss that pairs random correct and wrong completions instead of shortest/longest, and (ii) an ablation that retains shortest/longest selection but removes the frontier weight. These experiments show that the specific shortest/longest contrast accounts for roughly 0.5–0.7 points beyond a generic distillation baseline, while the frontier weight contributes an additional increment. The updated results therefore allow the 0.8-point gap versus solution-conditioned OPSD to be more directly attributed to the on-policy contrast design. revision: yes
Circularity Check
No significant circularity in SSOPD derivation or reported gains
full rationale
The paper defines SSOPD explicitly as an auxiliary distillation loss that selects the shortest correct completion as teacher and applies it to prefixes of the longest wrong completion, using a prompt-level frontier weight. This is a design choice motivated by a stopping-time interpretation rather than a mathematical derivation that reduces the target metric to itself. No equations make the observed Avg@12 improvements (e.g., +1.6 over GRPO) equivalent to a fitted parameter or self-referential input by construction. The method operates directly on on-policy group samples without invoking self-citations for load-bearing uniqueness theorems or smuggling ansatzes. Empirical results across nine settings are presented as external validation, not tautological outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- prompt-level frontier weight
axioms (1)
- domain assumption A mixed group of on-policy completions contains a richer process signal than terminal rewards alone.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions... λx = λ0 · 4 p̂x(1−p̂x)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1 (Local improvement by reweighting good futures)... Vπηϕ(s) − Vπϕ(s) = η Var[Q] / V
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=. doi:10.48550/arXiv.2209.15189 , url=
-
[2]
Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya , journal=. 2026 , doi=
work page 2026
-
[3]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
arXiv preprint arXiv:2501.12948 , year=. doi:10.48550/arXiv.2501.12948 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
-
[5]
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and others , journal=. 2025 , doi=
work page 2025
-
[6]
arXiv preprint arXiv:2509.10396 , year=
Inpainting-guided policy optimization for diffusion large language models , author=. arXiv preprint arXiv:2509.10396 , year=
-
[7]
Proceedings of the 41st International Conference on Machine Learning , pages=
Self-Rewarding Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=
work page 2024
-
[8]
International Conference on Machine Learning , pages=
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[9]
Thirty-seventh Conference on Neural Information Processing Systems , year =
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =
-
[10]
The Eleventh International Conference on Learning Representations , year=
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=
-
[11]
and Khashabi, Daniel and Hajishirzi, Hannaneh
Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=. 2023 , address=. doi:10.18653/v1/2023.acl-long.754 , url=
-
[12]
A Survey on Knowledge Distillation of Large Language Models , author=. CoRR , year=
-
[13]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning , author=. arXiv preprint arXiv:2507.00432 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2506.10910 , year=
Magistral , author=. arXiv preprint arXiv:2506.10910 , year=. doi:10.48550/arXiv.2506.10910 , url=
-
[18]
Kimi K2: Open Agentic Intelligence
Kimi. arXiv preprint arXiv:2507.20534 , year=. doi:10.48550/arXiv.2507.20534 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.20534
-
[19]
Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=
work page 2015
-
[20]
The Lessons of Developing Process Reward Models in Mathematical Reasoning
The Lessons of Developing Process Reward Models in Mathematical Reasoning , author=. arXiv preprint arXiv:2501.07301 , year=. doi:10.48550/arXiv.2501.07301 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.07301
-
[21]
Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=
work page 2022
-
[22]
Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal=. Reinforced Self-Training (. 2023 , doi=
work page 2023
-
[23]
Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=
Evaluation may be easier than generation , author=. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=
-
[24]
Advances in Neural Information Processing Systems , volume=
Easy-to-hard generalization: Scalable alignment beyond human supervision , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. 2024 , url=
work page 2024
-
[27]
The Twelfth International Conference on Learning Representations , year=
Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
-
[28]
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=
work page 2011
-
[29]
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , address=. doi:10.18653/v1/D16-1139 , url=
-
[30]
Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. 2019 , doi=
work page 2019
-
[31]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178
-
[32]
Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year=. doi:10.48550/arXiv.2505.09388 , url=. 2505.09388 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
-
[33]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =
-
[35]
On-Policy Distillation , journal=
Lu, Kevin and. On-Policy Distillation , journal=. 2025 , note=
work page 2025
-
[36]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=. 2024 , doi=
work page 2024
-
[37]
arXiv preprint arXiv:2510.26768 , year=
AMO-Bench: Large Language Models Still Struggle in High School Math Competitions , author=. arXiv preprint arXiv:2510.26768 , year=
-
[38]
Transactions on Machine Learning Research , year=
Efficient Knowledge Injection in LLMs via Self-Distillation , author=. Transactions on Machine Learning Research , year=
-
[39]
The Thirteenth International Conference on Learning Representations , year=
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=
-
[40]
The Twelfth International Conference on Learning Representations , year=
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=
-
[41]
arXiv preprint arXiv:2212.10670 , year=
In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=. doi:10.48550/arXiv.2212.10670 , url=
-
[42]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[43]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [47]
-
[48]
Team, OpenThoughts , month = jan, title =
-
[49]
Advances in Neural Information Processing Systems , volume=
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=
work page 2022
-
[50]
Advances in Neural Information Processing Systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[52]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=
work page 2023
-
[54]
Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=
LIMA: less is more for alignment , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=
-
[55]
Measuring Mathematical Problem Solving With the
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=
work page 2021
-
[56]
Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
The Thirteenth International Conference on Learning Representations , year=
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[58]
GitHub repository , howpublished =
Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =
work page 2024
-
[59]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [61]
-
[62]
Ye, Jiacheng and Xie, Zhihui and Zheng, Lin and Gao, Jiahui and Wu, Zirui and Jiang, Xin and Li, Zhenguo and Kong, Lingpeng , year =. Dream 7B , url =
-
[63]
Mercury: Ultra-Fast Language Models Based on Diffusion , author =. 2025 , url =
work page 2025
-
[64]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [65]
-
[66]
Learning how hard to think: Input-adaptive allocation of lm computation
Learning how hard to think: Input-adaptive allocation of lm computation , author=. arXiv preprint arXiv:2410.04707 , year=
-
[67]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
The Thirteenth International Conference on Learning Representations , year=
Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=
-
[69]
Forty-first International Conference on Machine Learning , year=
Alphazero-like tree-search can guide large language model decoding and training , author=. Forty-first International Conference on Machine Learning , year=
-
[70]
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=
-
[71]
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=. doi:10.48550/arXiv.2110.14168 , url=
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
-
[72]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding R1-Zero-Like Training: A Critical Perspective , author=. arXiv preprint arXiv:2503.20783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [74]
- [75]
-
[76]
Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =
Google , month =. Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =
-
[77]
Learning to Reason with LLMs , url =
OpenAI , month =. Learning to Reason with LLMs , url =
-
[78]
Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle=. 2022 , url=
work page 2022
-
[79]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , author=. arXiv preprint arXiv:2503.01307 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
arXiv preprint arXiv:2410.18514 , year=
Scaling up Masked Diffusion Models on Text , author=. arXiv preprint arXiv:2410.18514 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.