Self-Supervised On-Policy Distillation for Reasoning Language Models

Yinrong Hong; Zhiquan Tan

arxiv: 2605.17497 · v1 · pith:7HG5I3ZNnew · submitted 2026-05-17 · 💻 cs.LG

Self-Supervised On-Policy Distillation for Reasoning Language Models

Zhiquan Tan , Yinrong Hong This is my paper

Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-supervised distillationon-policy learningreasoning language modelsprocess supervisionGRPOAIME benchmarkcorrect-wrong contraststopping time

0 comments

The pith

Self-supervised distillation from shortest correct to longest wrong on-policy completions improves reasoning model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Self-Supervised On-Policy Distillation (SSOPD) as a way to extract richer training signals from the multiple attempts that GRPO-style training already generates per prompt. Instead of using those attempts only for final rewards, it treats the shortest correct completion as a self-generated teacher and distills its distribution into the prefixes of the longest wrong completion. This turns the natural correct-wrong contrast inside each group into dense process supervision without needing external solution traces. A stopping-time argument justifies selecting the shortest correct and longest wrong examples as a practical approximation for steering the policy toward quicker success paths. Experiments across nine model-benchmark combinations show consistent gains over standard GRPO and over a solution-conditioned baseline.

Core claim

The central claim is that distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion converts intra-group contrast into dense process supervision, yielding higher reasoning performance than terminal-reward-only training.

What carries the argument

Self-Supervised On-Policy Distillation (SSOPD), which distills from the shortest correct completion into prefixes of the longest wrong completion using a prompt-level frontier weight to focus loss where both branches exist.

If this is right

SSOPD raises macro Avg@12 from 64.0 to 65.6 on Qwen3-8B across AIME 2024, AIME 2025, and HMMT 2025.
The method outperforms both GRPO and a solution-conditioned OPSD baseline by 1.6 and 0.8 points respectively on the same setting.
Dense process supervision is obtained solely from on-policy rollouts without any external solution traces.
The stopping-time view supplies a concrete rule for choosing which completions to use as teacher and student within each group.
A prompt-level frontier weight automatically concentrates the auxiliary loss on prompts that contain both correct and wrong branches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shortest-correct / longest-wrong selection rule could be tested in other on-policy reinforcement learning setups that generate multiple completions per prompt.
If the stopping-time approximation holds, similar contrastive signals might appear in non-language domains that produce trajectories with clear success and failure endpoints.
Removing the need for external solution traces could lower the data cost of scaling reasoning models that currently rely on curated answer sets.
Extending the frontier weight idea to multi-turn or agentic settings might concentrate supervision on the exact decision points where policies diverge.

Load-bearing premise

The shortest correct completion and longest wrong completion in a finite on-policy group give a reliable approximation to editing persistent failures toward fast-success actions.

What would settle it

Running SSOPD versus plain GRPO on a held-out model and benchmark pair and finding no accuracy gain or a loss would falsify the claim that the method reliably improves reasoning.

Figures

Figures reproduced from arXiv: 2605.17497 by Yinrong Hong, Zhiquan Tan.

**Figure 1.** Figure 1: SSOPD training pipeline. A GRPO group provides both successful and failed on-policy completions. SSOPD selects a self-generated successful witness, applies a teacher distribution at prefixes of a failed completion, and distills this local distribution into the student with a prompt-level frontier weight. uses this contrast only through group-relative scalar advantages. We instead view it as an opportunity … view at source ↗

read the original abstract

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Self-Supervised On-Policy Distillation (SSOPD) for reasoning language models trained with GRPO-style RLVR. SSOPD extracts a process signal from on-policy groups by distilling a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion, modulated by a prompt-level frontier weight. This construction is motivated by a stopping-time view that aims to edit persistent failures toward fast-success trajectories. The authors report that SSOPD improves over GRPO in all nine model-benchmark settings on AIME 2024, AIME 2025, and HMMT 2025; on Qwen3-8B it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and a solution-conditioned OPSD baseline by 0.8 points.

Significance. If the empirical improvements are robust, the work supplies a practical mechanism for converting intra-group correct-wrong contrasts into dense, self-generated process supervision without external solution traces. This augments standard terminal-reward RLVR pipelines at modest additional cost and could be broadly useful for scaling reasoning models. The explicit promise of code release further strengthens the contribution by enabling direct reproduction and extension.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.
[§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.
[§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.

minor comments (2)

[§3.1] Clarify in §3.1 how the frontier weight is exactly computed from the per-prompt group statistics and whether it is normalized across the batch.
[Table 2] Table 2 (or equivalent results table): Report the number of evaluation prompts per benchmark and the exact definition of Avg@12 to allow direct comparison with prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical presentation, methodological discussion, and ablation studies where feasible.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of consistent gains across all nine settings rests on point estimates (e.g., +1.6 on Qwen3-8B) with no error bars, standard deviations across seeds, or statistical significance tests. In the absence of these, it is impossible to assess whether the reported margins exceed sampling variance arising from finite on-policy groups or from the stochasticity of the underlying GRPO training.

Authors: We agree that measures of variability are necessary to evaluate whether the observed margins are robust to sampling noise. Due to the high computational cost of full RLVR runs, our original experiments used single seeds. In the revised manuscript we report standard deviations from three independent seeds for the Qwen3-8B setting (SD ≈ 0.9 for SSOPD), and we add a short discussion of variance sources arising from on-policy group sampling and GRPO stochasticity. While we cannot rerun all nine settings at multiple seeds within reasonable resources, the consistent directional improvement across three benchmarks and three model scales provides supporting evidence that the gains exceed typical run-to-run fluctuation. revision: partial
Referee: [§3] §3 (Method, stopping-time motivation): The shortest-correct / longest-wrong selection rule is presented as a finite-group approximation to editing persistent failures toward fast-success actions. No derivation, analysis, or sensitivity study is supplied showing that completion length reliably identifies the states most in need of correction when group size is small (typically 8–16). Length is confounded by repetition, verbosity, and temperature, so the auxiliary loss may supply a weaker or mis-targeted signal than claimed.

Authors: The stopping-time framing is offered as an intuitive motivation for preferring short successful trajectories over long unsuccessful ones rather than a formal theorem. We acknowledge that length is an imperfect proxy and can be influenced by repetition or verbosity. To address this, the revised §3 now includes a brief discussion of these confounders together with a new appendix sensitivity study that replaces the shortest/longest rule with median-length selections; the median variant still improves over GRPO but by a smaller margin, supporting that the extreme-length contrast contributes additional signal. We also note that the frontier weighting is designed to mitigate mis-targeting by concentrating loss only on prompts where both correct and wrong branches coexist. revision: yes
Referee: [§4 and Appendix] §4 and Appendix (Ablations): No ablation isolates the contribution of the shortest/longest contrast from the generic effects of adding any auxiliary distillation loss or from the frontier weighting alone. Without such controls, the 0.8-point advantage over the solution-conditioned OPSD baseline cannot be confidently attributed to the specific on-policy contrast mechanism.

Authors: We accept that stronger isolation of the on-policy contrast is needed. The revised appendix now contains two additional controls: (i) a generic auxiliary distillation loss that pairs random correct and wrong completions instead of shortest/longest, and (ii) an ablation that retains shortest/longest selection but removes the frontier weight. These experiments show that the specific shortest/longest contrast accounts for roughly 0.5–0.7 points beyond a generic distillation baseline, while the frontier weight contributes an additional increment. The updated results therefore allow the 0.8-point gap versus solution-conditioned OPSD to be more directly attributed to the on-policy contrast design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SSOPD derivation or reported gains

full rationale

The paper defines SSOPD explicitly as an auxiliary distillation loss that selects the shortest correct completion as teacher and applies it to prefixes of the longest wrong completion, using a prompt-level frontier weight. This is a design choice motivated by a stopping-time interpretation rather than a mathematical derivation that reduces the target metric to itself. No equations make the observed Avg@12 improvements (e.g., +1.6 over GRPO) equivalent to a fitted parameter or self-referential input by construction. The method operates directly on on-policy group samples without invoking self-citations for load-bearing uniqueness theorems or smuggling ansatzes. Empirical results across nine settings are presented as external validation, not tautological outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that on-policy groups contain usable contrastive process signals and that the frontier weight can be chosen to concentrate loss where correct and wrong branches coexist.

free parameters (1)

prompt-level frontier weight
Weight that concentrates the auxiliary loss on prompts containing both correct and wrong completions; value not specified in abstract.

axioms (1)

domain assumption A mixed group of on-policy completions contains a richer process signal than terminal rewards alone.
Invoked in the opening paragraph to motivate the method.

pith-pipeline@v0.9.0 · 5769 in / 1218 out tokens · 34310 ms · 2026-05-20T14:38:30.102010+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions... λx = λ0 · 4 p̂x(1−p̂x)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Local improvement by reweighting good futures)... Vπηϕ(s) − Vπϕ(s) = η Var[Q] / V

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 33 internal anchors

[1]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=. doi:10.48550/arXiv.2209.15189 , url=

work page doi:10.48550/arxiv.2209.15189
[2]

2026 , doi=

Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya , journal=. 2026 , doi=

work page 2026
[3]

2025 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

arXiv preprint arXiv:2501.12948 , year=. doi:10.48550/arXiv.2501.12948 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[5]

2025 , doi=

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and others , journal=. 2025 , doi=

work page 2025
[6]

arXiv preprint arXiv:2509.10396 , year=

Inpainting-guided policy optimization for diffusion large language models , author=. arXiv preprint arXiv:2509.10396 , year=

work page arXiv
[7]

Proceedings of the 41st International Conference on Machine Learning , pages=

Self-Rewarding Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=

work page 2024
[8]

International Conference on Machine Learning , pages=

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[9]

Thirty-seventh Conference on Neural Information Processing Systems , year =

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

work page
[10]

The Eleventh International Conference on Learning Representations , year=

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[11]

and Khashabi, Daniel and Hajishirzi, Hannaneh

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=. 2023 , address=. doi:10.18653/v1/2023.acl-long.754 , url=

work page doi:10.18653/v1/2023.acl-long.754 2023
[12]

CoRR , year=

A Survey on Knowledge Distillation of Large Language Models , author=. CoRR , year=

work page
[13]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning , author=. arXiv preprint arXiv:2507.00432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2506.10910 , year=

Magistral , author=. arXiv preprint arXiv:2506.10910 , year=. doi:10.48550/arXiv.2506.10910 , url=

work page doi:10.48550/arxiv.2506.10910
[18]

Kimi K2: Open Agentic Intelligence

Kimi. arXiv preprint arXiv:2507.20534 , year=. doi:10.48550/arXiv.2507.20534 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.20534
[19]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

work page 2015
[20]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

The Lessons of Developing Process Reward Models in Mathematical Reasoning , author=. arXiv preprint arXiv:2501.07301 , year=. doi:10.48550/arXiv.2501.07301 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.07301
[21]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

work page 2022
[22]

Reinforced Self-Training (

Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal=. Reinforced Self-Training (. 2023 , doi=

work page 2023
[23]

Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

Evaluation may be easier than generation , author=. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

work page
[24]

Advances in Neural Information Processing Systems , volume=

Easy-to-hard generalization: Scalable alignment beyond human supervision , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2024 , url=

Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. 2024 , url=

work page 2024
[27]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[28]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011
[29]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , address=. doi:10.18653/v1/D16-1139 , url=

work page doi:10.18653/v1/d16-1139 2016
[30]

2019 , doi=

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. 2019 , doi=

work page 2019
[31]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178
[32]

Qwen3 Technical Report

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year=. doi:10.48550/arXiv.2505.09388 , url=. 2505.09388 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[33]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page
[35]

On-Policy Distillation , journal=

Lu, Kevin and. On-Policy Distillation , journal=. 2025 , note=

work page 2025
[36]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=. 2024 , doi=

work page 2024
[37]

arXiv preprint arXiv:2510.26768 , year=

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions , author=. arXiv preprint arXiv:2510.26768 , year=

work page arXiv
[38]

Transactions on Machine Learning Research , year=

Efficient Knowledge Injection in LLMs via Self-Distillation , author=. Transactions on Machine Learning Research , year=

work page
[39]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[40]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page
[41]

arXiv preprint arXiv:2212.10670 , year=

In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=. doi:10.48550/arXiv.2212.10670 , url=

work page doi:10.48550/arxiv.2212.10670
[42]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[43]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

work page 2025
[48]

Team, OpenThoughts , month = jan, title =

work page
[49]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022
[50]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[51]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page
[52]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

work page 2023
[54]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

LIMA: less is more for alignment , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

work page
[55]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

work page 2021
[56]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

The Thirteenth International Conference on Learning Representations , year=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[58]

GitHub repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

work page 2024
[59]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

2024 , month =

Gemini 2.0 Flash Thinking Mode , author =. 2024 , month =

work page 2024
[62]

Dream 7B , url =

Ye, Jiacheng and Xie, Zhihui and Zheng, Lin and Gao, Jiahui and Wu, Zirui and Jiang, Xin and Li, Zhenguo and Kong, Lingpeng , year =. Dream 7B , url =

work page
[63]

2025 , url =

Mercury: Ultra-Fast Language Models Based on Diffusion , author =. 2025 , url =

work page 2025
[64]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

2025 , eprint=

LIMO: Less is More for Reasoning , author=. 2025 , eprint=

work page 2025
[66]

Learning how hard to think: Input-adaptive allocation of lm computation

Learning how hard to think: Input-adaptive allocation of lm computation , author=. arXiv preprint arXiv:2410.04707 , year=

work page arXiv
[67]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

The Thirteenth International Conference on Learning Representations , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[69]

Forty-first International Conference on Machine Learning , year=

Alphazero-like tree-search can guide large language model decoding and training , author=. Forty-first International Conference on Machine Learning , year=

work page
[70]

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=

work page
[71]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=. doi:10.48550/arXiv.2110.14168 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
[72]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding R1-Zero-Like Training: A Critical Perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[75]

DeepSeek R1 , url =

DeepSeek Team , month =. DeepSeek R1 , url =

work page
[76]

Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

Google , month =. Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

work page
[77]

Learning to Reason with LLMs , url =

OpenAI , month =. Learning to Reason with LLMs , url =

work page
[78]

2022 , url=

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle=. 2022 , url=

work page 2022
[79]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , author=. arXiv preprint arXiv:2503.01307 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

arXiv preprint arXiv:2410.18514 , year=

Scaling up Masked Diffusion Models on Text , author=. arXiv preprint arXiv:2410.18514 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=. doi:10.48550/arXiv.2209.15189 , url=

work page doi:10.48550/arxiv.2209.15189

[2] [2]

2026 , doi=

Zhao, Siyan and Xie, Zhihui and Liu, Mengchen and Huang, Jing and Pang, Guan and Chen, Feiyu and Grover, Aditya , journal=. 2026 , doi=

work page 2026

[3] [3]

2025 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

arXiv preprint arXiv:2501.12948 , year=. doi:10.48550/arXiv.2501.12948 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948

[5] [5]

2025 , doi=

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and others , journal=. 2025 , doi=

work page 2025

[6] [6]

arXiv preprint arXiv:2509.10396 , year=

Inpainting-guided policy optimization for diffusion large language models , author=. arXiv preprint arXiv:2509.10396 , year=

work page arXiv

[7] [7]

Proceedings of the 41st International Conference on Machine Learning , pages=

Self-Rewarding Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , volume=

work page 2024

[8] [8]

International Conference on Machine Learning , pages=

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[9] [9]

Thirty-seventh Conference on Neural Information Processing Systems , year =

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

work page

[10] [10]

The Eleventh International Conference on Learning Representations , year=

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page

[11] [11]

and Khashabi, Daniel and Hajishirzi, Hannaneh

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh , booktitle=. 2023 , address=. doi:10.18653/v1/2023.acl-long.754 , url=

work page doi:10.18653/v1/2023.acl-long.754 2023

[12] [12]

CoRR , year=

A Survey on Knowledge Distillation of Large Language Models , author=. CoRR , year=

work page

[13] [13]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning , author=. arXiv preprint arXiv:2507.00432 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2506.10910 , year=

Magistral , author=. arXiv preprint arXiv:2506.10910 , year=. doi:10.48550/arXiv.2506.10910 , url=

work page doi:10.48550/arxiv.2506.10910

[18] [18]

Kimi K2: Open Agentic Intelligence

Kimi. arXiv preprint arXiv:2507.20534 , year=. doi:10.48550/arXiv.2507.20534 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.20534

[19] [19]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

work page 2015

[20] [20]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

The Lessons of Developing Process Reward Models in Mathematical Reasoning , author=. arXiv preprint arXiv:2501.07301 , year=. doi:10.48550/arXiv.2501.07301 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.07301

[21] [21]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

work page 2022

[22] [22]

Reinforced Self-Training (

Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal=. Reinforced Self-Training (. 2023 , doi=

work page 2023

[23] [23]

Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

Evaluation may be easier than generation , author=. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages=

work page

[24] [24]

Advances in Neural Information Processing Systems , volume=

Easy-to-hard generalization: Scalable alignment beyond human supervision , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [25]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2024 , url=

Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , booktitle=. 2024 , url=

work page 2024

[27] [27]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[28] [28]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011

[29] [29]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

Sequence-Level Knowledge Distillation , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , address=. doi:10.18653/v1/D16-1139 , url=

work page doi:10.18653/v1/d16-1139 2016

[30] [30]

2019 , doi=

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. 2019 , doi=

work page 2019

[31] [31]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178

[32] [32]

Qwen3 Technical Report

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , year=. doi:10.48550/arXiv.2505.09388 , url=. 2505.09388 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388

[33] [33]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page

[35] [35]

On-Policy Distillation , journal=

Lu, Kevin and. On-Policy Distillation , journal=. 2025 , note=

work page 2025

[36] [36]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=. 2024 , doi=

work page 2024

[37] [37]

arXiv preprint arXiv:2510.26768 , year=

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions , author=. arXiv preprint arXiv:2510.26768 , year=

work page arXiv

[38] [38]

Transactions on Machine Learning Research , year=

Efficient Knowledge Injection in LLMs via Self-Distillation , author=. Transactions on Machine Learning Research , year=

work page

[39] [39]

The Thirteenth International Conference on Learning Representations , year=

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[40] [40]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page

[41] [41]

arXiv preprint arXiv:2212.10670 , year=

In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=. doi:10.48550/arXiv.2212.10670 , url=

work page doi:10.48550/arxiv.2212.10670

[42] [42]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[43] [43]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

work page 2025

[48] [48]

Team, OpenThoughts , month = jan, title =

work page

[49] [49]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022

[50] [50]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[51] [51]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page

[52] [52]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

work page 2023

[54] [54]

Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

LIMA: less is more for alignment , author=. Proceedings of the 37th International Conference on Neural Information Processing Systems , pages=

work page

[55] [55]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

work page 2021

[56] [56]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

The Thirteenth International Conference on Learning Representations , year=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[58] [58]

GitHub repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

work page 2024

[59] [59]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

T " ulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

2024 , month =

Gemini 2.0 Flash Thinking Mode , author =. 2024 , month =

work page 2024

[62] [62]

Dream 7B , url =

Ye, Jiacheng and Xie, Zhihui and Zheng, Lin and Gao, Jiahui and Wu, Zirui and Jiang, Xin and Li, Zhenguo and Kong, Lingpeng , year =. Dream 7B , url =

work page

[63] [63]

2025 , url =

Mercury: Ultra-Fast Language Models Based on Diffusion , author =. 2025 , url =

work page 2025

[64] [64]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

2025 , eprint=

LIMO: Less is More for Reasoning , author=. 2025 , eprint=

work page 2025

[66] [66]

Learning how hard to think: Input-adaptive allocation of lm computation

Learning how hard to think: Input-adaptive allocation of lm computation , author=. arXiv preprint arXiv:2410.04707 , year=

work page arXiv

[67] [67]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

The Thirteenth International Conference on Learning Representations , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for LLM problem-solving , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[69] [69]

Forty-first International Conference on Machine Learning , year=

Alphazero-like tree-search can guide large language model decoding and training , author=. Forty-first International Conference on Machine Learning , year=

work page

[70] [70]

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=

work page

[71] [71]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=. doi:10.48550/arXiv.2110.14168 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168

[72] [72]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[73] [73]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding R1-Zero-Like Training: A Critical Perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[75] [75]

DeepSeek R1 , url =

DeepSeek Team , month =. DeepSeek R1 , url =

work page

[76] [76]

Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

Google , month =. Gemini 2.0 Flash Thinking Mode (gemini-2.0-flash-thinking-exp-1219) , url =

work page

[77] [77]

Learning to Reason with LLMs , url =

OpenAI , month =. Learning to Reason with LLMs , url =

work page

[78] [78]

2022 , url=

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle=. 2022 , url=

work page 2022

[79] [79]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars , author=. arXiv preprint arXiv:2503.01307 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

arXiv preprint arXiv:2410.18514 , year=

Scaling up Masked Diffusion Models on Text , author=. arXiv preprint arXiv:2410.18514 , year=

work page arXiv