arxiv: 2603.11178 · v3 · submitted 2026-03-11 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu , Hejian Sang , Zhengze Zhou , Ran He , Zhipeng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM distillationself-distillationmath reasoningproblem weightingcurriculum learningBeta kernelgradient SNRforgetting reduction

0 comments

The pith

Weighting distillation problems by student pass rate p(1-p) focuses training on the competence frontier and improves math benchmark scores by up to 8.2 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard LLM distillation wastes compute on problems the student has already mastered or cannot yet solve. The paper shows this waste has a clear signature: the gradient signal-to-noise ratio across problems forms a bell curve that peaks at intermediate pass rates and collapses at the extremes. PACED therefore weights each problem by w(p) = p(1-p), where p is the student's empirical pass rate on that problem. The resulting schedule requires only student rollouts, introduces no new hyperparameters, and raises performance on MATH-500, AIME 2024, and AIME 2025 while cutting forgetting to 1.4 percent in distillation and 0.6 percent in self-distillation. A two-stage forward-then-reverse KL variant adds still larger gains on the hardest tasks.

Core claim

PACED weights each training problem by w(p) = p(1-p) where p is the student's empirical pass rate, thereby concentrating gradient updates on problems that sit at the frontier of current competence. The authors prove that the Beta kernel family w(p) = p^α(1-p)^β is the leading-order optimal weighting under the observed SNR boundary-collapse structure and remains minimax-robust to misspecification. Across Qwen3, Qwen2.5, and Llama-3 families the method sets new state-of-the-art numbers on MATH-500, AIME 2024, and AIME 2025, delivering gains of up to +8.2 over unweighted distillation and +3.6 over the AKL baseline while reducing forgetting to 1.4 percent and 0.6 percent.

What carries the argument

The Beta kernel weighting w(p) = p(1-p), which multiplies each problem's loss by the product of its pass rate and failure rate to emphasize intermediate competence.

If this is right

New state-of-the-art results on MATH-500, AIME 2024, and AIME 2025 across Qwen and Llama model families.
Forgetting drops to 1.4 percent during distillation and 0.6 percent during self-distillation.
A two-stage forward-then-reverse KL schedule adds up to +5.8 points on the hardest benchmarks over standard forward KL.
Gains of up to +3.6 over the strong AKL baseline are obtained with no architectural changes.
The method uses only student rollouts and requires no extra hyperparameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting could be applied to non-math reasoning domains where problem difficulty varies across a student's competence range.
Combining PACED with other curriculum or difficulty-scheduling techniques may produce further efficiency gains.
The bell-curve SNR pattern may appear in reinforcement learning from human feedback or other on-policy training settings.
Testing the approach on models larger than those in the current experiments would check whether the Beta kernel remains near-optimal at scale.

Load-bearing premise

The cross-problem gradient signal-to-noise ratio follows a bell curve over student pass rate, collapsing at both extremes.

What would settle it

Measuring the gradient SNR across problems on a fresh math benchmark and checking whether the bell-curve shape appears; if it does not, the optimality derivation for the Beta kernel would not hold.

Figures

Figures reproduced from arXiv: 2603.11178 by Hejian Sang, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

**Figure 1.** Figure 1: Overview of PACED. Left: The pipeline—an expert provides reference solutions, and the student learns via a distillation loss weighted by pass rate. Right: The Beta-kernel weighting w(p) = p α(1 − p) β concentrates training on the zone of proximal development, suppressing trivial and intractable problems. gotsky and Cole, 1978)—problems where the student sometimes succeeds and sometimes fails. The weighting… view at source ↗

**Figure 2.** Figure 2: Cross-problem gradient SNR vs. student pass rate at two training stages (Qwen3-1.7B, forward KL). Left: Step 1. Right: Step 20. Each problem contributes one gradient (from its fixed teacher reference); K=10 student rollouts are used only for pass-rate estimation. Problems are grouped into equal-width pass-rate bins. Both empirical and theoretical values are normalized by dividing by the respective bin maxi… view at source ↗

**Figure 3.** Figure 3: Prompt example for student and teacher policies. Both policies share the same model family but differ in conditioning context. The teacher receives the expert solution yE as additional context, while the student receives only the original problem. This contextual asymmetry enables black-box expert guidance to be transferred into white-box teacher logits for distillation. B.2 Implementation Details and Hype… view at source ↗

read the original abstract

Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME~2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4\%}$ and $\mathbf{0.6\%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PACED gives a simple p(1-p) weighting rule for distillation that delivers clear gains on math benchmarks with no extra hyperparameters, but the optimality proof depends on an SNR curve whose exact shape still needs direct verification.

read the letter

The main point is that weighting each training problem by the student's current pass rate p times (1-p) improves distillation results on hard reasoning tasks and reduces forgetting, all without new hyperparameters or model changes. They observe that gradient signal-to-noise ratio across problems forms a bell curve that drops at both p near 0 and p near 1, then use that pattern to motivate the weight. The paper also shows the broader Beta family is leading-order optimal under that structure and stays robust to small misspecifications in the curve shape. Experiments on Qwen3, Qwen2.5, and Llama-3 models report gains up to 8.2 points over plain distillation and 3.6 over the AKL baseline on MATH-500, AIME 2024, and AIME 2025, plus lower forgetting rates. A two-stage forward-then-reverse KL schedule adds a bit more on the toughest sets. This is a clean, low-overhead idea that practitioners can try immediately. The empirical side looks solid across model families and the lack of tuning knobs is a real plus. The soft spot is the theoretical step: the optimality and minimax claims rest on the SNR following a specific quadratic bell shape over p. If the measured curve turns out asymmetric, peaks away from 0.5, or changes once pass-rate estimation variance is modeled, the exact w(p)=p(1-p) loses its claimed uniqueness. The abstract presents the empirical signature but the full derivation of how that shape directly yields the Beta kernel is not visible here, so the link between observation and proof needs checking. This paper is aimed at people training or distilling reasoning models who want a practical weighting trick backed by some analysis. Readers focused on efficient LLM training will get immediate value from the method and the reported numbers. It deserves a serious referee because the idea is focused, the results are concrete across several models, and the theoretical angle is worth verifying even if it turns out to be more heuristic than fully rigorous.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces PACED, a weighting scheme for LLM distillation and self-distillation that assigns weights w(p) = p(1-p) to training problems based on the student's empirical pass rate p. This is motivated by an empirical observation that the gradient signal-to-noise ratio (SNR) across problems follows a bell curve over p, collapsing at low and high pass rates. The authors claim to prove that the Beta kernel family is the leading-order optimal weighting arising from this SNR structure and is minimax-robust to misspecification with O(δ²) loss. Empirically, PACED achieves new state-of-the-art results on MATH-500, AIME 2024, and AIME 2025 across Qwen3, Qwen2.5, and Llama-3 model families, with improvements of up to +8.2 over unweighted distillation and +3.6 over the AKL baseline, while reducing forgetting to 1.4% and 0.6%.

Significance. If the SNR-based optimality holds, PACED offers a simple, hyperparameter-free improvement to distillation efficiency by focusing compute on the zone of proximal development. The reported gains across multiple model families and the reduction in forgetting are substantial and would represent a meaningful advance for LLM training pipelines if reproducible. The minimax-robustness claim adds theoretical value if the underlying SNR functional form is validated.

major comments (3)

[§3] §3 (SNR optimality derivation): the claim that the observed bell-curve SNR structure implies the Beta kernel w(p)=p(1-p) is leading-order optimal requires an explicit step-by-step derivation showing how SNR(p) ∝ p(1-p) produces this weighting; if the empirical SNR is asymmetric or contains higher-order terms once pass-rate estimator variance is included, the specific functional form loses its justification.
[§4] §4 (Empirical SNR analysis): the central assumption that SNR follows a symmetric bell curve peaking near p=0.5 with collapse at extremes must be supported by quantitative plots, statistics, and error bars that account for finite-sample variance in the pass-rate estimator p; without this, the optimality argument rests on an unverified functional form.
[Results] Results section (Tables 1-3): the attribution of the +3.6 gain over AKL and +8.2 over unweighted to the theoretical construction (rather than incidental concentration effects) needs an ablation comparing the Beta kernel to other weightings that achieve similar focus but deviate from the SNR-derived form.

minor comments (3)

[Abstract] Abstract: the phrase 'in our experimental setting' when claiming SOTA should be replaced with a precise description of the training data, rollout budget, and evaluation protocol.
[Methods] Notation: the definition of empirical pass rate p (including number of rollouts per problem and handling of ties) should be stated explicitly in the methods section.
[§5] §5 (two-stage KL schedule): details on how the forward-then-reverse KL interacts with the PACED weighting function are needed to understand whether the additional +5.8 gain is independent of the Beta kernel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We appreciate the opportunity to clarify the theoretical foundations and strengthen the empirical validation of PACED. Below we address each major comment point by point, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (SNR optimality derivation): the claim that the observed bell-curve SNR structure implies the Beta kernel w(p)=p(1-p) is leading-order optimal requires an explicit step-by-step derivation showing how SNR(p) ∝ p(1-p) produces this weighting; if the empirical SNR is asymmetric or contains higher-order terms once pass-rate estimator variance is included, the specific functional form loses its justification.

Authors: We will add an explicit step-by-step derivation in §3 showing how the SNR(p) ∝ p(1-p) structure leads to the leading-order optimality of the Beta kernel w(p)=p(1-p). The derivation starts from the gradient variance and signal terms, derives the optimal weighting as proportional to SNR(p), and shows that under the boundary collapse assumption, the Beta(2,2) kernel (i.e., p(1-p)) emerges as the minimax-robust choice. Regarding potential asymmetry or higher-order terms, we will include a discussion noting that the leading-order approximation holds even with moderate asymmetry, with the robustness bound O(δ²) covering misspecification. We will also add a note on the pass-rate estimator variance and its impact. revision: yes
Referee: [§4] §4 (Empirical SNR analysis): the central assumption that SNR follows a symmetric bell curve peaking near p=0.5 with collapse at extremes must be supported by quantitative plots, statistics, and error bars that account for finite-sample variance in the pass-rate estimator p; without this, the optimality argument rests on an unverified functional form.

Authors: We agree that the empirical validation needs strengthening. In the revised manuscript, we will include quantitative plots of SNR vs. p with error bars computed via bootstrap resampling to account for finite-sample variance in the pass-rate estimator. We will report statistics such as the peak location, symmetry measures, and goodness-of-fit to the bell curve. This will be added to §4, confirming the symmetric collapse at extremes. revision: yes
Referee: [Results] Results section (Tables 1-3): the attribution of the +3.6 gain over AKL and +8.2 over unweighted to the theoretical construction (rather than incidental concentration effects) needs an ablation comparing the Beta kernel to other weightings that achieve similar focus but deviate from the SNR-derived form.

Authors: To address this, we will add an ablation study in the results section comparing the Beta kernel w(p)=p(1-p) to other concentration-based weightings, such as a Gaussian kernel centered at p=0.5 with similar variance, and a uniform weighting over a focused interval [0.2,0.8]. This will demonstrate that the specific SNR-derived form provides additional gains beyond mere concentration, supporting the theoretical attribution. The ablation will be included in Tables 1-3 or as a new table. revision: yes

Circularity Check

0 steps flagged

No circularity: SNR observation independently motivates Beta weighting whose optimality is proved from that structure, with gains reported as separate empirical results

full rationale

The paper first reports an empirical observation that cross-problem gradient SNR follows a bell curve over student pass rate p, collapsing at extremes. It then defines w(p)=p(1-p) and proves the Beta family is leading-order optimal for any such boundary-collapse structure (with O(δ²) minimax robustness). The reported SOTA gains (+8.2 over unweighted, +3.6 over AKL) and forgetting reductions are presented as experimental outcomes on MATH-500/AIME, not as quantities derived from the weighting by construction. No equation equates the final performance metric to the input SNR fit; the theoretical step takes the observed functional signature as given and derives the weight family from it, which is a non-circular empirical-to-analytic pipeline. No self-citation chains or fitted-input-as-prediction patterns appear in the provided derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about gradient SNR behavior; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption Cross-problem gradient signal-to-noise ratio follows a bell curve over student pass rate, collapsing at both extremes
Stated as an empirical observation that motivates the weighting choice.

pith-pipeline@v0.9.0 · 5607 in / 1242 out tokens · 47789 ms · 2026-05-15T13:07:09.088847+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost; cost_alpha_one_eq_jcost matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

We prove the Beta kernel w(p)=p^α(1-p)^β is the leading-order optimal weight family arising from the SNR boundary-collapse structure... equals the inverse Bernoulli Fisher information (Remark 5).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective; Jcost_pos_of_ne_one; absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Under power-law regularity at the boundaries (Assumption 3(b)), any such SNR profile decomposes as p^{a'}(1-p)^{b'}·e^r(p) with bounded remainder (Proposition 2).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 4 Pith papers · 13 internal anchors

[1]

Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030,

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030,

work page arXiv
[2]

Saeed Ghadimi and Guanghui Lan

URL https://zenodo.org/records/10256836. Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368,

work page arXiv
[3]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Alex Graves, Marc G Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks.ICML,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

MiniLLM: On-Policy Distillation of Large Language Models

URLhttps://arxiv.org/abs/2306.08543. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematica...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919,

Taehyeon Kim, Jaehoon Oh, NakYoung Kim, Sangheum Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919,

work page arXiv
[6]

Sequence-Level Knowledge Distillation

URL https://arxiv.org/abs/1606.07947. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the Natio...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

gpt-oss-120b & gpt-oss-20b Model Card

URLhttps://arxiv.org/abs/2508.10925. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InInternational Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv
[9]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

10 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Self-Distillation Enables Continual Learning

URLhttps://arxiv.org/abs/2601.19897. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. pages 1279–1297,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Chi, and Sagar Jain

Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532,

work page arXiv 2002
[14]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420,

work page arXiv
[16]

Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye. Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

work page arXiv
[17]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

URLhttps://arxiv.org/abs/2602.12275. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xia...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

URL https://arxiv.org/abs/2601.18734. 11 The appendix is organized into four parts: theory and proofs, prompts and implementation details, additional experiments, and additional interpretations. A Theory and Proofs Proof roadmap.The main-text results map to the appendix as follows: •Result 1 (Structural characterization):Section A.3, Propositions 1 and

work page internal anchor Pith review Pith/arXiv arXiv
[23]

These are standard conditions in stochastic optimization (Ghadimi and Lan, 2013; Bottou et al., 2018). Assumption 2(Bounded Logits and Jacobian).For all training steps and vocabulary dimensions v, the student and teacher logits are bounded as|lS,v|,|l T,v | ≤B , and the Jacobian of the student logits with respect to parameters satisfies ∥Jθ∥op =∥∂l S/∂θ∥o...

work page 2013
[24]

SNR2(p)∝p a′ (1−p) b′

confirm that this as- sumption is well-matched to the observed gradient structure. By Proposition 2, this yields the decomposition SNR2(p) =p a′ (1−p) b′ ·e r(p) with bounded remainder r. The Beta kernel pa′ (1−p) b′ is the leading-order (maximum-parsimony) approximation obtained by setting the shape variation of r to zero. When we write “SNR2(p)∝p a′ (1−...

work page 2022
[25]

The paper’s main experiments correspond to the single-pass special case where recomputation is disabled

applies within each such epoch. The paper’s main experiments correspond to the single-pass special case where recomputation is disabled. 14 A.3 Gradient Boundary Conditions and Representation Theorem The following two propositions establish—under mild structural conditions on distillation—that the gradient SNR collapses at both boundaries (SNR→0 as p→0 an...

work page 2020
[26]

maximum parsimony

but r(p) = p |logp| → ∞ . The asymptotic power-law condition f(p)/p α0 →c 0 is strictly stronger and ensuresrconverges tologc 0 rather than diverging. Maximum parsimony.Since w∗ is defined only up to proportionality (the overall scale is absorbed by the learning rate), the constants c0, c1 are irrelevant for the weight profile. The Beta kernel pα0(1−p) β0...

work page 1978
[27]

(33) In the symmetric case (α=β= 1 , γ= 2a s −1 ): R≈0.84 for as = 1/4; R≈1.00 at as ≈0.34 ; andR >1fora s ≥1/2. A.6.2 Convergence Rate Proposition 8(Convergence Rate of Beta Kernel Weighted SGD).Under Assumptions 1–4, SGD on Lw with learning rate η≤1/L for T steps satisfies the standard non-convex bound (Ghadimi and Lan, 2013): 1 T T−1X t=0 E ∥∇Lw(θt)∥2 ...

work page 2013
[28]

Figure 3:Prompt example for student and teacher policies.Both policies share the same model family but differ in conditioning context

Remember to put your answer on its own line after “Answer:”. Figure 3:Prompt example for student and teacher policies.Both policies share the same model family but differ in conditioning context. The teacher receives the expert solution yE as additional context, while the student receives only the original problem. This contextual asymmetry enables black-...

work page 2025
[29]

on the distillation split of DAPO-Math-17k (Yu et al., 2025). Concretely, we run GRPO with group size G=8, KL penalty coefficient βKL=0.001, learning rate 1×10−6, global batch size 128, and a cosine schedule over 2 epochs; all other settings follow the DAPO recipe (Yu et al., 2025). The resulting model serves as afrozenteacher throughout all distillation ...

work page 2025
[30]

Shared settings are listed once, with setting-specific differences noted explicitly

Max prompt length (student) 1,024 tokens (problem only) Max prompt length (teacher) 3,072 tokens (problem + expert solution) Max response length 16,384 tokens (training) Generation (student rollout) Temperature 1.0 Rollouts per prompt (K) 8 Max generation tokens 8,192 Evaluation Benchmarks MATH-500, AIME 2024, AIME 2025, MMLU (2,000-question random subsam...

work page 2024
[31]

Single- pass

with diminishing returns, suggestingK=8strikes a practical balance between estimation quality and rollout cost. C.1.3 Effect of Periodic Pass-Rate Recomputation The main experiments estimate pass rates once before training (single-pass). We ablate the recompu- tation interval on Qwen3-1.7B distillation (forward KL,α=β=1,K=8). Interpretation.Periodic recom...

work page 2024
[32]

The first two rows give the corresponding single-loss references under the same midpoint-recompute setup, and the last two rows isolate schedule order

Results are reported as 8-sample mean accuracy. The first two rows give the corresponding single-loss references under the same midpoint-recompute setup, and the last two rows isolate schedule order. Stage 1 Stage 2 MATH-500 (↑) AIME 2024 (↑) AIME 2025 (↑) MMLU Fgt. (↓) Paced KL Paced KL 79.7% 25.6% 21.1% 1.3% Paced RevKL Paced RevKL 78.8% 23.5% 19.4% 1.2...

work page 2024
[33]

too hard

and the remaining steps use Paced RevKL (Stage 2). Results are 8-sample mean accuracy. Schedule Stage 1 ratio MATH-500 (↑) AIME 2024 (↑) AIME 2025 (↑) MMLU Fgt. (↓) KL→RevKL 25% 78.9% 22.4% 19.3% 1.6% KL→RevKL 50% 81.4% 26.1% 22.8% 1.1% KL→RevKL 75% 80.1% 26.9% 20.6% 1.1% The 50/50 split offers the best overall trade-off ( 81.4% MATH-500, 26.1% AIME 2024,...

work page 2024
[34]

All other settings (DAPO training data, K=8 rollouts, α=β=1) follow the Qwen3 distillation track, with the same learning rate of1×10 −7

as teacher and Llama-3.1-8B-Instruct as student, with forward KL as the base loss. All other settings (DAPO training data, K=8 rollouts, α=β=1) follow the Qwen3 distillation track, with the same learning rate of1×10 −7. Table 13: Distillation from Llama-3.3-70B-Instruct to Llama-3.1-8B-Instruct (forward KL family): reasoning performance (8-sample mean acc...

work page 2024