pith. machine review for the scientific record. sign in

arxiv: 2603.11178 · v3 · submitted 2026-03-11 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM distillationself-distillationmath reasoningproblem weightingcurriculum learningBeta kernelgradient SNRforgetting reduction
0
0 comments X

The pith

Weighting distillation problems by student pass rate p(1-p) focuses training on the competence frontier and improves math benchmark scores by up to 8.2 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard LLM distillation wastes compute on problems the student has already mastered or cannot yet solve. The paper shows this waste has a clear signature: the gradient signal-to-noise ratio across problems forms a bell curve that peaks at intermediate pass rates and collapses at the extremes. PACED therefore weights each problem by w(p) = p(1-p), where p is the student's empirical pass rate on that problem. The resulting schedule requires only student rollouts, introduces no new hyperparameters, and raises performance on MATH-500, AIME 2024, and AIME 2025 while cutting forgetting to 1.4 percent in distillation and 0.6 percent in self-distillation. A two-stage forward-then-reverse KL variant adds still larger gains on the hardest tasks.

Core claim

PACED weights each training problem by w(p) = p(1-p) where p is the student's empirical pass rate, thereby concentrating gradient updates on problems that sit at the frontier of current competence. The authors prove that the Beta kernel family w(p) = p^α(1-p)^β is the leading-order optimal weighting under the observed SNR boundary-collapse structure and remains minimax-robust to misspecification. Across Qwen3, Qwen2.5, and Llama-3 families the method sets new state-of-the-art numbers on MATH-500, AIME 2024, and AIME 2025, delivering gains of up to +8.2 over unweighted distillation and +3.6 over the AKL baseline while reducing forgetting to 1.4 percent and 0.6 percent.

What carries the argument

The Beta kernel weighting w(p) = p(1-p), which multiplies each problem's loss by the product of its pass rate and failure rate to emphasize intermediate competence.

If this is right

  • New state-of-the-art results on MATH-500, AIME 2024, and AIME 2025 across Qwen and Llama model families.
  • Forgetting drops to 1.4 percent during distillation and 0.6 percent during self-distillation.
  • A two-stage forward-then-reverse KL schedule adds up to +5.8 points on the hardest benchmarks over standard forward KL.
  • Gains of up to +3.6 over the strong AKL baseline are obtained with no architectural changes.
  • The method uses only student rollouts and requires no extra hyperparameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weighting could be applied to non-math reasoning domains where problem difficulty varies across a student's competence range.
  • Combining PACED with other curriculum or difficulty-scheduling techniques may produce further efficiency gains.
  • The bell-curve SNR pattern may appear in reinforcement learning from human feedback or other on-policy training settings.
  • Testing the approach on models larger than those in the current experiments would check whether the Beta kernel remains near-optimal at scale.

Load-bearing premise

The cross-problem gradient signal-to-noise ratio follows a bell curve over student pass rate, collapsing at both extremes.

What would settle it

Measuring the gradient SNR across problems on a fresh math benchmark and checking whether the bell-curve shape appears; if it does not, the optimality derivation for the Beta kernel would not hold.

Figures

Figures reproduced from arXiv: 2603.11178 by Hejian Sang, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

Figure 1
Figure 1. Figure 1: Overview of PACED. Left: The pipeline—an expert provides reference solutions, and the student learns via a distillation loss weighted by pass rate. Right: The Beta-kernel weighting w(p) = p α(1 − p) β concentrates training on the zone of proximal development, suppressing trivial and intractable problems. gotsky and Cole, 1978)—problems where the student sometimes succeeds and sometimes fails. The weighting… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-problem gradient SNR vs. student pass rate at two training stages (Qwen3-1.7B, forward KL). Left: Step 1. Right: Step 20. Each problem contributes one gradient (from its fixed teacher reference); K=10 student rollouts are used only for pass-rate estimation. Problems are grouped into equal-width pass-rate bins. Both empirical and theoretical values are normalized by dividing by the respective bin maxi… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt example for student and teacher policies. Both policies share the same model family but differ in conditioning context. The teacher receives the expert solution yE as additional context, while the student receives only the original problem. This contextual asymmetry enables black-box expert guidance to be transferred into white-box teacher logits for distillation. B.2 Implementation Details and Hype… view at source ↗
read the original abstract

Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w(p) = p(1{-}p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w(p) = p^\alpha(1{-}p)^\beta$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O(\delta^2)$). Across Qwen3, Qwen2.5, and Llama-3 families, PACED sets a new state of the art in our experimental setting on MATH-500, AIME~2024, and AIME~2025, improving over unweighted distillation by up to $\mathbf{+8.2}$ and over the strong AKL baseline by up to $\mathbf{+3.6}$, while reducing forgetting to $\mathbf{1.4\%}$ and $\mathbf{0.6\%}$ in distillation and self-distillation. A two-stage forward-then-reverse KL schedule pushes gains further to $\mathbf{+5.8}$ over standard forward KL on the hardest benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces PACED, a weighting scheme for LLM distillation and self-distillation that assigns weights w(p) = p(1-p) to training problems based on the student's empirical pass rate p. This is motivated by an empirical observation that the gradient signal-to-noise ratio (SNR) across problems follows a bell curve over p, collapsing at low and high pass rates. The authors claim to prove that the Beta kernel family is the leading-order optimal weighting arising from this SNR structure and is minimax-robust to misspecification with O(δ²) loss. Empirically, PACED achieves new state-of-the-art results on MATH-500, AIME 2024, and AIME 2025 across Qwen3, Qwen2.5, and Llama-3 model families, with improvements of up to +8.2 over unweighted distillation and +3.6 over the AKL baseline, while reducing forgetting to 1.4% and 0.6%.

Significance. If the SNR-based optimality holds, PACED offers a simple, hyperparameter-free improvement to distillation efficiency by focusing compute on the zone of proximal development. The reported gains across multiple model families and the reduction in forgetting are substantial and would represent a meaningful advance for LLM training pipelines if reproducible. The minimax-robustness claim adds theoretical value if the underlying SNR functional form is validated.

major comments (3)
  1. [§3] §3 (SNR optimality derivation): the claim that the observed bell-curve SNR structure implies the Beta kernel w(p)=p(1-p) is leading-order optimal requires an explicit step-by-step derivation showing how SNR(p) ∝ p(1-p) produces this weighting; if the empirical SNR is asymmetric or contains higher-order terms once pass-rate estimator variance is included, the specific functional form loses its justification.
  2. [§4] §4 (Empirical SNR analysis): the central assumption that SNR follows a symmetric bell curve peaking near p=0.5 with collapse at extremes must be supported by quantitative plots, statistics, and error bars that account for finite-sample variance in the pass-rate estimator p; without this, the optimality argument rests on an unverified functional form.
  3. [Results] Results section (Tables 1-3): the attribution of the +3.6 gain over AKL and +8.2 over unweighted to the theoretical construction (rather than incidental concentration effects) needs an ablation comparing the Beta kernel to other weightings that achieve similar focus but deviate from the SNR-derived form.
minor comments (3)
  1. [Abstract] Abstract: the phrase 'in our experimental setting' when claiming SOTA should be replaced with a precise description of the training data, rollout budget, and evaluation protocol.
  2. [Methods] Notation: the definition of empirical pass rate p (including number of rollouts per problem and handling of ties) should be stated explicitly in the methods section.
  3. [§5] §5 (two-stage KL schedule): details on how the forward-then-reverse KL interacts with the PACED weighting function are needed to understand whether the additional +5.8 gain is independent of the Beta kernel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We appreciate the opportunity to clarify the theoretical foundations and strengthen the empirical validation of PACED. Below we address each major comment point by point, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (SNR optimality derivation): the claim that the observed bell-curve SNR structure implies the Beta kernel w(p)=p(1-p) is leading-order optimal requires an explicit step-by-step derivation showing how SNR(p) ∝ p(1-p) produces this weighting; if the empirical SNR is asymmetric or contains higher-order terms once pass-rate estimator variance is included, the specific functional form loses its justification.

    Authors: We will add an explicit step-by-step derivation in §3 showing how the SNR(p) ∝ p(1-p) structure leads to the leading-order optimality of the Beta kernel w(p)=p(1-p). The derivation starts from the gradient variance and signal terms, derives the optimal weighting as proportional to SNR(p), and shows that under the boundary collapse assumption, the Beta(2,2) kernel (i.e., p(1-p)) emerges as the minimax-robust choice. Regarding potential asymmetry or higher-order terms, we will include a discussion noting that the leading-order approximation holds even with moderate asymmetry, with the robustness bound O(δ²) covering misspecification. We will also add a note on the pass-rate estimator variance and its impact. revision: yes

  2. Referee: [§4] §4 (Empirical SNR analysis): the central assumption that SNR follows a symmetric bell curve peaking near p=0.5 with collapse at extremes must be supported by quantitative plots, statistics, and error bars that account for finite-sample variance in the pass-rate estimator p; without this, the optimality argument rests on an unverified functional form.

    Authors: We agree that the empirical validation needs strengthening. In the revised manuscript, we will include quantitative plots of SNR vs. p with error bars computed via bootstrap resampling to account for finite-sample variance in the pass-rate estimator. We will report statistics such as the peak location, symmetry measures, and goodness-of-fit to the bell curve. This will be added to §4, confirming the symmetric collapse at extremes. revision: yes

  3. Referee: [Results] Results section (Tables 1-3): the attribution of the +3.6 gain over AKL and +8.2 over unweighted to the theoretical construction (rather than incidental concentration effects) needs an ablation comparing the Beta kernel to other weightings that achieve similar focus but deviate from the SNR-derived form.

    Authors: To address this, we will add an ablation study in the results section comparing the Beta kernel w(p)=p(1-p) to other concentration-based weightings, such as a Gaussian kernel centered at p=0.5 with similar variance, and a uniform weighting over a focused interval [0.2,0.8]. This will demonstrate that the specific SNR-derived form provides additional gains beyond mere concentration, supporting the theoretical attribution. The ablation will be included in Tables 1-3 or as a new table. revision: yes

Circularity Check

0 steps flagged

No circularity: SNR observation independently motivates Beta weighting whose optimality is proved from that structure, with gains reported as separate empirical results

full rationale

The paper first reports an empirical observation that cross-problem gradient SNR follows a bell curve over student pass rate p, collapsing at extremes. It then defines w(p)=p(1-p) and proves the Beta family is leading-order optimal for any such boundary-collapse structure (with O(δ²) minimax robustness). The reported SOTA gains (+8.2 over unweighted, +3.6 over AKL) and forgetting reductions are presented as experimental outcomes on MATH-500/AIME, not as quantities derived from the weighting by construction. No equation equates the final performance metric to the input SNR fit; the theoretical step takes the observed functional signature as given and derives the weight family from it, which is a non-circular empirical-to-analytic pipeline. No self-citation chains or fitted-input-as-prediction patterns appear in the provided derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about gradient SNR behavior; no free parameters are introduced and no new entities are postulated.

axioms (1)
  • domain assumption Cross-problem gradient signal-to-noise ratio follows a bell curve over student pass rate, collapsing at both extremes
    Stated as an empirical observation that motivates the weighting choice.

pith-pipeline@v0.9.0 · 5607 in / 1242 out tokens · 47789 ms · 2026-05-15T13:07:09.088847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

  2. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  3. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  4. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 4 Pith papers · 13 internal anchors

  1. [1]

    Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030,

    Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distillation: the universal logit distillation loss for llms.arXiv preprint arXiv:2402.12030,

  2. [2]

    Saeed Ghadimi and Guanghui Lan

    URL https://zenodo.org/records/10256836. Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368,

  3. [3]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Alex Graves, Marc G Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks.ICML,

  4. [4]

    MiniLLM: On-Policy Distillation of Large Language Models

    URLhttps://arxiv.org/abs/2306.08543. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematica...

  5. [5]

    Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919,

    Taehyeon Kim, Jaehoon Oh, NakYoung Kim, Sangheum Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919,

  6. [6]

    Sequence-Level Knowledge Distillation

    URL https://arxiv.org/abs/1606.07947. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the Natio...

  7. [7]

    An Empirical Model of Large-Batch Training

    Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

  8. [8]

    gpt-oss-120b & gpt-oss-20b Model Card

    URLhttps://arxiv.org/abs/2508.10925. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InInternational Conference on Machine Learning (ICML),

  9. [9]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv preprint arXiv:2603.05433,

  10. [10]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    10 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  11. [12]

    Self-Distillation Enables Continual Learning

    URLhttps://arxiv.org/abs/2601.19897. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. pages 1279–1297,

  12. [13]

    Chi, and Sagar Jain

    Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532,

  13. [14]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057,

  14. [15]

    Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420,

  15. [16]

    Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

    Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye. Distribution-aligned sequence distillation for superior long-cot reasoning.arXiv preprint arXiv:2601.09088,

  16. [17]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

  17. [18]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  18. [20]

    URLhttps://arxiv.org/abs/2602.12275. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xia...

  19. [22]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    URL https://arxiv.org/abs/2601.18734. 11 The appendix is organized into four parts: theory and proofs, prompts and implementation details, additional experiments, and additional interpretations. A Theory and Proofs Proof roadmap.The main-text results map to the appendix as follows: •Result 1 (Structural characterization):Section A.3, Propositions 1 and

  20. [23]

    These are standard conditions in stochastic optimization (Ghadimi and Lan, 2013; Bottou et al., 2018). Assumption 2(Bounded Logits and Jacobian).For all training steps and vocabulary dimensions v, the student and teacher logits are bounded as|lS,v|,|l T,v | ≤B , and the Jacobian of the student logits with respect to parameters satisfies ∥Jθ∥op =∥∂l S/∂θ∥o...

  21. [24]

    SNR2(p)∝p a′ (1−p) b′

    confirm that this as- sumption is well-matched to the observed gradient structure. By Proposition 2, this yields the decomposition SNR2(p) =p a′ (1−p) b′ ·e r(p) with bounded remainder r. The Beta kernel pa′ (1−p) b′ is the leading-order (maximum-parsimony) approximation obtained by setting the shape variation of r to zero. When we write “SNR2(p)∝p a′ (1−...

  22. [25]

    The paper’s main experiments correspond to the single-pass special case where recomputation is disabled

    applies within each such epoch. The paper’s main experiments correspond to the single-pass special case where recomputation is disabled. 14 A.3 Gradient Boundary Conditions and Representation Theorem The following two propositions establish—under mild structural conditions on distillation—that the gradient SNR collapses at both boundaries (SNR→0 as p→0 an...

  23. [26]

    maximum parsimony

    but r(p) = p |logp| → ∞ . The asymptotic power-law condition f(p)/p α0 →c 0 is strictly stronger and ensuresrconverges tologc 0 rather than diverging. Maximum parsimony.Since w∗ is defined only up to proportionality (the overall scale is absorbed by the learning rate), the constants c0, c1 are irrelevant for the weight profile. The Beta kernel pα0(1−p) β0...

  24. [27]

    (33) In the symmetric case (α=β= 1 , γ= 2a s −1 ): R≈0.84 for as = 1/4; R≈1.00 at as ≈0.34 ; andR >1fora s ≥1/2. A.6.2 Convergence Rate Proposition 8(Convergence Rate of Beta Kernel Weighted SGD).Under Assumptions 1–4, SGD on Lw with learning rate η≤1/L for T steps satisfies the standard non-convex bound (Ghadimi and Lan, 2013): 1 T T−1X t=0 E ∥∇Lw(θt)∥2 ...

  25. [28]

    Figure 3:Prompt example for student and teacher policies.Both policies share the same model family but differ in conditioning context

    Remember to put your answer on its own line after “Answer:”. Figure 3:Prompt example for student and teacher policies.Both policies share the same model family but differ in conditioning context. The teacher receives the expert solution yE as additional context, while the student receives only the original problem. This contextual asymmetry enables black-...

  26. [29]

    on the distillation split of DAPO-Math-17k (Yu et al., 2025). Concretely, we run GRPO with group size G=8, KL penalty coefficient βKL=0.001, learning rate 1×10−6, global batch size 128, and a cosine schedule over 2 epochs; all other settings follow the DAPO recipe (Yu et al., 2025). The resulting model serves as afrozenteacher throughout all distillation ...

  27. [30]

    Shared settings are listed once, with setting-specific differences noted explicitly

    Max prompt length (student) 1,024 tokens (problem only) Max prompt length (teacher) 3,072 tokens (problem + expert solution) Max response length 16,384 tokens (training) Generation (student rollout) Temperature 1.0 Rollouts per prompt (K) 8 Max generation tokens 8,192 Evaluation Benchmarks MATH-500, AIME 2024, AIME 2025, MMLU (2,000-question random subsam...

  28. [31]

    Single- pass

    with diminishing returns, suggestingK=8strikes a practical balance between estimation quality and rollout cost. C.1.3 Effect of Periodic Pass-Rate Recomputation The main experiments estimate pass rates once before training (single-pass). We ablate the recompu- tation interval on Qwen3-1.7B distillation (forward KL,α=β=1,K=8). Interpretation.Periodic recom...

  29. [32]

    The first two rows give the corresponding single-loss references under the same midpoint-recompute setup, and the last two rows isolate schedule order

    Results are reported as 8-sample mean accuracy. The first two rows give the corresponding single-loss references under the same midpoint-recompute setup, and the last two rows isolate schedule order. Stage 1 Stage 2 MATH-500 (↑) AIME 2024 (↑) AIME 2025 (↑) MMLU Fgt. (↓) Paced KL Paced KL 79.7% 25.6% 21.1% 1.3% Paced RevKL Paced RevKL 78.8% 23.5% 19.4% 1.2...

  30. [33]

    too hard

    and the remaining steps use Paced RevKL (Stage 2). Results are 8-sample mean accuracy. Schedule Stage 1 ratio MATH-500 (↑) AIME 2024 (↑) AIME 2025 (↑) MMLU Fgt. (↓) KL→RevKL 25% 78.9% 22.4% 19.3% 1.6% KL→RevKL 50% 81.4% 26.1% 22.8% 1.1% KL→RevKL 75% 80.1% 26.9% 20.6% 1.1% The 50/50 split offers the best overall trade-off ( 81.4% MATH-500, 26.1% AIME 2024,...

  31. [34]

    All other settings (DAPO training data, K=8 rollouts, α=β=1) follow the Qwen3 distillation track, with the same learning rate of1×10 −7

    as teacher and Llama-3.1-8B-Instruct as student, with forward KL as the base loss. All other settings (DAPO training data, K=8 rollouts, α=β=1) follow the Qwen3 distillation track, with the same learning rate of1×10 −7. Table 13: Distillation from Llama-3.3-70B-Instruct to Llama-3.1-8B-Instruct (forward KL family): reasoning performance (8-sample mean acc...