Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Pith reviewed 2026-05-21 09:32 UTC · model grok-4.3
The pith
Self-distillation suppresses uncertainty expression and degrades LLM reasoning on new problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-distillation degrades the reasoning capability of LLMs by suppressing epistemic verbalization, the model's expression of uncertainty during reasoning. Conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly.
What carries the argument
Epistemic verbalization, the model's expression of uncertainty while reasoning, whose suppression by rich teacher conditioning produces faster in-domain gains at the cost of out-of-domain generalization.
If this is right
- Conditioning the teacher on rich information enables rapid in-domain optimization even with limited task coverage.
- Out-of-domain performance declines because models lose the capacity to express uncertainty and adjust to unseen problems.
- Performance drops of up to 40 percent appear across tested models including Qwen3-1.7B, Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct.
- Robust reasoning requires optimizing behavior beyond merely reinforcing correct answer traces.
Where Pith is reading between the lines
- Other post-training methods that favor short, certain outputs may produce similar losses in generalization.
- Adding explicit rewards for uncertainty phrases during training could help retain out-of-domain capability.
- A simple diagnostic could track the frequency of hedging language on held-out problems to detect this form of degradation early.
Load-bearing premise
Varying conditioning context richness and task coverage in the experiments isolates the causal effect of uncertainty suppression without other confounding changes in reasoning behavior or data distribution.
What would settle it
An experiment that measures uncertainty phrases in reasoning traces before and after self-distillation and checks whether their reduction directly predicts the size of out-of-domain accuracy drops would test the claim.
read the original abstract
Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates cases where self-distillation of LLMs on mathematical reasoning tasks reduces response length but degrades performance. It attributes the degradation to suppression of epistemic verbalization (expression of uncertainty during reasoning) induced by conditioning the teacher on rich context. Controlled experiments varying context richness and task coverage are used to show that this enables rapid in-domain gains with limited coverage but harms OOD generalization, with drops up to 40% observed across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct. The work concludes that appropriate uncertainty expression is important for robust reasoning.
Significance. If the causal attribution to uncertainty suppression holds after addressing controls, the result would usefully highlight a trade-off in self-distillation: reinforcing correct traces can inadvertently reduce the model's ability to express and act on uncertainty, which is beneficial for OOD problems. The multi-model evaluation and focus on OOD effects are positive features; the absence of parameter-free derivations or machine-checked proofs is consistent with the experimental nature of the work.
major comments (2)
- [Abstract / Experiments] Abstract and experimental description: performance drops of up to 40% are reported without statistical significance tests, confidence intervals, exact data splits, or explicit controls separating response length changes from content/uncertainty changes. This leaves the magnitude and reliability of the central degradation claim under-supported.
- [Controlled experiments] Controlled experiments section: varying conditioning context richness and task coverage simultaneously alters reasoning depth, presence of explicit justifications, answer correctness rates, and output distribution. Without an ablation that holds length, step accuracy, and problem distribution fixed while only modulating uncertainty markers, the performance drop cannot be attributed specifically to epistemic verbalization suppression rather than these confounders.
minor comments (2)
- [Introduction] The introduction would benefit from a concise operational definition or example of 'epistemic verbalization' to distinguish it from general uncertainty or hedging language.
- [Results] Figure or table captions should explicitly state the number of runs or seeds used for each reported metric to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and outline revisions to improve statistical reporting and experimental isolation.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental description: performance drops of up to 40% are reported without statistical significance tests, confidence intervals, exact data splits, or explicit controls separating response length changes from content/uncertainty changes. This leaves the magnitude and reliability of the central degradation claim under-supported.
Authors: We agree that statistical tests and confidence intervals would strengthen the claims. In the revision we will add bootstrap 95% confidence intervals and paired significance tests (t-test or Wilcoxon) for all reported performance differences. Exact data splits follow the canonical GSM8K and MATH test partitions; we will state this explicitly in Section 3 and the abstract. For length versus content separation, the existing experiments already include length-matched subsets and manual annotation of epistemic markers; we will expand the description of these controls and add a supplementary table showing performance at fixed response lengths. revision: yes
-
Referee: [Controlled experiments] Controlled experiments section: varying conditioning context richness and task coverage simultaneously alters reasoning depth, presence of explicit justifications, answer correctness rates, and output distribution. Without an ablation that holds length, step accuracy, and problem distribution fixed while only modulating uncertainty markers, the performance drop cannot be attributed specifically to epistemic verbalization suppression rather than these confounders.
Authors: We partially agree that simultaneous variation introduces potential confounders. Our current design already holds task coverage fixed in several conditions and reports that drops track reduced epistemic markers even among length-similar outputs. Nevertheless, we will add a new ablation that fixes problem distribution, enforces step-verified correctness, and matches response length via constrained decoding, varying only the presence of uncertainty phrases through targeted prompting. This will be included as an additional figure and table in the revised manuscript. revision: partial
Circularity Check
No circularity: empirical claims rest on experimental observations rather than self-referential derivation
full rationale
The paper presents an empirical investigation into self-distillation effects on LLM reasoning, tracing performance degradation to suppressed epistemic verbalization via controlled experiments that vary conditioning context richness and task coverage. No mathematical derivation chain, equations, or fitted parameters are described that reduce to inputs by construction. The central claim relies on observed correlations between context richness, uncertainty markers in outputs, and OOD performance drops across multiple models, without self-definitional loops or load-bearing self-citations that would force the result. This is a standard experimental setup where interpretations are grounded in data rather than tautological renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Epistemic verbalization (expression of uncertainty) is a measurable and causally relevant component of reasoning traces that can be suppressed by conditioning context.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We trace this degradation to the suppression of epistemic verbalization... conditioning the teacher on rich information suppresses uncertainty expression
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
varying conditioning context richness and task coverage... performance drops of up to 40%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and b...
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD achieves superior memorization-retention trade-off in knowledge injection by using mixed self-generated supervision from the base model's conditionals, retaining up to 100% held-out capability versus 1% for stan...
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.