pith. sign in

arxiv: 2603.24472 · v3 · pith:WFIONFVJnew · submitted 2026-03-25 · 💻 cs.CL · cs.LG

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Pith reviewed 2026-05-21 09:32 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords self-distillationLLM reasoningepistemic verbalizationuncertainty expressionmathematical reasoningout-of-domain performanceresponse length
0
0 comments X

The pith

Self-distillation suppresses uncertainty expression and degrades LLM reasoning on new problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper traces how self-distillation sometimes worsens mathematical reasoning in large language models. The core issue is that distillation reduces the model's habit of voicing uncertainty while working through a problem. Experiments that alter how much extra context the teacher receives and how many tasks it sees show this leads to quick gains on familiar problems yet clear losses on unfamiliar ones. Readers should care because shorter, more confident-looking answers can mask a real drop in the model's ability to adjust when facing unseen cases, with measured declines reaching 40 percent on several models.

Core claim

Self-distillation degrades the reasoning capability of LLMs by suppressing epistemic verbalization, the model's expression of uncertainty during reasoning. Conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly.

What carries the argument

Epistemic verbalization, the model's expression of uncertainty while reasoning, whose suppression by rich teacher conditioning produces faster in-domain gains at the cost of out-of-domain generalization.

If this is right

  • Conditioning the teacher on rich information enables rapid in-domain optimization even with limited task coverage.
  • Out-of-domain performance declines because models lose the capacity to express uncertainty and adjust to unseen problems.
  • Performance drops of up to 40 percent appear across tested models including Qwen3-1.7B, Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct.
  • Robust reasoning requires optimizing behavior beyond merely reinforcing correct answer traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other post-training methods that favor short, certain outputs may produce similar losses in generalization.
  • Adding explicit rewards for uncertainty phrases during training could help retain out-of-domain capability.
  • A simple diagnostic could track the frequency of hedging language on held-out problems to detect this form of degradation early.

Load-bearing premise

Varying conditioning context richness and task coverage in the experiments isolates the causal effect of uncertainty suppression without other confounding changes in reasoning behavior or data distribution.

What would settle it

An experiment that measures uncertainty phrases in reasoning traces before and after self-distillation and checks whether their reduction directly predicts the size of out-of-domain accuracy drops would test the claim.

read the original abstract

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates cases where self-distillation of LLMs on mathematical reasoning tasks reduces response length but degrades performance. It attributes the degradation to suppression of epistemic verbalization (expression of uncertainty during reasoning) induced by conditioning the teacher on rich context. Controlled experiments varying context richness and task coverage are used to show that this enables rapid in-domain gains with limited coverage but harms OOD generalization, with drops up to 40% observed across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct. The work concludes that appropriate uncertainty expression is important for robust reasoning.

Significance. If the causal attribution to uncertainty suppression holds after addressing controls, the result would usefully highlight a trade-off in self-distillation: reinforcing correct traces can inadvertently reduce the model's ability to express and act on uncertainty, which is beneficial for OOD problems. The multi-model evaluation and focus on OOD effects are positive features; the absence of parameter-free derivations or machine-checked proofs is consistent with the experimental nature of the work.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental description: performance drops of up to 40% are reported without statistical significance tests, confidence intervals, exact data splits, or explicit controls separating response length changes from content/uncertainty changes. This leaves the magnitude and reliability of the central degradation claim under-supported.
  2. [Controlled experiments] Controlled experiments section: varying conditioning context richness and task coverage simultaneously alters reasoning depth, presence of explicit justifications, answer correctness rates, and output distribution. Without an ablation that holds length, step accuracy, and problem distribution fixed while only modulating uncertainty markers, the performance drop cannot be attributed specifically to epistemic verbalization suppression rather than these confounders.
minor comments (2)
  1. [Introduction] The introduction would benefit from a concise operational definition or example of 'epistemic verbalization' to distinguish it from general uncertainty or hedging language.
  2. [Results] Figure or table captions should explicitly state the number of runs or seeds used for each reported metric to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline revisions to improve statistical reporting and experimental isolation.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental description: performance drops of up to 40% are reported without statistical significance tests, confidence intervals, exact data splits, or explicit controls separating response length changes from content/uncertainty changes. This leaves the magnitude and reliability of the central degradation claim under-supported.

    Authors: We agree that statistical tests and confidence intervals would strengthen the claims. In the revision we will add bootstrap 95% confidence intervals and paired significance tests (t-test or Wilcoxon) for all reported performance differences. Exact data splits follow the canonical GSM8K and MATH test partitions; we will state this explicitly in Section 3 and the abstract. For length versus content separation, the existing experiments already include length-matched subsets and manual annotation of epistemic markers; we will expand the description of these controls and add a supplementary table showing performance at fixed response lengths. revision: yes

  2. Referee: [Controlled experiments] Controlled experiments section: varying conditioning context richness and task coverage simultaneously alters reasoning depth, presence of explicit justifications, answer correctness rates, and output distribution. Without an ablation that holds length, step accuracy, and problem distribution fixed while only modulating uncertainty markers, the performance drop cannot be attributed specifically to epistemic verbalization suppression rather than these confounders.

    Authors: We partially agree that simultaneous variation introduces potential confounders. Our current design already holds task coverage fixed in several conditions and reports that drops track reduced epistemic markers even among length-similar outputs. Nevertheless, we will add a new ablation that fixes problem distribution, enforces step-verified correctness, and matches response length via constrained decoding, varying only the presence of uncertainty phrases through targeted prompting. This will be included as an additional figure and table in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental observations rather than self-referential derivation

full rationale

The paper presents an empirical investigation into self-distillation effects on LLM reasoning, tracing performance degradation to suppressed epistemic verbalization via controlled experiments that vary conditioning context richness and task coverage. No mathematical derivation chain, equations, or fitted parameters are described that reduce to inputs by construction. The central claim relies on observed correlations between context richness, uncertainty markers in outputs, and OOD performance drops across multiple models, without self-definitional loops or load-bearing self-citations that would force the result. This is a standard experimental setup where interpretations are grounded in data rather than tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the empirical observation that uncertainty expression improves OOD generalization; no explicit free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Epistemic verbalization (expression of uncertainty) is a measurable and causally relevant component of reasoning traces that can be suppressed by conditioning context.
    Invoked when attributing performance changes to suppression of uncertainty rather than other trace properties.

pith-pipeline@v0.9.0 · 5719 in / 1292 out tokens · 30588 ms · 2026-05-21T09:32:06.727928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning from Language Feedback via Variational Policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...

  2. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

  3. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  4. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

  5. TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...

  6. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  7. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

    cs.LG 2026-05 unverdicted novelty 7.0

    PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

  8. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  9. How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

    cs.CL 2026-03 conditional novelty 7.0

    TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

  10. Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

  11. MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

    cs.CL 2026-05 unverdicted novelty 6.0

    MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and b...

  12. MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

    cs.CL 2026-05 unverdicted novelty 6.0

    MixSD achieves superior memorization-retention trade-off in knowledge injection by using mixed self-generated supervision from the base model's conditionals, retaining up to 100% held-out capability versus 1% for stan...

  13. Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

    cs.CL 2026-05 unverdicted novelty 6.0

    CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

  14. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  15. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

    cs.LG 2026-05 unverdicted novelty 6.0

    Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

  16. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

  17. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  18. AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

    cs.CL 2026-04 unverdicted novelty 6.0

    AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.

  19. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    cs.LG 2026-04 unverdicted novelty 6.0

    On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

  20. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 5.0

    SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.

  21. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  22. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  23. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  24. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...

  25. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.