pith. sign in

hub Canonical reference

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Canonical reference. 80% of citing Pith papers cite this work as background.

20 Pith papers citing it
Background 80% of classified citations
abstract

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

hub tools

citation-role summary

background 4 other 1

citation-polarity summary

years

2026 20

polarities

background 4 unclear 1

representative citing papers

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

Multilingual Safety Alignment via Self-Distillation

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

citing papers explorer

Showing 20 of 20 citing papers.