Embarrassingly Simple Self-Distillation Improves Code Generation

Huangjie Zheng; Navdeep Jaitly; Richard He Bai; Ronan Collobert; Ruixiang Zhang; Yizhe Zhang

arxiv: 2604.01193 · v2 · pith:O77O4AGRnew · submitted 2026-04-01 · 💻 cs.CL

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang , Richard He Bai , Huangjie Zheng , Navdeep Jaitly , Ronan Collobert , Yizhe Zhang This is my paper

classification 💻 cs.CL

keywords codegenerationmodelsimplegainsimprovesmattersself-distillation

0 comments

read the original abstract

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation. Our code is available at https://github.com/apple/ml-ssd

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Policy Distillation via Capability-Selective Subspace Projection
cs.CL 2026-05 unverdicted novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
Iterative Finetuning is Mostly Idempotent
cs.AI 2026-05 unverdicted novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
cs.AI 2026-05 unverdicted novelty 5.0

AMR-SD adds a reflection bottleneck to compress diagnostic signals into self-generated hints and uses asymmetric Causal Information Gain to create sparse token-level advantage signals, outperforming baselines and prev...
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
cs.LG 2026-04 unverdicted novelty 5.0

Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...