Embarrassingly Simple Self-Distillation Improves Code Generation

Huangjie Zheng; Navdeep Jaitly; Richard He Bai; Ronan Collobert; Ruixiang Zhang; Yizhe Zhang

arxiv: 2604.01193 · v2 · pith:O77O4AGRnew · submitted 2026-04-01 · 💻 cs.CL

Embarrassingly Simple Self-Distillation Improves Code Generation

Ruixiang Zhang , Richard He Bai , Huangjie Zheng , Navdeep Jaitly , Ronan Collobert , Yizhe Zhang This is my paper

classification 💻 cs.CL

keywords codegenerationmodelsimplegainsimprovesmattersself-distillation

0 comments

read the original abstract

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation. Our code is available at https://github.com/apple/ml-ssd

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
cs.LG 2026-06 unverdicted novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
cs.CL 2026-06 unverdicted novelty 7.0

UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
cs.LG 2026-06 unverdicted novelty 7.0

Three problem-level trajectory features derived from the distributional signature of failed LLM rollouts enable failure clustering at 84.3% accuracy and a training-free routing rule that improves rescue by 12.2% on ha...
Self-Policy Distillation via Capability-Selective Subspace Projection
cs.CL 2026-05 unverdicted novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
cs.LG 2026-06 unverdicted novelty 6.0

Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.
BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
cs.SE 2026-05 unverdicted novelty 6.0

BenchEvolver evolves coding problem solutions to generate harder, valid tasks, producing LiveCodeBench-Plus where frontier models score 27.5-62.6% and enabling RL gains on held-out tests.
Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses
cs.CL 2026-05 unverdicted novelty 6.0

SFR applies conditional flow matching on future sentence embeddings as a training regularizer to increase output diversity in style-conditioned LLMs without deployment overhead.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
Iterative Finetuning is Mostly Idempotent
cs.AI 2026-05 unverdicted novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents?
cs.CL 2026-05 unverdicted novelty 5.0

Controlled empirical study shows correcting Wikipedia data coverage yields larger gains than algorithm differences in LLM search agent training, with outcome-based rewards competitive.
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
cs.AI 2026-05 unverdicted novelty 5.0

AMR-SD adds a reflection bottleneck to compress diagnostic signals into self-generated hints and uses asymmetric Causal Information Gain to create sparse token-level advantage signals, outperforming baselines and prev...
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
cs.LG 2026-04 unverdicted novelty 5.0

Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
cs.CL 2026-05 unverdicted novelty 4.0

DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...