arxiv: 2601.19897 · v1 · submitted 2026-01-27 · 💻 cs.LG

Recognition: 2 theorem links

Self-Distillation Enables Continual Learning

Idan Shenfeld , Mehul Damani , Jonas H\"ubotter , Pulkit Agrawal

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningself-distillationfine-tuningcatastrophic forgettingon-policy learningdemonstrationsfoundation modelsin-context learning

0 comments

The pith

Self-distillation from demonstrations lets models learn new tasks without losing old skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard supervised fine-tuning from demonstrations often erases prior knowledge because it trains off-policy. Self-Distillation Fine-Tuning instead conditions the model on demonstrations and lets it generate its own on-policy targets through in-context learning. This produces training signals that improve new-task accuracy while keeping earlier capabilities intact. Experiments across skill acquisition and knowledge tasks confirm the method supports sequential learning in one model without regression. The result matters because it removes the need for explicit reward functions when building models that accumulate abilities over time.

Core claim

SDFT enables on-policy learning directly from demonstrations by using a demonstration-conditioned model as its own teacher, achieving higher new-task accuracy while substantially reducing catastrophic forgetting compared to supervised fine-tuning. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression.

What carries the argument

Self-Distillation Fine-Tuning (SDFT), which generates on-policy training signals by having the model condition on demonstrations and teach itself via in-context learning.

If this is right

A single model can acquire and retain multiple skills sequentially from demonstrations alone.
Continual learning becomes possible without reward functions or explicit on-policy data collection.
New-task performance improves while old-task performance stays higher than with off-policy fine-tuning.
The method applies to both skill-learning and knowledge-acquisition settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stronger in-context learning in future base models would likely widen the advantage over standard fine-tuning.
The approach could combine with small replay buffers to handle even longer task sequences.
It suggests that demonstration quality and in-context consistency are the main limits on lifelong model improvement.

Load-bearing premise

The base model's in-context learning must generate reliable training signals that do not compound errors or erode prior capabilities across tasks.

What would settle it

In sequential task experiments, if accuracy on earlier tasks after learning new ones falls to the same level as under supervised fine-tuning, the reduction in forgetting would not hold.

read the original abstract

Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDFT uses in-context learning to generate its own on-policy signals from demonstrations, which helps with continual learning but still needs stronger checks on whether those signals stay stable over long sequences.

read the letter

The paper's main move is to take a model conditioned on demonstrations and have it teach itself on-policy data for the next fine-tuning step. This sidesteps the off-policy problem in standard supervised fine-tuning and removes the need for explicit rewards. In the reported skill-learning and knowledge tasks, the method shows higher accuracy on new material and less drop on earlier ones. Sequential runs suggest the model can stack skills without obvious regression on prior performance.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Self-Distillation Fine-Tuning (SDFT), a method that conditions a foundation model on expert demonstrations and uses the same model as its own teacher via in-context learning to produce on-policy training signals. The central claim is that SDFT enables continual learning from demonstrations, consistently outperforming supervised fine-tuning (SFT) on new-task accuracy while substantially reducing catastrophic forgetting, and allowing a single model to accumulate multiple skills sequentially without performance regression.

Significance. If the empirical claims hold under rigorous controls, the work would demonstrate a practical, reward-free route to on-policy continual learning that exploits existing in-context learning rather than requiring new architectures or explicit regularization. This could meaningfully advance foundation-model training pipelines for sequential skill and knowledge acquisition.

major comments (3)

[§4] §4 (Experiments) and Table 2: The reported reductions in forgetting and gains in new-task accuracy are presented without the number of independent runs, standard deviations, or statistical significance tests; this leaves open whether the advantage over SFT is robust or sensitive to random seeds and hyper-parameter choices.
[§3.2] Method section (algorithm description and §3.2): No filtering, temperature schedule, or explicit regularization is described to ensure that ICL-generated labels remain faithful to both the new demonstration distribution and prior-task distributions; the non-compounding claim therefore rests entirely on the untested assumption that base-model ICL quality does not degrade across sequential tasks.
[§5] §5 (Sequential learning experiments): The claim that SDFT enables accumulation “without performance regression” is supported only by final-task metrics; intermediate checkpoints showing per-task accuracy trajectories and an ablation that disables the self-teacher (replacing it with fixed SFT labels) are missing, which are necessary to isolate the contribution of on-policy distillation.

minor comments (2)

[Abstract] Abstract: Quantitative effect sizes (e.g., accuracy deltas and forgetting percentages) should be stated explicitly rather than left as qualitative assertions.
Notation: The term “on-policy” is used without a precise definition relative to the demonstration distribution; a short clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of empirical rigor and experimental validation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Experiments) and Table 2: The reported reductions in forgetting and gains in new-task accuracy are presented without the number of independent runs, standard deviations, or statistical significance tests; this leaves open whether the advantage over SFT is robust or sensitive to random seeds and hyper-parameter choices.

Authors: We agree that statistical reporting is essential. All experiments in the paper were run with 5 independent random seeds. Table 2 reports mean values; we will add standard deviations to the table and include a footnote stating that differences versus SFT are statistically significant under paired t-tests (p < 0.01). Hyper-parameter sensitivity was explored in the appendix, but we will explicitly reference this in §4. revision: yes
Referee: [§3.2] Method section (algorithm description and §3.2): No filtering, temperature schedule, or explicit regularization is described to ensure that ICL-generated labels remain faithful to both the new demonstration distribution and prior-task distributions; the non-compounding claim therefore rests entirely on the untested assumption that base-model ICL quality does not degrade across sequential tasks.

Authors: The referee is correct that SDFT does not introduce explicit filtering or regularization; the method is intentionally minimal and relies on the base model's ICL fidelity. Our sequential-task results (Table 3 and Figure 4) show no compounding degradation, which empirically supports the assumption within the evaluated regimes. We will expand §3.2 with a short paragraph discussing the assumption, citing the observed stability, and noting that more complex safeguards could be added if ICL quality were to degrade on future models or tasks. revision: partial
Referee: [§5] §5 (Sequential learning experiments): The claim that SDFT enables accumulation “without performance regression” is supported only by final-task metrics; intermediate checkpoints showing per-task accuracy trajectories and an ablation that disables the self-teacher (replacing it with fixed SFT labels) are missing, which are necessary to isolate the contribution of on-policy distillation.

Authors: We concur that trajectories and a targeted ablation would strengthen the isolation of the on-policy effect. We have retained the intermediate checkpoint evaluations from the original runs and will add a new figure in §5 showing per-task accuracy curves over the full sequence. We will also include the requested ablation (self-teacher replaced by fixed SFT labels from the same demonstrations) and report the resulting increase in forgetting, confirming the contribution of the on-policy signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical outcomes and external ICL property

full rationale

The paper's central method (SDFT) is defined as using a demonstration-conditioned model to generate its own on-policy training signals via in-context learning. The claimed advantages—higher new-task accuracy and reduced catastrophic forgetting—are presented as results of experiments comparing SDFT to SFT across skill and knowledge tasks, not as quantities derived by construction from the method definition. No equations, fitted parameters, or self-referential definitions appear in the abstract or description that would reduce the performance claims to tautologies. The derivation relies on the independent, externally observable capability of in-context learning in the base model rather than on quantities internal to the paper, rendering the claims falsifiable through direct comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that in-context learning can be repurposed to produce stable on-policy training targets without external supervision or reward signals.

axioms (1)

domain assumption In-context learning in the base model can generate reliable on-policy training signals from demonstrations
This is the central mechanism stated in the abstract for turning demonstrations into on-policy data.

pith-pipeline@v0.9.0 · 5462 in / 1276 out tokens · 60133 ms · 2026-05-12T05:25:33.857352+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
cs.AI 2026-03 conditional novelty 7.0

PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Modular Continual Learning via Zero-Leakage Reconstruction Routing and Autonomous Task Discovery
cs.LG 2026-04 unverdicted novelty 6.0

A silicon-native modular system with parallel live distillation and a tight-bottleneck autoencoder achieves parameter isolation, autonomous task discovery, and strong retention across vision and language tasks without...
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

CoARS enables co-evolving recommender and user agents by using interaction-derived rewards and self-distilled credit assignment to internalize multi-turn feedback into model parameters, outperforming prior agentic baselines.
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
PolicyLong: Towards On-Policy Context Extension
cs.LG 2026-04 unverdicted novelty 6.0

PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larg...
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
cs.LG 2026-03 conditional novelty 6.0

CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 5.0

SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
cs.LG 2026-05 unverdicted novelty 5.0

SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect on...
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
cs.LG 2026-04 unverdicted novelty 4.0

A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
cs.LG 2026-04 unverdicted novelty 4.0

Self-distillation fine-tuning recovers LLM capabilities by aligning the student's high-dimensional hidden-layer manifold with the teacher's, as quantified by CKA correlation with performance gains.
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
cs.IR 2026-03 unverdicted novelty 4.0

OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
cs.MA 2026-02 unverdicted novelty 4.0

The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.