{"work":{"id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","openalex_id":null,"doi":null,"arxiv_id":"2601.19897","raw_key":null,"title":"Self-Distillation Enables Continual Learning","authors":null,"authors_text":"Idan Shenfeld, Mehul Damani, Jonas H\\\"ubotter, Pulkit Agrawal","year":2026,"venue":"cs.LG","abstract":"Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.","external_url":"https://arxiv.org/abs/2601.19897","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:10:23.891619+00:00","pith_arxiv_id":"2601.19897","created_at":"2026-05-09T05:45:22.368661+00:00","updated_at":"2026-05-25T06:10:23.891619+00:00","title_quality_ok":true,"display_title":"Self-Distillation Enables Continual Learning","render_title":"Self-Distillation Enables Continual Learning"},"hub":{"state":{"work_id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":53,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2026-01-26T17:56:50+00:00","last_pith_cited_at":"2026-05-21T14:00:57+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T05:56:17.843291+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":18},{"context_role":"baseline","n":2}],"polarity_counts":[{"context_polarity":"background","n":16},{"context_polarity":"baseline","n":2},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:09:53.021640+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":23},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":21},{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","work_id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","shared_citers":21},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":16},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":14},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":13},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":13},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":12},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":8},{"title":"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?","work_id":"8df6a2d1-d890-48ae-af85-c11643a91097","shared_citers":8},{"title":"MiMo-V2-Flash Technical Report","work_id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","shared_citers":7},{"title":"OpenThoughts: Data Recipes for Reasoning Models","work_id":"c7acbe41-27a0-4773-a7be-8f08d86cdf21","shared_citers":7},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":7},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":6},{"title":"CRISP: Compressed Reasoning via Iterative Self-Policy Distillation","work_id":"c5a99022-15b6-4d77-9850-23036df7a073","shared_citers":6},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":6},{"title":"On-policy distillation","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":6},{"title":"Privileged information distillation for language models","work_id":"674b7199-1d6e-4f36-89f1-fe1abe5b4db1","shared_citers":6},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":6},{"title":"Unifying group-relative and self-distillation policy optimization via sample routing","work_id":"6dea6cb7-a4e5-478a-ab3c-eeeb335abd51","shared_citers":6},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":5},{"title":"Entropy-aware on-policy distillation of language models","work_id":"7dccbe12-e2aa-48d8-9b76-5521ccf02668","shared_citers":5},{"title":"Expanding the capabilities of reinforcement learning via text feedback","work_id":"c9552e44-1e80-45df-b599-f8554214e923","shared_citers":5},{"title":"Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, et al","work_id":"5e961f0b-b20e-4580-965d-15fb63ec8965","shared_citers":5}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:09:53.058945+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:10:27.804895+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Self-Distillation Enables Continual Learning","claims":[{"claim_text":"Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Self-Distillation Enables Continual Learning because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:09:42.068880+00:00"}},"summary":{"title":"Self-Distillation Enables Continual Learning","claims":[{"claim_text":"Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Self-Distillation Enables Continual Learning because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":23},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":21},{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","work_id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","shared_citers":21},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":16},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":14},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":13},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":13},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":12},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":8},{"title":"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?","work_id":"8df6a2d1-d890-48ae-af85-c11643a91097","shared_citers":8},{"title":"MiMo-V2-Flash Technical Report","work_id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","shared_citers":7},{"title":"OpenThoughts: Data Recipes for Reasoning Models","work_id":"c7acbe41-27a0-4773-a7be-8f08d86cdf21","shared_citers":7},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":7},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":6},{"title":"CRISP: Compressed Reasoning via Iterative Self-Policy Distillation","work_id":"c5a99022-15b6-4d77-9850-23036df7a073","shared_citers":6},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":6},{"title":"On-policy distillation","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":6},{"title":"Privileged information distillation for language models","work_id":"674b7199-1d6e-4f36-89f1-fe1abe5b4db1","shared_citers":6},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":6},{"title":"Unifying group-relative and self-distillation policy optimization via sample routing","work_id":"6dea6cb7-a4e5-478a-ab3c-eeeb335abd51","shared_citers":6},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":5},{"title":"Entropy-aware on-policy distillation of language models","work_id":"7dccbe12-e2aa-48d8-9b76-5521ccf02668","shared_citers":5},{"title":"Expanding the capabilities of reinforcement learning via text feedback","work_id":"c9552e44-1e80-45df-b599-f8554214e923","shared_citers":5},{"title":"Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, et al","work_id":"5e961f0b-b20e-4580-965d-15fb63ec8965","shared_citers":5}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"authors":[]}}