{"work":{"id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","openalex_id":null,"doi":null,"arxiv_id":"2601.18734","raw_key":null,"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","authors":null,"authors_text":"Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen","year":2026,"venue":"cs.LG","abstract":"Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: https://github.com/siyan-zhao/OPSD.","external_url":"https://arxiv.org/abs/2601.18734","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:13:27.200382+00:00","pith_arxiv_id":"2601.18734","created_at":"2026-05-09T05:45:22.373179+00:00","updated_at":"2026-06-29T13:13:27.200382+00:00","title_quality_ok":true,"display_title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","render_title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models"},"hub":{"state":{"work_id":"bae00e84-9b0d-433d-a066-20b951f0b4d0","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":87,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2026-02-12T16:14:29+00:00","last_pith_cited_at":"2026-05-31T22:31:15+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:08:47.477203+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":16},{"context_role":"method","n":5},{"context_role":"baseline","n":3},{"context_role":"other","n":2}],"polarity_counts":[{"context_polarity":"background","n":15},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":3},{"context_polarity":"unclear","n":2},{"context_polarity":"support","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T16:22:40.380085+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":34},{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":26},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":24},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Self-Distillation Enables Continual Learning","work_id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","shared_citers":21},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":17},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":16},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":15},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":14},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":13},{"title":"https://thinkingmachines.ai/blog/ on-policy-distillation/","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":10},{"title":"MiMo-V2-Flash Technical Report","work_id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","shared_citers":10},{"title":"Privileged information distillation for language models","work_id":"674b7199-1d6e-4f36-89f1-fe1abe5b4db1","shared_citers":10},{"title":"CRISP: Compressed Reasoning via Iterative Self-Policy Distillation","work_id":"c5a99022-15b6-4d77-9850-23036df7a073","shared_citers":9},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":8},{"title":"OpenThoughts: Data Recipes for Reasoning Models","work_id":"c7acbe41-27a0-4773-a7be-8f08d86cdf21","shared_citers":8},{"title":"Process Reinforcement through Implicit Rewards","work_id":"c31a2126-86f9-44f3-91f3-208d0fc1463a","shared_citers":8},{"title":"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?","work_id":"8df6a2d1-d890-48ae-af85-c11643a91097","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":7},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":7},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":6},{"title":"OpenClaw-RL: Train Any Agent Simply by Talking","work_id":"78607317-8305-4515-8dc3-20b4ff5b8f3a","shared_citers":6}],"time_series":[{"n":43,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T16:32:41.697915+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T16:22:23.520122+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","claims":[{"claim_text":"Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuitio","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T16:32:41.675774+00:00"}},"summary":{"title":"Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models","claims":[{"claim_text":"Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuitio","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":34},{"title":"Reinforcement Learning via Self-Distillation","work_id":"b193541d-5853-4ea4-8e4b-8e4c08617eb6","shared_citers":26},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":24},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Self-Distillation Enables Continual Learning","work_id":"e9aa25e3-870c-46c8-8270-e4e5948d09f0","shared_citers":21},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":17},{"title":"On-Policy Context Distillation for Language Models","work_id":"b56a7e15-d864-43f4-9212-59bc7ec70d21","shared_citers":16},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":15},{"title":"Self-Distilled RLVR","work_id":"935a34f3-b83d-4214-b6a0-ae2395b3d107","shared_citers":14},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":13},{"title":"https://thinkingmachines.ai/blog/ on-policy-distillation/","work_id":"bb76b11f-d59b-421e-88c6-fa0920ed09c3","shared_citers":10},{"title":"MiMo-V2-Flash Technical Report","work_id":"1f3df90c-4bc3-49b1-ad9b-7f3b34e4ffba","shared_citers":10},{"title":"Privileged information distillation for language models","work_id":"674b7199-1d6e-4f36-89f1-fe1abe5b4db1","shared_citers":10},{"title":"CRISP: Compressed Reasoning via Iterative Self-Policy Distillation","work_id":"c5a99022-15b6-4d77-9850-23036df7a073","shared_citers":9},{"title":"arXiv preprint arXiv:2602.12125 , year=","work_id":"bb968107-1f43-4bf4-aa52-cc58000a6e89","shared_citers":8},{"title":"OpenThoughts: Data Recipes for Reasoning Models","work_id":"c7acbe41-27a0-4773-a7be-8f08d86cdf21","shared_citers":8},{"title":"Process Reinforcement through Implicit Rewards","work_id":"c31a2126-86f9-44f3-91f3-208d0fc1463a","shared_citers":8},{"title":"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?","work_id":"8df6a2d1-d890-48ae-af85-c11643a91097","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":7},{"title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","work_id":"42b43df0-4c82-493f-9d9b-1be8c116d9af","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":7},{"title":"A Survey of On-Policy Distillation for Large Language Models","work_id":"f6aaea8e-1f0d-43e3-b28f-6066d3e0a66b","shared_citers":6},{"title":"OpenClaw-RL: Train Any Agent Simply by Talking","work_id":"78607317-8305-4515-8dc3-20b4ff5b8f3a","shared_citers":6}],"time_series":[{"n":43,"year":2026}],"dependency_candidates":[]},"authors":[]}}