{"paper":{"title":"Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Anhao Zhao, Haoran Xin, Junlong Tong, Wenjie Li, Xiaoyu Shen, Yingqi Fan","submitted_at":"2026-05-16T06:05:27Z","abstract_excerpt":"Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes is valid and produces four distinct, usable objectives without hidden inconsistencies or additional constraints.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ddf49289492006433aaef28d8fcaa2051589329979739e6756f8b02d211cd8f9"},"source":{"id":"2605.16826","kind":"arxiv","version":1},"verdict":{"id":"c291704e-e0b9-4de3-ba22-dcce0db4383a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T20:45:16.306817Z","strongest_claim":"We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions.","one_line_summary":"Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes is valid and produces four distinct, usable objectives without hidden inconsistencies or additional constraints.","pith_extraction_headline":"Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.16826/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T21:01:19.257304Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T20:51:12.717730Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T19:01:56.263033Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T18:33:26.406132Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"9f6fcdaac01836cdd713a2a1d37e1d4b0c1ddf558d4f5d7baf8033078619290c"},"references":{"count":42,"sample":[{"doi":"","year":2024,"title":"On-policy distillation of language models: Learning from self- generated mistakes","work_id":"f8236916-3196-4c62-b87d-c98e74ee3578","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"American mathematics competitions, 2023","work_id":"4691852d-b17b-4313-b6ac-2fc073845d7b","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Scheduled sampling for sequence prediction with recurrent neural networks","work_id":"c0f72f51-f3a3-462d-ae89-f6e73b4ac81e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Retaining by doing: The role of on-policy data in mitigating forgetting, 2025","work_id":"6c2970ce-db0f-4161-888e-b0a220e85e83","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/2025.findings-acl.782","year":2025,"title":"Unveiling the key factors for distilling chain-of-thought reasoning","work_id":"b7455570-d1ca-404a-94cd-a1c76e3ee0d7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":42,"snapshot_sha256":"69b7276496f97fca1112304e57b2e3a5235abbabfcd1f9daf5573ed053a0d95a","internal_anchors":18},"formal_canon":{"evidence_count":2,"snapshot_sha256":"03e9ecd4e55e8a9abb0c654dfe35748f98ad8f8ed93f3deccbd1aff9cd82f5a6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}