{"total":29,"items":[{"citing_arxiv_id":"2606.30626","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DOPD: Dual On-policy Distillation","primary_cat":"cs.AI","submitted_at":"2026-06-29T17:55:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30345","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:20:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DRIFT is an online self-evolution policy optimization framework using Difficulty Routing, Rhythm Gating, success buffers, and two-stage curriculum learning that reports new SOTA results on five reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29869","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation","primary_cat":"cs.CL","submitted_at":"2026-06-29T07:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARKD uses an RL policy network to adaptively balance FKL and RKL in LLM distillation, claiming gains of 0.4-0.6 points on Rouge-L and BertScore over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28562","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision","primary_cat":"cs.CL","submitted_at":"2026-06-26T19:41:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEAD applies entropy-guided token selection, KL annealing, and easy-to-hard curriculum to on-policy distillation and reports +4.8 average accuracy gain over vanilla OPD on six math benchmarks with OLMo-3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27814","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-06-26T07:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02684","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-01T17:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00755","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-05-30T14:44:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"TS-OPSD internalizes temperature via on-policy self-distillation to reheat entropy-collapsed RL policies in LLMs, providing stronger initialization for further training than continued RL or rollout temperature adjustment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00306","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking the Role of Temperature in Large Language Model Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-29T19:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Including temperature scaling makes forward KL divergence outperform reverse KL in LLM distillation on instruction benchmarks, overturning the τ=1 preference for reverse KL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22263","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-21T10:07:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21924","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual-Advantage On-Policy Distillation for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-21T02:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21834","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation","primary_cat":"cs.LG","submitted_at":"2026-05-20T23:56:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21606","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-20T18:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17862","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"$\\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control","primary_cat":"cs.LG","submitted_at":"2026-05-18T05:14:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16826","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-16T06:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13643","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-13T15:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13255","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T09:38:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12652","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","primary_cat":"cs.LG","submitted_at":"2026-05-12T18:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09725","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On-Policy Distillation with Best-of-N Teacher Rollout Selection","primary_cat":"cs.CV","submitted_at":"2026-05-10T19:49:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08737","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:48:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"supporting the mechanism-driven clip-threshold reading. Li et al. [23] characterise the modal-token concentration regime (peff≥0.99) that we exploit; our peff aggregator quantifies their phenomenology and turns it into a calibration target. Three orthogonal OPD failure modes appear in Fu et al. [11]; complementary analyses are in Jang et al. [16], Kim et al. [19], Ko et al. [20, 21], Song and Zheng [34], Xu et al. [41]. None characterize theλ-axis cliff or the IS-clip boundary itself. Format adherence and listwise ranking.Structured-output brittleness has motivated constrained decoding [4, 10, 29, 37, 39], benchmarks distinguishing structural from semantic violations [12, 36], and direct schema-RL [1, 24]."},{"citing_arxiv_id":"2605.08063","ref_index":38,"ref_count":5,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Flow-OPD: On-Policy Distillation for Flow Matching Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:50:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Flow-OPD is a two-stage on-policy distillation method for flow matching models that lifts GenEval from 63 to 92 and OCR from 59 to 94 on SD 3.5 Medium while preserving fidelity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"teacher's supervisory signal with the student's exploration space. In the LLM domain, OPD has seen rapid development: GKD [34] established the canonical framework to mitigate exposure bias; MiniLLM [35] and DistiLLM [36] introduced Reverse and Skewed KL to refine mode-seeking and optimization stability; G-OPD [37] unified OPD under KL-constrained RL theory; Entropy-Aware OPD [38] preserves diversity through adaptive divergence functions; Fast OPD [ 39] significantly accelerates computation via prefix truncation; and PACED [40] implements a competence-aware curriculum based on gradient signal-to-noise analysis. Despite these LLM advancements, OPD remains underexplored in visual Flow Matching models, which require dense supervision within"},{"citing_arxiv_id":"2605.07865","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KL for a KL: On-Policy Distillation with Control Variate Baseline","primary_cat":"cs.LG","submitted_at":"2026-05-08T15:24:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. URLhttps://arxiv.org/abs/2412.16720. [13] Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. URLhttps://arxiv.org/abs/2603.07079. [14] Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025. URL https://arxiv.org/abs/2510.13786. 11 [15] Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon,"},{"citing_arxiv_id":"2605.07725","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024. [24] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024. [25] Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. [26] Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation."},{"citing_arxiv_id":"2605.07711","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025. [32] Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. Enhancing cross-tokenizer knowledge distillation with contextual dynamical mapping. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8005-8018, Vienna, Austria, 2025. [33] Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint, arXiv:2603.07079, 2026. [34] Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. doi: 10.64434/tml.20251026."},{"citing_arxiv_id":"2605.07396","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rubric-based On-policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"OPD has become a promising post-training paradigm that replaces sparse rewards with dense feedback on student-generated trajectories, thereby not only mitigating exposure bias but also improving sample efficiency [8, 6, 7, 18]. Existing work strengthens OPD from several angles, including objective design and reward extrapolation [24, 22], training efficiency and signal calibration [25, 26, 27, 16, 28], cross-tokenizer distillation [29], and empirical analyses of failure modes and practical recipes [17, 30]. Frontier open-source models have also adopted OPD as a key component of post-training [5, 9, 10]. Despite this progress, the dominant line still assumes dense teacher probabilities or aligned token spaces, limiting proprietary-teacher and cross-architecture"},{"citing_arxiv_id":"2605.06597","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniSD: Towards a Unified Self-Distillation Framework for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03677","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe","primary_cat":"cs.LG","submitted_at":"2026-05-05T12:15:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01347","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2026-05-02T09:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Knowledge distillation [ 16] transfers a teacher's soft predictions to a smaller student. Agarwal et al. [1], Tan et al. [34] introduced OPD as a dense on-policy alternative to 2 off-policy sequence-level distillation [21], and OPD has since been adopted in frontier-model post- training [10, 37] and surveyed by Song et al. [33]. Recent diagnostic work investigates when OPD fails [12, 22], and Jang et al. [19] and Jin et al. [20] propose objective-level reformulations to stabilize the reverse-KL objective. Penaloza et al. [30] extend OPD to leverageprivileged informationthat is visible only to a single teacher at training time. These methods all rely on a single teacher with task-agnostic divergence; MAD-OPD instead provides debate-generated privileged information and a task-adaptive divergence."},{"citing_arxiv_id":"2604.24005","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents","primary_cat":"cs.LG","submitted_at":"2026-04-27T03:38:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13016","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","primary_cat":"cs.LG","submitted_at":"2026-04-14T17:54:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}