{"total":17,"items":[{"citing_arxiv_id":"2605.22675","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Policy Distillation via Capability-Selective Subspace Projection","primary_cat":"cs.CL","submitted_at":"2026-05-21T16:18:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21924","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Visual-Advantage On-Policy Distillation for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-21T02:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21851","ref_index":2,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-21T00:55:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20005","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18740","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:57:04+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17531","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification","primary_cat":"cs.CV","submitted_at":"2026-05-17T16:30:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15113","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning from Language Feedback via Variational Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13643","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-13T15:05:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12913","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Revisiting DAgger in the Era of LLM-Agents","primary_cat":"cs.LG","submitted_at":"2026-05-13T02:40:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12741","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12000","ref_index":3,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation","primary_cat":"cs.LG","submitted_at":"2026-05-12T11:49:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MA-BC partitions divergent expert data and pools non-conflicting pairs to achieve faster convergence to Pareto-optimal policies in MOMDPs, with a matching minimax lower bound.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Dolgov, A. Y. Ng, and S. Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008. [2] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2004. [3] Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024. [4] Shibbir Ahmed, Baijing Qiu, Chun-Wei Kong, Huang Xin, Fiaz Ahmad, and Jinlong Lin."},{"citing_arxiv_id":"2605.11613","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:43:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Several lines of work address this bottleneck by providing denser reward signals. Process Reward Models (PRMs) train a separate model to score intermediate reasoning steps [5, 6, 7, 8], but require step-level annotations or extensive Monte Carlo rollouts. On-policy distillation (OPD) uses a stronger teacher model to provide token-level supervision on the student's own trajectories [9, 10, 11], offering dense on-policy signals but requiring access to a separate, often larger, teacher model whose quality upper-bounds the student. On-policy Self-Distillation (OPSD) has recently emerged as a compelling alternative that addresses both limitations simultaneously. The key idea is to condition the model on environment feedback, such as ground-truth solutions, test results, or error messages, to form aself-teacher, then distill this"},{"citing_arxiv_id":"2605.11609","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"single bit per rollout that does not indicate which intermediate step was responsible, leaving credit assignment to individual reasoning steps as an open problem. To address this, two main directions have emerged: training a separate process reward model (PRM) to score intermediate steps [14; 26; 18], or applying on-policy distillation (OPD) to provide a token-level imitation signal from a stronger teacher [1; 4; 17]. Both, however, depend on an external model. Can the model itself supply this credit? On-policy self-distillation answers this in the affirmative. It specializes OPD by taking the teacher to be the student itself, conditioned on privileged context: typically a verified solution and any feedback from the environment. The token-level signal is then produced by the model's own forward pass"},{"citing_arxiv_id":"2605.10189","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:38:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"advanced protein design and alignment methods. Our proposed ProteinOPD method achieves optimal performance in both designability and multi-property objectives. On-policy Distillation (OPD) is valued in large language model (LLM) alignment due to its dual capability: efficientlyadapting to new preferences whileeffectivelyresisting catastrophic forgetting [3, 39, 41]. Specifically, OPD enables a student to learn from token-level supervision provided by a teacher model on the student's own generated trajectories. Unlike mode-covering methods such as offline distillation or SFT [34, 40], OPD exhibits a mode-seeking nature [3]. As shown in Figure 1(a), OPD guides the student model to converge to the sharper and higher-reward modes of the teacher, rather than spreading probability mass across suboptimal"},{"citing_arxiv_id":"2605.08063","ref_index":34,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Flow-OPD: On-Policy Distillation for Flow Matching Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:50:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This motivates a more controllable multi-reward coordination mechanism. On-Policy DistillationTraditional offline distillation relies on fixed datasets and fails to adapt to the student's evolving trajectory. In contrast, On-Policy Distillation (OPD) dynamically couples the teacher's supervisory signal with the student's exploration space. In the LLM domain, OPD has seen rapid development: GKD [34] established the canonical framework to mitigate exposure bias; MiniLLM [35] and DistiLLM [36] introduced Reverse and Skewed KL to refine mode-seeking and optimization stability; G-OPD [37] unified OPD under KL-constrained RL theory; Entropy-Aware OPD [38] preserves diversity through adaptive divergence functions; Fast OPD [ 39] significantly accelerates computation via prefix truncation; and PACED [40] implements a competence-aware"},{"citing_arxiv_id":"2605.07725","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"to Appendix F for detailed cases. (c) Radar chart comparing methods on four benchmarks. Recently, on-policy distillation (OPD) has emerged as a promising paradigm for post-training [23-27]. Unlike RL methods that rely on sparse, trajectory-level rewards [21], OPD provides dense token-level supervision [24] on trajectories sampled from the student's own policy [23], thereby alleviating the credit assignment difficulty inherent in sparse reward signals [28, 29] while substantially improving sample efficiency [30] and training stability [28]. However, our experiments show that directly transferring OPD to SLM-based TIR can lead to severe training instability [31-34]. We attribute this failure to a fundamental difference between"},{"citing_arxiv_id":"2605.06094","ref_index":1,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:13:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"and EMA-based teacher stabilization, VISD robustly scales to complex video sequences. Extensive experiments demonstrate that VISD significantly accelerates convergence and improves accuracy, 9 grounding, and interpretability over strong baselines. Future work will explore more efficient judge modeling and richer feedback structures for broader grounded reasoning applications. References [1] Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024. [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang,"}],"limit":50,"offset":0}