{"total":18,"items":[{"citing_arxiv_id":"2607.01480","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Procedural Memory Distillation: Online Reflection for Self-Improving Language Models","primary_cat":"cs.AI","submitted_at":"2026-07-01T21:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PMD extracts and distills cross-episode procedural knowledge from RL rollouts into LLM policies at three abstraction levels, yielding 3.8-13.6% gains over SDPO on SCIKNOWEVAL and LIVECODEBENCH via co-evolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30015","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parametric Skills","primary_cat":"cs.CL","submitted_at":"2026-06-29T09:19:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ParametricSkills uses a hypernetwork to turn textual skills into LoRA adapters, outperforming in-context learning by 6.44 points on average across six SWE subtasks with higher BERT Score and F1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29502","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation","primary_cat":"cs.AI","submitted_at":"2026-06-28T17:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UCOB improves agentic RL by using return-to-go comparisons between skill-conditioned and no-skill prompts as local teachers for bidirectional self-distillation and skill memory updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29476","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-28T16:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27814","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-06-26T07:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12634","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents","primary_cat":"cs.LG","submitted_at":"2026-06-10T19:53:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09365","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory","primary_cat":"cs.AI","submitted_at":"2026-06-08T11:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03979","ref_index":138,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories","primary_cat":"cs.LG","submitted_at":"2026-06-02T17:56:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02684","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-06-01T17:58:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02355","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training","primary_cat":"cs.AI","submitted_at":"2026-06-01T15:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21605","ref_index":44,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18141","ref_index":45,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","primary_cat":"cs.HC","submitted_at":"2026-05-18T09:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17873","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents","primary_cat":"cs.LG","submitted_at":"2026-05-18T05:34:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12483","ref_index":19,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10038","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning","primary_cat":"cs.AI","submitted_at":"2026-05-11T06:09:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tool Use✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ Explore Learn.✓ ✗ ✗ ✓ ✗ ✗ ✗ ✓ ✓ Distill Reuse✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ Metric Compare✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ Agent Learning with Tools and External Experience.Recent agent-learning research has in- creasingly explored how agents can improve across interactions by using tools, retaining external experience, and learning from execution traces [ 27, 28, 38, 39, 7]. Tool-learning methods such as Toolformer [21], Gorilla [22], and ToolLLM [23] improve tool invocation, API selection, and action grounding. Memory- and skill-based methods further store reusable experience outside model weights: MemSkill [ 24] studies evolving memory skills, ASDA [ 25] distills structured skill files for inference-time use, and Skill-SD [ 26] compresses multi-turn trajectories into reusable guid-"},{"citing_arxiv_id":"2605.07725","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SOD: Step-wise On-policy Distillation for Small Language Model Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T13:30:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. [65] Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision.arXiv preprint arXiv:2604.12002, 2026. [66] Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, et al. Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents.arXiv preprint arXiv:2604.10674, 2026. [67] Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10."},{"citing_arxiv_id":"2605.05040","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization","primary_cat":"cs.LG","submitted_at":"2026-05-06T15:31:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"2, where we also justify why the required local conditions are mild in our setting. The theorem shows that the local sample complexity is governed by the smallest eigenvalue of the empirical information matrix: larger curvature yields a tighter estimation bound. For the Hessian estimate, each comparison pair contributes β2σ mθ⋆(xi, y+ i , y− i ) \u0001 1−σ mθ⋆(xi, y+ i , y− i ) \u0001\u0001 di(θ⋆)di(θ⋆)⊤.(18) Context-Augmented Teacher vs. External Teacher.Thus a useful pair must satisfy two com- plementary conditions. First, from the perspective of self-distillation, the positive samples should come from a distribution that is more diverse than the current student distribution, so that the induced score-gap directions di(θ⋆) span informative directions beyond those already covered by the student."},{"citing_arxiv_id":"2604.22558","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-24T13:53:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}