Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

· 2026 · cs.LG · arXiv 2604.10674

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

cs.CV · 2026-05-20 · unverdicted · novelty 5.0 · 2 refs

GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

cs.LG · 2026-05-12 · unverdicted · novelty 5.0 · 4 refs

Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.

A Brief Overview: On-Policy Self-Distillation In Large Language Models

cs.HC · 2026-05-18 · unverdicted · novelty 2.0 · 2 refs

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

citing papers explorer

Showing 8 of 8 citing papers.

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning cs.AI · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization cs.LG · 2026-05-06 · unverdicted · none · ref 18 · internal anchor
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents cs.LG · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 66 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning cs.LG · 2026-04-24 · unverdicted · none · ref 18 · internal anchor
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation cs.CV · 2026-05-20 · unverdicted · none · ref 44 · 2 links · internal anchor
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 19 · 4 links · internal anchor
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
A Brief Overview: On-Policy Self-Distillation In Large Language Models cs.HC · 2026-05-18 · unverdicted · none · ref 45 · 2 links · internal anchor
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer