SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
hub Canonical reference
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
Large Language Model (LLM) agents increasingly rely on domain-specific skills, yet manually authoring such skills does not scale, and skills generated purely from parametric knowledge often miss critical operational pitfalls. We introduce Trace2Skill, a framework that consolidates broad execution trajectories in parallel into a unified skill directory through inductive reasoning over agent experience. Trace2Skill supports both deepening existing human-written skills and creating useful skills from weak LLM-generated drafts. Experiments demonstrate the effectiveness of Trace2Skill across diverse domains, including office workflows, math reasoning, and vision QA. Importantly, the evolved skills are not merely memorized artifacts of the trajectories used to create them: they often transfer across model scales, across model families, and to out-of-distribution settings. For example, skills evolved from Qwen3.5-35B trajectories improve a Qwen3.5-122B agent by up to $57.65$ percentage points on WikiTableQuestions. Further analyses show that Trace2Skill outperforms sequential skill editing and ReasoningBank-style retrieval memories, compresses recurring failures and workarounds into standard operating procedures (SoPs), and yields portable skills that can be reused without parameter updates or test-time retrieval.
hub tools
citation-role summary
citation-polarity summary
years
2026 26roles
background 8representative citing papers
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Metric Freedom (F), quantified via Mantel test on output diversity and score variance, predicts when single-agent skill distillation from multi-agent systems will succeed, enabling up to 8x cost and 15x latency reductions across tested tasks.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
OptSkills clusters optimization problems by archetypes, distills workflow skills from successful trajectories, and achieves 68.27% micro-averaged accuracy on diverse benchmarks while outperforming DeepSeek-V3.2-Thinking by 4.53% on MIPLIB-NL.
SkillBrew introduces a Pareto-aware multi-objective optimization framework with bi-level propose-then-verify to curate skill banks for LLM agents, evaluated on two public benchmarks.
SGSD retrieves skill-mistake pairs to build a multi-teacher pool, validates teacher polarity via a verifier, and applies a gated objective to distill useful signals, yielding 6.2% average gains over GRPO on math benchmarks with Qwen3-1.7B.
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.
SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
citing papers explorer
-
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.