SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
hub Canonical reference
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
Large Language Model (LLM) agents increasingly rely on domain-specific skills, yet manually authoring such skills does not scale, and skills generated purely from parametric knowledge often miss critical operational pitfalls. We introduce Trace2Skill, a framework that consolidates broad execution trajectories in parallel into a unified skill directory through inductive reasoning over agent experience. Trace2Skill supports both deepening existing human-written skills and creating useful skills from weak LLM-generated drafts. Experiments demonstrate the effectiveness of Trace2Skill across diverse domains, including office workflows, math reasoning, and vision QA. Importantly, the evolved skills are not merely memorized artifacts of the trajectories used to create them: they often transfer across model scales, across model families, and to out-of-distribution settings. For example, skills evolved from Qwen3.5-35B trajectories improve a Qwen3.5-122B agent by up to $57.65$ percentage points on WikiTableQuestions. Further analyses show that Trace2Skill outperforms sequential skill editing and ReasoningBank-style retrieval memories, compresses recurring failures and workarounds into standard operating procedures (SoPs), and yields portable skills that can be reused without parameter updates or test-time retrieval.
hub tools
citation-role summary
citation-polarity summary
years
2026 39roles
background 8representative citing papers
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Metric Freedom (F), quantified via Mantel test on output diversity and score variance, predicts when single-agent skill distillation from multi-agent systems will succeed, enabling up to 8x cost and 15x latency reductions across tested tasks.
COMFYCLAW introduces skill evolution via graph editing, automatic reversion, VLM verification, and distillation of runs into reusable Agent Skills, achieving higher average scores than a verifier-only baseline across benchmarks.
SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.
Skill-MAS evolves a meta-skill for LLM-based multi-agent system generation via multi-trajectory rollout and selective reflection to improve performance without parametric updates.
SkillCAT proposes a three-stage training-free pipeline for LLM agent skill self-evolution using contrastive causal extraction, assessment-augmented merging, and topology-aware execution, reporting up to 40.40% average score gains on agent benchmarks.
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
OptSkills clusters optimization problems by archetypes, distills workflow skills from successful trajectories, and achieves 68.27% micro-averaged accuracy on diverse benchmarks while outperforming DeepSeek-V3.2-Thinking by 4.53% on MIPLIB-NL.
SkillBrew introduces a Pareto-aware multi-objective optimization framework with bi-level propose-then-verify to curate skill banks for LLM agents, evaluated on two public benchmarks.
SGSD retrieves skill-mistake pairs to build a multi-teacher pool, validates teacher polarity via a verifier, and applies a gated objective to distill useful signals, yielding 6.2% average gains over GRPO on math benchmarks with Qwen3-1.7B.
SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
GraphMind builds and evolves action-centric workflow graphs from traces, navigates them via multi-agent LLM reasoning, and adapts via ATR, outperforming baselines on 93 incidents with 8x less context and 26% lower hallucination in production deployment.
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
SPARK generates environment-verified trajectories to compute PDI, enabling posterior skill distillation that outperforms no-skill baselines and human-written skills across 86 tasks with up to 1000x cheaper inference.
citing papers explorer
-
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.