SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
hub Canonical reference
Xskill: Continual learning from experience and skills in multimodal agents
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
hub tools
citation-role summary
citation-polarity summary
years
2026 21verdicts
UNVERDICTED 21roles
background 5polarities
background 5representative citing papers
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
COMFYCLAW introduces skill evolution via graph editing, automatic reversion, VLM verification, and distillation of runs into reusable Agent Skills, achieving higher average scores than a verifier-only baseline across benchmarks.
Multimodal skills retaining visual figures improve CUA benchmark scores by 8.3 points over text-only equivalents generated from the same source content.
SkillCAT proposes a three-stage training-free pipeline for LLM agent skill self-evolution using contrastive causal extraction, assessment-augmented merging, and topology-aware execution, reporting up to 40.40% average score gains on agent benchmarks.
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
EmbodiSkill uses skill-aware reflection on execution trajectories to update skills in embodied agents, achieving 93.28% success on ALFWorld with a frozen Qwen3.5-27B model, outperforming direct GPT-5.2 use by 31.58%.
ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
SkillSmith introduces a synergy-aware skill-tool co-evolution framework with atomic bundles, Lotka-Volterra-inspired interaction modeling, and anti-pattern recording that outperforms baselines on complex tasks.
Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.
PathoSage is a three-stage framework using Structured Evidence Deliberation and a Beta-Bernoulli experience system to improve patch-level pathology reasoning by mitigating hallucinations and tool conflicts.
AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
citing papers explorer
-
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.