Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
hub Canonical reference
Organizing, orchestrating, and benchmarking agent skills at ecosystem scale
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 22representative citing papers
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
SkillGenBench is a benchmark for evaluating LLM skill generation pipelines in task-conditioned and task-agnostic regimes from repository and document sources using execution-based checks.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
Ctx2Skill uses a self-evolving multi-agent loop with Challenger, Reasoner, Judge, and Cross-time Replay to discover context-specific skills, improving task-solving rates on CL-bench benchmarks across models.
SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.
Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
citing papers explorer
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
-
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
SkillGenBench is a benchmark for evaluating LLM skill generation pipelines in task-conditioned and task-agnostic regimes from repository and document sources using execution-based checks.
-
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
-
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
-
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
-
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
-
Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
-
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.
-
SkillEvolver: Skill Learning as a Meta-Skill
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
-
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
-
Skill-R1: Agent Skill Evolution via Reinforcement Learning
Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
-
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill uses a self-evolving multi-agent loop with Challenger, Reasoner, Judge, and Cross-time Replay to discover context-specific skills, improving task-solving rates on CL-bench benchmarks across models.
-
Toward Scalable Terminal Task Synthesis via Skill Graphs
SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.
-
Workflow Closure Is Not Scientific Closure in Auto-Research Systems
Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
-
Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
Bilevel optimization with outer-loop MCTS for skill structure and inner-loop LLM refinement improves agent accuracy on an operations-research question-answering dataset.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.