SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

· 2026 · cs.AI · arXiv 2605.05726

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large-scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two-level taxonomy spanning 6 major categories and 18 sub-categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval-oriented training. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off-the-shelf models struggle on realistic large-scale skill libraries, and prior skill-retrieval models still leave substantial headroom. Task-specific fine-tuning on SkillRet substantially improves performance, improving NDCG@10 by +13.1 points over the strongest prior retriever and by +16.9 points over the strongest off-the-shelf retriever. Our analysis further suggests that these gains arise because fine-tuned models better focus on the small skill-relevant signals within long and noisy queries. These results establish SkillRet as a strong benchmark and foundation for future research on retrieval in large-scale agent systems.

representative citing papers

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.

Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

cs.PL · 2026-05-27 · unverdicted · novelty 6.0

SaP converts prose skills to typed pseudocode via clustering and deterministic verification, yielding 82 vs 47 wins on ALFWorld unseen split versus Graph-of-Skills baseline.

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security cs.CR · 2026-06-10 · unverdicted · none · ref 5 · internal anchor
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents cs.PL · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
SaP converts prose skills to typed pseudocode via clustering and deterministic verification, yielding 82 vs 47 wins on ALFWorld unseen split versus Graph-of-Skills baseline.
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

fields

years

verdicts

representative citing papers

citing papers explorer