arxiv: 2605.05726 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Hongcheol Cho , Ryangkyung Kang , Youngeun Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill retrievalLLM agentsbenchmarkinformation retrievalfine-tuningagent skillsNDCGnoisy queries

0 comments

The pith

Fine-tuning retrievers on SkillRet raises NDCG@10 by more than 13 points for skill selection in large LLM agent libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillRet as a benchmark to address the growing challenge of selecting appropriate skills from large libraries in LLM agent systems, where explicit naming by users becomes impractical. It supplies over 17,000 skills with semantic tags and a taxonomy, plus training samples and evaluation queries that keep skill pools separate. Experiments show that both general-purpose and earlier skill-focused retrievers perform weakly on realistic noisy, long-form requests. Task-specific fine-tuning on the benchmark data yields clear gains, which the authors trace to improved attention on the small relevant portions of otherwise noisy user queries.

Core claim

SkillRet contains 17,810 public agent skills organized by structured semantic tags and a two-level taxonomy of 6 major categories and 18 sub-categories, together with 63,259 training samples and 4,997 evaluation queries that use disjoint skill pools. Across multiple retrievers, off-the-shelf models struggle on these large-scale libraries and prior skill-retrieval models leave substantial headroom. Task-specific fine-tuning on SkillRet improves NDCG@10 by 13.1 points over the strongest prior retriever and by 16.9 points over the strongest off-the-shelf retriever. The performance lift occurs because the tuned models learn to focus on the limited skill-relevant signals inside long and noisy raw

What carries the argument

The SkillRet benchmark dataset, which supplies a two-level taxonomy, structured semantic tags, and separate training and test pools to support both evaluation and retrieval-oriented fine-tuning on realistic agent skill libraries.

If this is right

Fine-tuned retrievers become more practical for agent systems that must operate under tight context and latency limits.
Skill selection no longer requires users to know or recall exact skill names as libraries grow.
The benchmark supplies a standard yardstick that lets future retrieval methods be compared directly on the same noisy, large-scale setting.
Analysis of query focus suggests that similar fine-tuning could help other retrieval problems where user input is long and contains extraneous material.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of such tuned retrievers would let agents maintain larger skill libraries without proportional increases in prompt size or response time.
The same data-construction approach could be applied to tool-use or function-calling benchmarks outside the agent domain.
Expanding the taxonomy with additional real-world usage logs would test whether the current categories already capture most practical skill distinctions.

Load-bearing premise

The collected set of 17,810 skills and 4,997 evaluation queries mirrors the noise levels, length distributions, and skill frequencies found in actual deployed LLM agent systems.

What would settle it

Running the fine-tuned retrievers on queries and skill libraries drawn directly from a production LLM agent deployment and finding no meaningful NDCG improvement or a reversal of the reported ranking would falsify the claim that SkillRet training produces substantial gains.

Figures

Figures reproduced from arXiv: 2605.05726 by Hongcheol Cho, Ryangkyung Kang, Youngeun Kim.

**Figure 1.** Figure 1: Overview of the SKILLRET data generation pipeline. Starting from 165 seed queries and 17,810 curated agent skills, we sample skills using inverse-frequency weighting and prompt an LLM to synthesize realistic user messages that naturally require the selected capabilities. Training queries are generated with Qwen3.5-122B-A10B, while evaluation queries are generated with Claude Opus 4.6. Generated queries are… view at source ↗

**Figure 2.** Figure 2: Skill and query length statistics. (a) Distribution of document length across all 17,810 skills. view at source ↗

**Figure 3.** Figure 3: MTEB Retrieval score vs. SKILLRET. Circle size is proportional to parameter count view at source ↗

**Figure 4.** Figure 4: Major-category distribution of the 17,810 skills. Software Engineering dominates (62.2%), view at source ↗

**Figure 5.** Figure 5: Major-category distribution across data splits. The three bars per category (full library, train, view at source ↗

**Figure 6.** Figure 6: Sentence-level erasure importance for an example query. Each bar shows the similarity view at source ↗

read the original abstract

As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large-scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two-level taxonomy spanning 6 major categories and 18 sub-categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and retrieval-oriented training. Across a diverse set of retrievers, we find that skill retrieval remains far from solved: off-the-shelf models struggle on realistic large-scale skill libraries, and prior skill-retrieval models still leave substantial headroom. Task-specific fine-tuning on SkillRet substantially improves performance, improving NDCG@10 by +13.1 points over the strongest prior retriever and by +16.9 points over the strongest off-the-shelf retriever. Our analysis further suggests that these gains arise because fine-tuned models better focus on the small skill-relevant signals within long and noisy queries. These results establish SkillRet as a strong benchmark and foundation for future research on retrieval in large-scale agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillRet is a solid new benchmark for skill retrieval in agents with clear fine-tuning gains, but its queries and skills lack any external validation against real deployments.

read the letter

The main thing to know is that this paper drops a new benchmark called SkillRet with 17k skills, a two-level taxonomy, disjoint train and eval pools, and 63k training samples plus 5k queries. It shows that fine-tuning a retriever on this data lifts NDCG@10 by 13 points over the best prior skill model and 17 over strong off-the-shelf ones, with some analysis that the gains come from better focus on the few relevant signals inside long noisy queries.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillRet, a benchmark dataset containing 17,810 public agent skills organized by a two-level taxonomy (6 major categories, 18 sub-categories) along with 63,259 training samples and 4,997 evaluation queries using disjoint skill pools. It evaluates a range of retrievers on this data, reports that off-the-shelf and prior skill-retrieval models leave substantial headroom, and shows that task-specific fine-tuning improves NDCG@10 by +13.1 points over the strongest prior retriever and +16.9 over the strongest off-the-shelf model, attributing the gains to better focus on small relevant signals within long, noisy queries.

Significance. If the benchmark queries and skills are representative of real LLM agent deployments, SkillRet would provide a valuable large-scale resource for studying retrieval under realistic library sizes and query noise, with the fine-tuning results offering a concrete starting point for improving agent skill selection. The work highlights an underexplored systems challenge and supplies both data and initial empirical analysis that could support follow-on research.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline NDCG@10 gains (+13.1 and +16.9) are presented without reported statistical significance tests, variance across runs, or details on how the 4,997 evaluation queries were sampled or constructed from the skill library. This leaves the magnitude and reliability of the fine-tuning improvement only moderately supported.
[Introduction and §3] Introduction and §3 (Dataset): The central claim that SkillRet captures 'noisy, long-form' requests and realistic skill distributions for large agent libraries is load-bearing for generalizability, yet the manuscript provides no quantitative validation such as query-length histograms, n-gram overlap statistics, or skill-frequency skew comparisons against any external production logs or deployed systems.

minor comments (2)

[Abstract] The abstract states that fine-tuned models 'better focus on the small skill-relevant signals'; this interpretation would be strengthened by including attention or feature-importance analysis in the results section.
[§4] Table or figure reporting the full set of retriever baselines (including exact model names, embedding dimensions, and retrieval hyperparameters) is needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the empirical support for our results and the characterization of SkillRet. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline NDCG@10 gains (+13.1 and +16.9) are presented without reported statistical significance tests, variance across runs, or details on how the 4,997 evaluation queries were sampled or constructed from the skill library. This leaves the magnitude and reliability of the fine-tuning improvement only moderately supported.

Authors: We agree that statistical significance testing and variance reporting would strengthen the claims. In the revised manuscript we will add standard deviations across three independent fine-tuning runs with different random seeds, along with p-values from paired t-tests comparing the fine-tuned retriever against the strongest baselines. We will also expand the description in §3 and §4 to detail the sampling procedure for the 4,997 evaluation queries, including the stratified selection across the two-level taxonomy and the explicit construction of disjoint skill pools between training and evaluation sets. revision: yes
Referee: [Introduction and §3] Introduction and §3 (Dataset): The central claim that SkillRet captures 'noisy, long-form' requests and realistic skill distributions for large agent libraries is load-bearing for generalizability, yet the manuscript provides no quantitative validation such as query-length histograms, n-gram overlap statistics, or skill-frequency skew comparisons against any external production logs or deployed systems.

Authors: We will add query-length histograms, n-gram overlap statistics, and skill-frequency skew plots to §3 to quantitatively characterize the noisy, long-form nature of the queries and the category distributions within SkillRet. These internal analyses are derived directly from the public skill sources and generated queries. Direct comparisons to external production logs are not feasible because such proprietary deployment data are not publicly available; we therefore cannot perform those specific external validations. revision: partial

standing simulated objections not resolved

Direct quantitative comparisons against external production logs or deployed LLM agent systems, as such data are proprietary and inaccessible.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluations on new dataset

full rationale

The paper introduces SkillRet as a new benchmark dataset with 17,810 skills and 4,997 queries, then reports empirical NDCG@10 results from evaluating and fine-tuning retrievers on it. No equations, parameter fits, predictions derived from inputs, or self-citation chains are present in the abstract or described claims. Central results (e.g., +13.1 NDCG@10 gain) are straightforward performance measurements on the held-out evaluation split, with no reduction to self-defined quantities or prior author work as load-bearing justification. This is a standard empirical contribution without derivation steps that could be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the collected public skills and queries form a realistic proxy for production LLM agent skill libraries; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Publicly available agent skills can be meaningfully organized with semantic tags and a two-level taxonomy that reflects real usage.
Invoked to justify the benchmark's structure and claim of realism.

pith-pipeline@v0.9.0 · 5576 in / 1238 out tokens · 62210 ms · 2026-05-08T11:35:38.683351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 22 canonical work pages · 9 internal anchors

[1]

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. jina-embeddings-v5-text: Task-targeted embedding distillation.arXiv preprint arXiv:2602.15547, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Claude opus 4.6.https://www.anthropic.com/claude, 2026

Anthropic. Claude opus 4.6.https://www.anthropic.com/claude, 2026

2026
[3]

Llm explainability via attributive masking learning

Oren Barkan, Yonatan Toib, Yehonatan Elisha, Jonathan Weill, and Noam Koenigstein. Llm explainability via attributive masking learning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9522–9537, 2024

2024
[4]

Class-balanced loss based on effective number of samples

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019

2019
[5]

Diffusion-pretrained dense and contextual embeddings.arXiv preprint arXiv:2602.11151, 2026

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, and Denis Bykov. Diffusion-pretrained dense and contextual embeddings.arXiv preprint arXiv:2602.11151, 2026

work page arXiv 2026
[6]

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

2023
[7]

Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?arXiv preprint arXiv:2603.15401, 2026

work page arXiv 2026
[8]

Harrier-oss-v1

Aolong Huang, Liang Wang, Furu Wei, et al. Harrier-oss-v1. https://huggingface.co/ microsoft/harrier-oss-v1-0.6b, 2026

2026
[9]

MTEB leaderboard

Hugging Face. MTEB leaderboard. https://huggingface.co/spaces/mteb/ leaderboard, 2026

2026
[10]

Cumulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques.ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002

2002
[11]

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. Xskill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056, 2026

work page arXiv 2026
[12]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review arXiv 2026
[13]

jina-reranker-v2-base-multilingual

Jina AI. jina-reranker-v2-base-multilingual. https://huggingface.co/jinaai/ jina-reranker-v2-base-multilingual, 2024

2024
[14]

Decoupling representa- tion and classifier for long-tailed recognition,

Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition.arXiv preprint arXiv:1910.09217, 2019

work page arXiv 1910
[15]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024. 10

work page internal anchor Pith review arXiv 2024
[16]

Skillflow: Scalable and efficient agent skill retrieval system.arXiv e-prints, pages arXiv–2504, 2025

Fangzhou Li, Pagkratios Tagkopoulos, and Ilias Tagkopoulos. Skillflow: Scalable and efficient agent skill retrieval system.arXiv e-prints, pages arXiv–2504, 2025

2025
[17]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv preprint arXiv:2603.02176, 2026

work page arXiv 2026
[18]

Understanding neural networks through representation erasure.arXiv preprint arXiv:1612.08220, 2016

Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure.arXiv preprint arXiv:1612.08220, 2016

work page arXiv 2016
[19]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review arXiv 2026
[20]

Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

work page arXiv 2024
[21]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, 2024

2024
[22]

Towards completeness-oriented tool retrieval for large language models

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Towards completeness-oriented tool retrieval for large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 1930–1940, 2024

1930
[23]

Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-04-22

2026
[24]

Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval

Stephen E Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. InSIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, pages 232–241. Springer, 1994

1994
[25]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models.arXiv preprint arXiv:2503.01763, 2025

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models.arXiv preprint arXiv:2503.01763, 2025

work page arXiv 2025
[26]

Octen series: Optimizing embedding models to #1 on rteb leaderboard, 2025

Octen Team. Octen series: Optimizing embedding models to #1 on rteb leaderboard, 2025

2025
[27]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review arXiv 2023
[28]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review arXiv 2022
[29]

Self-instruct: Aligning language models with self-generated instruc- tions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

2023
[30]

WebXSkill: Skill Learning for Autonomous Web Agents

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026b

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Metaclaw: Just talk – an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026
[32]

C- pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649, 2024

2024
[33]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review arXiv 2026
[34]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review arXiv 2025
[35]

F2LLM-v2: Inclusive, performant, and efficient embeddings for a multilingual world.arXiv preprint arXiv:2603.19223, 2026

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, and Rui Wang. F2llm-v2: Inclusive, performant, and efficient embeddings for a multilingual world.arXiv preprint arXiv:2603.19223, 2026

work page arXiv 2026
[36]

Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

work page arXiv 2025
[37]

SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. Skillrouter: Skill routing for llm agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026
[38]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[39]

Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024

Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. Can large language models transform computational social science?Computational Linguistics, 50(1):237–291, 2024. 12 Appendix A Data, code, and model. We release SKILLRETpublicly at https://huggingface.co/datasets/ThakiCloud/ SKILLRET. Code is available at https://github.com/T...

2024
[40]

**primary_action**: What the skill DOES (the core verb/activity)
[41]

**primary_object**: What the skill acts ON (the target/subject)
[42]

primary_action

**domain**: What technical field the skill belongs to Output strict JSON with this structure: { "primary_action": [ {"label": "...", "description": "..."} ], "primary_object": [ {"label": "...", "description": "..."} ], "domain": [ {"label": "...", "description": "..."} ] } No markdown fences, no explanations outside the JSON. User message format. Here ar...
[43]

Can you help me refactor our data_pipeline.py module?
[44]

It's a 400-line mess that pulls records from three different APIs, normalises them, and inserts into Postgres
[45]

Right now it uses raw requests calls with no error handling, returns plain dicts everywhere, and has zero type annotations
[46]

I need it to support running with full type hints

I'd like it rewritten to use async def with concurrent HTTP fetches, Pydantic models for the responses. I need it to support running with full type hints
[47]

Also add a __main__ block so it can run as a standalone script, using the date range as params for testing the normalisation logic against edge cases
[48]

skill-relevant 0.010 0.006 0.008 0.160 0.012 0.009 SKLL Description Guide for writing clean, efficient, idiomatic Python 3.11+ code

The whole thing should be clean enough to pass mypy --strict and ruff without warnings. skill-relevant 0.010 0.006 0.008 0.160 0.012 0.009 SKLL Description Guide for writing clean, efficient, idiomatic Python 3.11+ code. Enforces type hints, Pydantic v2 for APIs, comprehensions over loops, EAFP error handling. Base (Qwen3-Emb-0.6B) Trained (SkillRet-Emb-0...