arxiv: 2604.13064 · v1 · submitted 2026-03-19 · 💻 cs.CL · cs.CY

Recognition: no theorem link

Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

Haichuan Hu , Ye Shang , Quanjun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:30 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords agent skillsLLM agentsskill registrycross-lingual analysissecurity risksrisk predictionClawHubempirical study

0 comments

The pith

ClawHub analysis of 26,502 agent skills finds language-based clusters and over 30 percent suspicious labels, with submission-time prediction reaching 79 percent AUROC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the structure and risks of a large public registry of reusable skills for LLM agents. It normalizes the full set of 26,502 skills and clusters them by function, revealing that English-language skills concentrate on technical infrastructure such as APIs, automation, and memory handling, while Chinese-language skills form tighter groups around concrete applications including media generation, social content, and financial services. More than 30 percent of all skills carry platform-assigned suspicious or malicious labels, and a substantial share lack complete safety information. The authors test early risk detection using only data available at submission and show that logistic regression on these signals, led by primary documentation, reaches 72.62 percent accuracy and 78.95 percent AUROC on a balanced benchmark of 11,010 skills.

Core claim

Normalizing and clustering 26,502 skills from the ClawHub registry demonstrates clear cross-lingual differences: English skills center on infrastructure and technical capabilities such as APIs, automation, and memory, whereas Chinese skills organize around scenario-driven applications like media generation, social content production, and finance services. Over 30 percent of the skills are labeled suspicious or malicious by platform signals, and submission-time risk prediction using logistic regression on features including primary documentation achieves 72.62 percent accuracy and 78.95 percent AUROC on a balanced set of 11,010 skills.

What carries the argument

Clustering of normalized skill descriptions by language and function together with logistic regression on submission-time signals such as primary documentation.

If this is right

English and Chinese skills play different roles in agent capability sharing.
Public skill registries contain a sizable fraction of potentially risky items.
Risk can be flagged at publication using only documentation and metadata.
Primary documentation is the strongest early indicator of security issues.
Agent skill ecosystems require ongoing safety monitoring alongside reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If label noise is high, the true risk level could differ enough to change moderation priorities.
Similar language splits may appear in other public agent skill registries.
Requiring richer documentation at submission could simultaneously improve usability and risk detection.
Language-specific moderation policies might be needed to address the differing cluster types.

Load-bearing premise

The platform-provided suspicious and malicious labels are reliable ground truth and the set of 26,502 crawled skills represents the full registry without major selection bias.

What would settle it

An independent manual audit of a random sample of labeled skills that finds a substantially lower or higher true rate of malicious behavior than the reported 30 percent would falsify the prevalence claim.

Figures

Figures reproduced from arXiv: 2604.13064 by Haichuan Hu, Quanjun Zhang, Ye Shang.

**Figure 2.** Figure 2: Data collection pipeline. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Domain distribution of skills. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cluster analysis. Furthermore, we conducted a clustering analysis over skill functionality. Given the substantial differences between Chinese and English tokenization, we analyzed Chinese and English skills separately. After filtering out low-quality skills with unclear functionality or missing descriptions, we retained 17,499 English skills and 3,882 Chinese skills for downstream analysis. For each skill,… view at source ↗

**Figure 5.** Figure 5: Download distribution of skills. 3.2 Risk Analysis In addition to the functional analysis, we conducted an in-depth assessment of skill risk profiles using publicly available tags on ClawHub related to risk scanning, evaluation, and relevant labels. As shown in Figures 6a and 6b, we first quantified the temporal trends and categorical distributions of risk, performing statistical analyses across two dimens… view at source ↗

**Figure 6.** Figure 6: Risk overview. As shown in Figure 6a, from the temporal perspective, despite the surge in the number of skills, the proportion of suspicious skills has not declined and has remained relatively stable. This suggests that security within the current skills community remains a significant unresolved risk. Furthermore, the number of skills that are not labeled as either safe or suspicious has also stayed persi… view at source ↗

**Figure 7.** Figure 7: Comparison between risk detection by LLMs and security scanning tools. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Risk distribution by domain and function. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: List of owners aggregating high-risk skills. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawHub study flags over 30% suspicious skills and modest submission-time prediction, but rests on unvalidated platform labels.

read the letter

The main takeaway is that this paper gives the first numbers on ClawHub: more than 30% of the 26k crawled skills carry suspicious or malicious platform flags, English skills cluster around infrastructure and APIs while Chinese ones group around media, social, and finance apps, and a simple logistic regression can hit 72.6% accuracy and 79% AUROC for early risk using only submission docs. That is concrete data on an emerging registry that matters for agent security work. The cross-lingual functional split and the balanced 11k-skill benchmark for submission-time prediction are the clearest new pieces; prior work had not quantified these patterns on this platform. The collection effort itself is useful because it surfaces the scale and the language split in one place. The paper does a straightforward job turning raw registry data into these observations without overclaiming methods. The central weakness is the ground truth. Both the 30% prevalence claim and the classifier treat platform suspicious/malicious labels as reliable without any manual audit, error analysis, or cross-check against actual code or prompt patterns. If those labels carry noise or systematic bias by language or skill type, the fraction and the reported importance of primary documentation both become less trustworthy. The abstract also skips details on the clustering algorithm, feature choices, and how incomplete safety data were handled, which makes the results hard to reproduce or stress-test. This is for researchers who follow LLM agent marketplaces and security exposure in tool ecosystems. A reader who wants early empirical benchmarks on risk signals will find usable numbers here even if the methods stay light. It deserves a serious referee because the registry is new, the security question is timely, and the data collection is non-trivial; the review process can push for label validation and fuller method disclosure without starting from zero.

Referee Report

3 major / 2 minor

Summary. The paper presents an empirical study of the ClawHub public registry of LLM agent skills. It builds a normalized dataset of 26,502 skills and analyzes language distribution, functional organization through clustering, popularity, and security signals. Key claims include clear cross-lingual differences (English skills infrastructure-oriented around APIs/automation/memory; Chinese skills application-oriented around media generation, social content, and finance), more than 30% of skills labeled suspicious or malicious by platform signals, and a submission-time risk prediction task on a balanced benchmark of 11,010 skills where logistic regression achieves 72.62% accuracy and 78.95% AUROC, with primary documentation as the most informative feature.

Significance. If the methodological gaps are addressed and platform labels validated, the work would provide timely empirical insights into the structure, cross-lingual variation, and security risks of emerging public skill ecosystems for LLM agents. It could inform registry design, early risk detection, and policy on capability reuse, while highlighting a new attack surface in agent systems.

major comments (3)

[Clustering results] Clustering results (Abstract and corresponding analysis section): No details are supplied on the clustering algorithm, feature representation (e.g., embeddings or TF-IDF), normalization, number of clusters, or validation/labeling procedure. These omissions make it impossible to reproduce or assess the central claim of clear cross-lingual differences in skill organization.
[Security signals analysis] Security signals and prevalence claim (Abstract and security analysis section): The >30% suspicious/malicious fraction and the ground-truth labels for the risk classifier are taken directly from platform signals with no error analysis, manual audit, cross-check against content, or discussion of potential label noise or bias by language or skill type. This directly undermines both the prevalence statistic and the reliability of the reported classifier performance.
[Risk prediction] Risk prediction experiment (Abstract and prediction section): The manuscript provides no description of the 12 classifiers, feature construction from submission-time signals, how the balanced benchmark of 11,010 skills was constructed, handling of incomplete safety data, or evaluation protocol (e.g., train/test split, hyperparameter tuning). Without these, the 72.62% accuracy and 78.95% AUROC for logistic regression cannot be verified.

minor comments (2)

[Abstract] The abstract refers to 'available platform signals' and 'primary documentation' without defining these terms or listing the exact signals used.
[Dataset] Dataset construction details (normalization steps, deduplication, language detection) are mentioned but not elaborated, affecting reproducibility of the 26,502-skill corpus.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the reproducibility and transparency of our empirical analysis. We will revise the manuscript to address the methodological omissions and provide additional context on data labeling and experimental protocols. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Clustering results] Clustering results (Abstract and corresponding analysis section): No details are supplied on the clustering algorithm, feature representation (e.g., embeddings or TF-IDF), normalization, number of clusters, or validation/labeling procedure. These omissions make it impossible to reproduce or assess the central claim of clear cross-lingual differences in skill organization.

Authors: We agree that the clustering methodology requires explicit documentation to support reproducibility. In the revised version we will add a dedicated subsection in the analysis section that specifies the algorithm (K-Means), the feature representation (sentence embeddings from a multilingual model followed by TF-IDF weighting on skill descriptions), normalization steps, the criterion used to select the number of clusters, and the procedure for interpreting and labeling clusters via top terms and representative examples. This will allow readers to assess the cross-lingual organizational differences we report. revision: yes
Referee: [Security signals analysis] Security signals and prevalence claim (Abstract and security analysis section): The >30% suspicious/malicious fraction and the ground-truth labels for the risk classifier are taken directly from platform signals with no error analysis, manual audit, cross-check against content, or discussion of potential label noise or bias by language or skill type. This directly undermines both the prevalence statistic and the reliability of the reported classifier performance.

Authors: We acknowledge the need for greater scrutiny of the platform-provided labels. In the revision we will expand the security analysis section with a limitations paragraph that discusses the source of the signals, reports the fraction of skills with complete versus partial safety metadata, and notes potential sources of noise or language-specific bias. While a comprehensive manual audit of all 26k skills was outside the scope of the current study, we will include a small-scale manual review of a random sample to provide initial validation of the label quality. revision: partial
Referee: [Risk prediction] Risk prediction experiment (Abstract and prediction section): The manuscript provides no description of the 12 classifiers, feature construction from submission-time signals, how the balanced benchmark of 11,010 skills was constructed, handling of incomplete safety data, or evaluation protocol (e.g., train/test split, hyperparameter tuning). Without these, the 72.62% accuracy and 78.95% AUROC for logistic regression cannot be verified.

Authors: We agree that the experimental setup must be described in full. The revised manuscript will include a new subsection detailing the 12 classifiers (logistic regression, random forest, gradient boosting, SVM, and neural baselines), the exact feature construction process from submission-time fields (documentation text, metadata, and derived statistics), the construction of the balanced 11,010-skill benchmark (stratified sampling to equalize positive/negative classes while preserving language distribution), our handling of missing safety fields (imputation and exclusion criteria), and the evaluation protocol (5-fold cross-validation with hyperparameter search via grid search). These additions will make the reported performance figures verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with external labels

full rationale

The paper performs data collection from ClawHub, language analysis, clustering of skills, and supervised classification for risk prediction. The risk model trains standard classifiers (including Logistic Regression) on platform-provided suspicious/malicious labels using submission-time features; this is ordinary supervised learning, not a derivation that reduces to its inputs by construction. No equations, ansatzes, uniqueness theorems, or self-citations are load-bearing for any central claim. The cross-lingual clustering and prevalence statistics are direct observations from the crawled dataset. The study is self-contained against external benchmarks and contains no self-definitional or fitted-input-called-prediction patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the reliability of external platform risk labels and the assumption that the collected sample reflects the broader ecosystem; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Platform signals for suspicious or malicious skills are accurate and unbiased ground truth
Directly supports the 30% suspicious finding and the supervised risk prediction task.

pith-pipeline@v0.9.0 · 5581 in / 1467 out tokens · 89780 ms · 2026-05-15T08:30:49.373755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 7 internal anchors

[1]

Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers.arXiv preprint arXiv:2504.08999,

Arash Ahmadi, Sarah Sharif, and Yaser M Banad. Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers.arXiv preprint arXiv:2504.08999,

work page arXiv
[2]

Hossein Bahak, Farzaneh Taheri, Zahra Zojaji, and Arefeh Kazemi

Accessed: 2026-03-19. Hossein Bahak, Farzaneh Taheri, Zahra Zojaji, and Arefeh Kazemi. Evaluating chatgpt as a question answering system: A comprehensive analysis and comparison with existing models.arXiv preprint arXiv:2312.07592,

work page arXiv 2026
[3]

Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

work page arXiv
[4]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Skillnet: Create, evaluate, and connect ai skills.arXiv preprint arXiv:2603.04448,

10 Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, et al. Skillnet: Create, evaluate, and connect ai skills.arXiv preprint arXiv:2603.04448,

work page arXiv
[7]

Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004,

George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004,

work page arXiv
[8]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Huihui Chen, Chenyu Zhang, Tianyang Zhang, Yu Su, Maosong Sun, and Jie Tang. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920,

work page arXiv
[10]

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183,

work page 2025
[11]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A surve...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Accessed: 2026-03-19

OpenClaw documentation. Accessed: 2026-03-19. OpenClaw. Skills,

work page 2026
[13]

Accessed: 2026-03-19

OpenClaw documentation. Accessed: 2026-03-19. Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems

work page 2026
[14]

ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014,

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014,

work page arXiv
[15]

Survey of llm agent communication with mcp: A software design pattern centric review.arXiv preprint arXiv:2506.05364,

Anjana Sarkar and Soumyendu Sarkar. Survey of llm agent communication with mcp: A software design pattern centric review.arXiv preprint arXiv:2506.05364,

work page arXiv
[16]

In chatgpt we trust? measuring and characterizing the reliability of chatgpt.arXiv preprint arXiv:2304.08979,

Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. In chatgpt we trust? measuring and characterizing the reliability of chatgpt.arXiv preprint arXiv:2304.08979,

work page arXiv
[17]

Llm with tools: A survey.arXiv preprint arXiv:2409.18807,

Zhuocheng Shen. Llm with tools: A survey.arXiv preprint arXiv:2409.18807,

work page arXiv
[18]

Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page arXiv
[19]

Tom’s Hardware

Accessed: 2026-03-19. Tom’s Hardware. Malicious openclaw ’skill’ targets crypto users on clawhub, February

work page 2026
[20]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Accessed: 2026-03-19. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

work page arXiv
[22]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv