pith. sign in

arxiv: 2606.00925 · v1 · pith:WLMT7QAPnew · submitted 2026-05-30 · 💻 cs.CR · cs.AI

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Pith reviewed 2026-06-28 18:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords malicious skill detectionagentic skill ecosystemssecurity benchmarkingsandbox verificationsupply chain securityruntime behavior analysissemantic vetting
0
0 comments X

The pith

Semantic and signature checks miss up to 89% of malicious skills whose threats come from natural-language instructions or multicomponent logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Open agent platforms let contributors publish reusable skills, creating supply-chain risks when malicious code hides behind benign descriptions. The paper introduces SkillVetBench, a benchmark with a first stage that scans skill specifications for hidden intent and a second stage that executes flagged skills in a sandbox to collect runtime traces. Tests on real malicious samples from the OpenClaw ecosystem show that static methods fail on many cases involving complex instructions or interactions across components. This matters because undetected skills can trigger harmful actions such as file writes or process spawns once agents invoke them. The benchmark supplies concrete execution evidence to support detection verdicts.

Core claim

SkillVetBench is a two-stage benchmark that first performs semantic vetting on natural-language skill specifications and then executes flagged skills in an instrumented sandbox; experiments on confirmed malicious skills demonstrate that semantic-only and signature-based baselines miss up to 89% of threats arising from natural-language instructions, multicomponent logic, or cross-component interactions, while runtime attacks concentrate in a small set of high-permission primitives.

What carries the argument

SkillVetBench two-stage process of semantic vetting over specifications followed by sandbox execution that records traces for primitives such as exec, write_file, install_skill, and spawn.

Load-bearing premise

The collected malicious skills from the live OpenClaw ecosystem represent the wider threat landscape and sandbox execution can reliably surface the relevant malicious behaviors.

What would settle it

A new collection of malicious skills in which semantic analysis alone detects more than 80 percent would challenge the reported insufficiency of static methods.

read the original abstract

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89\% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write\_file, install\_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SkillVetBench, a two-stage benchmark for security vetting in open agentic skill ecosystems. Stage 1 performs semantic analysis on natural-language skill specifications to detect hidden malicious intent; Stage 2 executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. The benchmark is constructed from confirmed malicious skills collected from the live OpenClaw ecosystem, including ClawHavoc campaign samples. Experiments claim that semantic-only and signature-based baselines miss up to 89% of malicious skills whose threats stem from natural-language instructions, multicomponent logic, or cross-component interactions; runtime attacks concentrate in high-permission primitives such as exec, write_file, install_skill, and spawn; and sandbox traces provide concrete evidence supporting malicious verdicts.

Significance. If the experimental claims hold after methodological clarification, the work is significant as an early benchmark addressing supply-chain risks in extensible agent platforms. It demonstrates the insufficiency of static methods for this threat model and supplies real-world samples plus case studies with execution traces that can serve as a foundation for evaluating future defenses. The focus on runtime verification of multicomponent behaviors fills a gap in current agent security evaluation.

major comments (2)
  1. [Abstract] Abstract: The central claim that semantic-only and signature-based baselines miss up to 89% of malicious skills is presented without any description of dataset size, selection criteria for the confirmed malicious skills, baseline implementations, or statistical significance testing. This information is load-bearing for assessing whether the result supports the conclusion that semantic methods are insufficient.
  2. [Abstract] Abstract: The malicious skills are labeled as 'confirmed' from the live OpenClaw ecosystem and ClawHavoc campaign, yet no confirmation criteria are provided and no analysis addresses whether labeling is independent of the sandbox/runtime stage of SkillVetBench or whether the sample is biased toward behaviors easily surfaced by execution but missed by static analysis. This directly affects the validity of the 89% miss-rate comparison.
minor comments (1)
  1. [Abstract] The abstract refers to an 'instrumented sandbox' and 'auditable evidence' but provides no high-level description of the instrumentation or evidence format, which would aid reproducibility even at the abstract level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important areas where the abstract can be strengthened to better support the central claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that semantic-only and signature-based baselines miss up to 89% of malicious skills is presented without any description of dataset size, selection criteria for the confirmed malicious skills, baseline implementations, or statistical significance testing. This information is load-bearing for assessing whether the result supports the conclusion that semantic methods are insufficient.

    Authors: We agree that the abstract should be more self-contained. The full manuscript details the dataset size and selection criteria in Section 3, baseline implementations in Section 4, and statistical significance testing in Section 5. We will revise the abstract to incorporate a concise summary of these elements (dataset scale, sources, baseline types, and significance) so that the 89% claim can be evaluated directly from the abstract. revision: yes

  2. Referee: [Abstract] Abstract: The malicious skills are labeled as 'confirmed' from the live OpenClaw ecosystem and ClawHavoc campaign, yet no confirmation criteria are provided and no analysis addresses whether labeling is independent of the sandbox/runtime stage of SkillVetBench or whether the sample is biased toward behaviors easily surfaced by execution but missed by static analysis. This directly affects the validity of the 89% miss-rate comparison.

    Authors: The manuscript describes the labeling process and its independence from the sandbox stage in Section 3.2, drawing on external community reports and prior analyses of the ClawHavoc campaign. We will revise the abstract to include a brief statement on confirmation criteria and labeling independence. We will also add a short discussion of potential selection bias in the revised version to strengthen the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper is a purely empirical benchmarking study that collects confirmed malicious skills from an external ecosystem and measures detection rates of semantic baselines against them. No equations, parameters, or predictions are derived from inputs by construction. The 89% miss-rate is a direct experimental count on the collected set, not a fitted or self-defined quantity. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks and receives the default low score for non-derivational empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark construction paper with no mathematical model, free parameters, or invented entities; relies on the domain assumption that collected malicious samples are representative.

axioms (1)
  • domain assumption The collected malicious skills from OpenClaw and ClawHavoc represent the relevant threat distribution for open agent skill ecosystems.
    Used to build the benchmark and evaluate detection rates; stated implicitly in the abstract's description of benchmark construction.

pith-pipeline@v0.9.1-grok · 5783 in / 1271 out tokens · 16542 ms · 2026-06-28T18:17:01.033500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

    cs.CR 2026-06 unverdicted novelty 2.0

    This work categorizes seven risks of OpenClaw for non-technical users, provides plain-language mitigations, and supplies a companion Skill to automate security configurations.

Reference graph

Works this paper leans on

67 extracted references · 15 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    OpenReview,https://openreview.net/forum?id=UVgbFuXPaO, 2025

    Log-To-Leak: Prompt Injection Attacks on Tool-Using LLM Agents via Model Context Protocol. OpenReview,https://openreview.net/forum?id=UVgbFuXPaO, 2025

  2. [2]

    From Magic to Malware: How OpenClaw’s Agent Skills Become an Attack Surface

    1Password Security Team. From Magic to Malware: How OpenClaw’s Agent Skills Become an Attack Surface. 1Password Blog, February 2026. URLhttps://1password.com/blog/ from-magic-to-malware-how-openclaws-agent-skills-become-an-attack-surface . Accessed: Apr. 2026

  3. [3]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InAISec@CCS, pages 79–90. ACM, 2023. SkillVetBench 15

  4. [4]

    Agentic ai: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1): 11, 2025

    Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. Agentic ai: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review, 59(1): 11, 2025

  5. [5]

    Formal analysis and supply chain security for agentic AI skills.CoRR, abs/2603.00195, 2026

    Varun Pratap Bhardwaj. Formal analysis and supply chain security for agentic AI skills.CoRR, abs/2603.00195, 2026

  6. [6]

    Personal AI agents like OpenClaw are a security nightmare

    Amy Chang, Vineeth Sai Narajala, and Idan Habler. Personal AI agents like OpenClaw are a security nightmare. Cisco Blogs, January 2026. URLhttps://blogs.cisco.com/ai/ personal-ai-agents-like-openclaw-are-a-security-nightmare . Accessed: 2026- 04-14

  7. [7]

    Reconcile: Round-table conference improves reasoning via consensus among diverse llms

    Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InACL (1), pages 7066–7085. Association for Computational Linguistics, 2024

  8. [8]

    Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases. InNeurIPS, 2024

  9. [9]

    A comparison of static, dynamic, and hybrid analysis for malware detection.Journal of Computer Virology and Hacking Techniques, 13(1):1–12, 2017

    Anusha Damodaran, Fabio Di Troia, Corrado Aaron Visaggio, Thomas H Austin, and Mark Stamp. A comparison of static, dynamic, and hybrid analysis for malware detection.Journal of Computer Virology and Hacking Techniques, 13(1):1–12, 2017

  10. [10]

    LiteLLM and Telnyx Compromised on PyPI: Tracing the Team- PCP Supply Chain Campaign

    Datadog Security Labs. LiteLLM and Telnyx Compromised on PyPI: Tracing the Team- PCP Supply Chain Campaign. https://securitylabs.datadoghq.com/articles/ litellm-compromised-pypi-teampcp-supply-chain-campaign/, 2026

  11. [11]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InNeurIPS, 2024

  12. [12]

    A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

    Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

  13. [13]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InICML, Proceedings of Machine Learning Research, pages 11733–11763. PMLR / OpenReview.net, 2024

  14. [14]

    Towards measuring supply chain attacks on package managers for interpreted languages

    Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. Towards measuring supply chain attacks on package managers for interpreted languages. InNDSS. The Internet Society, 2021

  15. [15]

    A survey on automated dynamic malware-analysis techniques and tools.ACM computing surveys (CSUR), 44(2):1–42, 2008

    Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. A survey on automated dynamic malware-analysis techniques and tools.ACM computing surveys (CSUR), 44(2):1–42, 2008

  16. [16]

    CodeBERT: A pre-trained model for program- ming and natural languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for program- ming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP, pages 1536–1547. Association for Computational Linguistics, 2020

  17. [17]

    From prompt injections to protocol exploits: Threats in llm- powered ai agents workflows.ICT Express, 2025

    Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah. From prompt injections to protocol exploits: Threats in llm- powered ai agents workflows.ICT Express, 2025. SkillVetBench 16

  18. [18]

    CVSSv4.0FAQ.https://www.first.org/cvss/v4.0/faq, 2023

    FIRST.Org, Inc. CVSSv4.0FAQ.https://www.first.org/cvss/v4.0/faq, 2023. Includes official test vectors and reference library list

  19. [19]

    CVSS v4.0 specification document

    FIRST.Org, Inc. CVSS v4.0 specification document. Technical report, Forum of Incident Response and Security Teams (FIRST), 2023. URLhttps://www.first.org/cvss/v4.0/ specification-document

  20. [20]

    Common Vulnerability Scor- ing System Version 4.0: Specification Document

    Forum of Incident Response and Security Teams. Common Vulnerability Scor- ing System Version 4.0: Specification Document. https://www.first.org/cvss/ specification-document, 2023. Accessed: 2026-05-07

  21. [21]

    Skillprobe: Securityauditingforemergingagentskillmarketplacesviamulti-agentcollaboration

    Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang. Skillprobe: Securityauditingforemergingagentskillmarketplacesviamulti-agentcollaboration. CoRR, abs/2603.21019, 2026

  22. [22]

    A multi-agent llm defense pipeline against prompt injection attacks.arXiv preprint arXiv:2509.14285, 2025

    SM Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, MF Mridha, and Jungpil Shin. A multi-agent llm defense pipeline against prompt injection attacks.arXiv preprint arXiv:2509.14285, 2025

  23. [23]

    SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

    Yinghan Hou and Zongyou Yang. Skillsieve: A hierarchical triage framework for detecting malicious ai agent skills.arXiv preprint arXiv:2604.06550, 2026

  24. [24]

    OpenClaw can be hazardous to your software supply chain.https://jfrog.com/ blog/giving-openclaw-the-keys-to-your-kingdom-read-this-first/, 2026

    JFrog. OpenClaw can be hazardous to your software supply chain.https://jfrog.com/ blog/giving-openclaw-the-keys-to-your-kingdom-read-this-first/, 2026

  25. [25]

    SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

    Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr. Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.CoRR, abs/2602.14211, 2026

  26. [26]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

  27. [27]

    Voting or consensus? decision-making in multi-agent debate

    Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Voting or consensus? decision-making in multi-agent debate. InACL (Findings), Findings of ACL, pages 11640–11671. Association for Computational Linguistics, 2025

  28. [28]

    Lsast: Enhancing cybersecurity through llm-supported static application security testing

    Mete Keltek, Rong Hu, Mohammadreza Fani Sani, and Ziyue Li. Lsast: Enhancing cybersecurity through llm-supported static application security testing. InIFIP International Conference on ICT Systems Security and Privacy Protection, pages 166–179. Springer, 2025

  29. [29]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

  30. [30]

    Sok: Taxonomy of attacks on open-source software supply chains

    Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. Sok: Taxonomy of attacks on open-source software supply chains. InSP, pages 1509–1526. IEEE, 2023

  31. [31]

    IRIS: llm-assisted static analysis for detecting security vulnerabilities

    Ziyang Li, Saikat Dutta, and Mayur Naik. IRIS: llm-assisted static analysis for detecting security vulnerabilities. InICLR. OpenReview.net, 2025

  32. [32]

    "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

    Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious agent skills in the wild: A large-scale security empirical study.CoRR, abs/2602.06547, 2026. SkillVetBench 17

  33. [33]

    Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. Agent skills in the wild: An empirical study of security vulnerabilities at scale.CoRR, abs/2601.10338, 2026

  34. [34]

    Agents Rule of Two: A Practical Approach to AI Agent Security.https://ai.meta

    Meta AI. Agents Rule of Two: A Practical Approach to AI Agent Security.https://ai.meta. com/blog/practical-ai-agent-security/, 2025. Accessed: 2026-05-07

  35. [35]

    Cárdenas

    Amy Munson, Juanita Gomez, and Alvaro A. Cárdenas. With a little help from my (LLM) friends: Enhancing static analysis with llms to detect software vulnerabilities. InLLM4Code@ICSE, pages 25–32. IEEE, 2025

  36. [36]

    Impact Level.https://csrc.nist.gov/ glossary/term/impact_level, 2026

    National Institute of Standards and Technology. Impact Level.https://csrc.nist.gov/ glossary/term/impact_level, 2026. Accessed: 2026-05-07

  37. [37]

    Backstabber’s knife collection: A review of open source software supply chain attacks

    Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. Backstabber’s knife collection: A review of open source software supply chain attacks. InDIMVA, Lecture Notes in Computer Science, pages 23–43. Springer, 2020

  38. [38]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. URLhttps://arxiv.org/ abs/2303.08774

  39. [39]

    OpenClaw partners with VirusTotal for skill security.https://openclaw.ai/ blog/virustotal-partnership, 2026

    OpenClaw. OpenClaw partners with VirusTotal for skill security.https://openclaw.ai/ blog/virustotal-partnership, 2026

  40. [40]

    LLM06:2025 Excessive Agency.https://genai.owasp.org/llmrisk/ llm06-sensitive-information-disclosure/, 2025

    OWASP Foundation. LLM06:2025 Excessive Agency.https://genai.owasp.org/llmrisk/ llm06-sensitive-information-disclosure/, 2025. Accessed: 2026-05-07

  41. [41]

    OWASP Top 10 for Large Language Model Applications 2025.https: //owasp.org/www-project-top-10-for-large-language-model-applications/ ,

    OWASP Foundation. OWASP Top 10 for Large Language Model Applications 2025.https: //owasp.org/www-project-top-10-for-large-language-model-applications/ ,

  42. [42]

    Accessed: 2026-05-07

  43. [43]

    AI Agent Security Cheat Sheet.https://cheatsheetseries.owasp

    OWASP Foundation. AI Agent Security Cheat Sheet.https://cheatsheetseries.owasp. org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html, 2025. Accessed: 2026-05- 07

  44. [44]

    Standards for security categorization of federal information and information systems

    FIPS Pub. Standards for security categorization of federal information and information systems. NIST FIPS, 199:122, 2004

  45. [45]

    CVSS v4.0 calculator

    Red Hat Product Security. CVSS v4.0 calculator. https://github.com/ RedHatProductSecurity/cvss-v4-calculator, 2023. JavaScript reference imple- mentation; source of the 270-entry MacroVector lookup table

  46. [46]

    cvss: CVSS v2, v3, and v4 python library.https://github.com/ RedHatProductSecurity/cvss, 2024

    Red Hat Product Security. cvss: CVSS v2, v3, and v4 python library.https://github.com/ RedHatProductSecurity/cvss, 2024. PyPI packagecvss; used for score verification

  47. [47]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InThe Twelfth International Conference on Learning Representations, 2024

  48. [48]

    OpenClaw security engineer’s cheat sheet.https://semgrep.dev/blog/2026/ openclaw-security-engineers-cheat-sheet/, 2026

    Semgrep. OpenClaw security engineer’s cheat sheet.https://semgrep.dev/blog/2026/ openclaw-security-engineers-cheat-sheet/, 2026. SkillVetBench 18

  49. [49]

    The TeamPCP Credential Infostealer Chain Attack Reaches Python’s LiteLLM

    Semgrep. The TeamPCP Credential Infostealer Chain Attack Reaches Python’s LiteLLM. https://semgrep.dev/blog/2026/ the-teampcp-credential-infostealer-chain-attack-reaches-pythons-litellm/ , 2026

  50. [50]

    ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem

    Mohib Shaikh. ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem. https://github.com/MohibShaikh/clawvet, 2026

  51. [51]

    Prompt Injection Attack to Tool Selection in LLM Agents

    Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

  52. [52]

    A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis.Int

    Rami Sihwail, Khairuddin Omar, and KA Zainol Ariffin. A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis.Int. J. Adv. Sci. Eng. Inf. Technol, 8(4-2):1662– 1671, 2018

  53. [53]

    How a Malicious Google Skill on ClawHub Tricks Users Into In- stalling Malware

    Snyk Security. How a Malicious Google Skill on ClawHub Tricks Users Into In- stalling Malware. Snyk Blog, February 2026. URL https://snyk.io/blog/ clawhub-malicious-google-skill-openclaw-malware/. Accessed: Apr. 2026

  54. [54]

    Guide for mapping types of information and information systems to security categories

    Kevin Stine, Richard Kissel, William Barker, Jim Fahlsing, and Jessica Gulick. Guide for mapping types of information and information systems to security categories. Technical report, National Institute of Standards and Technology, 2008

  55. [55]

    Researchers Find 341 Malicious ClawHub Skills Stealing Data from OpenClaw Users

    The Hacker News. Researchers Find 341 Malicious ClawHub Skills Stealing Data from OpenClaw Users. The Hacker News, February 2026. URLhttps://thehackernews.com/2026/02/ researchers-find-341-malicious-clawhub.html. Accessed: Apr. 2026

  56. [56]

    Tree-sitter: Official documentation / project page

    Tree-sitter. Tree-sitter: Official documentation / project page. https://tree-sitter. github.io/tree-sitter/

  57. [57]

    Malicious OpenClaw Skills Used to Distribute Atomic macOS Stealer

    TrendAI Research, Trend Micro. Malicious OpenClaw Skills Used to Distribute Atomic macOS Stealer. Trend Micro Research Blog, Febru- ary 2026. URL https://www.trendmicro.com/en_us/research/26/b/ openclaw-skills-used-to-distribute-atomic-macos-stealer.html . Accessed: Apr. 2026

  58. [58]

    VirusTotal – free online virus, malware and url scanner

    VirusTotal. VirusTotal – free online virus, malware and url scanner. https://www. virustotal.com, 2024

  59. [59]

    From Automation to Infection: How OpenClaw AI Agent Skills Are Being Weaponized

    VirusTotal. From Automation to Infection: How OpenClaw AI Agent Skills Are Being Weaponized. VirusTotal Blog, February 2026. URLhttps://blog.virustotal.com/2026/ 02/from-automation-to-infection-how.html. Accessed: Apr. 2026

  60. [60]

    Skilltester: Benchmarking utility and security of agent skills.CoRR, abs/2603.28815, 2026

    Leye Wang, Zixing Wang, and Anjie Xu. Skilltester: Benchmarking utility and security of agent skills.CoRR, abs/2603.28815, 2026

  61. [61]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  62. [62]

    The Lethal Trifecta for AI Agents: Private Data, Untrusted Con- tent, and External Communication

    Simon Willison. The Lethal Trifecta for AI Agents: Private Data, Untrusted Con- tent, and External Communication. https://simonwillison.net/2025/Jun/16/ the-lethal-trifecta/, 2025. Accessed: 2026-05-07. SkillVetBench 19

  63. [63]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.CoRR, abs/2602.12430, 2026

  64. [64]

    ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting

    Oren Yomtov and Alex. ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting. Koi Security Blog, February 2026. URL https://www.koi.ai/blog/ clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting . Accessed: Apr. 2026

  65. [65]

    Agent audit: A security analysis system for LLM agent applications.CoRR, abs/2603.22853, 2026

    Haiyue Zhang, Yi Nian, and Yue Zhao. Agent audit: A security analysis system for LLM agent applications.CoRR, abs/2603.22853, 2026

  66. [66]

    Junan Zhang, Kaifeng Huang, Yiheng Huang, Bihuan Chen, Ruisi Wang, Chong Wang, and Xin Peng. Killing two birds with one stone: Malicious package detection in npm and pypi using a single model of malicious behavior sequence.ACM transactions on software engineering and methodology, 34(4):1–28, 2025

  67. [67]

    malicious or benign

    Jiaying Zhu and Wenbo Guo. SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.CoRR, abs/2603.22447, 2026. Contents 1 Introduction 1 2 Related Work 3 3 Benchmark Construction 4 3.1 Stage 1: Semantic analysis with LLM-as-a-Judge . . . . . . . . . . . . . . . . . . . . . 4 3.2 Stage 2: Programmatic Analysis wi...