pith. machine review for the scientific record. sign in

arxiv: 2604.06550 · v1 · submitted 2026-04-08 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords SkillSievemalicious AI agent skillshierarchical triageLLM jury votingvulnerability detectionagent skill marketplacesprompt injectionsecurity scanning
0
0 comments X

The pith

A three-layer triage framework detects malicious skills in AI agent marketplaces by filtering benign ones cheaply before applying targeted LLM analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillSieve as a practical way to scan large sets of AI agent skills for security issues that hide in both code and natural language instructions. It demonstrates that the majority of safe skills can be cleared quickly with basic code and metadata checks, so that language model analysis is used only on the small fraction that raises flags. The framework divides the deeper analysis into four focused sub-tasks and adds a jury of multiple models to settle uncertain cases. This staged design produces stronger detection results than earlier single-method tools while keeping overall costs low and allowing the full scan to run on modest hardware. Readers would care because growing marketplaces contain thousands of skills where undetected vulnerabilities could lead to prompt injection or other agent exploits.

Core claim

SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 uses regex, AST, and metadata checks with an XGBoost scorer to filter out most benign skills in milliseconds at zero API cost. Layer 2 sends remaining skills to an LLM split across four parallel sub-tasks covering intent alignment, permission justification, covert behavior detection, and cross-file consistency. Layer 3 routes high-risk items to a jury of three different LLMs that vote independently and debate disagreements before issuing a final verdict. On a 400-skill labeled benchmark drawn from real marketplace data, the system reaches higher detection performance than the

What carries the argument

The three-layer hierarchical triage that starts with lightweight code and metadata filters, moves to structured multi-prompt LLM subtasks for deeper inspection, and ends with LLM jury voting for confirmation on uncertain cases.

If this is right

  • Most benign skills are discarded in under 40 milliseconds using only local checks with no API cost.
  • Splitting analysis into four parallel subtasks allows separate checks for intent, permissions, covert actions, and file consistency.
  • Jury voting among different LLMs resolves disagreements on high-risk skills before a final decision.
  • The complete pipeline can process the full 49,000-skill corpus on a single low-power ARM board at low average cost per skill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Staged filtering methods like this could be adapted to other AI security tasks where full analysis of every item would be too expensive.
  • Dividing detection into several narrow questions may reduce the risk that one broad query overlooks subtle problems.
  • Adding human review for cases where the LLM jury disagrees could strengthen trust in the automated output.

Load-bearing premise

The 400-skill labeled benchmark together with the five tested adversarial evasion samples accurately represent the malicious skills present in large real-world marketplaces, and the LLM subtasks plus jury voting can reliably separate malicious intent from complex but benign natural-language instructions.

What would settle it

A new collection of malicious skills that pass the initial filters and cause the LLM subtasks and jury to classify them as benign, or a large set of benign skills that the system consistently flags as malicious.

Figures

Figures reproduced from arXiv: 2604.06550 by Yinghan Hou, Zongyou Yang.

Figure 1
Figure 1. Figure 1: The SkillSieve three-layer triage architecture. Layer 1 filters ∼86% of benign skills via static analysis at zero cost. Layer 2 applies four parallel LLM sub-tasks to suspi￾cious skills. Layer 3 convenes a multi-LLM jury for high-risk cases. cross-validation it achieves 0.959 F1 on the triage task. However, because the training malicious samples are dominated by three known-malicious authors with similar a… view at source ↗
read the original abstract

OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SkillSieve, a three-layer hierarchical triage framework for detecting malicious AI agent skills in marketplaces such as ClawHub. Layer 1 applies fast regex, AST, and metadata checks via an XGBoost feature scorer to filter the majority of benign skills at near-zero cost. Layer 2 decomposes analysis of remaining skills into four parallel LLM sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency). Layer 3 escalates high-risk cases to a jury of three LLMs that vote and debate if needed. The system is evaluated on the full 49,592-skill ClawHub corpus plus adversarial samples, reporting 0.800 F1 on a 400-skill labeled benchmark (vs. ClawVet at 0.421 F1) at an average cost of $0.006 per skill, with deployment tested on low-power ARM hardware. Code, data, and benchmark are open-sourced.

Significance. If the empirical results hold, SkillSieve provides a practical, cost-efficient solution to a real security gap: natural-language prompt-injection and social-engineering attacks embedded in community-contributed agent skills that neither regex scanners nor formal static analyzers can reliably catch. The hierarchical design and multi-LLM jury mechanism represent a concrete advance over single-pass LLM or baseline scanners. The open-sourcing of code, data, and the 400-skill benchmark is a clear strength that supports reproducibility and future work.

major comments (1)
  1. [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'five adversarial evasion samples' and 'five evasion techniques' is mentioned without even a one-sentence characterization of the techniques; adding this would help readers assess the robustness claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about insufficient detail on benchmark labeling is valid and directly impacts the interpretability of our primary result. We address it below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.

    Authors: We agree that the manuscript provides insufficient detail on how the 400-skill benchmark was constructed and labeled, limiting the ability to interpret the F1 score as evidence of the framework's effectiveness rather than labeling artifacts. In the revised manuscript we will add a dedicated subsection in the Evaluation section describing: (1) the stratified sampling method from the 49,592-skill ClawHub corpus, (2) the explicit criteria for malicious vs. benign labels based on our threat model (prompt injection, unauthorized permissions, covert behavior, social engineering), (3) the annotation protocol including annotator expertise in AI security, (4) inter-annotator agreement, and (5) the generation and inclusion of the five adversarial evasion samples. The open-sourced benchmark release will include the full annotation guidelines. These changes will allow readers to assess label reliability. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes a hierarchical detection framework evaluated empirically on an external 400-skill labeled benchmark drawn from the ClawHub corpus, reporting F1 scores and costs without any equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims. The methodology (regex/AST/XGBoost filtering, four LLM subtasks, jury voting) is defined independently of the benchmark outcomes, and performance is presented as measured against that benchmark rather than constructed from it. No self-definitional loops, ansatzes via prior author work, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Framework rests on standard ML assumptions plus the unproven domain assumption that structured LLM prompting can reliably surface covert malicious intent in natural-language skill files.

free parameters (2)
  • XGBoost decision thresholds and feature weights
    Layer 1 scorer is trained on data; thresholds for 86% benign filter are fitted.
  • Layer escalation risk thresholds
    Cutoffs determining when to invoke LLM layers are chosen or tuned.
axioms (1)
  • domain assumption LLMs given structured prompts on intent alignment, permission justification, covert behavior, and cross-file consistency can produce reliable signals for malicious skills
    Core of Layers 2 and 3; no independent verification supplied in abstract.

pith-pipeline@v0.9.0 · 5569 in / 1326 out tokens · 34126 ms · 2026-05-10T18:33:56.516765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry

    cs.AI 2026-05 unverdicted novelty 8.0

    Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.

  2. Exploiting LLM Agent Supply Chains via Payload-less Skills

    cs.CR 2026-05 conditional novelty 6.0

    Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...

  3. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  4. Behavioral Integrity Verification for AI Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.

  5. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    cs.CL 2026-04 unverdicted novelty 6.0

    SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

Reference graph

Works this paper leans on

30 extracted references · 11 canonical work pages · cited by 5 Pith papers · 2 internal anchors

  1. [1]

    OpenClaw: Your own personal AI assistant

    OpenClaw. OpenClaw: Your own personal AI assistant. https://github.com/ openclaw/openclaw, 2026

  2. [2]

    ClawHub: Skill directory for OpenClaw

    OpenClaw. ClawHub: Skill directory for OpenClaw. https://github.com/ openclaw/clawhub, 2026

  3. [3]

    Skill format specification

    OpenClaw. Skill format specification. https://github.com/openclaw/clawhub/ blob/main/docs/skill-format.md, 2026

  4. [4]

    ToxicSkills: Malicious AI agent skills in ClawHub

    Snyk Labs. ToxicSkills: Malicious AI agent skills in ClawHub. https://snyk.io/ blog/toxicskills-malicious-ai-agent-skills-clawhub/, February 2026

  5. [5]

    ClawHavoc: 341 malicious skills found by the bot they were target- ing

    Koi Security. ClawHavoc: 341 malicious skills found by the bot they were target- ing. https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found- by-the-bot-they-were-targeting, February 2026

  6. [6]

    Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

    Liu, Y., Wang, W., Feng, R., Zhang, Y., Xu, G., Deng, G., Li, Y., and Zhang, L. Agent skills in the wild: An empirical study of security vulnerabilities at scale. arXiv:2601.10338, January 2026

  7. [7]

    Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026

    Liu, Y., Chen, Z., Zhang, Y., Deng, G., Li, Y., Ning, J., Zhang, Y., and Zhang, L.Y. Malicious agent skills in the wild: A large-scale security empirical study. arXiv:2602.06547, February 2026

  8. [8]

    Formal analysis and supply chain security for agentic AI skills.arXiv preprint arXiv:2603.00195, 2026

    Bhardwaj, V.P. Formal analysis and supply chain security for agentic AI skills. arXiv:2603.00195, February 2026

  9. [9]

    ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem

    Shaikh, M. ClawVet: Skill vetting & supply chain security for the OpenClaw ecosystem. https://github.com/MohibShaikh/clawvet, 2026

  10. [10]

    From automation to infection: How OpenClaw agent skills are being weaponized

    VirusTotal. From automation to infection: How OpenClaw agent skills are being weaponized. https://blog.virustotal.com/2026/02/from-automation-to-infection- how.html, February 2026

  11. [11]

    Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026

    Guo, Z., Chen, Z., Nie, X., Lin, J., Zhou, Y., and Zhang, W. SkillProbe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration. arXiv:2603.21019, March 2026

  12. [12]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Xu, R. and Yan, Y. Agent skills for large language models: Architecture, acquisi- tion, security, and the path forward.arXiv:2602.12430, February 2026

  13. [13]

    OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security

    AuthMind. OpenClaw’s 230 malicious skills: What agentic AI supply chains teach us about the need to evolve identity security. https://www.authmind.com/ blogs/openclaw-malicious-skills-agentic-ai-supply-chain, 2026

  14. [14]

    From magic to malware: How OpenClaw’s agent skills become an attack surface

    1Password. From magic to malware: How OpenClaw’s agent skills become an attack surface. https://1password.com/blog/from-magic-to-malware-how- openclaws-agent-skills-become-an-attack-surface, 2026

  15. [15]

    OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform

    HKCERT. OpenClaw’s rapid adoption exposes skills supply chain and fake installer risks in a high-privilege AI agent platform. https://www.hkcert.org/blog/openclaw-s-rapid-adoption-exposes-skills- supply-chain-and-fake-installer-risks-in-a-high-privilege-ai-agent-platform, March 2026

  16. [16]

    Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer

    Trend Micro. Malicious OpenClaw skills used to distribute Atomic ma- cOS Stealer. https://www.trendmicro.com/en_us/research/26/b/openclaw-skills- used-to-distribute-atomic-macos-stealer.html, February 2026

  17. [17]

    Malicious crypto skills compromise OpenClaw AI assistant users

    Paubox. Malicious crypto skills compromise OpenClaw AI assistant users. https://www.paubox.com/blog/malicious-crypto-skills-compromise- openclaw-ai-assistant-users, 2026

  18. [18]

    OWASP Agentic Skills Top 10

    OWASP. OWASP Agentic Skills Top 10. https://owasp.org/www-project-agentic- skills-top-10/, 2026

  19. [19]

    and Guestrin, C

    Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. InKDD, 2016

  20. [20]

    Official documentation / project page

    Tree-sitter. Official documentation / project page. https://tree-sitter.github.io/ tree-sitter/

  21. [21]

    Ohm, M. et al. Backstabber’s knife collection: A review of open source software supply chain attacks. InDIMV A, 2020

  22. [22]

    SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.arXiv:2603.22447, March 2026

    Zhu, J., Zhang, L., Guo, W., and Liu, Y. SkillClone: Multi-modal clone detection and clone propagation analysis in the agent skill ecosystem.arXiv:2603.22447, March 2026

  23. [23]

    Skilltester: Benchmarking utility and security of agent skills.arXiv preprint arXiv:2603.28815, 2026

    Wang, L., Wang, Z., and Xu, A. SkillTester: Benchmarking utility and security of agent skills.arXiv:2603.28815, March 2026

  24. [24]

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

    Jia, X., Liao, J., Qin, S., Gu, J., Ren, W., Cao, X., Liu, Y., and Torr, P. SkillJect: Automating stealthy skill-based prompt injection for coding agents with trace- driven closed-loop refinement.arXiv:2602.14211, February 2026

  25. [25]

    Agent audit: A security analysis system for llm agent applications,

    Zhang, H., Nian, Y., and Zhao, Y. Agent Audit: A security analysis system for LLM agent applications.arXiv:2603.22853, March 2026

  26. [26]

    Mal- ware detection at the edge with lightweight LLMs: A performance evaluation

    Rondanini, C., Carminati, B., Ferrari, E., Gaudiano, A., and Kundu, A. Mal- ware detection at the edge with lightweight LLMs: A performance evaluation. arXiv:2503.04302, March 2025

  27. [27]

    LoRA-based parameter-efficient LLMs for continuous learning in edge-based malware detec- tion.arXiv:2602.11655, February 2026

    Rondanini, C., Carminati, B., Ferrari, E., Lardo, N., and Kundu, A. LoRA-based parameter-efficient LLMs for continuous learning in edge-based malware detec- tion.arXiv:2602.11655, February 2026

  28. [28]

    Top 10 for Agentic Applications for 2026

    OWASP. Top 10 for Agentic Applications for 2026. https://genai.owasp.org/ resource/owasp-top-10-for-agentic-applications-for-2026/, December 2025

  29. [29]

    OpenClaw can be hazardous to your software supply chain

    JFrog. OpenClaw can be hazardous to your software supply chain. https://jfrog. com/blog/giving-openclaw-the-keys-to-your-kingdom-read-this-first/, 2026

  30. [30]

    OpenClaw security engineer’s cheat sheet

    Semgrep. OpenClaw security engineer’s cheat sheet. https://semgrep.dev/blog/ 2026/openclaw-security-engineers-cheat-sheet/, 2026