SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3
The pith
A three-layer triage framework detects malicious skills in AI agent marketplaces by filtering benign ones cheaply before applying targeted LLM analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 uses regex, AST, and metadata checks with an XGBoost scorer to filter out most benign skills in milliseconds at zero API cost. Layer 2 sends remaining skills to an LLM split across four parallel sub-tasks covering intent alignment, permission justification, covert behavior detection, and cross-file consistency. Layer 3 routes high-risk items to a jury of three different LLMs that vote independently and debate disagreements before issuing a final verdict. On a 400-skill labeled benchmark drawn from real marketplace data, the system reaches higher detection performance than the
What carries the argument
The three-layer hierarchical triage that starts with lightweight code and metadata filters, moves to structured multi-prompt LLM subtasks for deeper inspection, and ends with LLM jury voting for confirmation on uncertain cases.
If this is right
- Most benign skills are discarded in under 40 milliseconds using only local checks with no API cost.
- Splitting analysis into four parallel subtasks allows separate checks for intent, permissions, covert actions, and file consistency.
- Jury voting among different LLMs resolves disagreements on high-risk skills before a final decision.
- The complete pipeline can process the full 49,000-skill corpus on a single low-power ARM board at low average cost per skill.
Where Pith is reading between the lines
- Staged filtering methods like this could be adapted to other AI security tasks where full analysis of every item would be too expensive.
- Dividing detection into several narrow questions may reduce the risk that one broad query overlooks subtle problems.
- Adding human review for cases where the LLM jury disagrees could strengthen trust in the automated output.
Load-bearing premise
The 400-skill labeled benchmark together with the five tested adversarial evasion samples accurately represent the malicious skills present in large real-world marketplaces, and the LLM subtasks plus jury voting can reliably separate malicious intent from complex but benign natural-language instructions.
What would settle it
A new collection of malicious skills that pass the initial filters and cause the LLM subtasks and jury to classify them as benign, or a large set of benign skills that the system consistently flags as malicious.
Figures
read the original abstract
OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SkillSieve, a three-layer hierarchical triage framework for detecting malicious AI agent skills in marketplaces such as ClawHub. Layer 1 applies fast regex, AST, and metadata checks via an XGBoost feature scorer to filter the majority of benign skills at near-zero cost. Layer 2 decomposes analysis of remaining skills into four parallel LLM sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency). Layer 3 escalates high-risk cases to a jury of three LLMs that vote and debate if needed. The system is evaluated on the full 49,592-skill ClawHub corpus plus adversarial samples, reporting 0.800 F1 on a 400-skill labeled benchmark (vs. ClawVet at 0.421 F1) at an average cost of $0.006 per skill, with deployment tested on low-power ARM hardware. Code, data, and benchmark are open-sourced.
Significance. If the empirical results hold, SkillSieve provides a practical, cost-efficient solution to a real security gap: natural-language prompt-injection and social-engineering attacks embedded in community-contributed agent skills that neither regex scanners nor formal static analyzers can reliably catch. The hierarchical design and multi-LLM jury mechanism represent a concrete advance over single-pass LLM or baseline scanners. The open-sourcing of code, data, and the 400-skill benchmark is a clear strength that supports reproducibility and future work.
major comments (1)
- [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.
minor comments (1)
- [Abstract] Abstract: The phrase 'five adversarial evasion samples' and 'five evasion techniques' is mentioned without even a one-sentence characterization of the techniques; adding this would help readers assess the robustness claim.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The concern about insufficient detail on benchmark labeling is valid and directly impacts the interpretability of our primary result. We address it below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.
Authors: We agree that the manuscript provides insufficient detail on how the 400-skill benchmark was constructed and labeled, limiting the ability to interpret the F1 score as evidence of the framework's effectiveness rather than labeling artifacts. In the revised manuscript we will add a dedicated subsection in the Evaluation section describing: (1) the stratified sampling method from the 49,592-skill ClawHub corpus, (2) the explicit criteria for malicious vs. benign labels based on our threat model (prompt injection, unauthorized permissions, covert behavior, social engineering), (3) the annotation protocol including annotator expertise in AI security, (4) inter-annotator agreement, and (5) the generation and inclusion of the five adversarial evasion samples. The open-sourced benchmark release will include the full annotation guidelines. These changes will allow readers to assess label reliability. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper describes a hierarchical detection framework evaluated empirically on an external 400-skill labeled benchmark drawn from the ClawHub corpus, reporting F1 scores and costs without any equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims. The methodology (regex/AST/XGBoost filtering, four LLM subtasks, jury voting) is defined independently of the benchmark outcomes, and performance is presented as measured against that benchmark rather than constructed from it. No self-definitional loops, ansatzes via prior author work, or renaming of known results appear in the provided text.
Axiom & Free-Parameter Ledger
free parameters (2)
- XGBoost decision thresholds and feature weights
- Layer escalation risk thresholds
axioms (1)
- domain assumption LLMs given structured prompts on intent alignment, permission justification, covert behavior, and cross-file consistency can produce reliable signals for malicious skills
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-layer detection framework that applies progressively deeper analysis only where needed... Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer... Layer 2 splits the analysis into four parallel sub-tasks... Layer 3 puts high-risk skills before a jury of three different LLMs
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1... at an average cost of 0.006 per skill
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
-
Exploiting LLM Agent Supply Chains via Payload-less Skills
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Behavioral Integrity Verification for AI Agent Skills
BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.