SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Xiujun Ma; Yinghan Hou; Zaihu Pang; Zongyou Yang

arxiv: 2604.06550 · v2 · pith:E3WHGQYLnew · submitted 2026-04-08 · 💻 cs.CR · cs.AI

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Yinghan Hou , Zongyou Yang , Zaihu Pang , Xiujun Ma This is my paper

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords SkillSievemalicious AI agent skillshierarchical triageLLM jury votingvulnerability detectionagent skill marketplacesprompt injectionsecurity scanning

0 comments

The pith

A three-layer triage framework detects malicious skills in AI agent marketplaces by filtering benign ones cheaply before applying targeted LLM analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillSieve as a practical way to scan large sets of AI agent skills for security issues that hide in both code and natural language instructions. It demonstrates that the majority of safe skills can be cleared quickly with basic code and metadata checks, so that language model analysis is used only on the small fraction that raises flags. The framework divides the deeper analysis into four focused sub-tasks and adds a jury of multiple models to settle uncertain cases. This staged design produces stronger detection results than earlier single-method tools while keeping overall costs low and allowing the full scan to run on modest hardware. Readers would care because growing marketplaces contain thousands of skills where undetected vulnerabilities could lead to prompt injection or other agent exploits.

Core claim

SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 uses regex, AST, and metadata checks with an XGBoost scorer to filter out most benign skills in milliseconds at zero API cost. Layer 2 sends remaining skills to an LLM split across four parallel sub-tasks covering intent alignment, permission justification, covert behavior detection, and cross-file consistency. Layer 3 routes high-risk items to a jury of three different LLMs that vote independently and debate disagreements before issuing a final verdict. On a 400-skill labeled benchmark drawn from real marketplace data, the system reaches higher detection performance than the

What carries the argument

The three-layer hierarchical triage that starts with lightweight code and metadata filters, moves to structured multi-prompt LLM subtasks for deeper inspection, and ends with LLM jury voting for confirmation on uncertain cases.

If this is right

Most benign skills are discarded in under 40 milliseconds using only local checks with no API cost.
Splitting analysis into four parallel subtasks allows separate checks for intent, permissions, covert actions, and file consistency.
Jury voting among different LLMs resolves disagreements on high-risk skills before a final decision.
The complete pipeline can process the full 49,000-skill corpus on a single low-power ARM board at low average cost per skill.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Staged filtering methods like this could be adapted to other AI security tasks where full analysis of every item would be too expensive.
Dividing detection into several narrow questions may reduce the risk that one broad query overlooks subtle problems.
Adding human review for cases where the LLM jury disagrees could strengthen trust in the automated output.

Load-bearing premise

The 400-skill labeled benchmark together with the five tested adversarial evasion samples accurately represent the malicious skills present in large real-world marketplaces, and the LLM subtasks plus jury voting can reliably separate malicious intent from complex but benign natural-language instructions.

What would settle it

A new collection of malicious skills that pass the initial filters and cause the LLM subtasks and jury to classify them as benign, or a large set of benign skills that the system consistently flags as malicious.

Figures

Figures reproduced from arXiv: 2604.06550 by Xiujun Ma, Yinghan Hou, Zaihu Pang, Zongyou Yang.

**Figure 1.** Figure 1: The SkillSieve three-layer triage architecture. Layer 1 filters ∼86% of benign skills via static analysis at zero cost. Layer 2 applies four parallel LLM sub-tasks to suspicious skills. Layer 3 convenes a multi-LLM jury for high-risk cases. cross-validation it achieves 0.959 F1 on the triage task. However, because the training malicious samples are dominated by three known-malicious authors with similar a… view at source ↗

read the original abstract

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillSieve gives a practical three-layer filter for malicious agent skills that beats ClawVet on the reported numbers and ships the code and data, but the 400-skill benchmark's labeling process is the part that still needs verification.

read the letter

The main takeaway is that this paper ships a usable system for triaging skills in marketplaces like ClawHub. It combines a fast XGBoost static filter that drops most benign cases, four parallel LLM checks for intent and behavior, and a three-model jury for the hard cases. That architecture is new relative to the ClawVet baseline they cite, and the reported 0.800 F1 at low cost on a 440 ARM board is a concrete result worth looking at if you care about agent security in practice. They also open-source the code, data, and benchmark, which makes the claims checkable rather than just asserted.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SkillSieve, a three-layer hierarchical triage framework for detecting malicious AI agent skills in marketplaces such as ClawHub. Layer 1 applies fast regex, AST, and metadata checks via an XGBoost feature scorer to filter the majority of benign skills at near-zero cost. Layer 2 decomposes analysis of remaining skills into four parallel LLM sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency). Layer 3 escalates high-risk cases to a jury of three LLMs that vote and debate if needed. The system is evaluated on the full 49,592-skill ClawHub corpus plus adversarial samples, reporting 0.800 F1 on a 400-skill labeled benchmark (vs. ClawVet at 0.421 F1) at an average cost of $0.006 per skill, with deployment tested on low-power ARM hardware. Code, data, and benchmark are open-sourced.

Significance. If the empirical results hold, SkillSieve provides a practical, cost-efficient solution to a real security gap: natural-language prompt-injection and social-engineering attacks embedded in community-contributed agent skills that neither regex scanners nor formal static analyzers can reliably catch. The hierarchical design and multi-LLM jury mechanism represent a concrete advance over single-pass LLM or baseline scanners. The open-sourcing of code, data, and the 400-skill benchmark is a clear strength that supports reproducibility and future work.

major comments (1)

[Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.

minor comments (1)

[Abstract] Abstract: The phrase 'five adversarial evasion samples' and 'five evasion techniques' is mentioned without even a one-sentence characterization of the techniques; adding this would help readers assess the robustness claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about insufficient detail on benchmark labeling is valid and directly impacts the interpretability of our primary result. We address it below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of 0.800 F1 on the 400-skill labeled benchmark (outperforming ClawVet's 0.421) is the primary evidence offered for the framework's effectiveness. The manuscript states only that the benchmark is 'labeled' and that five adversarial evasion samples were used; it supplies no protocol for label assignment, criteria defining 'malicious' versus benign natural-language instructions, inter-annotator agreement, annotator expertise, or sampling method from the 49,592-skill corpus. Without these details the reported F1 score cannot be interpreted as evidence that the four-subtask LLM analysis plus jury voting distinguishes malicious intent rather than artifacts of the labeling process.

Authors: We agree that the manuscript provides insufficient detail on how the 400-skill benchmark was constructed and labeled, limiting the ability to interpret the F1 score as evidence of the framework's effectiveness rather than labeling artifacts. In the revised manuscript we will add a dedicated subsection in the Evaluation section describing: (1) the stratified sampling method from the 49,592-skill ClawHub corpus, (2) the explicit criteria for malicious vs. benign labels based on our threat model (prompt injection, unauthorized permissions, covert behavior, social engineering), (3) the annotation protocol including annotator expertise in AI security, (4) inter-annotator agreement, and (5) the generation and inclusion of the five adversarial evasion samples. The open-sourced benchmark release will include the full annotation guidelines. These changes will allow readers to assess label reliability. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes a hierarchical detection framework evaluated empirically on an external 400-skill labeled benchmark drawn from the ClawHub corpus, reporting F1 scores and costs without any equations, derivations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims. The methodology (regex/AST/XGBoost filtering, four LLM subtasks, jury voting) is defined independently of the benchmark outcomes, and performance is presented as measured against that benchmark rather than constructed from it. No self-definitional loops, ansatzes via prior author work, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Framework rests on standard ML assumptions plus the unproven domain assumption that structured LLM prompting can reliably surface covert malicious intent in natural-language skill files.

free parameters (2)

XGBoost decision thresholds and feature weights
Layer 1 scorer is trained on data; thresholds for 86% benign filter are fitted.
Layer escalation risk thresholds
Cutoffs determining when to invoke LLM layers are chosen or tuned.

axioms (1)

domain assumption LLMs given structured prompts on intent alignment, permission justification, covert behavior, and cross-file consistency can produce reliable signals for malicious skills
Core of Layers 2 and 3; no independent verification supplied in abstract.

pith-pipeline@v0.9.0 · 5569 in / 1326 out tokens · 34126 ms · 2026-05-10T18:33:56.516765+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-layer detection framework that applies progressively deeper analysis only where needed... Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer... Layer 2 splits the analysis into four parallel sub-tasks... Layer 3 puts high-risk skills before a jury of three different LLMs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1... at an average cost of 0.006 per skill

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
cs.AI 2026-05 unverdicted novelty 8.0

Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Behavioral Integrity Verification for AI Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
cs.CL 2026-04 unverdicted novelty 6.0

SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.