pith. sign in

No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal. On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design.

fields

cs.CR 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.

  • MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills cs.CR · 2026-06-05 · unverdicted · none · ref 31 · internal anchor

    MalSkillBench supplies the first sandbox-verified dataset of malicious agent skills and shows that existing detectors achieve high recall on code injection but collapse on prompt injection and agent-control attacks.