Recognition: unknown
No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3
The pith
Semantic fuzzing detects specification violations in 30 percent of real-world agent skills on ordinary inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM mutator then produces benign inputs whose traces are steered toward the violation patterns by a multi-armed bandit that treats goal proximity as its reward signal. Evaluation across 402 real-world skills shows specification violations in 120 cases (29.9 percent), of which 26 are previously unknown exploitable guardrail violations in deployed skills.
What carries the argument
Sefz, the goal-directed semantic fuzzing framework that converts natural-language guardrails into deterministic reachability goals on execution traces and drives input generation with an LLM mutator plus multi-armed bandit reward based on goal proximity.
If this is right
- Specification violations occur on benign inputs, so attack-focused defenses leave a large class of failures unaddressed.
- 26 exploitable guardrail violations were found in already-deployed skills, showing immediate practical impact.
- Six recurring specification pitfalls account for the bulk of failures and directly suggest concrete safer design rules.
- Traditional static analyzers and prompt-injection tools miss these internal specification breaches.
- Skills can silently ignore their own documented constraints, breaking the contract users rely on when installing them.
Where Pith is reading between the lines
- Marketplaces could integrate semantic fuzzing as an automated pre-deployment check to filter out skills with weak guardrails.
- The same reachability-goal approach could be applied to other LLM-driven systems that publish natural-language safety constraints.
- Refining the guardrail-to-goal translation step would reduce the risk that some real violations are still missed.
- Users may start demanding verifiable evidence that a skill upholds its stated guardrails rather than only resisting external attacks.
Load-bearing premise
Translating natural-language guardrails into deterministic reachability goals over execution traces accurately captures the intended semantics without creating false violations or missing real ones.
What would settle it
Manual review of a random sample of the 120 reported violations to confirm each is a genuine breach of the skill's documented guardrail on the supplied benign input.
Figures
read the original abstract
LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal. On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Sefz, a goal-directed semantic fuzzing framework that translates natural-language guardrails from agent skills into deterministic reachability goals over annotated execution traces. An LLM-based mutator, guided by a multi-armed bandit using goal-proximity rewards, generates benign inputs to discover violations. On 402 real-world skills from a public marketplace, Sefz reports specification violations in 120 skills (29.9%), including 26 previously unknown exploitable guardrail violations, and identifies six recurring specification pitfalls for safer design.
Significance. If the guardrail-to-reachability translations are faithful, the work identifies a practically important class of safety failures in deployed agent skills that arise from underspecified natural-language contracts rather than attacks. The scale of the evaluation on 402 marketplace skills supplies a useful empirical baseline, and the reduction of checking to deterministic graph queries is a clean technical step that supports reproducibility. The recurring pitfalls provide concrete, actionable guidance for skill developers.
major comments (2)
- [Evaluation] Evaluation section: the headline result of 120 violations (29.9%) and 26 exploitable cases rests on LLM-generated reachability goals, yet no validation is reported (manual review of a sample, inter-rater agreement, or held-out guardrail comparison). This directly affects whether the measured rate reflects genuine specification breaches or translation artifacts.
- [§4] §4 (framework description): the claim that violation checking reduces to a deterministic graph query is load-bearing, but the paper provides no error analysis or accuracy metrics for the trace annotation step that produces the graphs; annotation errors would propagate into both the bandit guidance and the final violation counts.
minor comments (2)
- [Abstract] Abstract: the six recurring pitfalls are mentioned but not enumerated; a brief list would strengthen the takeaway without lengthening the abstract.
- [Evaluation] The paper should clarify whether the 26 exploitable violations were confirmed by manual reproduction or only by the automated reachability query.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical importance of identifying specification violations in agent skills. We agree that additional validation is needed for the LLM-generated reachability goals and for the trace annotation process. We will revise the manuscript accordingly by adding empirical validation studies and error analyses.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the headline result of 120 violations (29.9%) and 26 exploitable cases rests on LLM-generated reachability goals, yet no validation is reported (manual review of a sample, inter-rater agreement, or held-out guardrail comparison). This directly affects whether the measured rate reflects genuine specification breaches or translation artifacts.
Authors: We agree that validation of the translations is essential to substantiate the reported rates. In the revised manuscript, we will add a dedicated validation subsection in the evaluation. This will include: (1) manual review by two independent reviewers of a random sample of 50 guardrail-to-reachability translations, with reported inter-rater agreement (Cohen's kappa); (2) a comparison against a held-out set of 20 guardrails where we manually craft reference reachability goals. These steps will quantify translation fidelity and support that the 29.9% figure reflects genuine violations rather than artifacts. revision: yes
-
Referee: [§4] §4 (framework description): the claim that violation checking reduces to a deterministic graph query is load-bearing, but the paper provides no error analysis or accuracy metrics for the trace annotation step that produces the graphs; annotation errors would propagate into both the bandit guidance and the final violation counts.
Authors: We acknowledge that the absence of error analysis for trace annotation is a gap, as annotation inaccuracies could affect downstream results. In the revision, we will expand §4 with an error analysis subsection. We will manually annotate a random sample of 100 execution traces and report precision, recall, and F1-score for key annotation elements (e.g., state transitions, variable bindings). We will also analyze and discuss potential error propagation into the multi-armed bandit rewards and violation detection, including sensitivity experiments where we inject controlled annotation noise. revision: yes
Circularity Check
No circularity: framework is a new construction with independent empirical results
full rationale
The paper presents Sefz as a novel goal-directed fuzzing framework that translates guardrails to reachability queries and applies LLM mutation plus bandit search. No equations, fitted parameters, or self-citations appear in the provided text. The 120/402 violation count is produced by executing the framework on external marketplace skills rather than by algebraic reduction or renaming of prior fitted quantities. The translation step is an explicit modeling choice whose accuracy is an open correctness question, not a definitional tautology. Therefore the derivation chain is self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural-language guardrails in skill specifications can be translated into deterministic reachability goals over execution traces without loss of intended meaning.
Reference graph
Works this paper leans on
-
[1]
Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey
Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B Divya. Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey. IEEe Access, 2025
2025
-
[2]
Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, and Yuan Tian. The landscape of prompt injection threats in llm agents: From taxonomy to analysis.arXiv preprint arXiv:2602.10453, 2026
-
[3]
Agent skills overview
Anthropic. Agent skills overview. https://platform.claude.com/docs/en/ agents-and-tools/agent-skills/overview, 2026. Online
2026
-
[4]
ClawHub Community. ClawHub. https://clawhub.ai/, 2026. Accessed: 2026-05-07
2026
-
[5]
Openclaw — personal ai assistant
npm. Openclaw — personal ai assistant. https://www.npmjs.com/ package/openclaw, 2026
2026
-
[6]
npm. clawhub. https://www.npmjs.com/package/clawhub, 2026
2026
-
[7]
OpenClaw documentation
OpenClaw. OpenClaw documentation. https://docs.openclaw.ai/, 2026. Accessed: 2026-05-07
2026
-
[8]
Demystifying rce vulnerabilities in llm-integrated apps
Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. Demystifying rce vulnerabilities in llm-integrated apps. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communi- cations Security, pages 1716–1730, 2024
2024
-
[9]
Make agent defeat agent: Automatic detection of{Taint-Style}vulnerabilities in{LLM-based}agents
Fengyu Liu, Yuan Zhang, Jiaqi Luo, Jiarun Dai, Tian Chen, Letian Yuan, Zhengmin Yu, Youkun Shi, Ke Li, Chengyuan Zhou, et al. Make agent defeat agent: Automatic detection of{Taint-Style}vulnerabilities in{LLM-based}agents. In34th USENIX Security Symposium (USENIX Security 25), pages 3767–3786, 2025
2025
-
[10]
In33rd USENIX Security Symposium (USENIX Security 24), pages 4657–4674, 2024
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing.{LLM-Fuzzer}: Scaling assessment of large language model jailbreaks. In33rd USENIX Security Symposium (USENIX Security 24), pages 4657–4674, 2024
2024
-
[11]
Datasentinel: A game-theoretic detection of prompt injection attacks
Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In 2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE, 2025
2025
-
[12]
Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, and Ye Wu. Agentarmor: Enforcing program analysis on agent runtime trace to defend against prompt injection.arXiv preprint arXiv:2508.01249, 2025
-
[13]
Ignore previous prompt: Attack techniques for language models
F ´abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. InNeurIPS 2022 ML Safety Workshop, 2022
2022
-
[14]
InjecAgent: Benchmarking indirect prompt injections in tool-integrated large lan- guage model agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large lan- guage model agents. InFindings of the Association for Computational Linguistics: ACL 2024, 2024
2024
-
[15]
Zihan Guo, Zhiyu Chen, Xiaohang Nie, Jianghao Lin, Yuanjian Zhou, and Weinan Zhang. Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration.arXiv preprint arXiv:2603.21019, 2026
-
[16]
David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026
-
[17]
Bin Wang, Zexin Liu, Hao Yu, Ao Yang, Yenan Huang, Jing Guo, Huangsheng Cheng, Hui Li, and Huiyu Wu. Mcpguard: Auto- matically detecting vulnerabilities in mcp servers.arXiv preprint arXiv:2510.23673, 2025
-
[18]
Marcel B ¨ohme, Valentin J. M. Manes, and Sang Kil Cha. Boosting fuzzer efficiency: An information theoretic perspective. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020
2020
-
[19]
Taming various privilege escalation in llm-based agent systems: A mandatory access control framework, 2026
Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Yudong Gao, Shuai Wang, and Yingjiu Li. Taming various privilege escalation in llm-based agent systems: A mandatory access control framework, 2026
2026
-
[20]
Isolategpt: An execution isolation architecture for llm- based agentic systems
Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. Isolategpt: An execution isolation architecture for llm- based agentic systems. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Society, 2025
2025
-
[21]
Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY , USA,
-
[22]
Association for Computing Machinery
-
[23]
Ottenstein, and Joe D
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization.ACM Transactions on Programming Languages and Systems, 9(3):319–349, 1987
1987
-
[24]
Demand interprocedural program analysis using logic databases
Thomas Reps. Demand interprocedural program analysis using logic databases. InApplications of Logic Databases, pages 163–196. Kluwer Academic Publishers, 1994
1994
-
[25]
AFL++: Combining incremental steps of fuzzing research
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. AFL++: Combining incremental steps of fuzzing research. In14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, 2020
2020
-
[26]
libfuzzer: A library for coverage-guided fuzz testing
LLVM Project. libfuzzer: A library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html, 2025. Accessed May 14, 2026
2025
-
[27]
garak: A framework for security probing large language models, 2024
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models, 2024
2024
-
[28]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Cus- tomizable runtime enforcement for safe and reliable llm agents.arXiv preprint arXiv:2503.18666, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
Secalign: Defending against prompt injection with preference optimization
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 2833–2847, 2025
2025
-
[30]
Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems
Stav Cohen, Ron Bitton, and Ben Nassi. Here comes the ai worm: Preventing the propagation of adversarial self-replicating prompts within genai ecosystems. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 3975–3989, 2025
2025
-
[31]
Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025
Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents. arXiv preprint arXiv:2504.19793, 2025
-
[32]
Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025
-
[33]
Progent: Securing AI Agents with Privilege Control
Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable privilege control for llm agents.arXiv preprint arXiv:2504.11703, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
{JBShield}: Defending large language models from jailbreak attacks through activated concept analysis and manipulation
Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. {JBShield}: Defending large language models from jailbreak attacks through activated concept analysis and manipulation. In34th USENIX Security Symposium (USENIX Security 25), pages 8215–8234, 2025
2025
-
[35]
Exploiting{Task-Level}vulnerabilities: An automatic jailbreak attack and defense benchmarking for{LLMs}
Lan Zhang, Xinben Gao, Liuyi Yao, Jinke Song, and Yaliang Li. Exploiting{Task-Level}vulnerabilities: An automatic jailbreak attack and defense benchmarking for{LLMs}. In34th USENIX Security Symposium (USENIX Security 25), pages 2363–2382, 2025
2025
-
[36]
In34th USENIX Security Symposium (USENIX Security 25), pages 2401–2420, 2025
Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam.{PAPILLON}: Efficient and stealthy fuzz{Testing-Powered}jailbreaks for{LLMs}. In34th USENIX Security Symposium (USENIX Security 25), pages 2401–2420, 2025
2025
-
[37]
An in-depth investigation of data collection in llm app ecosystems
Yuhao Wu, Evin Jaff, Ke Yang, Ning Zhang, and Umar Iqbal. An in-depth investigation of data collection in llm app ecosystems. In Proceedings of the 2025 ACM Internet Measurement Conference, pages 150–170, 2025
2025
-
[38]
Gptracker: A large-scale measurement of misused gpts
Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang. Gptracker: A large-scale measurement of misused gpts. In2025 IEEE Symposium on Security and Privacy (SP), pages 336–354. IEEE, 2025
2025
-
[39]
Model context protocol (mcp): Landscape, security threats, and future research direc- tions.ACM Transactions on Software Engineering and Methodology, 2025
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research direc- tions.ACM Transactions on Software Engineering and Methodology, 2025
2025
-
[40]
Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547, 2026
-
[41]
Directed greybox fuzzing
Marcel B ¨ohme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. Directed greybox fuzzing. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2017
2017
-
[42]
Hawkeye: Towards a desired directed grey- 14 box fuzzer
Hongxu Chen, Yinxing Xue, Yuekang Li, Bihuan Chen, Xiaofei Xie, Xiuheng Wu, and Yang Liu. Hawkeye: Towards a desired directed grey- 14 box fuzzer. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2018
2018
-
[43]
Large language model guided protocol fuzzing
Ruijie Meng, Martin Mirchev, Marcel B ¨ohme, and Abhik Roychoudhury. Large language model guided protocol fuzzing. InNDSS, 2024
2024
-
[44]
In32nd USENIX Security Symposium (USENIX Security 23), pages 1919–1936, 2023
Dawei Wang, Ying Li, Zhiyu Zhang, and Kai Chen.{CarpetFuzz}: Automatic program option constraint extraction from documentation for fuzzing. In32nd USENIX Security Symposium (USENIX Security 23), pages 1919–1936, 2023
1919
-
[45]
Prophet- fuzz: Fully automated prediction and fuzzing of high-risk option com- binations with only documentation via large language model
Dawei Wang, Geng Zhou, Li Chen, Dan Li, and Yukai Miao. Prophet- fuzz: Fully automated prediction and fuzzing of high-risk option com- binations with only documentation via large language model. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 735–749, 2024
2024
-
[46]
Docter: Documentation- guided fuzzing for testing deep learning api functions
Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W Godfrey. Docter: Documentation- guided fuzzing for testing deep learning api functions. InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pages 176–188, 2022
2022
-
[47]
eBPF misbehavior detection: Fuzzing with a specification-based oracle
Tao Lyu, Kumar Kartikeya Dwivedi, Thomas Bourgeat, Mathias Payer, Meng Xu, and Sanidhya Kashyap. eBPF misbehavior detection: Fuzzing with a specification-based oracle. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). ACM, 2025
2025
-
[48]
I guess”/“probably
Yu Hao, Juefei Pu, Xingyu Li, Zhiyun Qian, and Ardalan Amiri Sani. Syzspec: Specification generation for linux kernel fuzzing via under- constrained symbolic execution. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 813–826, 2025. 15 Table VI.Complete mutation operator catalog. Operators marked ∗are feedback...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.