pith. machine review for the scientific record. sign in

arxiv: 2604.25109 · v1 · submitted 2026-04-28 · 💻 cs.CR · cs.AI

Recognition: unknown

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords Agent Skillssecurity auditingrobustnessmalicious detectionuntrusted packagessemantic verificationconsistency adjudicationthree-way classification
0
0 comments X

The pith

Factorized auditing of Agent Skills packages reaches 97.3 percent exact match and 98.3 percent malicious recall by factoring review into role-aware extraction, semantic checks, and consistent adjudication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agent Skills bundle SKILL.md files, scripts, and context into reusable units that can hide malicious intent, turning pre-load security checks into cross-file reviews rather than single-prompt filters. The paper formulates this as a robust three-way classification problem and introduces SkillGuard-Robust to extract role-aware evidence, apply selective semantic verification, and enforce consistency-preserving adjudication. This directly targets how existing guardrails flag risk inconsistently when attackers use semantics-preserving rewrites. On a 404-package held-out set the system records 97.30 percent exact match and 98.33 percent malicious-risk recall; on a 254-package external-ecosystem set the figures rise to 99.66 percent and 100.00 percent. These outcomes indicate that breaking auditing into factorized steps materially strengthens detection in frozen and public-ecosystem settings.

Core claim

SkillGuard-Robust treats pre-load auditing of untrusted Agent Skills as a three-way classification task and combines role-aware evidence extraction, selective semantic verification, and consistency-preserving adjudication to reach 97.30 percent overall exact match, 98.33 percent malicious-risk recall, and 98.89 percent attack exact consistency on a 404-package held-out aggregate, with 99.66 percent, 100.00 percent, and 100.00 percent respectively on a 254-package external-ecosystem view, thereby supporting the claim that factorized package auditing materially improves robustness for frozen and public-ecosystem cases while leaving harsher external-source transfer as an open challenge.

What carries the argument

SkillGuard-Robust, the three-component system that factors package auditing into role-aware evidence extraction from SKILL.md and related files, selective semantic verification, and consistency-preserving adjudication to resist semantics-preserving rewrites.

If this is right

  • Pre-load security reviews for Agent Skills become reliable against common semantics-preserving evasion tactics.
  • Robustness gains appear for both frozen-model settings and public-ecosystem packages.
  • Attack exact consistency reaches 98.89 percent on held-out data, reducing inconsistent flagging of malicious intent.
  • Factorized auditing outperforms single-prompt filtering by using the full package context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorized structure could apply to other composite AI artifacts such as multi-file tool bundles or prompt libraries.
  • Closing the remaining gap on harsher external sources would require new benchmark families that stress transfer from truly unseen origins.
  • Pairing the pre-load auditor with lightweight runtime monitoring would create a two-layer defense for deployed Agent Skills.

Load-bearing premise

The held-out and external benchmarks sufficiently represent real-world malicious Agent Skills and that semantics-preserving rewrites capture the main ways attackers evade detection.

What would settle it

A clear drop below 90 percent malicious recall or attack consistency on a fresh collection of real-world malicious Agent Skills or on rewrites that preserve semantics differently from those in the current benchmarks would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2604.25109 by Jie Wen, Jizhong Han, Lijia Lv, Songlin Hu, Xuehai Tang.

Figure 1
Figure 1. Figure 1: Problem illustration. A skill guardrail must view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SKILLGUARD-ROBUST. The architecture resolves four recurrent ambiguities in sequence: distributed cross-file evidence, localized semantic uncertainty, chain-dominance conflict, and anchor–rewrite residual disagreement. Simple samples follow a fast path of structured extraction followed by direct adjudication, whereas uncertain samples are sent to semantic verification before chain- and cluster-l… view at source ↗
Figure 3
Figure 3. Figure 3: Scale overview of the five large evaluation view at source ↗
Figure 4
Figure 4. Figure 4: Input-view comparison for PEB and Qwen2.5- 14B when reading only SKILL.md versus the full pack￾age. Why SKILL.md-only views are insufficient view at source ↗
Figure 5
Figure 5. Figure 5: Directional failure patterns across the large evaluation views. The left panel shows that flat and structured view at source ↗
Figure 6
Figure 6. Figure 6: Main results and method progression on the large evaluation views. Across Main, All-HO, Int.-stress, view at source ↗
Figure 7
Figure 7. Figure 7: Stage-wise gains on the large evaluation view at source ↗
read the original abstract

Agent Skills package SKILL.md files, scripts, reference documents, and repository context into reusable capability units, turning pre-load auditing from single-prompt filtering into cross-file security review. Existing guardrails often flag risk but recover malicious intent inconsistently under semantics-preserving rewrites. This paper formulates pre-load auditing for untrusted Agent Skills as a robust three-way classification task and introduces SkillGuard-Robust, which combines role-aware evidence extraction, selective semantic verification, and consistency-preserving adjudication. We evaluate SkillGuard-Robust on SkillGuardBench and two public-ecosystem extensions through five large evaluation views ranging from 254 to 404 packages. On the 404-package held-out aggregate, SkillGuard-Robust reaches 97.30% overall exact match, 98.33% malicious-risk recall, and 98.89% attack exact consistency. On the 254-package external-ecosystem view, it reaches 99.66%, 100.00%, and 100.00%, respectively. These results support a bounded conclusion: factorized package auditing materially improves frozen and public-ecosystem robustness, while harsher external-source transfer remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper formulates pre-load auditing of untrusted Agent Skills packages (SKILL.md files, scripts, and context) as a three-way classification task and introduces SkillGuard-Robust, which integrates role-aware evidence extraction, selective semantic verification, and consistency-preserving adjudication. It reports results across five evaluation views on SkillGuardBench and two public-ecosystem extensions (254–404 packages), achieving 97.30% overall exact match, 98.33% malicious-risk recall, and 98.89% attack exact consistency on the 404-package held-out aggregate, plus 99.66% exact match, 100% malicious recall, and 100% consistency on the 254-package external view. The central claim is that factorized package auditing materially improves robustness for frozen and public-ecosystem cases, while harsher external-source transfer remains an open challenge.

Significance. If the evaluation methodology and benchmark construction are sound, the work would offer a meaningful advance in securing AI agent capability ecosystems by providing a structured, multi-component auditing approach that targets inconsistency under semantics-preserving rewrites. The emphasis on factorized auditing and consistency adjudication supplies a concrete framework that could inform practical guardrails, with the multi-view evaluation (held-out plus external) adding some breadth to the reported gains.

major comments (2)
  1. [Abstract] Abstract: the reported 99.66% exact match, 100.00% malicious-risk recall, and 100.00% attack exact consistency on the 254-package external-ecosystem view stands in tension with the explicit statement that 'harsher external-source transfer remains an open challenge.' This raises a load-bearing concern for the bounded conclusion, because the external numbers imply the selective semantic verification and consistency-preserving adjudication components were not stressed by the primary evasion patterns (e.g., control-flow flattening, identifier renaming, or docstring injection) that the abstract itself identifies as relevant.
  2. [Evaluation] Evaluation (across the five views): the manuscript does not supply sufficient detail on SkillGuardBench construction—specifically how malicious instances were generated, how semantics-preserving rewrites were applied to create the test cases, or what baseline systems were compared against. Without these elements, the high recall and consistency figures (e.g., 98.33% malicious recall on the 404-package aggregate) cannot be assessed for genuine robustness versus benchmark-specific tuning, directly affecting support for the claim that factorized auditing improves robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address both major comments by clarifying the scope of the external evaluation results and committing to expanded methodological details in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] the reported 99.66% exact match, 100.00% malicious-risk recall, and 100.00% attack exact consistency on the 254-package external-ecosystem view stands in tension with the explicit statement that 'harsher external-source transfer remains an open challenge.' This raises a load-bearing concern for the bounded conclusion, because the external numbers imply the selective semantic verification and consistency-preserving adjudication components were not stressed by the primary evasion patterns (e.g., control-flow flattening, identifier renaming, or docstring injection) that the abstract itself identifies as relevant.

    Authors: We agree the abstract wording risks misinterpretation. The 254-package external-ecosystem view draws directly from public repositories without additional adversarial rewrites. The phrase 'harsher external-source transfer' specifically denotes more aggressive post-sourcing transformations (control-flow flattening, identifier renaming, docstring injection) that are not instantiated in the current 254-package collection. We will revise the abstract to explicitly separate the evaluated external view from these harsher transfer scenarios that remain an open challenge. revision: yes

  2. Referee: [Evaluation] Evaluation (across the five views): the manuscript does not supply sufficient detail on SkillGuardBench construction—specifically how malicious instances were generated, how semantics-preserving rewrites were applied to create the test cases, or what baseline systems were compared against. Without these elements, the high recall and consistency figures (e.g., 98.33% malicious recall on the 404-package aggregate) cannot be assessed for genuine robustness versus benchmark-specific tuning, directly affecting support for the claim that factorized auditing improves robustness.

    Authors: We concur that additional detail is required for reproducibility and to substantiate the robustness claims. In the revised manuscript we will expand the Evaluation section with: (i) the exact procedure for generating malicious instances (rule-based injection combined with LLM-assisted obfuscation), (ii) the concrete semantics-preserving rewrite operators applied (variable renaming, dead-code insertion, control-flow flattening, docstring injection), and (iii) the full baseline suite together with their configurations and implementation references. These additions will allow readers to distinguish benchmark-specific effects from the contribution of factorized auditing. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation relies on held-out and external sets with independent method definition

full rationale

The paper defines SkillGuard-Robust via role-aware extraction, selective semantic verification, and consistency-preserving adjudication, then evaluates it on explicitly separated SkillGuardBench held-out aggregates (404 packages) and external-ecosystem views (254 packages). No equations, parameter fits, or self-citations are shown that reduce the reported metrics (exact match, recall, consistency) to the inputs by construction. The central claim of material robustness improvement is supported by these external benchmarks rather than being definitionally equivalent to them. This is the normal case of a self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of the SkillGuard-Robust method itself.

pith-pipeline@v0.9.0 · 5511 in / 1166 out tokens · 62268 ms · 2026-05-07T16:17:46.124622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

    cs.CR 2026-05 unverdicted novelty 7.0

    Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

  2. Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

    cs.SE 2026-05 conditional novelty 7.0

    SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    IPIGuard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents.arXiv preprint arXiv:2508.15310, 2025

    Ipi- guard: A novel tool dependency graph-based defense against indirect prompt injection in LLM agents. arXiv preprint arXiv:2508.15310. Anthropic

  2. [2]

    https: //platform.claude.com/docs/en/agents-and-tools/age nt-skills/overview

    Agent skills – claude api docs. https: //platform.claude.com/docs/en/agents-and-tools/age nt-skills/overview. Accessed: 2026-04-07. Elias Bassani and Ignacio Sanchez

  3. [3]

    InFindings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 16995– 17006, Suzhou, China

    On guardrail models’ robustness to mutations and adversarial at- tacks. InFindings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 16995– 17006, Suzhou, China. Association for Computa- tional Linguistics. Julius Broomfield, Tom Gibbs, George Ingebretsen, Ethan Kosak-Hine, Tia Nasir, Jason Zhang, Rei- haneh Iranmanesh, Sara Pieri, Rei...

  4. [4]

    InFindings of the Association for Computational Linguistics: ACL 2025, pages 22134–22173, Vienna, Austria

    The structural safety gen- eralization problem. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22134–22173, Vienna, Austria. Association for Com- putational Linguistics. Sizhe Chen, Arman Zharmagambetov, Saeed Mahlouji- far, Kamalika Chaudhuri, David Wagner, and Chuan Guo

  5. [5]

    Secalign: Defending against prompt injection with preference optimization

    Secalign: Defending against prompt in- jection with preference optimization.arXiv preprint arXiv:2410.05451. Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, and Johanna Ullrich

  6. [6]

    Malicious or not: Adding repository context to agent skill classification.arXiv preprint arXiv:2603.16572, 2026

    Malicious or not: Adding repository context to agent skill classification.arXiv preprint arXiv:2603.16572. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa

  7. [7]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: LLM- based input-output safeguard for human-ai conversa- tions.arXiv preprint arXiv:2312.06674. Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, and Philip Torr

  8. [8]

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed-loop refinement.arXiv preprintarXiv:2602.14211, 2026

    Skillject: Automating stealthy skill-based prompt injection for coding agents with trace-driven closed- loop refinement.arXiv preprint arXiv:2602.14211. Hao Li and Xiaogeng Liu

  9. [9]

    arXiv preprint arXiv:2410.22770 , year=

    Injecguard: Benchmarking and mitigating over-defense in prompt injection guardrail models.arXiv preprint arXiv:2410.22770. Hao Li, Xiaogeng Liu, Ning Zhang, and Chaowei Xiao

  10. [10]

    Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang

    Reasalign: Reasoning enhanced safety alignment against prompt injection attack.arXiv preprint arXiv:2601.10173. Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Ying Zhang, and Leo Yu Zhang. 2026a. Malicious agent skills in the wild: A large-scale security empirical study.arXiv preprint arXiv:2602.06547. Yi Liu, Weizhe Wang, Ruita...

  11. [11]

    Formalizing and benchmarking prompt injection attacks and de- fenses

    Formalizing and bench- marking prompt injection attacks and defenses.arXiv preprint arXiv:2310.12815. Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Jiayi Mao, and Xueqi Cheng

  12. [12]

    https: //huggingface.co/meta-llama/Llama-Prompt-Guard -2-86M

    Llama prompt guard 2 model card. https: //huggingface.co/meta-llama/Llama-Prompt-Guard -2-86M. Accessed: 2026-04-07. Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, and Wanxiang Che

  13. [13]

    Stealthy jailbreak attacks on large language mod- els via benign data mirroring. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 1784–1799, Albuquerque, New Mexico. Association for Computational Linguistics. Itay Nakas...

  14. [14]

    InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6499–6524, Albuquerque, New Mexico

    Breaking ReAct agents: Foot-in- the-door attack will get you in. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6499–6524, Albuquerque, New Mexico. Association for Computational Linguistics. OpenAI

  15. [15]

    https://developers .openai.com/codex/skills

    Agent skills – codex. https://developers .openai.com/codex/skills. Accessed: 2026-04-07. Inkit Padhi, Manish Nagireddy, Giandomenico Cornac- chia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Z...

  16. [16]

    Qwen Team

    Granite guardian.arXiv preprint arXiv:2412.07724. Qwen Team

  17. [17]

    Accessed: 2026-04-07

    Qwen2.5: A party of foundation models! https://qwenlm.github.io/blog/qwen2.5/. Accessed: 2026-04-07. David Schmotz, Sahar Abdelnabi, and Maksym An- driushchenko

  18. [18]

    Agent skills enable a new class of realistic and trivially simple prompt injections,

    Agent skills enable a new class of realistic and trivially simple prompt injections. arXiv preprint arXiv:2510.26328. David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko

  19. [19]

    Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

    Skill-inject: Measuring agent vulnerability to skill file attacks. arXiv preprint arXiv:2602.20156. Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuan- dong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song

  20. [20]

    PromptArmor: Simple yet Effective Prompt Injection Defenses.arXiv preprint arXiv:2507.15219, 2025.https: //arxiv.org/abs/2507.15219

    Promptarmor: Sim- ple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219. Makesh Narsimhan Sreedhar, Traian Rebedea, and Christopher Parisien

  21. [21]

    InFindings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 21862– 21880, Suzhou, China

    Safety through rea- soning: An empirical study of reasoning guardrail models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2025, pages 21862– 21880, Suzhou, China. Association for Computa- tional Linguistics. Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Ruifang He, and Yuexian Hou

  22. [22]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    The in- struction hierarchy: Training LLMs to prioritize privi- leged instructions.arXiv preprint arXiv:2404.13208. Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song

  23. [23]

    InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23159–23172, Suzhou, China

    AGENTVIGIL: Automatic black-box red-teaming for indirect prompt injection against LLM agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23159–23172, Suzhou, China. Association for Com- putational Linguistics. Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen

  24. [24]

    InFindings of the Association for Computational Linguistics: ACL 2025, pages 13698–13713, Vienna, Austria

    Thinkguard: Deliberative slow thinking leads to cautious guardrails. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13698–13713, Vienna, Austria. Associa- tion for Computational Linguistics. Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu

  25. [25]

    Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

    Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197. Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang

  26. [26]

    InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 7116–7132, Albuquerque, New Mexico

    Adaptive attacks break defenses against indirect prompt injection attacks on LLM agents. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 7116–7132, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang

  27. [27]

    InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471– 10506, Bangkok, Thailand

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471– 10506, Bangkok, Thailand. Association for Compu- tational Linguistics. Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xi- aoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiaw...

  28. [28]

    AgentSentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and context pu- rification.arXiv preprint arXiv:2602.22724, 2026

    Agentsentry: Mitigating indirect prompt injection in LLM agents via temporal causal diagnostics and con- text purification.arXiv preprint arXiv:2602.22724. Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, and Meng Jiang

  29. [29]

    IHEval: Eval- uating language models on following the instruction hierarchy. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 8374–8398, Albuquerque, New Mexico. Association for Computational Linguistics. A Appendix R...

  30. [30]

    clas- sification over 30 prompts per split,

    The important point is not the absolute score of any one model, but the fact that two different baselines improve in the same direction once the input view is switched. Under skill-md-only, PEB nearly loses the abil- ity to recover malicious cases; Qwen2.5-14B also becomes clearly unstable on risk and rewrite met- rics. This indicates that skill auditing ...