Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security
Pith reviewed 2026-06-27 09:27 UTC · model grok-4.3
The pith
Runtime Skill Audit detects malicious LLM agent skills at 90% accuracy through targeted dynamic probing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RSA is a dynamic analysis method that audits skills by asking what the skill-mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk-relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. On 100 skills, RSA achieves 90.0% accuracy with an 88.0% true positive rate and an 8.0% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self-evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19--20 out of 20 malicious skills across round
What carries the argument
Runtime Skill Audit (RSA), a dynamic analysis method that profiles risk-relevant interfaces, prepares execution contexts, and assigns security labels from trace evidence.
If this is right
- RSA improves accuracy by 13 percentage points over the best static baseline on the tested skills.
- RSA maintains detection of 19-20 malicious skills across multiple rounds of self-evolving attacks where static detectors fail after one or two rounds.
- Dynamic analysis reveals harmful behavior that is not apparent from static code or documentation alone.
- Targeted probing of risk-relevant interfaces under prepared contexts is more effective than generic testing for skill security.
Where Pith is reading between the lines
- If the targeted probing approach scales, security practices for agent skills may shift toward runtime testing in addition to static review.
- Other agent frameworks could benefit from adopting context-preparation techniques to audit reusable components.
- The method implies that skill interfaces may need clearer documentation to support effective runtime auditing.
Load-bearing premise
The specific risk-relevant interfaces, prepared execution contexts, and trace-evidence rules used by RSA are sufficient to surface hidden malicious behavior across the full space of possible user requests, local assets, and multi-step interactions without missing important cases or over-labeling benign skills.
What would settle it
A collection of malicious skills that evade RSA's profiled interfaces and prepared contexts while still producing harm in actual use, or a high rate of false positives on benign skills under varied real-world conditions.
Figures
read the original abstract
Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persistent state, or multi-step tool interactions. This makes purely static vetting brittle. We present Runtime Skill Audit (RSA), a dynamic analysis method that audits skills by asking what the skill-mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk-relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. We instantiate RSA on OpenClaw and evaluate it on 100 skills against representative static baselines. RSA achieves 90.0\% accuracy with an 88.0\% true positive rate and an 8.0\% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self-evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19--20 out of 20 malicious skills across rounds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Runtime Skill Audit (RSA), a dynamic analysis method for detecting malicious LLM agent skills that may appear benign statically but exhibit harm under specific runtime conditions (user requests, local assets, state, or multi-step interactions). RSA profiles risk-relevant interfaces, prepares targeted execution contexts, and derives security labels from trace evidence. On a 100-skill corpus against static baselines, it reports 90.0% accuracy (88.0% TPR, 8.0% FPR), a 13-point accuracy gain, and sustained detection (19-20/20 malicious skills) under self-evolving attacks where static detectors fail after 1-2 rounds.
Significance. If the evaluation methodology is sound and the interface coverage is adequate, RSA would address a genuine limitation of static skill vetting for context-dependent malice in agent systems, offering a practical runtime probing approach with demonstrated robustness to adaptive attacks. This could inform security practices for reusable agent skills in LLM deployments.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The manuscript supplies no description of the 100-skill dataset composition, how the targeted conditions were chosen, how labels were assigned, or controls for selection bias. Without these details the headline performance numbers (90% accuracy, 88% TPR) cannot be assessed for reliability or generalizability.
- [§3 (RSA Method)] §3 (RSA Method): The description of profiling risk-relevant interfaces, preparing execution contexts, and trace-evidence rules provides no argument or evidence that this finite set is sufficient to surface hidden malicious behavior across the full space of possible user requests, local assets, and multi-step interactions. This leaves the accuracy and round-by-round robustness claims tied to the authors' test distribution rather than a general property.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in the evaluation and method sections. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§4 (Evaluation)] The manuscript supplies no description of the 100-skill dataset composition, how the targeted conditions were chosen, how labels were assigned, or controls for selection bias. Without these details the headline performance numbers (90% accuracy, 88% TPR) cannot be assessed for reliability or generalizability.
Authors: We agree that these methodological details are necessary to evaluate the reported performance. In the revised manuscript we will expand §4 with: (1) a breakdown of the 100-skill corpus by source (public repositories and controlled generation) and malicious/benign categories; (2) the process for deriving targeted execution conditions from the profiled risk-relevant interfaces; (3) the label-assignment protocol, performed by two independent security reviewers using explicit criteria for harm; and (4) bias-mitigation steps including stratified sampling across skill complexity and domain. These additions will allow readers to assess reliability and generalizability directly. revision: yes
-
Referee: [§3 (RSA Method)] The description of profiling risk-relevant interfaces, preparing execution contexts, and trace-evidence rules provides no argument or evidence that this finite set is sufficient to surface hidden malicious behavior across the full space of possible user requests, local assets, and multi-step interactions. This leaves the accuracy and round-by-round robustness claims tied to the authors' test distribution rather than a general property.
Authors: We acknowledge that exhaustive coverage of an infinite interaction space is impossible and that the paper does not claim universality. RSA deliberately restricts probing to a finite set of risk-relevant interfaces derived from documented attack patterns in LLM-agent literature. The self-evolving attack experiments provide evidence that this targeted set remains effective when adversaries adapt, which goes beyond a single fixed test distribution. In revision we will add an explicit limitations paragraph in §3 discussing the interface-selection rationale and coverage threats to validity, while retaining the practical robustness results. revision: partial
Circularity Check
No circularity: RSA presents an empirical runtime evaluation with no self-referential reductions.
full rationale
The manuscript describes a dynamic probing method that selects risk-relevant interfaces, prepares contexts, collects traces, and assigns labels, then reports accuracy on a fixed 100-skill corpus against static baselines. No equations, parameter-fitting steps presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the text. The reported 90% accuracy and round-by-round detection rates are framed as outcomes of the evaluation procedure rather than quantities defined in terms of themselves. The coverage assumption noted by the skeptic is a potential external-validity concern, not a circularity in the derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild
Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study. arXiv e-prints , keywords =. doi:10.48550/arXiv.2602.06547 , archivePrefix =. 2602.06547 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.06547
-
[2]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. arXiv e-prints , keywords =. doi:10.48550/arXiv.2403.02691 , archivePrefix =. 2403.02691 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.02691
-
[3]
BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents. arXiv e-prints , keywords =. doi:10.48550/arXiv.2601.04566 , archivePrefix =. 2601.04566 , primaryClass =
-
[4]
Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality. arXiv e-prints , keywords =. doi:10.48550/arXiv.2602.08004 , archivePrefix =. 2602.08004 , primaryClass =
-
[5]
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents. arXiv e-prints , keywords =. doi:10.48550/arXiv.2605.05726 , archivePrefix =. 2605.05726 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.05726
-
[6]
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings. arXiv e-prints , keywords =. doi:10.48550/arXiv.2604.04323 , archivePrefix =. 2604.04323 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04323
-
[7]
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw. arXiv e-prints , keywords =. doi:10.48550/arXiv.2604.04759 , archivePrefix =. 2604.04759 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04759
-
[8]
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis. arXiv e-prints , keywords =. doi:10.48550/arXiv.2605.00314 , archivePrefix =. 2605.00314 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.00314
-
[9]
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks. arXiv e-prints , keywords =. doi:10.48550/arXiv.2602.20156 , archivePrefix =. 2602.20156 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20156
-
[10]
Formal Analysis and Supply Chain Security for Agentic AI Skills. arXiv e-prints , keywords =. doi:10.48550/arXiv.2603.00195 , archivePrefix =. 2603.00195 , primaryClass =
-
[11]
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale
Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale. arXiv e-prints , keywords =. doi:10.48550/arXiv.2601.10338 , archivePrefix =. 2601.10338 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.10338
-
[12]
SkillProbe: Security Auditing for Emerging Agent Skill Marketplaces via Multi-Agent Collaboration. arXiv e-prints , keywords =. doi:10.48550/arXiv.2603.21019 , archivePrefix =. 2603.21019 , primaryClass =
-
[13]
TraceAegis: Securing LLM-Based Agents via Hierarchical and Behavioral Anomaly Detection. arXiv e-prints , keywords =. doi:10.48550/arXiv.2510.11203 , archivePrefix =. 2510.11203 , primaryClass =
-
[14]
MindGuard: Intrinsic Decision Inspection for Securing LLM Agents Against Metadata Poisoning. arXiv e-prints , keywords =. doi:10.48550/arXiv.2508.20412 , archivePrefix =. 2508.20412 , primaryClass =
-
[15]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. arXiv e-prints , keywords =. doi:10.48550/arXiv.2309.10253 , archivePrefix =. 2309.10253 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.10253
-
[16]
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety. arXiv e-prints , keywords =. doi:10.48550/arXiv.2602.01539 , archivePrefix =. 2602.01539 , primaryClass =
-
[17]
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement. arXiv e-prints , keywords =. doi:10.48550/arXiv.2604.04989 , archivePrefix =. 2604.04989 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04989
-
[18]
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs. arXiv e-prints , keywords =. doi:10.48550/arXiv.2410.05295 , archivePrefix =. 2410.05295 , primaryClass =
-
[19]
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv e-prints , keywords =. doi:10.48550/arXiv.2303.11366 , archivePrefix =. 2303.11366 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366
-
[20]
Generative Agents: Interactive Simulacra of Human Behavior
Generative Agents: Interactive Simulacra of Human Behavior. arXiv e-prints , keywords =. doi:10.48550/arXiv.2304.03442 , archivePrefix =. 2304.03442 , primaryClass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.03442
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.