Recognition: unknown
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
Pith reviewed 2026-05-08 03:03 UTC · model grok-4.3
The pith
AgentVisor defends LLM agents from prompt injection by enforcing semantic privilege separation through a trusted visor and audit protocol.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentVisor enforces semantic privilege separation by treating the target LLM agent as an untrusted guest and routing its tool calls through a trusted semantic visor. The visor applies a rigorous audit protocol based on operating-system security primitives to detect direct and indirect prompt injections. When a violation occurs, a one-shot self-correction mechanism supplies constructive feedback so the agent can recover and resume its workflow. This design is shown to reduce successful attacks to 0.65 percent while producing only a 1.45 percent average utility decrease compared with an undefended baseline.
What carries the argument
The semantic visor that intercepts tool calls from the untrusted agent and applies the audit protocol to enforce privilege separation and detect injections.
If this is right
- LLM agents can safely incorporate untrusted external data into automated workflows without high risk of compromise.
- The defense addresses both direct and indirect injections without requiring changes to the underlying model.
- Self-correction allows agents to maintain functionality after attempted attacks instead of halting entirely.
- The method achieves a better security-utility balance than prior defenses that either over-restrict or miss subtle attacks.
- The approach applies across varied agent configurations and task types.
Where Pith is reading between the lines
- The same virtualization layer could be extended to defend against other forms of adversarial manipulation in agent systems.
- Deploying the visor in production environments with diverse external data would test whether low false-positive rates hold in practice.
- The modular separation of agent and visor suggests a general pattern for adding security boundaries to autonomous AI without retraining the core model.
- Combining the audit protocol with other lightweight checks might further reduce remaining attack vectors.
Load-bearing premise
The audit protocol can reliably distinguish injection attempts from legitimate tool calls without generating enough false positives to harm the agent's normal utility.
What would settle it
A new suite of indirect prompt injection attacks that succeed in manipulating tool calls despite the visor, or that cause utility to drop more than a few percent on standard agent tasks.
Figures
read the original abstract
Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade-off where rigorous protection leads to over-defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one-shot self-correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AgentVisor, a defense framework for LLM agents against direct and indirect prompt injection. Inspired by OS virtualization, it treats the target agent as an untrusted guest and intercepts tool calls through a trusted semantic visor that enforces semantic privilege separation. The core components are a rigorous audit protocol grounded in classic OS security primitives and a one-shot self-correction mechanism that converts detected violations into feedback for recovery. Experiments report that AgentVisor reduces attack success rate to 0.65% while incurring only a 1.45% average utility decrease relative to the no-defense baseline, outperforming existing defenses.
Significance. If the evaluation holds, AgentVisor would represent a meaningful advance in securing LLM agents by achieving strong protection against prompt injection with negligible utility cost. The semantic-virtualization approach and self-correction mechanism are conceptually clean and directly address the over-defense vs. bypass trade-off noted in prior work. The paper targets a high-impact problem in agentic AI security and supplies concrete performance numbers that could guide practical deployment.
major comments (2)
- [Evaluation] Evaluation section (performance claims): The central result (0.65% ASR, 1.45% utility drop) is load-bearing. The manuscript must explicitly describe the full attack suite, the generation process for indirect injections, the number of trials per configuration, and whether any adversarial re-phrasing of injections was attempted to evade the semantic visor (e.g., by mimicking legitimate tool-response formats). Without this, the reported ASR may reflect test-set coverage rather than intrinsic robustness.
- [Audit Protocol] Audit protocol description: The semantic visor and audit protocol are the primary mechanisms for detecting indirect injections. The paper should supply concrete decision rules, pseudocode, or worked examples showing how a tool response is classified when an injection is embedded in otherwise legitimate data structures; this detail is required to assess whether the protocol can be bypassed without triggering false positives that would erode the reported utility.
minor comments (2)
- [Abstract] Abstract: The abstract states precise numerical results but omits any reference to the number of experimental runs, statistical tests, or the exact set of baseline defenses compared; adding one sentence on these points would improve immediate readability.
- [Introduction] Notation and terminology: Ensure consistent use of “ASR” and “utility” throughout; define both on first appearance and clarify whether utility is measured as task-completion accuracy or another metric.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional detail will strengthen the presentation of our evaluation and mechanisms. We address each point below and will incorporate the requested clarifications in the revised version.
read point-by-point responses
-
Referee: The manuscript must explicitly describe the full attack suite, the generation process for indirect injections, the number of trials per configuration, and whether any adversarial re-phrasing of injections was attempted to evade the semantic visor (e.g., by mimicking legitimate tool-response formats).
Authors: We agree that these methodological details are essential for interpreting the reported 0.65% ASR and 1.45% utility drop. In the revised manuscript we will expand the Evaluation section to enumerate the complete attack suite (direct and indirect injections), describe the generation process for indirect injections (including templates, data sources, and embedding methods), state the number of trials per configuration (100 trials per setup across three random seeds), and explicitly note that systematic adversarial re-phrasing to mimic legitimate tool-response formats was not performed in the current evaluation. We will frame this as a limitation and discuss its implications for claimed robustness. revision: yes
-
Referee: The paper should supply concrete decision rules, pseudocode, or worked examples showing how a tool response is classified when an injection is embedded in otherwise legitimate data structures.
Authors: We acknowledge that the current description of the audit protocol lacks sufficient concreteness for readers to assess bypass resistance and false-positive impact. We will add pseudocode for the semantic visor's classification rules and two worked examples illustrating how tool responses containing embedded injections within legitimate JSON or text structures are audited. These additions will make the decision logic transparent and allow direct evaluation of the over-defense versus bypass trade-off. revision: yes
Circularity Check
No significant circularity; independent engineering framework with external benchmarks
full rationale
The paper describes an engineering defense framework (semantic visor, audit protocol, one-shot self-correction) evaluated via attack success rate and utility metrics on experimental benchmarks. No equations, fitted parameters, derivations, or self-citation chains appear in the text. Claims reduce to design choices plus measured outcomes against independent attack suites rather than any self-definitional or fitted-input reduction. This is the common honest case of a self-contained proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can be isolated and audited at the semantic level using OS-style privilege separation primitives
invented entities (1)
-
semantic visor
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner
Don’t you (forget nlp): Prompt injection with control characters in chatgpt. Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025a. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400. Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David W...
-
[2]
https://docs.cloud.google
Gemini 2.5 flash | generative ai on vertex ai. https://docs.cloud.google. com/vertex-ai/generative-ai/docs/models/ gemini/2-5-flash. Accessed: 2025-12-31. Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman
2025
-
[3]
Defending Against Indirect Prompt Injection Attacks With Spotlighting
Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others
work page internal anchor Pith review arXiv
-
[4]
Gpt-4o system card.arXiv preprint arXiv:2410.21276. Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong
work page internal anchor Pith review arXiv
-
[5]
Promptlocate: Localizing prompt injection attacks,
Promptlocate: Localizing prompt injec- tion attacks.arXiv preprint arXiv:2510.12252. Yilun Kong, Jingqing Ruan, Yihong Chen, Bin Zhang, Tianpeng Bao, Shi Shiwei, Xiaoru Hu, Hangyu Mao, Ziyue Li, Xingyu Zeng, and 1 others
-
[6]
InProceedings of the 2024 conference on empirical methods in natural language processing: industry track, pages 371–385
Tptu- v2: Boosting task planning and tool usage of large language model-based agents in real-world industry systems. InProceedings of the 2024 conference on empirical methods in natural language processing: industry track, pages 371–385. Shih-Wei Li, John S Koh, and Jason Nieh
2024
-
[7]
Pro- tecting cloud virtual machines from hypervisor and host operating system exploits. In28th USENIX Security Symposium (USENIX Security 19), pages 1357–1374. Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024a. Automatic and univer- sal prompt injection attacks against large language models.arXiv preprint arXiv:2403.04957. Yupei L...
-
[8]
In 2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208
Datasentinel: A game- theoretic detection of prompt injection attacks. In 2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE. AI @ Meta Llama Team
2025
-
[9]
The llama 3 herd of models.Preprint, arXiv:2407.21783. Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes
work page internal anchor Pith review arXiv
-
[10]
cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594, 2025
cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594. Fábio Perez and Ian Ribeiro
-
[11]
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527. Yun Piao, Hongbo Min, Hang Su, Leilei Zhang, Lei Wang, Yue Yin, Xiao Wu, Zhejing Xu, Liwei Qu, Hang Li, and 1 others
work page internal anchor Pith review arXiv
-
[12]
9 Gerald J Popek and Robert P Goldberg
Agentbay: A hy- brid interaction sandbox for seamless human-ai intervention in agentic systems.arXiv preprint arXiv:2512.04367. 9 Gerald J Popek and Robert P Goldberg
-
[13]
InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Secu- rity, pages 660–674
Optimization-based prompt injection attack to llm-as- a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Secu- rity, pages 660–674. Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, and 1 others
2024
-
[14]
Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219. Takahiro Shinagawa, Hideki Eiraku, Kouichi Tanimoto, Kazumasa Omote, Shoichi Hasegawa, Takashi Horie, Manabu Hirano, Kenichi Kourai, Yoshihiro Oyama, Eiji Kawai, and 1 others
-
[15]
InProceed- ings of the 2009 ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 121–130
Bitvisor: a thin hyper- visor for enforcing i/o device security. InProceed- ings of the 2009 ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 121–130. GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, and 1 others
2009
-
[16]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Glm-4.5: Agentic, reason- ing, and coding (arc) foundation models.Preprint, arXiv:2508.06471. Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu
work page internal anchor Pith review arXiv
-
[17]
In Proceedings of the 33rd ACM International Confer- ence on Multimedia, pages 10955–10964
Manipulating multi- modal agents via cross-modal prompt injection. In Proceedings of the 33rd ACM International Confer- ence on Multimedia, pages 10955–10964. Simon Willison. Delimiters won’t save you from prompt injection, 2023.URL https://simonwillison. net/2023/May/11/delimiters-wont-save-you,
2023
-
[18]
Easytool: Enhancing llm-based agents with concise tool instruction. InProceedings of the 2025 Conference of the Nations of the Amer- icas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 951–972. Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, and William Yang Wang
2025
-
[19]
Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson
-
[20]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Trap–Audit–Recover Procedure This section provides the full algorithmic descrip- tion of the trap–audit–recover loop implemented by AgentVisor(Alg. 1). It presents the end-to-end con- trol flow, including proposal, auditing, exception handling, a...
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.