pith. machine review for the scientific record. sign in

arxiv: 2604.24118 · v1 · submitted 2026-04-27 · 💻 cs.CR

Recognition: unknown

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:03 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM agentsprompt injectionsemantic virtualizationdefense frameworkaudit protocolself-correctionprivilege separationAI security
0
0 comments X

The pith

AgentVisor defends LLM agents from prompt injection by enforcing semantic privilege separation through a trusted visor and audit protocol.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AgentVisor, a framework that adapts operating system virtualization concepts to protect LLM agents when they interact with untrusted external data. It positions the agent as an untrusted guest whose tool calls are intercepted and checked by a trusted semantic visor before execution. An audit protocol drawn from classic security primitives identifies both direct and indirect injection attempts. A one-shot self-correction step turns detected violations into feedback that lets the agent continue its work. Experiments indicate this combination keeps attack success very low while causing only a small average drop in the agent's ability to complete tasks.

Core claim

AgentVisor enforces semantic privilege separation by treating the target LLM agent as an untrusted guest and routing its tool calls through a trusted semantic visor. The visor applies a rigorous audit protocol based on operating-system security primitives to detect direct and indirect prompt injections. When a violation occurs, a one-shot self-correction mechanism supplies constructive feedback so the agent can recover and resume its workflow. This design is shown to reduce successful attacks to 0.65 percent while producing only a 1.45 percent average utility decrease compared with an undefended baseline.

What carries the argument

The semantic visor that intercepts tool calls from the untrusted agent and applies the audit protocol to enforce privilege separation and detect injections.

If this is right

  • LLM agents can safely incorporate untrusted external data into automated workflows without high risk of compromise.
  • The defense addresses both direct and indirect injections without requiring changes to the underlying model.
  • Self-correction allows agents to maintain functionality after attempted attacks instead of halting entirely.
  • The method achieves a better security-utility balance than prior defenses that either over-restrict or miss subtle attacks.
  • The approach applies across varied agent configurations and task types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same virtualization layer could be extended to defend against other forms of adversarial manipulation in agent systems.
  • Deploying the visor in production environments with diverse external data would test whether low false-positive rates hold in practice.
  • The modular separation of agent and visor suggests a general pattern for adding security boundaries to autonomous AI without retraining the core model.
  • Combining the audit protocol with other lightweight checks might further reduce remaining attack vectors.

Load-bearing premise

The audit protocol can reliably distinguish injection attempts from legitimate tool calls without generating enough false positives to harm the agent's normal utility.

What would settle it

A new suite of indirect prompt injection attacks that succeed in manipulating tool calls despite the visor, or that cause utility to drop more than a few percent on standard agent tasks.

Figures

Figures reproduced from arXiv: 2604.24118 by Aishan Liu, Haozheng Wang, Jiangfan Liu, Jian Yang, Quanchen Zou, Xianglong Liu, Yaodong Yang, Zonghao Ying.

Figure 1
Figure 1. Figure 1: Systematic mapping between OS Virtualiza view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AgentVisor architecture. Drawing inspiration from OS virtualization, AgentVisor enforces privilege separation between the untrusted target agent (Guest) and the trusted semantic hypervisor (Visor). Precision Recall Accuracy 90 92 94 96 98 100 102 Percentage 96.93 99.53 99.69 95.24 100.00 99.98 Direct Injection GPT-4o GLM-4.7 Precision Recall Accuracy 90 92 94 96 98 100 102 94.21 99.67 96.78… view at source ↗
Figure 3
Figure 3. Figure 3: Detection performance of target agents (based view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of AgentVisor components. w/o denotes the removal of a specific component. 5.3 Ablation Study We validate the contribution of each AgentVisor component through a comprehensive ablation study ( view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on AgentVisor’s LLM backbone against Direct Injection. 20 40 60 80 UR (%) 0 2 4 6 8 ASR (%) Direct 20 40 60 80 UR (%) 0 2 4 6 8 Ignore 20 40 60 80 UR (%) 0 2 4 6 8 SYSTEM 20 40 60 80 UR (%) 0 10 20 30 40 50 60 Important No Defense Sandwich Reminder Spotlight DeBERTa Ours-G2.5F Ours-G2.5P Ours-G3F Ours-GLM4 Ours-CS4 view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on AgentVisor’s LLM backbone against Indirect Injection. No Defense Naive Visor AgentVisor 0 20 40 60 80 100 Percentage (%) 91.56 53.95 86.85 96.18 20.51 0.00 Direct Injection No Defense Naive Visor AgentVisor 0 20 40 60 80 100 71.11 64.45 60.67 84.44 6.53 4.91 Indirect Injection UA ASR view at source ↗
Figure 7
Figure 7. Figure 7: Robustness of the defense mechanism against view at source ↗
read the original abstract

Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade-off where rigorous protection leads to over-defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one-shot self-correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AgentVisor, a defense framework for LLM agents against direct and indirect prompt injection. Inspired by OS virtualization, it treats the target agent as an untrusted guest and intercepts tool calls through a trusted semantic visor that enforces semantic privilege separation. The core components are a rigorous audit protocol grounded in classic OS security primitives and a one-shot self-correction mechanism that converts detected violations into feedback for recovery. Experiments report that AgentVisor reduces attack success rate to 0.65% while incurring only a 1.45% average utility decrease relative to the no-defense baseline, outperforming existing defenses.

Significance. If the evaluation holds, AgentVisor would represent a meaningful advance in securing LLM agents by achieving strong protection against prompt injection with negligible utility cost. The semantic-virtualization approach and self-correction mechanism are conceptually clean and directly address the over-defense vs. bypass trade-off noted in prior work. The paper targets a high-impact problem in agentic AI security and supplies concrete performance numbers that could guide practical deployment.

major comments (2)
  1. [Evaluation] Evaluation section (performance claims): The central result (0.65% ASR, 1.45% utility drop) is load-bearing. The manuscript must explicitly describe the full attack suite, the generation process for indirect injections, the number of trials per configuration, and whether any adversarial re-phrasing of injections was attempted to evade the semantic visor (e.g., by mimicking legitimate tool-response formats). Without this, the reported ASR may reflect test-set coverage rather than intrinsic robustness.
  2. [Audit Protocol] Audit protocol description: The semantic visor and audit protocol are the primary mechanisms for detecting indirect injections. The paper should supply concrete decision rules, pseudocode, or worked examples showing how a tool response is classified when an injection is embedded in otherwise legitimate data structures; this detail is required to assess whether the protocol can be bypassed without triggering false positives that would erode the reported utility.
minor comments (2)
  1. [Abstract] Abstract: The abstract states precise numerical results but omits any reference to the number of experimental runs, statistical tests, or the exact set of baseline defenses compared; adding one sentence on these points would improve immediate readability.
  2. [Introduction] Notation and terminology: Ensure consistent use of “ASR” and “utility” throughout; define both on first appearance and clarify whether utility is measured as task-completion accuracy or another metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional detail will strengthen the presentation of our evaluation and mechanisms. We address each point below and will incorporate the requested clarifications in the revised version.

read point-by-point responses
  1. Referee: The manuscript must explicitly describe the full attack suite, the generation process for indirect injections, the number of trials per configuration, and whether any adversarial re-phrasing of injections was attempted to evade the semantic visor (e.g., by mimicking legitimate tool-response formats).

    Authors: We agree that these methodological details are essential for interpreting the reported 0.65% ASR and 1.45% utility drop. In the revised manuscript we will expand the Evaluation section to enumerate the complete attack suite (direct and indirect injections), describe the generation process for indirect injections (including templates, data sources, and embedding methods), state the number of trials per configuration (100 trials per setup across three random seeds), and explicitly note that systematic adversarial re-phrasing to mimic legitimate tool-response formats was not performed in the current evaluation. We will frame this as a limitation and discuss its implications for claimed robustness. revision: yes

  2. Referee: The paper should supply concrete decision rules, pseudocode, or worked examples showing how a tool response is classified when an injection is embedded in otherwise legitimate data structures.

    Authors: We acknowledge that the current description of the audit protocol lacks sufficient concreteness for readers to assess bypass resistance and false-positive impact. We will add pseudocode for the semantic visor's classification rules and two worked examples illustrating how tool responses containing embedded injections within legitimate JSON or text structures are audited. These additions will make the decision logic transparent and allow direct evaluation of the over-defense versus bypass trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent engineering framework with external benchmarks

full rationale

The paper describes an engineering defense framework (semantic visor, audit protocol, one-shot self-correction) evaluated via attack success rate and utility metrics on experimental benchmarks. No equations, fitted parameters, derivations, or self-citation chains appear in the text. Claims reduce to design choices plus measured outcomes against independent attack suites rather than any self-definitional or fitted-input reduction. This is the common honest case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating LLM behavior as analogous to untrusted guest processes and assuming the visor can perform reliable semantic auditing; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption LLM agents can be isolated and audited at the semantic level using OS-style privilege separation primitives
    Invoked when mapping virtualization concepts to tool-call interception and attack mitigation.
invented entities (1)
  • semantic visor no independent evidence
    purpose: Trusted component that intercepts and audits tool calls for injection detection
    New architectural element introduced to enforce separation; no independent evidence outside the proposed system is provided.

pith-pipeline@v0.9.0 · 5514 in / 1250 out tokens · 55843 ms · 2026-05-08T03:03:27.596351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner

    Don’t you (forget nlp): Prompt injection with control characters in chatgpt. Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025a. {StruQ}: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400. Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, and David W...

  2. [2]

    https://docs.cloud.google

    Gemini 2.5 flash | generative ai on vertex ai. https://docs.cloud.google. com/vertex-ai/generative-ai/docs/models/ gemini/2-5-flash. Accessed: 2025-12-31. Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman

  3. [3]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others

  4. [4]

    GPT-4o System Card

    Gpt-4o system card.arXiv preprint arXiv:2410.21276. Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong

  5. [5]

    Promptlocate: Localizing prompt injection attacks,

    Promptlocate: Localizing prompt injec- tion attacks.arXiv preprint arXiv:2510.12252. Yilun Kong, Jingqing Ruan, Yihong Chen, Bin Zhang, Tianpeng Bao, Shi Shiwei, Xiaoru Hu, Hangyu Mao, Ziyue Li, Xingyu Zeng, and 1 others

  6. [6]

    InProceedings of the 2024 conference on empirical methods in natural language processing: industry track, pages 371–385

    Tptu- v2: Boosting task planning and tool usage of large language model-based agents in real-world industry systems. InProceedings of the 2024 conference on empirical methods in natural language processing: industry track, pages 371–385. Shih-Wei Li, John S Koh, and Jason Nieh

  7. [7]

    Automatic and uni- versal prompt injection attacks against large language models.arXiv preprint arXiv:2403.04957, 2024

    Pro- tecting cloud virtual machines from hypervisor and host operating system exploits. In28th USENIX Security Symposium (USENIX Security 19), pages 1357–1374. Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. 2024a. Automatic and univer- sal prompt injection attacks against large language models.arXiv preprint arXiv:2403.04957. Yupei L...

  8. [8]

    In 2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208

    Datasentinel: A game- theoretic detection of prompt injection attacks. In 2025 IEEE Symposium on Security and Privacy (SP), pages 2190–2208. IEEE. AI @ Meta Llama Team

  9. [9]

    The Llama 3 Herd of Models

    The llama 3 herd of models.Preprint, arXiv:2407.21783. Luoxi Meng, Henry Feng, Ilia Shumailov, and Earlence Fernandes

  10. [10]

    cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594, 2025

    cellmate: Sandboxing browser ai agents.arXiv preprint arXiv:2512.12594. Fábio Perez and Ian Ribeiro

  11. [11]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527. Yun Piao, Hongbo Min, Hang Su, Leilei Zhang, Lei Wang, Yue Yin, Xiao Wu, Zhejing Xu, Liwei Qu, Hang Li, and 1 others

  12. [12]

    9 Gerald J Popek and Robert P Goldberg

    Agentbay: A hy- brid interaction sandbox for seamless human-ai intervention in agentic systems.arXiv preprint arXiv:2512.04367. 9 Gerald J Popek and Robert P Goldberg

  13. [13]

    InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Secu- rity, pages 660–674

    Optimization-based prompt injection attack to llm-as- a-judge. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Secu- rity, pages 660–674. Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, and 1 others

  14. [14]

    PromptArmor: Simple yet Effective Prompt Injection Defenses.arXiv preprint arXiv:2507.15219, 2025.https: //arxiv.org/abs/2507.15219

    Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219. Takahiro Shinagawa, Hideki Eiraku, Kouichi Tanimoto, Kazumasa Omote, Shoichi Hasegawa, Takashi Horie, Manabu Hirano, Kenichi Kourai, Yoshihiro Oyama, Eiji Kawai, and 1 others

  15. [15]

    InProceed- ings of the 2009 ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 121–130

    Bitvisor: a thin hyper- visor for enforcing i/o device security. InProceed- ings of the 2009 ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 121–130. GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, and 1 others

  16. [16]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Glm-4.5: Agentic, reason- ing, and coding (arc) foundation models.Preprint, arXiv:2508.06471. Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, and Xianglong Liu

  17. [17]

    In Proceedings of the 33rd ACM International Confer- ence on Multimedia, pages 10955–10964

    Manipulating multi- modal agents via cross-modal prompt injection. In Proceedings of the 33rd ACM International Confer- ence on Multimedia, pages 10955–10964. Simon Willison. Delimiters won’t save you from prompt injection, 2023.URL https://simonwillison. net/2023/May/11/delimiters-wont-save-you,

  18. [18]

    Easytool: Enhancing llm-based agents with concise tool instruction. InProceedings of the 2025 Conference of the Nations of the Amer- icas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 951–972. Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, and William Yang Wang

  19. [19]

    Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174, 2025

    Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

  20. [20]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. A Trap–Audit–Recover Procedure This section provides the full algorithmic descrip- tion of the trap–audit–recover loop implemented by AgentVisor(Alg. 1). It presents the end-to-end con- trol flow, including proposal, auditing, exception handling, a...