Recognition: unknown
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
Pith reviewed 2026-05-10 16:14 UTC · model grok-4.3
The pith
PlanGuard stops indirect prompt injection attacks on LLM agents by checking actions against a user-only plan.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PlanGuard is a training-free defense framework built on context isolation. An isolated Planner generates a reference set of valid actions derived solely from the user's instructions. A Hierarchical Verification Mechanism first applies strict hard constraints to prevent unauthorized tool invocations and then uses an Intent Verifier to determine whether any observed parameter deviations represent benign formatting variances or malicious hijacking attempts.
What carries the argument
Isolated Planner that produces a reference set of valid actions from user instructions alone, paired with Hierarchical Verification Mechanism for runtime consistency checks.
If this is right
- Agents can safely use external tools and process retrieved content without successful hijacking.
- No model training or fine-tuning is required for the defense to function.
- Attack success rate falls to zero on the InjecAgent benchmark with a false positive rate of 1.49 percent.
- The approach remains effective across different underlying language models.
Where Pith is reading between the lines
- Runtime behavior verification can serve as a useful second layer alongside input filtering for agent security.
- The same planning-and-check structure might address other forms of context poisoning in autonomous systems.
- Integrating dynamic replanning when user goals are complex could further reduce false blocks.
Load-bearing premise
The planner can generate a complete and accurate reference set of valid actions from user instructions alone, and the verifier can reliably separate benign parameter formatting from malicious hijacking.
What would settle it
An attack that causes the agent to invoke an unauthorized tool or alter parameters in a way the hierarchical checks accept as valid, or a legitimate user request that the planner misses and therefore blocks.
Figures
read the original abstract
Large Language Model (LLM) agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre-processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training-free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations, and subsequently employs an Intent Verifier to validate whether parameter deviations are benign formatting variances or malicious hijacking. Experiments on the InjecAgent benchmark demonstrate that PlanGuard effectively neutralizes these attacks, reducing the Attack Success Rate (ASR) from 72.8% to 0%, while maintaining an acceptable False Positive Rate of 1.49%. Furthermore, our method is model-agnostic and highly compatible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PlanGuard, a training-free defense framework for LLM agents against Indirect Prompt Injection (IPI). It relies on Context Isolation via an isolated Planner that generates a reference set of valid actions solely from user instructions, combined with a Hierarchical Verification Mechanism that applies hard constraints to block unauthorized tool calls and then uses an Intent Verifier to classify parameter deviations as benign or malicious. Experiments on the InjecAgent benchmark are reported to reduce Attack Success Rate (ASR) from 72.8% to 0% while keeping False Positive Rate (FPR) at 1.49%, with the method claimed to be model-agnostic and compatible with existing agents.
Significance. If the results hold under broader conditions, the work would be significant for agent security by introducing runtime consistency verification rather than relying solely on input pre-processing. The training-free design and explicit separation of planning from execution are practical strengths that could enable deployment without retraining. The reported perfect ASR reduction on InjecAgent, if reproducible, would represent a strong empirical outcome for this threat model.
major comments (3)
- [Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.
- [Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.
- [Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.
minor comments (2)
- [Abstract] The abstract and method descriptions would benefit from explicit pseudocode or a diagram clarifying the data flow between the Isolated Planner, hard constraints, and Intent Verifier.
- [Introduction] Notation for 'valid actions' and 'parameter deviations' is introduced without a formal definition or example in the early sections, which could improve clarity for readers unfamiliar with agent tool-calling formats.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and proposing revisions to enhance the paper's rigor and completeness.
read point-by-point responses
-
Referee: [Evaluation] The central claim of ASR reduction to 0% (Abstract) depends on the Isolated Planner producing a complete and accurate reference action set from user instructions alone, yet the evaluation provides no analysis or metrics on planner completeness, failure modes, or coverage for complex multi-step instructions.
Authors: We agree that an analysis of the Isolated Planner's completeness would provide valuable context for the reported results. Although the InjecAgent benchmark consists of tasks where user instructions are sufficiently clear to allow complete planning, we will revise the manuscript to include quantitative metrics on planner success rate, coverage for multi-step instructions, and discussion of potential failure modes. This addition will better substantiate the conditions under which the 0% ASR is achieved. revision: yes
-
Referee: [Method] §3 (Method), Hierarchical Verification Mechanism: The distinction between benign parameter formatting variances and malicious hijacks is delegated to an LLM-based Intent Verifier whose reliability is not ablated against disguised or syntactically similar attacks; this assumption is load-bearing for the 0% ASR result but unsupported by targeted experiments.
Authors: The referee raises a valid point regarding the lack of targeted evaluation for the Intent Verifier. The hard constraints form the primary defense against unauthorized tools, while the verifier handles nuanced parameter cases. To strengthen this, we will add ablation studies and targeted experiments in the revised version that evaluate the Intent Verifier's performance against disguised and syntactically similar attacks, including cases designed to mimic benign formatting. These experiments will help confirm the robustness of the 0% ASR claim. revision: yes
-
Referee: [Experiments] Experiments section: No ablation studies isolate the contribution of hard constraints versus the Intent Verifier, nor test edge cases such as planner isolation failures or parameter deviations that mimic benign formatting, leaving the reported 1.49% FPR and 0% ASR vulnerable to benchmark-specific artifacts.
Authors: We acknowledge the absence of explicit ablations in the current manuscript. The hierarchical design ensures that hard constraints block most attacks, with the verifier as a secondary check, which contributes to the low FPR. However, to address the concern about benchmark-specific artifacts, we will incorporate ablation studies isolating the hard constraints and Intent Verifier, as well as tests for edge cases like planner isolation failures and mimicking parameter deviations. This will demonstrate the method's effectiveness more comprehensively. revision: yes
Circularity Check
No significant circularity; empirical evaluation on external benchmark is self-contained
full rationale
The paper presents a training-free defense framework (isolated Planner generating reference actions from user instructions alone, followed by hard-constraint then Intent-Verifier stages) whose central performance claims are direct experimental measurements on the external InjecAgent benchmark (ASR reduced from 72.8% to 0%, FPR 1.49%). No equations, fitted parameters, self-citation load-bearing premises, or ansatzes appear in the provided text that would make any result equivalent to its inputs by construction. The framework is described as model-agnostic and compatible with prior methods rather than derived from them. This is the normal case of an applied system paper whose validity rests on external falsifiable testing rather than internal redefinition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User instructions alone contain enough information to enumerate a complete reference set of valid actions.
invented entities (2)
-
Isolated Planner
no independent evidence
-
Intent Verifier
no independent evidence
Forward citations
Cited by 1 Pith paper
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Reference graph
Works this paper leans on
-
[1]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[2]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,”Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 68 539–68 551, 2023
2023
-
[3]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022
2022
-
[4]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausmanet al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review arXiv 2022
-
[5]
Llm-planner: Few-shot grounded planning for embodied agents with large language models,
C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2998–3009
2023
-
[6]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12173
work page internal anchor Pith review arXiv 2023
-
[8]
Ignore previous prompt: Attack techniques for language models,
F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” 2022. [Online]. Available: https://arxiv.org/abs/2211. 09527
2022
-
[9]
Prompt Injection attack against LLM-integrated Applications
Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, L. Y . Zhang, and Y . Liu, “Prompt injection attack against llm-integrated applications,” 2025. [Online]. Available: https://arxiv.org/abs/2306.05499
work page internal anchor Pith review arXiv 2025
-
[10]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents,” 2024. [Online]. Available: https://arxiv.org/abs/2403.02691
work page internal anchor Pith review arXiv 2024
-
[11]
Fine-tuned deberta-v3 for prompt injection detection,
ProtectAI.com, “Fine-tuned deberta-v3 for prompt injection detection,” 2023. [Online]. Available: https://huggingface.co/ProtectAI/ deberta-v3-base-prompt-injection
2023
-
[12]
Detecting Language Model Attacks with Perplexity
G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,”arXiv preprint arXiv:2308.14132, 2023
work page internal anchor Pith review arXiv 2023
-
[13]
Secalign: Defending against prompt injection with preference optimization,
S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 2833–2847
2025
-
[14]
{StruQ}: Defending against prompt injection with structured queries,
S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “{StruQ}: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2383–2400
2025
-
[15]
Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,
T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails,” inProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 2023, pp. 431–445
2023
-
[16]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
H. Wang, C. M. Poskitt, and J. Sun, “Agentspec: Customizable run- time enforcement for safe and reliable llm agents,”arXiv preprint arXiv:2503.18666, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Mitigating indirect prompt injection via instruction-following intent analysis,
M. Kang, C. Xiang, S. Kariyappa, C. Xiao, B. Li, and E. Suh, “Mitigating indirect prompt injection via instruction-following intent analysis,”arXiv preprint arXiv:2512.00966, 2025
-
[18]
Jailbroken: How does llm safety training fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 80 079–80 110, 2023
2023
-
[19]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review arXiv 2023
-
[20]
Optimization-based prompt injection attack to llm-as-a-judge,
J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based prompt injection attack to llm-as-a-judge,” 2025. [Online]. Available: https://arxiv.org/abs/2403.17710
-
[21]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freireet al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.