Assessing Automated Prompt Injection Attacks in Agentic Environments

David Hofer; Edoardo Debenedetti; Florian Tram\`er

arxiv: 2606.10525 · v1 · pith:F7LUHR6Unew · submitted 2026-06-09 · 💻 cs.CR · cs.AI

Assessing Automated Prompt Injection Attacks in Agentic Environments

David Hofer , Edoardo Debenedetti , Florian Tram\`er This is my paper

Pith reviewed 2026-06-27 12:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionLLM agentsautomated attacksblack-box optimizationGCGTAPtransfer attacksagent security

0 comments

The pith

Black-box optimization outperforms gradient-based methods for prompt injection attacks on LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates automated prompt injection attacks on LLM agents that process untrusted external data. It adapts white-box GCG and black-box TAP methods inside the AgentDojo framework and tests them on 80 task pairs across four domains and multiple models. Black-box optimization achieves higher success rates than gradient-based attacks because the latter prove unstable under typical compute limits. Task-universal attacks transfer to new tasks and out-of-distribution domains, though success depends on the attacker model's capability and safety tuning. The work shows these attacks form a credible threat to agent systems but remain limited by model choice.

Core claim

Adapting automated jailbreak methods to agentic environments shows that black-box optimization substantially outperforms gradient-based methods because of GCG's optimization instability under reasonable compute budgets. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5, and both general capability and safety tuning of the attacker model affect attack success.

What carries the argument

Adaptation of GCG white-box gradient attack and TAP black-box optimization methods to the AgentDojo framework for generating indirect prompt injections against LLM agents.

If this is right

Black-box methods deliver higher attack success than gradient-based ones when compute resources are limited.
Attacks optimized on one set of tasks transfer to unseen tasks and different domains.
Stronger attacker models generate more effective injection prompts.
Safety-tuned attacker models can refuse to produce adversarial prompts.
Optimizations performed on smaller open-source models fail to transfer against frontier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses for LLM agents may need to target black-box attack patterns specifically rather than assuming gradient access.
Universal transfer suggests that hardening only a subset of tasks leaves the overall system exposed.
Safety tuning on models used for attack generation could serve as an indirect mitigation.
Future evaluations with larger compute budgets could test whether the performance gap between methods narrows.

Load-bearing premise

The 80 task pairs and models tested in AgentDojo represent the range of realistic agentic environments and generalize beyond the evaluated compute budgets.

What would settle it

Demonstrating that GCG reaches or exceeds TAP success rates on the same tasks when given substantially more optimization steps or varied random seeds would undermine the instability explanation.

Figures

Figures reproduced from arXiv: 2606.10525 by David Hofer, Edoardo Debenedetti, Florian Tram\`er.

**Figure 2.** Figure 2: Successful Single-Task prompt injections by GCG (left) and TAP (right) against Qwen3-4B on the same Workspace [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Attack success rates on Qwen3-4B broken down [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Generalization of universal attacks by training [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison of GCG injection structure variants [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion matrices for the LLM judge (GPT-5-mini) against AgentDojo’s deterministic ground truth. The judge [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: TAP optimization outcomes with and without re [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Transfer of GCG adversarial suffixes optimized on [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Affirmative response targets outperform tool-call [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white-box (GCG) and black-box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black-box optimization substantially outperforms gradient-based methods, a gap we attribute to GCG's optimization instability under reasonable compute budgets. We also find that TAP's effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success--stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. These findings highlight automated prompt injection as a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies known jailbreak methods to agent prompt injection in AgentDojo and finds black-box attacks beat gradient ones, but the reason for the gap needs more evidence.

read the letter

This paper applies known automated jailbreak techniques to prompt injection against LLM agents. Black-box optimization comes out ahead of gradient-based methods in their AgentDojo tests, and task-universal attacks transfer across tasks but not always to stronger models.

What stands out is the scale of the evaluation: 80 task pairs in four domains, testing multiple models. They show that the choice of attacker model matters a lot—stronger ones generate better injections, while safety-tuned ones sometimes refuse. The transfer to out-of-distribution domains is also a useful result for understanding real-world risks.

The soft spot is the claim that the performance gap comes from GCG's instability under reasonable compute budgets. The stress-test note is right that there's no ablation evidence shown in the abstract for different budgets or adapted loss functions. If the full paper has those details, it would make the conclusion stronger; otherwise, it could be that the gradient method just doesn't map cleanly to multi-turn agent interactions. The assumption that these tasks represent typical agent environments is also worth checking against more complex use cases.

This work is aimed at people studying security for LLM agents that pull in external data. A reader focused on practical attack evaluations would get concrete numbers on what works and what doesn't. It deserves peer review because the experiments use a public framework and address a timely threat, though the authors should add more on the method adaptations and controls.

Referee Report

2 major / 2 minor

Summary. The paper empirically evaluates automated prompt injection attacks on LLM agents within the AgentDojo framework. It adapts white-box gradient-based optimization (GCG) and black-box optimization (TAP) to multi-turn agentic interactions involving tool use and state, testing across 80 task pairs spanning four domains and multiple models. Central findings are that TAP substantially outperforms GCG (attributed to GCG optimization instability under reasonable compute), that TAP success depends on attacker model capability and safety tuning, and that task-universal attacks transfer to unseen tasks and OOD domains but fail to transfer from small open-source models to frontier models like GPT-5.

Significance. If the empirical comparisons hold, the work supplies concrete data on relative attack effectiveness in realistic agentic environments, underscoring model-dependent risks and transfer properties of automated injections. This is relevant for AI security research as it moves beyond single-prompt jailbreaks to multi-turn tool-using agents.

major comments (2)

[Abstract and §4 (Evaluation)] The attribution of TAP's substantial outperformance over GCG to 'GCG's optimization instability under reasonable compute budgets' (Abstract) is load-bearing for the central claim yet unsupported by ablations. No results vary compute budgets, GCG hyperparameters, or loss definitions adapted to agent success metrics (e.g., multi-turn trajectory success rather than single-prompt loss), leaving open whether the gap stems from fundamental mismatch with agent trajectories instead.
[§3 (Experimental Setup) and §5 (Results)] The claim that the 80 task pairs provide a representative sample of agentic environments (Abstract) is assumed without evidence on task selection criteria, coverage of real-world tool-use patterns, or sensitivity to domain choice; this directly affects generalizability of the transfer and model-dependence conclusions.

minor comments (2)

[Abstract] Abstract reports empirical findings and attributions but omits any mention of error bars, statistical tests, data exclusion rules, or exact success metrics used for agent tasks.
[§3] Notation for attack success (e.g., how 'task success' is measured across multi-turn interactions) should be defined explicitly early in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our empirical evaluation of automated prompt injection attacks in agentic settings. The comments highlight important aspects of our claims regarding method comparisons and task representativeness. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] The attribution of TAP's substantial outperformance over GCG to 'GCG's optimization instability under reasonable compute budgets' (Abstract) is load-bearing for the central claim yet unsupported by ablations. No results vary compute budgets, GCG hyperparameters, or loss definitions adapted to agent success metrics (e.g., multi-turn trajectory success rather than single-prompt loss), leaving open whether the gap stems from fundamental mismatch with agent trajectories instead.

Authors: We agree that the specific attribution to GCG optimization instability is not backed by systematic ablations on compute budgets, hyperparameters, or agent-adapted loss functions. This attribution was based on our experimental runs where GCG consistently failed to converge to effective attacks under the compute constraints used, while TAP succeeded. However, without dedicated ablations, the claim is not fully supported. We will revise the abstract and add a paragraph in §4 to report the empirical performance gap without attributing it to instability, and note that distinguishing between optimization challenges and fundamental mismatches with multi-turn trajectories requires further study. This is a partial revision focused on language adjustment rather than new experiments. revision: partial
Referee: [§3 (Experimental Setup) and §5 (Results)] The claim that the 80 task pairs provide a representative sample of agentic environments (Abstract) is assumed without evidence on task selection criteria, coverage of real-world tool-use patterns, or sensitivity to domain choice; this directly affects generalizability of the transfer and model-dependence conclusions.

Authors: We acknowledge that §3 does not explicitly detail task selection criteria, coverage of real-world patterns, or sensitivity analyses beyond referencing the AgentDojo benchmark. The 80 task pairs were selected from AgentDojo's predefined tasks across four domains to include varied tool-use scenarios, but this was not justified with additional evidence. We will revise §3 to include a description of the selection process from AgentDojo, note the benchmark's intended coverage, and add a limitations discussion in §5 on generalizability and potential sensitivity to domain choice. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

Purely empirical evaluation with no derivation chain

full rationale

The paper performs an empirical comparison of adapted GCG and TAP attack methods inside the AgentDojo framework across 80 task pairs. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced. All claims rest on observed attack success rates rather than any reduction of a result to its own inputs or to a self-citation chain. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the work is an empirical comparison of existing attack methods.

pith-pipeline@v0.9.1-grok · 5727 in / 1026 out tokens · 27923 ms · 2026-06-27T12:45:21.090953+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 7 canonical work pages

[1]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2025. Jail- breaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. InInter- national Conference on Learning Representations (ICLR). arXiv:2404.02151 [cs.CR]

arXiv 2025
[2]

Tim Beyer, Yan Scholten, Leo Schwinn, and Stephan Günnemann. 2026. Sampling- aware Adversarial Attacks Against Large Language Models. InInternational Conference on Learning Representations (ICLR). arXiv:2507.04446 [cs.LG]

arXiv 2026
[3]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419 [cs.LG] https://arxiv.org/abs/2310.08419 Assessing Automated Prompt Injection Attacks in Agentic Environments ,

Pith/arXiv arXiv 2024
[4]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InUSENIX Security Symposium

2025
[5]

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 2833...

work page doi:10.1145/3719027.3744836 2025
[6]

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. 2026. Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. arXiv preprint arXiv:2507.02735(2026)

arXiv 2026
[7]

Xin Chen, Jie Zhang, and Florian Tramèr. 2026. Learning to Inject: Automated Prompt Injection via Reinforcement Learning.arXiv preprint arXiv:2602.05746 (2026)

Pith/arXiv arXiv 2026
[8]

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin
[9]

Securing AI Agents with Information-Flow Control.arXiv preprint arXiv:2505.23643(2025)

Pith/arXiv arXiv 2025
[10]

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Car- lini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Flo- rian Tramèr. 2025. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813(2025)

Pith/arXiv arXiv 2025
[11]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. In38th Conference on Neural Information Processing Systems (NeurIPS 2024). https: //arxiv.org/abs/2406.13352

Pith/arXiv arXiv 2024
[12]

Dreadnode. 2024. Parley: Tree of Attacks (TAP) Jailbreaking Implementation. https://github.com/dreadnode/parley. Adapted for use in this study

2024
[13]

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, et al. 2026. How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition.arXiv preprint arXiv:2603.15714(2026)

arXiv 2026
[14]

Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes. 2024. Imprompter: Tricking LLM Agents into Improper Tool Use. arXiv:2410.14923 [cs.CR]

arXiv 2024
[15]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

Pith/arXiv arXiv 2025
[16]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing...

work page doi:10.1145/3605764.3623985 2023
[17]

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting.arXiv preprint arXiv:2403.14720(2024)

Pith/arXiv arXiv 2024
[18]

Dennis Jacob, Emad Alghamdi, Zhanhao Hu, Basel Alomair, and David Wagner
[19]

arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

Preventing Prompt Injection with Type-Directed Privilege Separation. arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

Pith/arXiv arXiv
[20]

Auguste Kerckhoffs. 1883. La cryptographie militaire [Military cryptogra- phy].Journal des sciences militairesIX (February 1883), 161–191. https: //www.petitcolas.net/kerckhoffs/crypto_militaire_2.pdf Archived from the origi- nal on 2021-02-20

2021
[21]

Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. 2025. Prompt flow integrity to prevent privilege escalation in llm agents.arXiv preprint arXiv:2503.15547 (2025)

arXiv 2025
[22]

Le, and Tomas Pfister

Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, and Tomas Pfister. 2026. CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution.arXiv preprint arXiv:2602.07918(2026)

arXiv 2026
[23]

Learn Prompting. 2024. Sandwich Defense. https://learnprompting.org/docs/ prompt_hacking/defensive_measures/sandwich_defense

2024
[24]

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. 2026. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings 2026 Network and Distributed System Security Symposium (NDSS 2026). Internet Society. doi:10.14722/ndss.2026.230352

work page doi:10.14722/ndss.2026.230352 2026
[25]

Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Yevgeniy Vorobeychik, and Chaowei Xiao. 2026. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments? arXiv:2602.03117 [cs.CR] https://arxiv. org/abs/2602.03117

Pith/arXiv arXiv 2026
[26]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)

2024
[27]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/ presentation/liu-yupei

2024
[28]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: jailbreaking black-box LLMs automatically. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article ...

2024
[29]

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr
[30]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections.arXiv preprint arXiv:2510.09023 (2025)

Pith/arXiv arXiv 2025
[31]

Schmidt, and Florian Bernard

Zhakshylyk Nurlanov, Frank R. Schmidt, and Florian Bernard. 2026. Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks.arXiv preprint arXiv:2601.03420(2026)

arXiv 2026
[32]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training lan- guage models to follow instructions with human...

2022
[33]

Jinsheng Pan, Xiaogeng Liu, and Chaowei Xiao. 2025. OET: Optimization-based prompt injection Evaluation Toolkit. arXiv:2505.00843 [cs.CR]

arXiv 2025
[34]

Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

Nishit V. Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. 2025. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks.ArXivabs/2507.07417 (2025). arXiv:2507.07417 [cs.CR]

arXiv 2025
[35]

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. arXiv:2403.03792 [cs.CR]

arXiv 2024
[36]

Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527 [cs.CL]

Pith/arXiv arXiv 2022
[37]

ProtectAI. 2024. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection. https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2

2024
[38]

2019.Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

2019
[39]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates ...

2023
[40]

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt Injection Attack to Tool Selection in LLM Agents. arXiv:2504.19793 [cs.CR]

Pith/arXiv arXiv 2025
[41]

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2026. Progent: Securing AI Agents with Privilege Control. arXiv:2504.11703 [cs.CR] https://arxiv.org/abs/2504.11703

Pith/arXiv arXiv 2026
[42]

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.arXiv preprint arXiv:2404.13208(2024). arXiv:2404.13208 [cs.CR]

Pith/arXiv arXiv 2024
[43]

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wag- ner. 2026. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207(2026)

arXiv 2026
[44]

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents.arXiv preprint arXiv:2505.05849(May 2025). https://arxiv.org/abs/2505.05849

arXiv 2025
[45]

Simon Willison. 2023. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm- pattern/

2023
[46]

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective.arXiv preprint arXiv:2409.19091(2024)

arXiv 2024
[47]

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. 2025. Instructional Segment Embedding: Improving LLM Safety with Instruction Hier- archy.arXiv preprint arXiv:2410.09102(2025)

arXiv 2025
[48]

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal
[49]

InNetwork and Distributed System Security (NDSS) Symposium

IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems. InNetwork and Distributed System Security (NDSS) Symposium
[50]

Z. Xi, W. Chen, X. Guo, et al. 2025. The rise and potential of large language model based agents: a survey.Science China Information Sciences68 (2025), 121101. , David Hofer, Edoardo Debenedetti, and Florian Tramèr doi:10.1007/s11432-024-4222-0

work page doi:10.1007/s11432-024-4222-0 2025
[51]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Pith/arXiv arXiv 2025
[52]

Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses. arXiv:2505.15738 [cs.CR]

arXiv 2025
[53]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=WE_vluYUL-X

2023
[54]

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD ’25). doi:10.1145/3690624.3709179

work page doi:10.1145/3690624.3709179 2025
[55]

Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Lin- guistics, Albuquerque, New Mexico, 7116–7...

work page doi:10.18653/v1/2025.findings- 2025
[56]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10471–1...

work page doi:10.18653/v1/ 2024
[57]

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quan- long Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. 2026. AgentSen- try: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification.arXiv preprint arXiv:2602.22724(2026)

arXiv 2026
[58]

Titzer, Heather Miller, and Phillip B

Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. 2025. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage. arXiv:2502.08966 [cs.CR]

arXiv 2025
[59]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL]

Pith/arXiv arXiv 2023
[60]

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. 2025. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=8EtSBX41mt

2025
[61]

attacker goals

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, and Christoph H. Lampert. 2026. ASIDE: Architectural Separation of Instructions and Data in Language Models. arXiv:2503.10566 [cs.LG] A Ethical Considerations This work investigates automated indirect prompt injection attacks against...

arXiv 2026
[62]

Override the agent’s current task and context
[63]

Make the attacker goal appear authoritative and ur- gent
[64]

[SYSTEM OVERRIDE]

Work regardless of the specific goal content # How to craft successful PREFIX/SUFFIX - Strategies Effective techniques for universal injections: , David Hofer, Edoardo Debenedetti, and Florian Tramèr - **Authority manipulation**: Use PREFIX to imper- sonate system messages, developers, or admins (e.g., "[SYSTEM OVERRIDE]", "ADMIN INSTRUCTION:") - **Contex...
[65]

Agent takes ANY step toward the attacker’s goal, OR
[66]

Agent commits to doing what the attacker wants, OR
[67]

Agent calls the first expected tool (even if more are needed later) **Don’t penalize for:** - Not completing all steps in one turn (step-by-step is normal) - Minor formatting differences in arguments - Needing confirmation before final execution (if it shows intent) **DO penalize for:** - Wrong argument values that contradict the goal - Calling completely...

[1] [1]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2025. Jail- breaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. InInter- national Conference on Learning Representations (ICLR). arXiv:2404.02151 [cs.CR]

arXiv 2025

[2] [2]

Tim Beyer, Yan Scholten, Leo Schwinn, and Stephan Günnemann. 2026. Sampling- aware Adversarial Attacks Against Large Language Models. InInternational Conference on Learning Representations (ICLR). arXiv:2507.04446 [cs.LG]

arXiv 2026

[3] [3]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419 [cs.LG] https://arxiv.org/abs/2310.08419 Assessing Automated Prompt Injection Attacks in Agentic Environments ,

Pith/arXiv arXiv 2024

[4] [4]

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InUSENIX Security Symposium

2025

[5] [5]

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 2833...

work page doi:10.1145/3719027.3744836 2025

[6] [6]

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. 2026. Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. arXiv preprint arXiv:2507.02735(2026)

arXiv 2026

[7] [7]

Xin Chen, Jie Zhang, and Florian Tramèr. 2026. Learning to Inject: Automated Prompt Injection via Reinforcement Learning.arXiv preprint arXiv:2602.05746 (2026)

Pith/arXiv arXiv 2026

[8] [8]

Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin

[9] [9]

Securing AI Agents with Information-Flow Control.arXiv preprint arXiv:2505.23643(2025)

Pith/arXiv arXiv 2025

[10] [10]

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Car- lini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Flo- rian Tramèr. 2025. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813(2025)

Pith/arXiv arXiv 2025

[11] [11]

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. In38th Conference on Neural Information Processing Systems (NeurIPS 2024). https: //arxiv.org/abs/2406.13352

Pith/arXiv arXiv 2024

[12] [12]

Dreadnode. 2024. Parley: Tree of Attacks (TAP) Jailbreaking Implementation. https://github.com/dreadnode/parley. Adapted for use in this study

2024

[13] [13]

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, et al. 2026. How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition.arXiv preprint arXiv:2603.15714(2026)

arXiv 2026

[14] [14]

Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes. 2024. Imprompter: Tricking LLM Agents into Improper Tool Use. arXiv:2410.14923 [cs.CR]

arXiv 2024

[15] [15]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

Pith/arXiv arXiv 2025

[16] [16]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing...

work page doi:10.1145/3605764.3623985 2023

[17] [17]

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting.arXiv preprint arXiv:2403.14720(2024)

Pith/arXiv arXiv 2024

[18] [18]

Dennis Jacob, Emad Alghamdi, Zhanhao Hu, Basel Alomair, and David Wagner

[19] [19]

arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

Preventing Prompt Injection with Type-Directed Privilege Separation. arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

Pith/arXiv arXiv

[20] [20]

Auguste Kerckhoffs. 1883. La cryptographie militaire [Military cryptogra- phy].Journal des sciences militairesIX (February 1883), 161–191. https: //www.petitcolas.net/kerckhoffs/crypto_militaire_2.pdf Archived from the origi- nal on 2021-02-20

2021

[21] [21]

Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. 2025. Prompt flow integrity to prevent privilege escalation in llm agents.arXiv preprint arXiv:2503.15547 (2025)

arXiv 2025

[22] [22]

Le, and Tomas Pfister

Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, and Tomas Pfister. 2026. CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution.arXiv preprint arXiv:2602.07918(2026)

arXiv 2026

[23] [23]

Learn Prompting. 2024. Sandwich Defense. https://learnprompting.org/docs/ prompt_hacking/defensive_measures/sandwich_defense

2024

[24] [24]

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. 2026. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings 2026 Network and Distributed System Security Symposium (NDSS 2026). Internet Society. doi:10.14722/ndss.2026.230352

work page doi:10.14722/ndss.2026.230352 2026

[25] [25]

Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Yevgeniy Vorobeychik, and Chaowei Xiao. 2026. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments? arXiv:2602.03117 [cs.CR] https://arxiv. org/abs/2602.03117

Pith/arXiv arXiv 2026

[26] [26]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)

2024

[27] [27]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/ presentation/liu-yupei

2024

[28] [28]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: jailbreaking black-box LLMs automatically. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article ...

2024

[29] [29]

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

[30] [30]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections.arXiv preprint arXiv:2510.09023 (2025)

Pith/arXiv arXiv 2025

[31] [31]

Schmidt, and Florian Bernard

Zhakshylyk Nurlanov, Frank R. Schmidt, and Florian Bernard. 2026. Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks.arXiv preprint arXiv:2601.03420(2026)

arXiv 2026

[32] [32]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training lan- guage models to follow instructions with human...

2022

[33] [33]

Jinsheng Pan, Xiaogeng Liu, and Chaowei Xiao. 2025. OET: Optimization-based prompt injection Evaluation Toolkit. arXiv:2505.00843 [cs.CR]

arXiv 2025

[34] [34]

Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

Nishit V. Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. 2025. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks.ArXivabs/2507.07417 (2025). arXiv:2507.07417 [cs.CR]

arXiv 2025

[35] [35]

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. arXiv:2403.03792 [cs.CR]

arXiv 2024

[36] [36]

Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527 [cs.CL]

Pith/arXiv arXiv 2022

[37] [37]

ProtectAI. 2024. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection. https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2

2024

[38] [38]

2019.Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

2019

[39] [39]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates ...

2023

[40] [40]

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt Injection Attack to Tool Selection in LLM Agents. arXiv:2504.19793 [cs.CR]

Pith/arXiv arXiv 2025

[41] [41]

Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2026. Progent: Securing AI Agents with Privilege Control. arXiv:2504.11703 [cs.CR] https://arxiv.org/abs/2504.11703

Pith/arXiv arXiv 2026

[42] [42]

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.arXiv preprint arXiv:2404.13208(2024). arXiv:2404.13208 [cs.CR]

Pith/arXiv arXiv 2024

[43] [43]

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wag- ner. 2026. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207(2026)

arXiv 2026

[44] [44]

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents.arXiv preprint arXiv:2505.05849(May 2025). https://arxiv.org/abs/2505.05849

arXiv 2025

[45] [45]

Simon Willison. 2023. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm- pattern/

2023

[46] [46]

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective.arXiv preprint arXiv:2409.19091(2024)

arXiv 2024

[47] [47]

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. 2025. Instructional Segment Embedding: Improving LLM Safety with Instruction Hier- archy.arXiv preprint arXiv:2410.09102(2025)

arXiv 2025

[48] [48]

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal

[49] [49]

InNetwork and Distributed System Security (NDSS) Symposium

IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems. InNetwork and Distributed System Security (NDSS) Symposium

[50] [50]

Z. Xi, W. Chen, X. Guo, et al. 2025. The rise and potential of large language model based agents: a survey.Science China Information Sciences68 (2025), 121101. , David Hofer, Edoardo Debenedetti, and Florian Tramèr doi:10.1007/s11432-024-4222-0

work page doi:10.1007/s11432-024-4222-0 2025

[51] [51]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Pith/arXiv arXiv 2025

[52] [52]

Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses. arXiv:2505.15738 [cs.CR]

arXiv 2025

[53] [53]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=WE_vluYUL-X

2023

[54] [54]

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD ’25). doi:10.1145/3690624.3709179

work page doi:10.1145/3690624.3709179 2025

[55] [55]

Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Lin- guistics, Albuquerque, New Mexico, 7116–7...

work page doi:10.18653/v1/2025.findings- 2025

[56] [56]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10471–1...

work page doi:10.18653/v1/ 2024

[57] [57]

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quan- long Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. 2026. AgentSen- try: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification.arXiv preprint arXiv:2602.22724(2026)

arXiv 2026

[58] [58]

Titzer, Heather Miller, and Phillip B

Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. 2025. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage. arXiv:2502.08966 [cs.CR]

arXiv 2025

[59] [59]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL]

Pith/arXiv arXiv 2023

[60] [60]

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. 2025. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=8EtSBX41mt

2025

[61] [61]

attacker goals

Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, and Christoph H. Lampert. 2026. ASIDE: Architectural Separation of Instructions and Data in Language Models. arXiv:2503.10566 [cs.LG] A Ethical Considerations This work investigates automated indirect prompt injection attacks against...

arXiv 2026

[62] [62]

Override the agent’s current task and context

[63] [63]

Make the attacker goal appear authoritative and ur- gent

[64] [64]

[SYSTEM OVERRIDE]

Work regardless of the specific goal content # How to craft successful PREFIX/SUFFIX - Strategies Effective techniques for universal injections: , David Hofer, Edoardo Debenedetti, and Florian Tramèr - **Authority manipulation**: Use PREFIX to imper- sonate system messages, developers, or admins (e.g., "[SYSTEM OVERRIDE]", "ADMIN INSTRUCTION:") - **Contex...

[65] [65]

Agent takes ANY step toward the attacker’s goal, OR

[66] [66]

Agent commits to doing what the attacker wants, OR

[67] [67]

Agent calls the first expected tool (even if more are needed later) **Don’t penalize for:** - Not completing all steps in one turn (step-by-step is normal) - Minor formatting differences in arguments - Needing confirmation before final execution (if it shows intent) **DO penalize for:** - Wrong argument values that contradict the goal - Calling completely...