pith. sign in

arxiv: 2606.10525 · v1 · pith:F7LUHR6Unew · submitted 2026-06-09 · 💻 cs.CR · cs.AI

Assessing Automated Prompt Injection Attacks in Agentic Environments

Pith reviewed 2026-06-27 12:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectionLLM agentsautomated attacksblack-box optimizationGCGTAPtransfer attacksagent security
0
0 comments X

The pith

Black-box optimization outperforms gradient-based methods for prompt injection attacks on LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates automated prompt injection attacks on LLM agents that process untrusted external data. It adapts white-box GCG and black-box TAP methods inside the AgentDojo framework and tests them on 80 task pairs across four domains and multiple models. Black-box optimization achieves higher success rates than gradient-based attacks because the latter prove unstable under typical compute limits. Task-universal attacks transfer to new tasks and out-of-distribution domains, though success depends on the attacker model's capability and safety tuning. The work shows these attacks form a credible threat to agent systems but remain limited by model choice.

Core claim

Adapting automated jailbreak methods to agentic environments shows that black-box optimization substantially outperforms gradient-based methods because of GCG's optimization instability under reasonable compute budgets. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5, and both general capability and safety tuning of the attacker model affect attack success.

What carries the argument

Adaptation of GCG white-box gradient attack and TAP black-box optimization methods to the AgentDojo framework for generating indirect prompt injections against LLM agents.

If this is right

  • Black-box methods deliver higher attack success than gradient-based ones when compute resources are limited.
  • Attacks optimized on one set of tasks transfer to unseen tasks and different domains.
  • Stronger attacker models generate more effective injection prompts.
  • Safety-tuned attacker models can refuse to produce adversarial prompts.
  • Optimizations performed on smaller open-source models fail to transfer against frontier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses for LLM agents may need to target black-box attack patterns specifically rather than assuming gradient access.
  • Universal transfer suggests that hardening only a subset of tasks leaves the overall system exposed.
  • Safety tuning on models used for attack generation could serve as an indirect mitigation.
  • Future evaluations with larger compute budgets could test whether the performance gap between methods narrows.

Load-bearing premise

The 80 task pairs and models tested in AgentDojo represent the range of realistic agentic environments and generalize beyond the evaluated compute budgets.

What would settle it

Demonstrating that GCG reaches or exceeds TAP success rates on the same tasks when given substantially more optimization steps or varied random seeds would undermine the instability explanation.

Figures

Figures reproduced from arXiv: 2606.10525 by David Hofer, Edoardo Debenedetti, Florian Tram\`er.

Figure 1
Figure 1. Figure 1: Attack Success Rate (left), Success@N with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Successful Single-Task prompt injections by GCG (left) and TAP (right) against Qwen3-4B on the same Workspace [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attack success rates on Qwen3-4B broken down [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generalization of universal attacks by training [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of GCG injection structure variants [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confusion matrices for the LLM judge (GPT-5-mini) against AgentDojo’s deterministic ground truth. The judge [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TAP optimization outcomes with and without re [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Transfer of GCG adversarial suffixes optimized on [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Affirmative response targets outperform tool-call [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of automated prompt injection attacks against LLM agents, adapting both white-box (GCG) and black-box (TAP) methods to the agentic setting within the AgentDojo framework. We evaluate across 80 task pairs spanning four domains and multiple models, and find that black-box optimization substantially outperforms gradient-based methods, a gap we attribute to GCG's optimization instability under reasonable compute budgets. We also find that TAP's effectiveness depends on the attacker model, as both general capability and safety tuning affect attack success--stronger models produce more effective injections, while safety-tuned attackers can refuse to generate adversarial prompts. Task-universal attacks transfer effectively to unseen tasks and out-of-distribution domains, but attacks optimized on smaller open-source models do not transfer to frontier models like GPT-5. These findings highlight automated prompt injection as a credible but model-dependent threat, with significant barriers remaining for model-agnostic exploitation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically evaluates automated prompt injection attacks on LLM agents within the AgentDojo framework. It adapts white-box gradient-based optimization (GCG) and black-box optimization (TAP) to multi-turn agentic interactions involving tool use and state, testing across 80 task pairs spanning four domains and multiple models. Central findings are that TAP substantially outperforms GCG (attributed to GCG optimization instability under reasonable compute), that TAP success depends on attacker model capability and safety tuning, and that task-universal attacks transfer to unseen tasks and OOD domains but fail to transfer from small open-source models to frontier models like GPT-5.

Significance. If the empirical comparisons hold, the work supplies concrete data on relative attack effectiveness in realistic agentic environments, underscoring model-dependent risks and transfer properties of automated injections. This is relevant for AI security research as it moves beyond single-prompt jailbreaks to multi-turn tool-using agents.

major comments (2)
  1. [Abstract and §4 (Evaluation)] The attribution of TAP's substantial outperformance over GCG to 'GCG's optimization instability under reasonable compute budgets' (Abstract) is load-bearing for the central claim yet unsupported by ablations. No results vary compute budgets, GCG hyperparameters, or loss definitions adapted to agent success metrics (e.g., multi-turn trajectory success rather than single-prompt loss), leaving open whether the gap stems from fundamental mismatch with agent trajectories instead.
  2. [§3 (Experimental Setup) and §5 (Results)] The claim that the 80 task pairs provide a representative sample of agentic environments (Abstract) is assumed without evidence on task selection criteria, coverage of real-world tool-use patterns, or sensitivity to domain choice; this directly affects generalizability of the transfer and model-dependence conclusions.
minor comments (2)
  1. [Abstract] Abstract reports empirical findings and attributions but omits any mention of error bars, statistical tests, data exclusion rules, or exact success metrics used for agent tasks.
  2. [§3] Notation for attack success (e.g., how 'task success' is measured across multi-turn interactions) should be defined explicitly early in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our empirical evaluation of automated prompt injection attacks in agentic settings. The comments highlight important aspects of our claims regarding method comparisons and task representativeness. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] The attribution of TAP's substantial outperformance over GCG to 'GCG's optimization instability under reasonable compute budgets' (Abstract) is load-bearing for the central claim yet unsupported by ablations. No results vary compute budgets, GCG hyperparameters, or loss definitions adapted to agent success metrics (e.g., multi-turn trajectory success rather than single-prompt loss), leaving open whether the gap stems from fundamental mismatch with agent trajectories instead.

    Authors: We agree that the specific attribution to GCG optimization instability is not backed by systematic ablations on compute budgets, hyperparameters, or agent-adapted loss functions. This attribution was based on our experimental runs where GCG consistently failed to converge to effective attacks under the compute constraints used, while TAP succeeded. However, without dedicated ablations, the claim is not fully supported. We will revise the abstract and add a paragraph in §4 to report the empirical performance gap without attributing it to instability, and note that distinguishing between optimization challenges and fundamental mismatches with multi-turn trajectories requires further study. This is a partial revision focused on language adjustment rather than new experiments. revision: partial

  2. Referee: [§3 (Experimental Setup) and §5 (Results)] The claim that the 80 task pairs provide a representative sample of agentic environments (Abstract) is assumed without evidence on task selection criteria, coverage of real-world tool-use patterns, or sensitivity to domain choice; this directly affects generalizability of the transfer and model-dependence conclusions.

    Authors: We acknowledge that §3 does not explicitly detail task selection criteria, coverage of real-world patterns, or sensitivity analyses beyond referencing the AgentDojo benchmark. The 80 task pairs were selected from AgentDojo's predefined tasks across four domains to include varied tool-use scenarios, but this was not justified with additional evidence. We will revise §3 to include a description of the selection process from AgentDojo, note the benchmark's intended coverage, and add a limitations discussion in §5 on generalizability and potential sensitivity to domain choice. This addresses the concern directly. revision: yes

Circularity Check

0 steps flagged

Purely empirical evaluation with no derivation chain

full rationale

The paper performs an empirical comparison of adapted GCG and TAP attack methods inside the AgentDojo framework across 80 task pairs. No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced. All claims rest on observed attack success rates rather than any reduction of a result to its own inputs or to a self-citation chain. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the work is an empirical comparison of existing attack methods.

pith-pipeline@v0.9.1-grok · 5727 in / 1026 out tokens · 27923 ms · 2026-06-27T12:45:21.090953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 7 canonical work pages

  1. [1]

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2025. Jail- breaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. InInter- national Conference on Learning Representations (ICLR). arXiv:2404.02151 [cs.CR]

  2. [2]

    Tim Beyer, Yan Scholten, Leo Schwinn, and Stephan Günnemann. 2026. Sampling- aware Adversarial Attacks Against Large Language Models. InInternational Conference on Learning Representations (ICLR). arXiv:2507.04446 [cs.LG]

  3. [3]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419 [cs.LG] https://arxiv.org/abs/2310.08419 Assessing Automated Prompt Injection Attacks in Agentic Environments ,

  4. [4]

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. 2025. StruQ: Defending Against Prompt Injection with Structured Queries. InUSENIX Security Symposium

  5. [5]

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 2833...

  6. [6]

    Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. 2026. Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. arXiv preprint arXiv:2507.02735(2026)

  7. [7]

    Xin Chen, Jie Zhang, and Florian Tramèr. 2026. Learning to Inject: Automated Prompt Injection via Reinforcement Learning.arXiv preprint arXiv:2602.05746 (2026)

  8. [8]

    Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin

  9. [9]

    Securing AI Agents with Information-Flow Control.arXiv preprint arXiv:2505.23643(2025)

  10. [10]

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Car- lini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Flo- rian Tramèr. 2025. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813(2025)

  11. [11]

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. In38th Conference on Neural Information Processing Systems (NeurIPS 2024). https: //arxiv.org/abs/2406.13352

  12. [12]

    Dreadnode. 2024. Parley: Tree of Attacks (TAP) Jailbreaking Implementation. https://github.com/dreadnode/parley. Adapted for use in this study

  13. [13]

    Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, et al. 2026. How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition.arXiv preprint arXiv:2603.15714(2026)

  14. [14]

    Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes

    Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg- Kirkpatrick, and Earlence Fernandes. 2024. Imprompter: Tricking LLM Agents into Improper Tool Use. arXiv:2410.14923 [cs.CR]

  15. [15]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

  16. [16]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security(Copenhagen, Denmark)(AISec ’23). Association for Computing...

  17. [17]

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending Against Indirect Prompt Injection Attacks With Spotlighting.arXiv preprint arXiv:2403.14720(2024)

  18. [18]

    Dennis Jacob, Emad Alghamdi, Zhanhao Hu, Basel Alomair, and David Wagner

  19. [19]

    arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

    Preventing Prompt Injection with Type-Directed Privilege Separation. arXiv:2509.25926 [cs.CR] https://arxiv.org/abs/2509.25926

  20. [20]

    Auguste Kerckhoffs. 1883. La cryptographie militaire [Military cryptogra- phy].Journal des sciences militairesIX (February 1883), 161–191. https: //www.petitcolas.net/kerckhoffs/crypto_militaire_2.pdf Archived from the origi- nal on 2021-02-20

  21. [21]

    Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. 2025. Prompt flow integrity to prevent privilege escalation in llm agents.arXiv preprint arXiv:2503.15547 (2025)

  22. [22]

    Le, and Tomas Pfister

    Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, and Tomas Pfister. 2026. CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution.arXiv preprint arXiv:2602.07918(2026)

  23. [23]

    Learn Prompting. 2024. Sandwich Defense. https://learnprompting.org/docs/ prompt_hacking/defensive_measures/sandwich_defense

  24. [24]

    Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita-Rotaru. 2026. ACE: A Security Architecture for LLM-Integrated App Systems. InProceedings 2026 Network and Distributed System Security Symposium (NDSS 2026). Internet Society. doi:10.14722/ndss.2026.230352

  25. [25]

    Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Yevgeniy Vorobeychik, and Chaowei Xiao. 2026. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments? arXiv:2602.03117 [cs.CR] https://arxiv. org/abs/2602.03117

  26. [26]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)

  27. [27]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/ presentation/liu-yupei

  28. [28]

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: jailbreaking black-box LLMs automatically. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article ...

  29. [29]

    Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr

  30. [30]

    The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections.arXiv preprint arXiv:2510.09023 (2025)

  31. [31]

    Schmidt, and Florian Bernard

    Zhakshylyk Nurlanov, Frank R. Schmidt, and Florian Bernard. 2026. Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks.arXiv preprint arXiv:2601.03420(2026)

  32. [32]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training lan- guage models to follow instructions with human...

  33. [33]

    Jinsheng Pan, Xiaogeng Liu, and Chaowei Xiao. 2025. OET: Optimization-based prompt injection Evaluation Toolkit. arXiv:2505.00843 [cs.CR]

  34. [34]

    Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes

    Nishit V. Pandya, Andrey Labunets, Sicun Gao, and Earlence Fernandes. 2025. May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks.ArXivabs/2507.07417 (2025). arXiv:2507.07417 [cs.CR]

  35. [35]

    Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. 2024. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. arXiv:2403.03792 [cs.CR]

  36. [36]

    Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527 [cs.CL]

  37. [37]

    ProtectAI. 2024. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection. https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2

  38. [38]

    2019.Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI

  39. [39]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates ...

  40. [40]

    Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. 2025. Prompt Injection Attack to Tool Selection in LLM Agents. arXiv:2504.19793 [cs.CR]

  41. [41]

    Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. 2026. Progent: Securing AI Agents with Privilege Control. arXiv:2504.11703 [cs.CR] https://arxiv.org/abs/2504.11703

  42. [42]

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.arXiv preprint arXiv:2404.13208(2024). arXiv:2404.13208 [cs.CR]

  43. [43]

    Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wag- ner. 2026. Defending against prompt injection with datafilter.arXiv preprint arXiv:2510.19207(2026)

  44. [44]

    Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AGENTVIGIL: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents.arXiv preprint arXiv:2505.05849(May 2025). https://arxiv.org/abs/2505.05849

  45. [45]

    Simon Willison. 2023. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm- pattern/

  46. [46]

    Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. 2024. System-Level De- fense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective.arXiv preprint arXiv:2409.19091(2024)

  47. [47]

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. 2025. Instructional Segment Embedding: Improving LLM Safety with Instruction Hier- archy.arXiv preprint arXiv:2410.09102(2025)

  48. [48]

    Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal

  49. [49]

    InNetwork and Distributed System Security (NDSS) Symposium

    IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems. InNetwork and Distributed System Security (NDSS) Symposium

  50. [50]

    Z. Xi, W. Chen, X. Guo, et al. 2025. The rise and potential of large language model based agents: a survey.Science China Information Sciences68 (2025), 121101. , David Hofer, Edoardo Debenedetti, and Florian Tramèr doi:10.1007/s11432-024-4222-0

  51. [51]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  52. [52]

    Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, and Yves-Alexandre de Montjoye. 2025. Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses. arXiv:2505.15738 [cs.CR]

  53. [53]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=WE_vluYUL-X

  54. [54]

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD ’25). doi:10.1145/3690624.3709179

  55. [55]

    Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Lin- guistics, Albuquerque, New Mexico, 7116–7...

  56. [56]

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10471–1...

  57. [57]

    Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quan- long Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. 2026. AgentSen- try: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification.arXiv preprint arXiv:2602.22724(2026)

  58. [58]

    Titzer, Heather Miller, and Phillip B

    Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, and Phillip B. Gibbons. 2025. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage. arXiv:2502.08966 [cs.CR]

  59. [59]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL]

  60. [60]

    Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. 2025. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?id=8EtSBX41mt

  61. [61]

    attacker goals

    Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Alexandra Volkova, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, and Christoph H. Lampert. 2026. ASIDE: Architectural Separation of Instructions and Data in Language Models. arXiv:2503.10566 [cs.LG] A Ethical Considerations This work investigates automated indirect prompt injection attacks against...

  62. [62]

    Override the agent’s current task and context

  63. [63]

    Make the attacker goal appear authoritative and ur- gent

  64. [64]

    [SYSTEM OVERRIDE]

    Work regardless of the specific goal content # How to craft successful PREFIX/SUFFIX - Strategies Effective techniques for universal injections: , David Hofer, Edoardo Debenedetti, and Florian Tramèr - **Authority manipulation**: Use PREFIX to imper- sonate system messages, developers, or admins (e.g., "[SYSTEM OVERRIDE]", "ADMIN INSTRUCTION:") - **Contex...

  65. [65]

    Agent takes ANY step toward the attacker’s goal, OR

  66. [66]

    Agent commits to doing what the attacker wants, OR

  67. [67]

    Agent calls the first expected tool (even if more are needed later) **Don’t penalize for:** - Not completing all steps in one turn (step-by-step is normal) - Minor formatting differences in arguments - Needing confirmation before final execution (if it shows intent) **DO penalize for:** - Wrong argument values that contradict the goal - Calling completely...