Exploiting Web Search Tools of AI Agents for Data Exfiltration

arxiv: 2510.09093 · v2 · submitted 2025-10-10 · 💻 cs.CR · cs.CL

Exploiting Web Search Tools of AI Agents for Data Exfiltration

Dennis Rall , Bernhard Bauer , Mohit Mittal , Thomas Fraunholz This is my paper

Pith reviewed 2026-05-18 08:33 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords indirect prompt injectiondata exfiltrationAI agentsweb search toolsLLM vulnerabilitiesprompt attacksmodel defensescybersecurity

0 comments p. Extension

The pith

Indirect prompt injection still lets attackers exfiltrate data from AI agents through their web search tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how indirect prompt injection attacks exploit AI agents that use web search tools to retrieve and leak sensitive information. It tests multiple large language models to measure susceptibility based on size, manufacturer, and implementation details. Results show that familiar attack patterns continue to succeed, indicating that current defenses have not closed these gaps. A reader would care because autonomous agents now routinely handle corporate data through external tools, turning functional features into leakage paths. The work calls for better training, shared attack databases, and ongoing testing to make security part of LLM design rather than an afterthought.

Core claim

The paper claims that indirect prompt injection attacks succeed in exploiting the web search tools of AI agents to exfiltrate data, and that even well-known attack patterns continue to bypass defenses across different model sizes and manufacturers.

What carries the argument

Indirect prompt injection, which places malicious instructions in external web content that the agent retrieves and then follows to send out sensitive data.

If this is right

Strengthened training procedures would increase inherent resilience against these attacks.
A centralized database of known attack vectors would support proactive defense.
A unified testing framework would enable continuous security validation.
Developers must integrate security into the core design of LLMs rather than treating it as optional.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar exfiltration risks likely exist when agents use other external tools such as APIs or file systems.
Organizations running AI agents should add output monitoring or sanitization as an extra layer beyond model-level fixes.
The persistence of these attacks suggests that simply increasing model size will not remove the vulnerability.

Load-bearing premise

The tested models, attack implementations, and web search tool integrations are representative of real-world AI agent deployments and current LLM capabilities.

What would settle it

A production AI agent whose web search tool returns no sensitive data after repeated indirect prompt injection attempts on multiple models.

Figures

Figures reproduced from arXiv: 2510.09093 by Bernhard Bauer, Dennis Rall, Mohit Mittal, Thomas Fraunholz.

**Figure 2.** Figure 2: Bar plot showing the attack success rates of the models. The blue [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Attack success rate of the different variations. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Attack success rate of the twenty most effective templates. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter plot showing the attack success rate of the models compared [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like web searches. The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse. As LLMs increasingly interact with external data sources, indirect prompt injection emerges as a critical and evolving attack vector, enabling adversaries to exploit models through manipulated inputs. Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks, which parameters, including model size and manufacturer, specific implementations, shape their vulnerability, and which attack methods remain most effective. Our results reveal that even well-known attack patterns continue to succeed, exposing persistent weaknesses in model defenses. To address these vulnerabilities, we emphasize the need for strengthened training procedures to enhance inherent resilience, a centralized database of known attack vectors to enable proactive defense, and a unified testing framework to ensure continuous security validation. These steps are essential to push developers toward integrating security into the core design of LLMs, as our findings show that current models still fail to mitigate long-standing threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Indirect prompt injection still succeeds against AI agents using web search tools for data exfiltration, backed by tests across models.

read the letter

This paper finds that indirect prompt injection attacks continue to work on AI agents that use web search tools, letting attackers pull out sensitive data. The authors tested this across different models and report that even familiar attack patterns get through. They do a good job laying out the problem in the context of tool-calling and RAG in agents. The evaluation covers how model size and the provider affect vulnerability, and they compare different attack methods. Including specific templates and showing success rates with examples of how exfiltrated data appears in tool outputs adds concrete evidence. The systematic approach across models is a plus, as it highlights that no single size or maker is immune. This kind of data helps practitioners understand the real risks when deploying these systems with access to corporate info. The main limitation is that the core attacks draw from prior work on prompt injection, so this is more of an application study than a fresh discovery. Their test setups use various configurations, but it is still worth asking how closely they match production agent environments with real web search integrations. The proposed fixes, such as better training and a shared database of attacks, make sense but lack specifics on how to build or maintain them. Readers working on securing LLM-based agents or studying prompt-based threats would find this relevant. It gives a practical view of where current defenses are weak. I think this deserves peer review. The empirical results on multiple models provide enough substance for referees to evaluate the claims properly.

Referee Report

0 major / 3 minor

Summary. The paper evaluates indirect prompt injection attacks that exploit web search tools in LLM-based AI agents to achieve data exfiltration. It conducts a systematic empirical study across multiple models, examining the influence of model size, manufacturer, and implementation details on vulnerability, and reports that established attack patterns remain effective. The work concludes by advocating strengthened training procedures, a centralized attack-vector database, and a unified testing framework.

Significance. If the reported success rates and attack templates hold, the results are significant for AI agent security: they supply concrete attack templates, quantitative success rates differentiated by model size and provider, and examples of tool-call outputs that embed exfiltrated context. These elements directly support the central claim of persistent weaknesses in current defenses and provide reproducible evidence that can inform future mitigation research.

minor comments (3)

[§4.2] §4.2: the success-rate tables would benefit from explicit baseline comparisons against non-tool-augmented LLMs to isolate the contribution of the web-search integration.
[Figure 3] Figure 3: axis labels and legend entries are too small for readability; enlarge or add a supplementary high-resolution version.
[§6] The discussion of the proposed centralized attack database lacks a concrete schema or example entry format, which would aid reproducibility of the recommended defense.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary correctly captures the focus of our systematic evaluation of indirect prompt injection attacks that exploit web search tools in LLM agents to enable data exfiltration.

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical attack evaluations

full rationale

The paper presents a systematic empirical evaluation of indirect prompt injection attacks on LLMs equipped with web search tools, including concrete attack templates, success rates across multiple models and providers, and observed exfiltration outcomes. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided sections. Central claims derive directly from experimental results rather than reducing to inputs by construction, self-citation chains, or renamed known patterns. The work is self-contained as an observational security study against external benchmarks of LLM behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions about LLM tool-calling behavior without introducing new free parameters or invented entities.

axioms (1)

domain assumption LLMs with tool-calling capabilities can process and act on data retrieved from external web sources.
This underpins the setup of AI agents performing web searches and handling sensitive data.

pith-pipeline@v0.9.0 · 5743 in / 1044 out tokens · 37096 ms · 2026-05-18T08:33:26.294298+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 6.0

The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

A Comprehensive Overview of Large Language Models

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models,” Jul. 2023. [Online]. Available: https: //arxiv.org/abs/2307.06435v10

work page internal anchor Pith review arXiv 2023
[2]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG,” Feb. 2025, arXiv:2501.09136 [cs]. [Online]. Available: http://arxiv.org/abs/ 2501.09136

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Llm with tools: A survey,

Z. Shen, “LLM With Tools: A Survey,” Sep. 2024, arXiv:2409.18807 [cs]. [Online]. Available: http://arxiv.org/abs/2409.18807

work page arXiv 2024
[4]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” Dec. 2023, arXiv:2307.15043 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15043 5https://openrouter.ai/

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

“real attackers don’t compute gradients

G. Apruzzese, H. S. Anderson, S. Dambra, D. Freeman, F. Pierazzi, and K. Roundy, ““real attackers don’t compute gradients”: bridging the gap between adversarial ml research and practice,” in2023 IEEE conference on secure and trustworthy machine learning (SaTML). IEEE, 2023, pp. 339–364. [Online]. Available: https: //ieeexplore.ieee.org/abstract/document/10136152/

work page arXiv 2023
[6]

“Do Anything Now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,” inACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024

work page 2024
[7]

Autodan: interpretable gradient-based adversarial attacks on large language models

S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models,” Dec. 2023, arXiv:2310.15140 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 15140

work page arXiv 2023
[8]

GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,

Y . Yuan, W. Jiao, W. Wang, J. tse Huang, P. He, S. Shi, and Z. Tu, “GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=MbfAK4s61A

work page 2024
[9]

Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,

W. Hackett, L. Birch, S. Trawicki, N. Suri, and P. Garraghan, “Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,” Apr. 2025, arXiv:2504.11168 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2504.11168

work page arXiv 2025
[10]

A Survey of Attacks on Large Language Models,

W. Xu and K. K. Parhi, “A Survey of Attacks on Large Language Models,” May 2025, arXiv:2505.12567 [cs]. [Online]. Available: http://arxiv.org/abs/2505.12567

work page arXiv 2025
[11]

V ocabulary Attack to Hijack Large Language Model Applications,

P. Levi and C. P. Neumann, “V ocabulary Attack to Hijack Large Language Model Applications,” May 2024, arXiv:2404.02637 [cs]. [Online]. Available: http://arxiv.org/abs/2404.02637

work page arXiv 2024
[12]

Prompt injection attacks and defenses in llm-integrated applications

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” Nov. 2024, arXiv:2310.12815 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 12815

work page arXiv 2024
[13]

Shi, et al., Optimization-based Prompt Injection Attack to LLM-as-a- Judge (2025), https: //arxiv.org/abs/2403.17710

J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based Prompt Injection Attack to LLM-as- a-Judge,” Mar. 2025, arXiv:2403.17710 [cs]. [Online]. Available: http://arxiv.org/abs/2403.17710

work page arXiv 2025
[14]

OW ASP Gen AI Security Project

“OW ASP Gen AI Security Project.” [Online]. Available: https: //genai.owasp.org/

work page
[15]

Protecting against indirect prompt injection attacks in MCP,

S. Young, “Protecting against indirect prompt injection attacks in MCP,” Apr. 2025. [Online]. Available: https://developer.microsoft.com/ blog/protecting-against-indirect-injection-attacks-mcp

work page 2025
[16]

New hack uses prompt injection to corrupt Gemini’s long-term memory,

D. Goodin, “New hack uses prompt injection to corrupt Gemini’s long-term memory,” Feb. 2025. [Online]. Available: https://arstechnica.com/security/2025/02/ new-hack-uses-prompt-injection-to-corrupt-geminis-long-term-memory/

work page 2025
[17]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” May 2023, arXiv:2302.12173 [cs]. [Online]. Available: http://arxiv.org/abs/ 2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Lessons from Defending Gemini Against Indirect Prompt Injections,

C. Shi, S. Lin, S. Song, J. Hayes, I. Shumailov, I. Yona, J. Pluto, A. Pappu, C. A. Choquette-Choo, M. Nasr, C. Sitawarin, G. Gibson, A. Terzis, and J. F. Flynn, “Lessons from Defending Gemini Against Indirect Prompt Injections,” May 2025, arXiv:2505.14534 [cs]. [Online]. Available: http://arxiv.org/abs/2505.14534

work page arXiv 2025
[19]

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Y . Chen, H. Li, Y . Sui, Y . He, Y . Liu, Y . Song, and B. Hooi, “Can Indirect Prompt Injection Attacks Be Detected and Removed?” Aug. 2025, arXiv:2502.16580 [cs] version: 4. [Online]. Available: http://arxiv.org/abs/2502.16580

work page arXiv 2025
[20]

Security of AI Agents,

Y . He, E. Wang, Y . Rong, Z. Cheng, and H. Chen, “Security of AI Agents,” Dec. 2024, arXiv:2406.08689 [cs]. [Online]. Available: http://arxiv.org/abs/2406.08689

work page arXiv 2024
[21]

LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge,

S. Abdelnabi, A. Fay, A. Salem, E. Zverev, K.-C. Liao, C.-H. Liu, C.-C. Kuo, J. Weigend, D. Manlangit, A. Apostolov, H. Umair, J. Donato, M. Kawakita, A. Mahboob, T. H. Bach, T.-H. Chiang, M. Cho, H. Choi, B. Kim, H. Lee, B. Pannell, C. McCauley, M. Russinovich, A. Paverd, and G. Cherubin, “LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injecti...

work page arXiv 2025
[22]

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System,

G. D. L. Munoz, A. J. Minnich, R. Lutz, R. Lundeen, R. S. R. Dheekonda, N. Chikanov, B.-E. Jagdagdorj, M. Pouliot, S. Chawla, W. Maxwell, B. Bullwinkel, K. Pratt, J. d. Gruyter, C. Siska, P. Bryan, T. Westerhoff, C. Kawaguchi, C. Seifert, R. S. S. Kumar, and Y . Zunger, “PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI ...

work page arXiv 2024
[23]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

J. Yu, X. Lin, Z. Yu, and X. Xing, “GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,” Jun. 2024, arXiv:2309.10253 [cs]. [Online]. Available: http://arxiv.org/abs/ 2309.10253

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,

E. Debenedetti, J. Rando, D. Paleka, S. F. Florin, D. Albastroiu, N. Cohen, Y . Lemberg, R. Ghosh, R. Wen, A. Salem, G. Cherubin, S. Zanella-Beguelin, R. Schmid, V . Klemm, T. Miki, C. Li, S. Kraft, M. Fritz, F. Tram `er, S. Abdelnabi, and L. Sch ¨onherr, “Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,” Jun. 2024, arXiv:...

work page arXiv 2024
[25]

Plentiful jailbreaks with string compositions,

B. R. Y . Huang, “Plentiful jailbreaks with string compositions,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01084

work page arXiv 2024
[26]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions,” Apr. 2024, arXiv:2404.13208 [cs]. [Online]. Available: http://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

A Comprehensive Overview of Large Language Models

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models,” Jul. 2023. [Online]. Available: https: //arxiv.org/abs/2307.06435v10

work page internal anchor Pith review arXiv 2023

[2] [2]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG,” Feb. 2025, arXiv:2501.09136 [cs]. [Online]. Available: http://arxiv.org/abs/ 2501.09136

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Llm with tools: A survey,

Z. Shen, “LLM With Tools: A Survey,” Sep. 2024, arXiv:2409.18807 [cs]. [Online]. Available: http://arxiv.org/abs/2409.18807

work page arXiv 2024

[4] [4]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” Dec. 2023, arXiv:2307.15043 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15043 5https://openrouter.ai/

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

“real attackers don’t compute gradients

G. Apruzzese, H. S. Anderson, S. Dambra, D. Freeman, F. Pierazzi, and K. Roundy, ““real attackers don’t compute gradients”: bridging the gap between adversarial ml research and practice,” in2023 IEEE conference on secure and trustworthy machine learning (SaTML). IEEE, 2023, pp. 339–364. [Online]. Available: https: //ieeexplore.ieee.org/abstract/document/10136152/

work page arXiv 2023

[6] [6]

“Do Anything Now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,” inACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024

work page 2024

[7] [7]

Autodan: interpretable gradient-based adversarial attacks on large language models

S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models,” Dec. 2023, arXiv:2310.15140 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 15140

work page arXiv 2023

[8] [8]

GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,

Y . Yuan, W. Jiao, W. Wang, J. tse Huang, P. He, S. Shi, and Z. Tu, “GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=MbfAK4s61A

work page 2024

[9] [9]

Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,

W. Hackett, L. Birch, S. Trawicki, N. Suri, and P. Garraghan, “Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,” Apr. 2025, arXiv:2504.11168 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2504.11168

work page arXiv 2025

[10] [10]

A Survey of Attacks on Large Language Models,

W. Xu and K. K. Parhi, “A Survey of Attacks on Large Language Models,” May 2025, arXiv:2505.12567 [cs]. [Online]. Available: http://arxiv.org/abs/2505.12567

work page arXiv 2025

[11] [11]

V ocabulary Attack to Hijack Large Language Model Applications,

P. Levi and C. P. Neumann, “V ocabulary Attack to Hijack Large Language Model Applications,” May 2024, arXiv:2404.02637 [cs]. [Online]. Available: http://arxiv.org/abs/2404.02637

work page arXiv 2024

[12] [12]

Prompt injection attacks and defenses in llm-integrated applications

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” Nov. 2024, arXiv:2310.12815 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 12815

work page arXiv 2024

[13] [13]

Shi, et al., Optimization-based Prompt Injection Attack to LLM-as-a- Judge (2025), https: //arxiv.org/abs/2403.17710

J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based Prompt Injection Attack to LLM-as- a-Judge,” Mar. 2025, arXiv:2403.17710 [cs]. [Online]. Available: http://arxiv.org/abs/2403.17710

work page arXiv 2025

[14] [14]

OW ASP Gen AI Security Project

“OW ASP Gen AI Security Project.” [Online]. Available: https: //genai.owasp.org/

work page

[15] [15]

Protecting against indirect prompt injection attacks in MCP,

S. Young, “Protecting against indirect prompt injection attacks in MCP,” Apr. 2025. [Online]. Available: https://developer.microsoft.com/ blog/protecting-against-indirect-injection-attacks-mcp

work page 2025

[16] [16]

New hack uses prompt injection to corrupt Gemini’s long-term memory,

D. Goodin, “New hack uses prompt injection to corrupt Gemini’s long-term memory,” Feb. 2025. [Online]. Available: https://arstechnica.com/security/2025/02/ new-hack-uses-prompt-injection-to-corrupt-geminis-long-term-memory/

work page 2025

[17] [17]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” May 2023, arXiv:2302.12173 [cs]. [Online]. Available: http://arxiv.org/abs/ 2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Lessons from Defending Gemini Against Indirect Prompt Injections,

C. Shi, S. Lin, S. Song, J. Hayes, I. Shumailov, I. Yona, J. Pluto, A. Pappu, C. A. Choquette-Choo, M. Nasr, C. Sitawarin, G. Gibson, A. Terzis, and J. F. Flynn, “Lessons from Defending Gemini Against Indirect Prompt Injections,” May 2025, arXiv:2505.14534 [cs]. [Online]. Available: http://arxiv.org/abs/2505.14534

work page arXiv 2025

[19] [19]

Can Indirect Prompt Injection Attacks Be Detected and Removed?

Y . Chen, H. Li, Y . Sui, Y . He, Y . Liu, Y . Song, and B. Hooi, “Can Indirect Prompt Injection Attacks Be Detected and Removed?” Aug. 2025, arXiv:2502.16580 [cs] version: 4. [Online]. Available: http://arxiv.org/abs/2502.16580

work page arXiv 2025

[20] [20]

Security of AI Agents,

Y . He, E. Wang, Y . Rong, Z. Cheng, and H. Chen, “Security of AI Agents,” Dec. 2024, arXiv:2406.08689 [cs]. [Online]. Available: http://arxiv.org/abs/2406.08689

work page arXiv 2024

[21] [21]

LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge,

S. Abdelnabi, A. Fay, A. Salem, E. Zverev, K.-C. Liao, C.-H. Liu, C.-C. Kuo, J. Weigend, D. Manlangit, A. Apostolov, H. Umair, J. Donato, M. Kawakita, A. Mahboob, T. H. Bach, T.-H. Chiang, M. Cho, H. Choi, B. Kim, H. Lee, B. Pannell, C. McCauley, M. Russinovich, A. Paverd, and G. Cherubin, “LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injecti...

work page arXiv 2025

[22] [22]

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System,

G. D. L. Munoz, A. J. Minnich, R. Lutz, R. Lundeen, R. S. R. Dheekonda, N. Chikanov, B.-E. Jagdagdorj, M. Pouliot, S. Chawla, W. Maxwell, B. Bullwinkel, K. Pratt, J. d. Gruyter, C. Siska, P. Bryan, T. Westerhoff, C. Kawaguchi, C. Seifert, R. S. S. Kumar, and Y . Zunger, “PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI ...

work page arXiv 2024

[23] [23]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

J. Yu, X. Lin, Z. Yu, and X. Xing, “GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,” Jun. 2024, arXiv:2309.10253 [cs]. [Online]. Available: http://arxiv.org/abs/ 2309.10253

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,

E. Debenedetti, J. Rando, D. Paleka, S. F. Florin, D. Albastroiu, N. Cohen, Y . Lemberg, R. Ghosh, R. Wen, A. Salem, G. Cherubin, S. Zanella-Beguelin, R. Schmid, V . Klemm, T. Miki, C. Li, S. Kraft, M. Fritz, F. Tram `er, S. Abdelnabi, and L. Sch ¨onherr, “Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,” Jun. 2024, arXiv:...

work page arXiv 2024

[25] [25]

Plentiful jailbreaks with string compositions,

B. R. Y . Huang, “Plentiful jailbreaks with string compositions,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01084

work page arXiv 2024

[26] [26]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions,” Apr. 2024, arXiv:2404.13208 [cs]. [Online]. Available: http://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024