Exploiting Web Search Tools of AI Agents for Data Exfiltration
Pith reviewed 2026-05-18 08:33 UTC · model grok-4.3
The pith
Indirect prompt injection still lets attackers exfiltrate data from AI agents through their web search tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that indirect prompt injection attacks succeed in exploiting the web search tools of AI agents to exfiltrate data, and that even well-known attack patterns continue to bypass defenses across different model sizes and manufacturers.
What carries the argument
Indirect prompt injection, which places malicious instructions in external web content that the agent retrieves and then follows to send out sensitive data.
If this is right
- Strengthened training procedures would increase inherent resilience against these attacks.
- A centralized database of known attack vectors would support proactive defense.
- A unified testing framework would enable continuous security validation.
- Developers must integrate security into the core design of LLMs rather than treating it as optional.
Where Pith is reading between the lines
- Similar exfiltration risks likely exist when agents use other external tools such as APIs or file systems.
- Organizations running AI agents should add output monitoring or sanitization as an extra layer beyond model-level fixes.
- The persistence of these attacks suggests that simply increasing model size will not remove the vulnerability.
Load-bearing premise
The tested models, attack implementations, and web search tool integrations are representative of real-world AI agent deployments and current LLM capabilities.
What would settle it
A production AI agent whose web search tool returns no sensitive data after repeated indirect prompt injection attempts on multiple models.
Figures
read the original abstract
Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like web searches. The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse. As LLMs increasingly interact with external data sources, indirect prompt injection emerges as a critical and evolving attack vector, enabling adversaries to exploit models through manipulated inputs. Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks, which parameters, including model size and manufacturer, specific implementations, shape their vulnerability, and which attack methods remain most effective. Our results reveal that even well-known attack patterns continue to succeed, exposing persistent weaknesses in model defenses. To address these vulnerabilities, we emphasize the need for strengthened training procedures to enhance inherent resilience, a centralized database of known attack vectors to enable proactive defense, and a unified testing framework to ensure continuous security validation. These steps are essential to push developers toward integrating security into the core design of LLMs, as our findings show that current models still fail to mitigate long-standing threats.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates indirect prompt injection attacks that exploit web search tools in LLM-based AI agents to achieve data exfiltration. It conducts a systematic empirical study across multiple models, examining the influence of model size, manufacturer, and implementation details on vulnerability, and reports that established attack patterns remain effective. The work concludes by advocating strengthened training procedures, a centralized attack-vector database, and a unified testing framework.
Significance. If the reported success rates and attack templates hold, the results are significant for AI agent security: they supply concrete attack templates, quantitative success rates differentiated by model size and provider, and examples of tool-call outputs that embed exfiltrated context. These elements directly support the central claim of persistent weaknesses in current defenses and provide reproducible evidence that can inform future mitigation research.
minor comments (3)
- [§4.2] §4.2: the success-rate tables would benefit from explicit baseline comparisons against non-tool-augmented LLMs to isolate the contribution of the web-search integration.
- [Figure 3] Figure 3: axis labels and legend entries are too small for readability; enlarge or add a supplementary high-resolution version.
- [§6] The discussion of the proposed centralized attack database lacks a concrete schema or example entry format, which would aid reproducibility of the recommended defense.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary correctly captures the focus of our systematic evaluation of indirect prompt injection attacks that exploit web search tools in LLM agents to enable data exfiltration.
Circularity Check
No significant circularity; claims rest on empirical attack evaluations
full rationale
The paper presents a systematic empirical evaluation of indirect prompt injection attacks on LLMs equipped with web search tools, including concrete attack templates, success rates across multiple models and providers, and observed exfiltration outcomes. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided sections. Central claims derive directly from experimental results rather than reducing to inputs by construction, self-citation chains, or renamed known patterns. The work is self-contained as an observational security study against external benchmarks of LLM behavior.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs with tool-calling capabilities can process and act on data retrieved from external web sources.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Reference graph
Works this paper leans on
-
[1]
A Comprehensive Overview of Large Language Models
H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models,” Jul. 2023. [Online]. Available: https: //arxiv.org/abs/2307.06435v10
work page internal anchor Pith review arXiv 2023
-
[2]
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG,” Feb. 2025, arXiv:2501.09136 [cs]. [Online]. Available: http://arxiv.org/abs/ 2501.09136
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Z. Shen, “LLM With Tools: A Survey,” Sep. 2024, arXiv:2409.18807 [cs]. [Online]. Available: http://arxiv.org/abs/2409.18807
-
[4]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” Dec. 2023, arXiv:2307.15043 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15043 5https://openrouter.ai/
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
“real attackers don’t compute gradients
G. Apruzzese, H. S. Anderson, S. Dambra, D. Freeman, F. Pierazzi, and K. Roundy, ““real attackers don’t compute gradients”: bridging the gap between adversarial ml research and practice,” in2023 IEEE conference on secure and trustworthy machine learning (SaTML). IEEE, 2023, pp. 339–364. [Online]. Available: https: //ieeexplore.ieee.org/abstract/document/10136152/
-
[6]
X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models,” inACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024
work page 2024
-
[7]
Autodan: interpretable gradient-based adversarial attacks on large language models
S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models,” Dec. 2023, arXiv:2310.15140 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 15140
-
[8]
GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,
Y . Yuan, W. Jiao, W. Wang, J. tse Huang, P. He, S. Shi, and Z. Tu, “GPT- 4 is too smart to be safe: Stealthy chat with LLMs via cipher,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=MbfAK4s61A
work page 2024
-
[9]
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,
W. Hackett, L. Birch, S. Trawicki, N. Suri, and P. Garraghan, “Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails,” Apr. 2025, arXiv:2504.11168 [cs] version: 1. [Online]. Available: http://arxiv.org/abs/2504.11168
-
[10]
A Survey of Attacks on Large Language Models,
W. Xu and K. K. Parhi, “A Survey of Attacks on Large Language Models,” May 2025, arXiv:2505.12567 [cs]. [Online]. Available: http://arxiv.org/abs/2505.12567
-
[11]
V ocabulary Attack to Hijack Large Language Model Applications,
P. Levi and C. P. Neumann, “V ocabulary Attack to Hijack Large Language Model Applications,” May 2024, arXiv:2404.02637 [cs]. [Online]. Available: http://arxiv.org/abs/2404.02637
-
[12]
Prompt injection attacks and defenses in llm-integrated applications
Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” Nov. 2024, arXiv:2310.12815 [cs]. [Online]. Available: http://arxiv.org/abs/2310. 12815
-
[13]
J. Shi, Z. Yuan, Y . Liu, Y . Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based Prompt Injection Attack to LLM-as- a-Judge,” Mar. 2025, arXiv:2403.17710 [cs]. [Online]. Available: http://arxiv.org/abs/2403.17710
-
[14]
OW ASP Gen AI Security Project
“OW ASP Gen AI Security Project.” [Online]. Available: https: //genai.owasp.org/
-
[15]
Protecting against indirect prompt injection attacks in MCP,
S. Young, “Protecting against indirect prompt injection attacks in MCP,” Apr. 2025. [Online]. Available: https://developer.microsoft.com/ blog/protecting-against-indirect-injection-attacks-mcp
work page 2025
-
[16]
New hack uses prompt injection to corrupt Gemini’s long-term memory,
D. Goodin, “New hack uses prompt injection to corrupt Gemini’s long-term memory,” Feb. 2025. [Online]. Available: https://arstechnica.com/security/2025/02/ new-hack-uses-prompt-injection-to-corrupt-geminis-long-term-memory/
work page 2025
-
[17]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” May 2023, arXiv:2302.12173 [cs]. [Online]. Available: http://arxiv.org/abs/ 2302.12173
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Lessons from Defending Gemini Against Indirect Prompt Injections,
C. Shi, S. Lin, S. Song, J. Hayes, I. Shumailov, I. Yona, J. Pluto, A. Pappu, C. A. Choquette-Choo, M. Nasr, C. Sitawarin, G. Gibson, A. Terzis, and J. F. Flynn, “Lessons from Defending Gemini Against Indirect Prompt Injections,” May 2025, arXiv:2505.14534 [cs]. [Online]. Available: http://arxiv.org/abs/2505.14534
-
[19]
Can Indirect Prompt Injection Attacks Be Detected and Removed?
Y . Chen, H. Li, Y . Sui, Y . He, Y . Liu, Y . Song, and B. Hooi, “Can Indirect Prompt Injection Attacks Be Detected and Removed?” Aug. 2025, arXiv:2502.16580 [cs] version: 4. [Online]. Available: http://arxiv.org/abs/2502.16580
-
[20]
Y . He, E. Wang, Y . Rong, Z. Cheng, and H. Chen, “Security of AI Agents,” Dec. 2024, arXiv:2406.08689 [cs]. [Online]. Available: http://arxiv.org/abs/2406.08689
-
[21]
LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge,
S. Abdelnabi, A. Fay, A. Salem, E. Zverev, K.-C. Liao, C.-H. Liu, C.-C. Kuo, J. Weigend, D. Manlangit, A. Apostolov, H. Umair, J. Donato, M. Kawakita, A. Mahboob, T. H. Bach, T.-H. Chiang, M. Cho, H. Choi, B. Kim, H. Lee, B. Pannell, C. McCauley, M. Russinovich, A. Paverd, and G. Cherubin, “LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injecti...
-
[22]
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System,
G. D. L. Munoz, A. J. Minnich, R. Lutz, R. Lundeen, R. S. R. Dheekonda, N. Chikanov, B.-E. Jagdagdorj, M. Pouliot, S. Chawla, W. Maxwell, B. Bullwinkel, K. Pratt, J. d. Gruyter, C. Siska, P. Bryan, T. Westerhoff, C. Kawaguchi, C. Seifert, R. S. S. Kumar, and Y . Zunger, “PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI ...
-
[23]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
J. Yu, X. Lin, Z. Yu, and X. Xing, “GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,” Jun. 2024, arXiv:2309.10253 [cs]. [Online]. Available: http://arxiv.org/abs/ 2309.10253
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,
E. Debenedetti, J. Rando, D. Paleka, S. F. Florin, D. Albastroiu, N. Cohen, Y . Lemberg, R. Ghosh, R. Wen, A. Salem, G. Cherubin, S. Zanella-Beguelin, R. Schmid, V . Klemm, T. Miki, C. Li, S. Kraft, M. Fritz, F. Tram `er, S. Abdelnabi, and L. Sch ¨onherr, “Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition,” Jun. 2024, arXiv:...
-
[25]
Plentiful jailbreaks with string compositions,
B. R. Y . Huang, “Plentiful jailbreaks with string compositions,” 2024. [Online]. Available: https://arxiv.org/abs/2411.01084
-
[26]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions,” Apr. 2024, arXiv:2404.13208 [cs]. [Online]. Available: http://arxiv.org/abs/2404.13208
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.