An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments
Pith reviewed 2026-05-20 09:54 UTC · model grok-4.3
The pith
Indirect prompt injection through crafted web content can chain with tool calls to leak private data from black-box chatbot agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In black-box chatbot environments the attacker can hijack an agent's intended task by embedding indirect prompt injections inside external web content that appears benign to the victim. The exemplification technique uses a bridge in the retrieved page to present the user prompt and the benign beginning as few-shot examples before the attacker's objective, producing higher success rates than fake-completion. This combination of prompt injection, jailbreak-style instruction steering, and web-tool invocation forms a feasible privacy-leakage path, demonstrated by a controlled data-exfiltration chain that moves fictitious personal information.
What carries the argument
Exemplification technique: a prompt-injection method that inserts a bridge in external content to reframe the user prompt and benign page beginning as few-shot examples before appending the attacker's objective.
If this is right
- Attackers can hijack an agent's intended task by crafting external content that appears benign yet induces execution of an attacker-defined objective.
- The exemplification technique achieves higher attack success rates than the prior fake-completion method when tested against black-box agents.
- A complete data-exfiltration chain is possible by combining the injection with jailbreak-style steering and web-tool invocation.
- Deployed chatbot agents that process untrusted web results as part of user tasks remain vulnerable to privacy leakage without requiring attacker access to internals.
Where Pith is reading between the lines
- Agent designs may need explicit separation or verification steps between retrieved external content and the original user intent to limit override.
- Similar indirect injection chains could be tested against other tool-using LLM agents beyond chatbots.
- Developers could add runtime checks that flag or sandbox instructions originating from external pages before they influence tool calls.
Load-bearing premise
External web content retrieved by the agent will be processed as part of the user's task in a way that allows injected instructions to override the original user intent.
What would settle it
A test in which the agent consistently ignores or detects the injected instructions in the crafted web pages and adheres only to the original user query without performing any exfiltration action.
Figures
read the original abstract
LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that indirect prompt injection combined with web-tool invocation enables a feasible privacy-leakage attack chain in black-box LLM chatbot agents. It analyzes how attackers can hijack agent tasks via crafted external content, introduces and evaluates an 'exemplification' technique that uses a bridge in retrieved content to reframe the original prompt and benign page start as few-shot examples before appending the attacker objective, compares its success to a prior fake-completion method, and demonstrates a proof-of-concept exfiltration using only fictitious personal information in a researcher-controlled environment.
Significance. If the central feasibility claim holds under more rigorous testing, the work would usefully highlight a practical attack surface arising from the combination of prompt injection, jailbreak-style steering, and external tool use in deployed agents. This could inform defenses such as stricter content isolation or output sanitization, extending prior prompt-injection literature to tool-augmented settings with concrete examples.
major comments (3)
- [Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.
- [Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.
- [Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.
minor comments (2)
- [Abstract] Abstract contains typographical issues: 'user' s' and 'attacker' s' should be corrected to 'user's' and 'attacker's'.
- [Technique description] The description of the 'bridge' in the exemplification technique would benefit from a concrete example or pseudocode to clarify how the few-shot reframing is constructed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting our empirical scope or ethical constraints.
read point-by-point responses
-
Referee: [Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.
Authors: The manuscript reports a direct comparison of attack success rates between the exemplification technique and the prior fake-completion baseline. We agree, however, that explicit reporting of trial counts, error breakdowns, and basic statistical measures would improve transparency and support the feasibility claim more robustly. We will expand the evaluation section with these details in the revised manuscript. revision: yes
-
Referee: [Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.
Authors: Ethical and legal requirements preclude any use of real personal data in attack demonstrations. The controlled setting with fictitious information was selected precisely to isolate the injection and exfiltration mechanism while eliminating risk. We will add an explicit limitations paragraph discussing how real-world conversation histories and agent-side filtering might differ, framing this as an inherent boundary of the study rather than a flaw in the demonstrated chain. revision: partial
-
Referee: [Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.
Authors: Because the target agents are black-box, internal tool-handler logic (including summarization or truncation) cannot be directly observed or varied by the researcher. Our demonstrations reflect the observable incorporation of external content in public interfaces. We will add analysis of how truncation or summarization could degrade injection success and will include this as an explicit limitation and direction for future work. revision: partial
Circularity Check
Empirical attack demonstration with no derivational circularity
full rationale
This is an empirical security study whose central claims rest on observed behavior in a controlled proof-of-concept rather than any mathematical derivation chain. No equations, fitted parameters, or self-referential definitions appear; the exemplification technique and data-exfiltration path are presented as experimental constructions evaluated against baselines, not as quantities derived from prior results within the paper itself. External web-content processing assumptions are stated as the weakest link but are not smuggled in via self-citation or uniqueness theorems. The work therefore remains self-contained against its own experimental benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents process untrusted external content as legitimate input for task execution without strong isolation or sanitization.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker’s objective.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv:2302.12173,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv:2405.13068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Exfiltration of personal information from chatgpt via prompt injection
Gregory Schwartzman. Exfiltration of personal information from chatgpt via prompt injection. arXiv:2406.00199, Jun
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 1901
-
[5]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. arXiv:2406.13352,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Simple prompt injection attacks can leak personal data observed by llm agents during task execution
Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution. arXiv:2506.01055,
-
[8]
Bench- marking and defending against indirect prompt injection attacks on large language models
Jingwei Yi, Yueqi Xie, Bin Benjamin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197,
-
[9]
Echoleak: The first real-world zero-click prompt injection exploit in a production llm system
Pavan Reddy and Aditya Sanjay Gujral. Echoleak: The first real-world zero-click prompt injection exploit in a production llm system. arXiv:2509.10540,
-
[10]
Silent egress: When implicit prompt injection makes llm agents leak without a trace
Qianlong Lan, Anuj Kaul, Shaun Jones, and Stephanie Westrum. Silent egress: When implicit prompt injection makes llm agents leak without a trace. arXiv:2602.22450,
-
[11]
Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv:2311.17035,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Prompt injection attacks against gpt-3
Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/ prompt-injection/, Sep
work page 2022
-
[13]
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tennenholtz. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowled...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv:2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv:2303.17580,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis
8 Privacy Leakage Chains via Prompt InjectionA PREPRINT Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. arXiv:2303.16434,
-
[17]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv:2304.08244,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Tool Learning with Foundation Models
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yu Zhu, Zhen Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv:2305.15334,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models. arXiv:2402.07867,
-
[22]
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv:2407.12784, 2024a. Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. InNeurIPS ML Safety Workshop,
-
[23]
Prompt injection attacks and defenses in llm-integrated applications
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications. arXiv:2310.12815,
-
[24]
Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv:2401.07612,
-
[25]
Struq: Defending against prompt injection with structured queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363, 2024b. Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. Jatmo: Prompt injection defense by task-specific finetuning. arXiv:2312.17673,
-
[26]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv:2404.13208,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv:2409.19091,
-
[28]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv:2307.02483,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Promptarmor: Simple yet effective prompt injection defenses
Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses. arXiv:2507.15219,
-
[32]
Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification. arXiv:2602.22724,
-
[33]
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie, Jiongchi Yu, and Jia Liu. ARGUS: Defending llm agents against context-aware prompt injection attacks. arXiv:2605.03378,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.