pith. sign in

arxiv: 2605.18133 · v1 · pith:AEGZ7CBXnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI· cs.HC· cs.IR

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

Pith reviewed 2026-05-20 09:54 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.HCcs.IR
keywords prompt injectionindirect injectionprivacy leakagechatbot agentsblack-box attacksweb toolsdata exfiltrationLLM security
0
0 comments X

The pith

Indirect prompt injection through crafted web content can chain with tool calls to leak private data from black-box chatbot agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how attackers can exploit LLM-based chatbot agents that combine natural language reasoning with external web tools to create privacy leaks via indirect prompt injection. The attacker crafts web pages that look normal to the user but contain hidden instructions designed to override the original task and make the agent pursue an attacker-chosen goal such as data exfiltration. The authors introduce and test a technique called exemplification that inserts a bridge to reframe the user prompt and the start of the retrieved page as few-shot examples before appending the malicious objective. They compare this approach to a prior fake-completion method and show a working proof-of-concept chain using fictitious personal information in controlled black-box conditions. If the chain holds, the combination of untrusted external content processing and agentic tool use creates a practical attack surface for privacy violations even when the attacker has no access to model internals or system prompts.

Core claim

In black-box chatbot environments the attacker can hijack an agent's intended task by embedding indirect prompt injections inside external web content that appears benign to the victim. The exemplification technique uses a bridge in the retrieved page to present the user prompt and the benign beginning as few-shot examples before the attacker's objective, producing higher success rates than fake-completion. This combination of prompt injection, jailbreak-style instruction steering, and web-tool invocation forms a feasible privacy-leakage path, demonstrated by a controlled data-exfiltration chain that moves fictitious personal information.

What carries the argument

Exemplification technique: a prompt-injection method that inserts a bridge in external content to reframe the user prompt and benign page beginning as few-shot examples before appending the attacker's objective.

If this is right

  • Attackers can hijack an agent's intended task by crafting external content that appears benign yet induces execution of an attacker-defined objective.
  • The exemplification technique achieves higher attack success rates than the prior fake-completion method when tested against black-box agents.
  • A complete data-exfiltration chain is possible by combining the injection with jailbreak-style steering and web-tool invocation.
  • Deployed chatbot agents that process untrusted web results as part of user tasks remain vulnerable to privacy leakage without requiring attacker access to internals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs may need explicit separation or verification steps between retrieved external content and the original user intent to limit override.
  • Similar indirect injection chains could be tested against other tool-using LLM agents beyond chatbots.
  • Developers could add runtime checks that flag or sandbox instructions originating from external pages before they influence tool calls.

Load-bearing premise

External web content retrieved by the agent will be processed as part of the user's task in a way that allows injected instructions to override the original user intent.

What would settle it

A test in which the agent consistently ignores or detects the injected instructions in the crafted web pages and adheres only to the original user query without performing any exfiltration action.

Figures

Figures reproduced from arXiv: 2605.18133 by Daeseon Choi, Hongjang Yang, Hyunsik Na.

Figure 1
Figure 1. Figure 1: Attack success rate comparison between fake completion and the proposed exemplification technique. Fake [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall structure of the privacy-leakage chain. The victim sees benign browser content, while the chatbot agent [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that indirect prompt injection combined with web-tool invocation enables a feasible privacy-leakage attack chain in black-box LLM chatbot agents. It analyzes how attackers can hijack agent tasks via crafted external content, introduces and evaluates an 'exemplification' technique that uses a bridge in retrieved content to reframe the original prompt and benign page start as few-shot examples before appending the attacker objective, compares its success to a prior fake-completion method, and demonstrates a proof-of-concept exfiltration using only fictitious personal information in a researcher-controlled environment.

Significance. If the central feasibility claim holds under more rigorous testing, the work would usefully highlight a practical attack surface arising from the combination of prompt injection, jailbreak-style steering, and external tool use in deployed agents. This could inform defenses such as stricter content isolation or output sanitization, extending prior prompt-injection literature to tool-augmented settings with concrete examples.

major comments (3)
  1. [Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.
  2. [Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.
  3. [Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.
minor comments (2)
  1. [Abstract] Abstract contains typographical issues: 'user' s' and 'attacker' s' should be corrected to 'user's' and 'attacker's'.
  2. [Technique description] The description of the 'bridge' in the exemplification technique would benefit from a concrete example or pseudocode to clarify how the few-shot reframing is constructed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting our empirical scope or ethical constraints.

read point-by-point responses
  1. Referee: [Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.

    Authors: The manuscript reports a direct comparison of attack success rates between the exemplification technique and the prior fake-completion baseline. We agree, however, that explicit reporting of trial counts, error breakdowns, and basic statistical measures would improve transparency and support the feasibility claim more robustly. We will expand the evaluation section with these details in the revised manuscript. revision: yes

  2. Referee: [Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.

    Authors: Ethical and legal requirements preclude any use of real personal data in attack demonstrations. The controlled setting with fictitious information was selected precisely to isolate the injection and exfiltration mechanism while eliminating risk. We will add an explicit limitations paragraph discussing how real-world conversation histories and agent-side filtering might differ, framing this as an inherent boundary of the study rather than a flaw in the demonstrated chain. revision: partial

  3. Referee: [Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.

    Authors: Because the target agents are black-box, internal tool-handler logic (including summarization or truncation) cannot be directly observed or varied by the researcher. Our demonstrations reflect the observable incorporation of external content in public interfaces. We will add analysis of how truncation or summarization could degrade injection success and will include this as an explicit limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

Empirical attack demonstration with no derivational circularity

full rationale

This is an empirical security study whose central claims rest on observed behavior in a controlled proof-of-concept rather than any mathematical derivation chain. No equations, fitted parameters, or self-referential definitions appear; the exemplification technique and data-exfiltration path are presented as experimental constructions evaluated against baselines, not as quantities derived from prior results within the paper itself. External web-content processing assumptions are stated as the weakest link but are not smuggled in via self-citation or uniqueness theorems. The work therefore remains self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM agents will faithfully execute instructions embedded in retrieved external content when that content is framed as part of the user task.

axioms (1)
  • domain assumption LLM agents process untrusted external content as legitimate input for task execution without strong isolation or sanitization.
    Invoked in the description of the attack surface created by combining natural-language reasoning with external tools.

pith-pipeline@v0.9.0 · 5768 in / 1160 out tokens · 24849 ms · 2026-05-20T09:54:26.432223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 17 internal anchors

  1. [1]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv:2302.12173,

  2. [2]

    Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

    Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv:2405.13068,

  3. [3]

    Exfiltration of personal information from chatgpt via prompt injection

    Gregory Schwartzman. Exfiltration of personal information from chatgpt via prompt injection. arXiv:2406.00199, Jun

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691,

  6. [6]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. arXiv:2406.13352,

  7. [7]

    Simple prompt injection attacks can leak personal data observed by llm agents during task execution

    Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution. arXiv:2506.01055,

  8. [8]

    Bench- marking and defending against indirect prompt injection attacks on large language models

    Jingwei Yi, Yueqi Xie, Bin Benjamin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197,

  9. [9]

    Echoleak: The first real-world zero-click prompt injection exploit in a production llm system

    Pavan Reddy and Aditya Sanjay Gujral. Echoleak: The first real-world zero-click prompt injection exploit in a production llm system. arXiv:2509.10540,

  10. [10]

    Silent egress: When implicit prompt injection makes llm agents leak without a trace

    Qianlong Lan, Anuj Kaul, Shaun Jones, and Stephanie Westrum. Silent egress: When implicit prompt injection makes llm agents leak without a trace. arXiv:2602.22450,

  11. [11]

    Scalable Extraction of Training Data from (Production) Language Models

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv:2311.17035,

  12. [12]

    Prompt injection attacks against gpt-3

    Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/ prompt-injection/, Sep

  13. [13]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tennenholtz. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowled...

  14. [14]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv:2112.09332,

  15. [15]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv:2303.17580,

  16. [16]

    Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis

    8 Privacy Leakage Chains via Prompt InjectionA PREPRINT Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. arXiv:2303.16434,

  17. [17]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv:2304.08244,

  18. [18]

    Tool Learning with Foundation Models

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yu Zhu, Zhen Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huan...

  19. [19]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv:2305.15334,

  20. [20]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366,

  21. [21]

    Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models. arXiv:2402.07867,

  22. [22]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv:2407.12784, 2024a. Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. InNeurIPS ML Safety Workshop,

  23. [23]

    Prompt injection attacks and defenses in llm-integrated applications

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications. arXiv:2310.12815,

  24. [24]

    Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications

    Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv:2401.07612,

  25. [25]

    Struq: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363, 2024b. Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. Jatmo: Prompt injection defense by task-specific finetuning. arXiv:2312.17673,

  26. [26]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv:2404.13208,

  27. [27]

    System-level defense against indirect prompt injection attacks: An information flow control perspective.arXiv preprint arXiv:2409.19091, 2024

    Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv:2409.19091,

  28. [28]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv:2307.02483,

  29. [29]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043,

  30. [30]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253,

  31. [31]

    Promptarmor: Simple yet effective prompt injection defenses

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses. arXiv:2507.15219,

  32. [32]

    Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification

    Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification. arXiv:2602.22724,

  33. [33]

    ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

    Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie, Jiongchi Yu, and Jia Liu. ARGUS: Defending llm agents against context-aware prompt injection attacks. arXiv:2605.03378,