An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

Daeseon Choi; Hongjang Yang; Hyunsik Na

arxiv: 2605.18133 · v1 · pith:AEGZ7CBXnew · submitted 2026-05-18 · 💻 cs.CR · cs.AI· cs.HC· cs.IR

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

Hongjang Yang , Hyunsik Na , Daeseon Choi This is my paper

Pith reviewed 2026-05-20 09:54 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.HCcs.IR

keywords prompt injectionindirect injectionprivacy leakagechatbot agentsblack-box attacksweb toolsdata exfiltrationLLM security

0 comments

The pith

Indirect prompt injection through crafted web content can chain with tool calls to leak private data from black-box chatbot agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how attackers can exploit LLM-based chatbot agents that combine natural language reasoning with external web tools to create privacy leaks via indirect prompt injection. The attacker crafts web pages that look normal to the user but contain hidden instructions designed to override the original task and make the agent pursue an attacker-chosen goal such as data exfiltration. The authors introduce and test a technique called exemplification that inserts a bridge to reframe the user prompt and the start of the retrieved page as few-shot examples before appending the malicious objective. They compare this approach to a prior fake-completion method and show a working proof-of-concept chain using fictitious personal information in controlled black-box conditions. If the chain holds, the combination of untrusted external content processing and agentic tool use creates a practical attack surface for privacy violations even when the attacker has no access to model internals or system prompts.

Core claim

In black-box chatbot environments the attacker can hijack an agent's intended task by embedding indirect prompt injections inside external web content that appears benign to the victim. The exemplification technique uses a bridge in the retrieved page to present the user prompt and the benign beginning as few-shot examples before the attacker's objective, producing higher success rates than fake-completion. This combination of prompt injection, jailbreak-style instruction steering, and web-tool invocation forms a feasible privacy-leakage path, demonstrated by a controlled data-exfiltration chain that moves fictitious personal information.

What carries the argument

Exemplification technique: a prompt-injection method that inserts a bridge in external content to reframe the user prompt and benign page beginning as few-shot examples before appending the attacker's objective.

If this is right

Attackers can hijack an agent's intended task by crafting external content that appears benign yet induces execution of an attacker-defined objective.
The exemplification technique achieves higher attack success rates than the prior fake-completion method when tested against black-box agents.
A complete data-exfiltration chain is possible by combining the injection with jailbreak-style steering and web-tool invocation.
Deployed chatbot agents that process untrusted web results as part of user tasks remain vulnerable to privacy leakage without requiring attacker access to internals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs may need explicit separation or verification steps between retrieved external content and the original user intent to limit override.
Similar indirect injection chains could be tested against other tool-using LLM agents beyond chatbots.
Developers could add runtime checks that flag or sandbox instructions originating from external pages before they influence tool calls.

Load-bearing premise

External web content retrieved by the agent will be processed as part of the user's task in a way that allows injected instructions to override the original user intent.

What would settle it

A test in which the agent consistently ignores or detects the injected instructions in the crafted web pages and adheres only to the original user query without performing any exfiltration action.

Figures

Figures reproduced from arXiv: 2605.18133 by Daeseon Choi, Hongjang Yang, Hyunsik Na.

**Figure 1.** Figure 1: Attack success rate comparison between fake completion and the proposed exemplification technique. Fake [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Overall structure of the privacy-leakage chain. The victim sees benign browser content, while the chatbot agent [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a new exemplification prompt injection method that can chain with web tools to leak data in black-box agents, but only via a qualitative POC with fake info in a controlled lab setup.

read the letter

This paper introduces an exemplification prompt injection method that turns retrieved web content into few-shot examples to steer black-box agents into leaking personal information via their tools. The demonstration uses only fictitious data in a lab setting, so it highlights a potential issue without confirming real-world impact. What stands out is the focus on indirect injection through external content in agents that combine reasoning with web browsing. They describe how to craft content that appears normal but includes instructions to exfiltrate data. The new technique is compared directly to fake-completion in the same environment, which helps show its relative strengths. This kind of work draws attention to how untrusted external data can be processed in ways that bypass original user intent. The soft spots center on the limited evaluation. The claims rest on a qualitative proof-of-concept rather than measured success rates or error breakdowns across trials. Since they use made-up personal details and a controlled setup, it leaves open whether the chain would succeed with actual user data, under summarization by tools, or against any safety filters in deployed systems. The assumption about seamless processing of injected content needs stronger support. Security folks studying LLM agents and prompt-based attacks would get value from this. It raises awareness of these chains without overclaiming. The paper shows clear engagement with the problem and prior techniques, so it deserves a serious referee to help refine the experiments. I recommend putting it through peer review, with notes to expand on quantitative results and test more realistic conditions.

Referee Report

3 major / 2 minor

Summary. The paper claims that indirect prompt injection combined with web-tool invocation enables a feasible privacy-leakage attack chain in black-box LLM chatbot agents. It analyzes how attackers can hijack agent tasks via crafted external content, introduces and evaluates an 'exemplification' technique that uses a bridge in retrieved content to reframe the original prompt and benign page start as few-shot examples before appending the attacker objective, compares its success to a prior fake-completion method, and demonstrates a proof-of-concept exfiltration using only fictitious personal information in a researcher-controlled environment.

Significance. If the central feasibility claim holds under more rigorous testing, the work would usefully highlight a practical attack surface arising from the combination of prompt injection, jailbreak-style steering, and external tool use in deployed agents. This could inform defenses such as stricter content isolation or output sanitization, extending prior prompt-injection literature to tool-augmented settings with concrete examples.

major comments (3)

[Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.
[Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.
[Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.

minor comments (2)

[Abstract] Abstract contains typographical issues: 'user' s' and 'attacker' s' should be corrected to 'user's' and 'attacker's'.
[Technique description] The description of the 'bridge' in the exemplification technique would benefit from a concrete example or pseudocode to clarify how the few-shot reframing is constructed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting our empirical scope or ethical constraints.

read point-by-point responses

Referee: [Abstract and evaluation section] Abstract and the evaluation section: the proof-of-concept is described only qualitatively with no reported success rates, number of trials, error analysis, or statistical measures, so the claim that exemplification forms a 'feasible privacy-leakage path' rests on demonstration rather than measured reliability.

Authors: The manuscript reports a direct comparison of attack success rates between the exemplification technique and the prior fake-completion baseline. We agree, however, that explicit reporting of trial counts, error breakdowns, and basic statistical measures would improve transparency and support the feasibility claim more robustly. We will expand the evaluation section with these details in the revised manuscript. revision: yes
Referee: [Demonstration section] Demonstration section: the exfiltration chain is shown only with fictitious personal details inside a fully researcher-controlled environment; this does not test whether real conversation histories containing actual sensitive data would be processed identically or whether deployed agents apply additional filtering or context isolation that would block the override.

Authors: Ethical and legal requirements preclude any use of real personal data in attack demonstrations. The controlled setting with fictitious information was selected precisely to isolate the injection and exfiltration mechanism while eliminating risk. We will add an explicit limitations paragraph discussing how real-world conversation histories and agent-side filtering might differ, framing this as an inherent boundary of the study rather than a flaw in the demonstrated chain. revision: partial
Referee: [Attack analysis and demonstration] The weakest assumption (external web content processed verbatim allowing injected instructions to override user intent) is not stress-tested against realistic tool-handler behaviors such as summarization or truncation of retrieved pages.

Authors: Because the target agents are black-box, internal tool-handler logic (including summarization or truncation) cannot be directly observed or varied by the researcher. Our demonstrations reflect the observable incorporation of external content in public interfaces. We will add analysis of how truncation or summarization could degrade injection success and will include this as an explicit limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

Empirical attack demonstration with no derivational circularity

full rationale

This is an empirical security study whose central claims rest on observed behavior in a controlled proof-of-concept rather than any mathematical derivation chain. No equations, fitted parameters, or self-referential definitions appear; the exemplification technique and data-exfiltration path are presented as experimental constructions evaluated against baselines, not as quantities derived from prior results within the paper itself. External web-content processing assumptions are stated as the weakest link but are not smuggled in via self-citation or uniqueness theorems. The work therefore remains self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM agents will faithfully execute instructions embedded in retrieved external content when that content is framed as part of the user task.

axioms (1)

domain assumption LLM agents process untrusted external content as legitimate input for task execution without strong isolation or sanitization.
Invoked in the description of the attack surface created by combining natural-language reasoning with external tools.

pith-pipeline@v0.9.0 · 5768 in / 1160 out tokens · 24849 ms · 2026-05-20T09:54:26.432223+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker’s objective.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 17 internal anchors

[1]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv:2302.12173,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv:2405.13068,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Exfiltration of personal information from chatgpt via prompt injection

Gregory Schwartzman. Exfiltration of personal information from chatgpt via prompt injection. arXiv:2406.00199, Jun

work page arXiv
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 1901
[5]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. arXiv:2406.13352,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Simple prompt injection attacks can leak personal data observed by llm agents during task execution

Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution. arXiv:2506.01055,

work page arXiv
[8]

Bench- marking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Benjamin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197,

work page arXiv
[9]

Echoleak: The first real-world zero-click prompt injection exploit in a production llm system

Pavan Reddy and Aditya Sanjay Gujral. Echoleak: The first real-world zero-click prompt injection exploit in a production llm system. arXiv:2509.10540,

work page arXiv
[10]

Silent egress: When implicit prompt injection makes llm agents leak without a trace

Qianlong Lan, Anuj Kaul, Shaun Jones, and Stephanie Westrum. Silent egress: When implicit prompt injection makes llm agents leak without a trace. arXiv:2602.22450,

work page arXiv
[11]

Scalable Extraction of Training Data from (Production) Language Models

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv:2311.17035,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Prompt injection attacks against gpt-3

Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/ prompt-injection/, Sep

work page 2022
[13]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tennenholtz. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowled...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv:2303.17580,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis

8 Privacy Leakage Chains via Prompt InjectionA PREPRINT Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. arXiv:2303.16434,

work page arXiv
[17]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv:2304.08244,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Tool Learning with Foundation Models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yu Zhu, Zhen Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huan...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv:2305.15334,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models. arXiv:2402.07867,

work page arXiv
[22]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv:2407.12784, 2024a. Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. InNeurIPS ML Safety Workshop,

work page arXiv
[23]

Prompt injection attacks and defenses in llm-integrated applications

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications. arXiv:2310.12815,

work page arXiv
[24]

Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications

Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv:2401.07612,

work page arXiv
[25]

Struq: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363, 2024b. Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. Jatmo: Prompt injection defense by task-specific finetuning. arXiv:2312.17673,

work page arXiv
[26]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

System-level defense against indirect prompt injection attacks: An information flow control perspective.arXiv preprint arXiv:2409.19091, 2024

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv:2409.19091,

work page arXiv
[28]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv:2307.02483,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Promptarmor: Simple yet effective prompt injection defenses

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses. arXiv:2507.15219,

work page arXiv
[32]

Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification. arXiv:2602.22724,

work page arXiv
[33]

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie, Jiongchi Yu, and Jia Liu. ARGUS: Defending llm agents against context-aware prompt injection attacks. arXiv:2605.03378,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv:2302.12173,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment

Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, and Kailong Wang. Lockpicking llms: A logit-based jailbreak using token-level manipulation. arXiv:2405.13068,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Exfiltration of personal information from chatgpt via prompt injection

Gregory Schwartzman. Exfiltration of personal information from chatgpt via prompt injection. arXiv:2406.00199, Jun

work page arXiv

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 1901

[5] [5]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. arXiv:2406.13352,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Simple prompt injection attacks can leak personal data observed by llm agents during task execution

Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. Simple prompt injection attacks can leak personal data observed by llm agents during task execution. arXiv:2506.01055,

work page arXiv

[8] [8]

Bench- marking and defending against indirect prompt injection attacks on large language models

Jingwei Yi, Yueqi Xie, Bin Benjamin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Bench- marking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197,

work page arXiv

[9] [9]

Echoleak: The first real-world zero-click prompt injection exploit in a production llm system

Pavan Reddy and Aditya Sanjay Gujral. Echoleak: The first real-world zero-click prompt injection exploit in a production llm system. arXiv:2509.10540,

work page arXiv

[10] [10]

Silent egress: When implicit prompt injection makes llm agents leak without a trace

Qianlong Lan, Anuj Kaul, Shaun Jones, and Stephanie Westrum. Silent egress: When implicit prompt injection makes llm agents leak without a trace. arXiv:2602.22450,

work page arXiv

[11] [11]

Scalable Extraction of Training Data from (Production) Language Models

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv:2311.17035,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Prompt injection attacks against gpt-3

Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/ prompt-injection/, Sep

work page 2022

[13] [13]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tennenholtz. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowled...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv:2303.17580,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis

8 Privacy Leakage Chains via Prompt InjectionA PREPRINT Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. arXiv:2303.16434,

work page arXiv

[17] [17]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv:2304.08244,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Tool Learning with Foundation Models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yu Zhu, Zhen Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huan...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv:2305.15334,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval- augmented generation of large language models. arXiv:2402.07867,

work page arXiv

[22] [22]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv:2407.12784, 2024a. Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. InNeurIPS ML Safety Workshop,

work page arXiv

[23] [23]

Prompt injection attacks and defenses in llm-integrated applications

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Prompt injection attacks and defenses in llm-integrated applications. arXiv:2310.12815,

work page arXiv

[24] [24]

Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications

Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv:2401.07612,

work page arXiv

[25] [25]

Struq: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363, 2024b. Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. Jatmo: Prompt injection defense by task-specific finetuning. arXiv:2312.17673,

work page arXiv

[26] [26]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

System-level defense against indirect prompt injection attacks: An information flow control perspective.arXiv preprint arXiv:2409.19091, 2024

Fangzhou Wu, Ethan Cecchetti, and Chaowei Xiao. System-level defense against indirect prompt injection attacks: An information flow control perspective. arXiv:2409.19091,

work page arXiv

[28] [28]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv:2307.02483,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Promptarmor: Simple yet effective prompt injection defenses

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses. arXiv:2507.15219,

work page arXiv

[32] [32]

Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification

Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, and Hongxin Hu. Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification. arXiv:2602.22724,

work page arXiv

[33] [33]

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

Shihao Weng, Yang Feng, Jinrui Zhang, Xiaofei Xie, Jiongchi Yu, and Jia Liu. ARGUS: Defending llm agents against context-aware prompt injection attacks. arXiv:2605.03378,

work page internal anchor Pith review Pith/arXiv arXiv