Whispers in the Machine: Confidentiality in Agentic Systems
Pith reviewed 2026-05-24 03:57 UTC · model grok-4.3
The pith
LLM-based agents leak sensitive data through prompt injection in every tested case, with tools amplifying the risk and defenses failing to stop it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By abstracting sensitive data as a secret string, the evaluation of ten agents across twenty tool scenarios and fourteen attack strategies shows that all agents are vulnerable to at least one attack, existing defenses fail to provide reliable protection against these threats, and the tooling itself can amplify leakage risks.
What carries the argument
Secret-string abstraction for sensitive data combined with prompt-injection attacks on agent-tool interactions.
If this is right
- Prompt injection in connected services gives a direct path for sensitive data to leave the agent.
- No existing defense blocks leakage reliably across the tested scenarios.
- Adding tools can raise rather than lower the chance of data exfiltration.
- Agents performing tasks such as scheduling or document handling inherit these leakage pathways.
Where Pith is reading between the lines
- Future agent designs may need to isolate tool outputs from any secret data flows before execution.
- The same leakage patterns could appear in multi-agent setups where one agent passes data to another.
- Real deployments with more tool types than the twenty tested would likely show at least as many leaks.
Load-bearing premise
Modeling real confidentiality threats with a secret string plus the chosen twenty tool scenarios and fourteen attack strategies is enough to represent deployed agentic systems.
What would settle it
An agent configuration and tool set where none of the fourteen attack strategies succeeds in extracting the secret string would falsify the universal vulnerability claim.
Figures
read the original abstract
Large language model (LLM)-based agents combine LLMs with external tools to automate tasks such as scheduling meetings, managing documents, or booking travel. While these integrations unlock powerful capabilities, they also create new and more severe attack surfaces. In particular, prompt injection attacks become far more dangerous in the agentic setting: malicious instructions embedded in connected services can misdirect the agent, providing a direct pathway for sensitive data to be exfiltrated. Yet, despite a growing number of real-world incidents, the confidentiality risks of such systems remain poorly understood. To address this gap, we provide a formalization of confidentiality in LLM-based agents. By abstracting sensitive data as a secret string, we evaluate ten agents across 20 tool scenarios and 14 attack strategies. We find that all agents are vulnerable to at least one attack, and existing defenses fail to provide reliable protection against these threats. Strikingly, we find that the tooling itself can amplify leakage risks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes confidentiality risks in LLM-based agents by modeling sensitive data as secret strings, then empirically evaluates ten agents across 20 tool scenarios and 14 attack strategies. It reports that every agent is vulnerable to at least one attack, that existing defenses do not reliably mitigate leakage, and that the tooling layer itself can increase exfiltration risk.
Significance. If the chosen scenarios and secret-string abstraction are representative, the work supplies concrete evidence that prompt-injection surfaces in agentic systems are both widespread and inadequately addressed by current mitigations. The explicit attack catalog and multi-agent testbed constitute a useful empirical contribution even if the universality claim requires qualification.
major comments (2)
- [Abstract] Abstract: the universal claim that 'all agents are vulnerable to at least one attack' and that 'existing defenses fail to provide reliable protection' rests on 10 agents, 20 tool scenarios, and a secret-string abstraction; the manuscript does not demonstrate that these scenarios cover structured data, multi-turn state, or external service responses that dominate real deployments, so the generalization is load-bearing and under-supported.
- [Abstract] Abstract / Evaluation description: the reported findings include no error bars, statistical tests, or pre-specified exclusion criteria, leaving open the possibility that post-hoc scenario or attack selection influenced the 'universal vulnerability' result.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, qualifying our claims where the evaluation scope is limited and clarifying the experimental design. Revisions have been made to the abstract, evaluation section, and a new limitations discussion.
read point-by-point responses
-
Referee: [Abstract] Abstract: the universal claim that 'all agents are vulnerable to at least one attack' and that 'existing defenses fail to provide reliable protection' rests on 10 agents, 20 tool scenarios, and a secret-string abstraction; the manuscript does not demonstrate that these scenarios cover structured data, multi-turn state, or external service responses that dominate real deployments, so the generalization is load-bearing and under-supported.
Authors: We agree that the secret-string abstraction and the 20 scenarios do not cover structured data, multi-turn state, or external service responses typical in deployments. The abstraction was selected to enable systematic, reproducible measurement of leakage across a controlled matrix. In the revised manuscript we have added a Limitations section that explicitly bounds the claims to the tested agents and scenarios. We have also revised the abstract to state that vulnerabilities were observed in all ten evaluated agents rather than asserting universality across agentic systems in general. revision: partial
-
Referee: [Abstract] Abstract / Evaluation description: the reported findings include no error bars, statistical tests, or pre-specified exclusion criteria, leaving open the possibility that post-hoc scenario or attack selection influenced the 'universal vulnerability' result.
Authors: The evaluation reports deterministic binary outcomes (leakage or no leakage) for every combination in the 10×20×14 matrix; no sampling variability exists that would require error bars or statistical hypothesis tests. All scenarios and attacks were enumerated in advance according to categories of common tool use and documented injection vectors, with results reported for the complete set and no exclusions applied. The revised manuscript adds an expanded Evaluation Methodology subsection documenting this pre-specification and includes the full scenario and attack lists in an appendix. revision: yes
Circularity Check
No circularity: empirical evaluation of concrete agents and attacks
full rationale
The paper's central claims rest on direct empirical testing of 10 agents across 20 tool scenarios and 14 attack strategies, using an explicit secret-string abstraction for sensitive data. Results (universal vulnerability, defense failure, tooling amplification) are reported observations from these runs rather than quantities derived by definition, fitted parameters renamed as predictions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce the findings to the inputs by construction. The abstraction choice and scenario selection are methodological decisions whose adequacy is a question of external validity, not internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sensitive data can be usefully abstracted as a secret string for leakage measurement
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a secret-key game... tool-robustness framework... 20 unique scenarios... 14 different attack strategies
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
all agents are vulnerable to at least one attack... tooling itself can amplify leakage risks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
Public defenders view AI as most useful for evidence investigation but limited in courtroom work and strategy, with adoption blocked by costs, confidentiality risks, and norms, requiring human oversight and open development.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Reference graph
Works this paper leans on
-
[1]
(2024) Notion AI | now with q&a
Notion. (2024) Notion AI | now with q&a. [Online]. Available: https://www.notion.so/product/ai
work page 2024
-
[2]
(2024) Integrate the OpenAI (ChatGPT) API with the gmail API
OpenAI. (2024) Integrate the OpenAI (ChatGPT) API with the gmail API. OpenAI. [Online]. Available: https://pipedream.com/ apps/openai/integrations/gmail
work page 2024
-
[3]
(2024) AI calendar | AI scheduling assis- tant | clockwise
Clockwise. (2024) AI calendar | AI scheduling assis- tant | clockwise. Clockwise. [Online]. Available: https: //www.getclockwise.com/ai
work page 2024
- [4]
- [5]
-
[6]
(2024) Personal AI assistant | microsoft copilot
Microsoft. (2024) Personal AI assistant | microsoft copilot. [Online]. Available: https://www.microsoft.com/en-us/microsoft- copilot/personal-ai-assistant
work page 2024
- [7]
-
[8]
Beyond the Safeguards: Exploring the Security Risks of ChatGPT,
E. Derner and K. Batisti ˇc, “Beyond the Safeguards: Exploring the Security Risks of ChatGPT,” Computing Research Repository (CoRR), vol. abs/2305.08005, 2023
-
[10]
Auto- DAN: Automatic and Interpretable Adversarial Attacks on Large Language Models,
S. Zhu, R. Zhang, B. An, G. Wu, and J. B. et al., “Auto- DAN: Automatic and Interpretable Adversarial Attacks on Large Language Models,” Computing Research Repository (CoRR) , vol. abs/2310.15140, 2023
-
[12]
Extracting Training Data from Large Language Models,
N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, and A. H. et al., “Extracting Training Data from Large Language Models,” in USENIX Security Symposium , 2021
work page 2021
- [13]
-
[14]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, and R. Altman, “On the opportunities and risks of foundation models,” corr, no. arXiv:2108.07258, 2022. [Online]. Available: http://arxiv.org/abs/ 2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
A. Dubey, A. Jauhri, A. Pandey, and A. Kadian, “The Llama 3 Herd of Models,” corr, Aug. 2024, arXiv:2407.21783 [cs]. [Online]. Available: http://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Machine-generated text: A comprehensive survey of threat models and detection methods,
E. Crothers, N. Japkowicz, and H. L. Viktor, “Machine-generated text: A comprehensive survey of threat models and detection methods,” IEEE Access, 2023
work page 2023
-
[17]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, and I. e. a. Shafran, “ReAct: Synergizing reasoning and acting in language models,” corr, no. arXiv:2210.03629, 2023. [Online]. Available: http: //arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec @ CCS 2023)
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection,” in Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023 . ACM, 2023, pp. 79–90. [Online]. Av...
-
[19]
H. Chase. (2022) LangChain. Langchain. Original-date: 2022-10- 17T02:58:36Z. [Online]. Available: https://github.com/langchain- ai/langchain
work page 2022
-
[20]
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,
D. Kang, X. Li, I. Stoica, C. Guestrin, and M. Z. et al., “Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,” Computing Research Repository (CoRR) , vol. abs/2302.05733, 2023
-
[21]
M. Shergadwala. (2023) Prompt injection attacks in various LLMs. [Online]. Available: https://medium.com/@murtuza.shergadwala/ prompt-injection-attacks-in-various-llms-206f56cd6ee9
work page 2023
-
[22]
Jailbroken: How Does LLM Safety Training Fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?” in Annual Conference on Neural Information Processing Systems (NeurIPS) , 2023
work page 2023
-
[23]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Mod- els,”Computing Research Repository (CoRR), vol. abs/2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
0xk1h0 - github: Jailbreak prompts collection,
K. Lee, “0xk1h0 - github: Jailbreak prompts collection,” 2023. [Online]. Available: https://github.com/0xk1h0/ChatGPT_DAN
work page 2023
-
[25]
(2024) LLM jailbreak | MITRE ATLAS™
Mitre. (2024) LLM jailbreak | MITRE ATLAS™. [Online]. Available: https://atlas.mitre.org/techniques/AML.T0054
work page 2024
-
[26]
W. Zhang. (2023) Prompt injection attack on GPT-4 — robust intelligence. [Online]. Available: https://www.robustintelligence. com/blog-posts/prompt-injection-attack-on-gpt-4
work page 2023
-
[27]
(2023) Novel jailbreak technique via typoglycemia
LaurieWired [@lauriewired]. (2023) Novel jailbreak technique via typoglycemia. [Online]. Available: https://twitter.com/lauriewired/ status/1682825249203662848
-
[28]
S. Schulhoff, J. Pinto, A. Khan, and L.-F. Bouchard, “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition,” corr, Mar. 2024, arXiv:2311.16119 [cs]. [Online]. Available: http://arxiv.org/abs/2311.16119
- [29]
-
[30]
Promptbench: Towards evaluating the robustness of large language models on adversarial prompts
K. Zhu, J. Wang, J. Zhou, Z. Wang, and H. C. et al., “PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts,” Computing Research Repository (CoRR), vol. abs/2306.04528, 2023
-
[31]
Using GPT-eliezer against ChatGPT jailbreaking,
S. Armstrong and R. Gorman, “Using GPT-eliezer against ChatGPT jailbreaking,” 2023. [Online]. Available: https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/ using-gpt-eliezer-against-chatgpt-jailbreaking
work page 2023
-
[32]
Prompt Injection Attacks and Defenses in LLM-Integrated Applications,
Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Prompt Injection Attacks and Defenses in LLM-Integrated Applications,”Computing Research Repository (CoRR) , vol. abs/2310.12815, 2023
-
[33]
LearnPrompting. (2023) Learn prompting. [Online]. Available: https://learnprompting.org/docs/category/-defensive-measures
work page 2023
-
[34]
Detecting Language Model Attacks with Perplexity
G. Alon and M. Kamfonas, “Detecting Language Model Attacks with Perplexity,” Computing Research Repository (CoRR) , vol. abs/2308.14132, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [35]
- [36]
-
[37]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
M. Abdin, J. Aneja, H. Awadalla, and A. Awadallah, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” corr, Aug. 2024, arXiv:2404.14219 [cs]. [Online]. Available: http://arxiv.org/abs/2404.14219
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Gemma 2: Improving Open Language Models at a Practical Size
G. Team, M. Riviere, S. Pathak, P. G. Sessa, and C. Hardin, “Gemma 2: Improving Open Language Models at a Practical Size,” corr, Aug. 2024, arXiv:2408.00118 [cs]. [Online]. Available: http://arxiv.org/abs/2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
A. Yang, B. Yang, B. Hui, B. Zheng, and Yu, “Qwen2 technical report,” corr, no. arXiv:2407.10671, 2024. [Online]. Available: http://arxiv.org/abs/2407.10671
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Q. Team. (2024) Qwen2.5: A party of foundation models! Section: blog. [Online]. Available: http://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[41]
SecGPT: An Execution Isolation Architecture for LLM-Based Systems,
Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “SecGPT: An Execution Isolation Architecture for LLM-Based Systems,” corr, Mar. 2024, arXiv:2403.04960 [cs]. [Online]. Available: http://arxiv.org/abs/2403.04960
-
[42]
Ethical and social risks of harm from Language Models
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, and J. U. et al., “Eth- ical and social risks of harm from Language Models,” Computing Research Repository (CoRR) , vol. abs/2112.04359, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[43]
Jatmo: Prompt Injection Defense by Task-Specific Finetuning,
J. Piet, M. Alrashed, C. Sitawarin, S. Chen, and Z. W. et al., “Jatmo: Prompt Injection Defense by Task-Specific Finetuning,”Computing Research Repository (CoRR) , vol. abs/2312.17673, 2023
- [44]
-
[45]
Reflection-tuning: Data recycling improves LLM instruction-tuning,
M. Li, L. Chen, J. Chen, and S. He, “Reflection-tuning: Data recycling improves LLM instruction-tuning,” corr, no. arXiv:2310.11716, 2023. [Online]. Available: http://arxiv.org/abs/ 2310.11716
- [46]
- [47]
-
[48]
Towards Deep Learning Models Resistant to Adversarial Attacks
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” corr, no. arXiv:1706.06083, 2019. [Online]. Available: http: //arxiv.org/abs/1706.06083
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [49]
-
[50]
(2024) OW ASP top 10 for LLM applications
OW ASP. (2024) OW ASP top 10 for LLM applications. OW ASP. [Online]. Available: https://www.llmtop10.com
work page 2024
-
[51]
Jail- breaker: Automated jailbreak across multiple large language model chatbots,
G. Deng, Y . Liu, Y . Li, K. Wang, and Y . e. a. Zhang, “Jail- breaker: Automated jailbreak across multiple large language model chatbots,” Computing Research Repository (CoRR) , vol. abs/2307.08715, 2023
-
[52]
(ab) using images and sounds for indirect instruction injection in multi-modal llms,
E. Bagdasaryan, T.-Y . Hsieh, B. Nassi, and V . Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” Computing Research Repository (CoRR) , vol. abs/2307.10490, 2023
-
[53]
Beyond memorization: Violating privacy via inference with large language models,
R. Staab, M. Vero, M. Balunovi ´c, and M. Vechev, “Beyond memorization: Violating privacy via inference with large language models,” Computing Research Repository (CoRR) , 2023
work page 2023
-
[54]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, and J. K. et al., “Baseline Defenses for Adversarial Attacks Against Aligned Language Models,” Computing Research Repository (CoRR) , vol. abs/2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Demystifying Prompts in Language Models via Perplexity Estima- tion,
H. Gonen, S. Iyer, T. Blevins, N. A. Smith, and L. Zettlemoyer, “Demystifying Prompts in Language Models via Perplexity Estima- tion,” in Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023 . Association for Computational Linguistics, 2023, pp. 10 136–10 148
work page 2023
-
[56]
MagNet: A Two-Pronged Defense against Adversarial Examples,
D. Meng and H. Chen, “MagNet: A Two-Pronged Defense against Adversarial Examples,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017 . ACM, 2017, pp. 135–147
work page 2017
-
[57]
On Detecting Adversarial Perturbations,
J. H. Metzen, T. Genewein, V . Fischer, and B. Bischoff, “On Detecting Adversarial Perturbations,” in International Conference on Learning Representations (ICLR) , 2017
work page 2017
-
[58]
On the (Statistical) Detection of Adversarial Examples
K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. D. Mc- Daniel, “On the (Statistical) Detection of Adversarial Examples,” Computing Research Repository (CoRR) , vol. abs/1702.06280, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Towards Deep Neural Network Architec- tures Robust to Adversarial Examples,
S. Gu and L. Rigazio, “Towards Deep Neural Network Architec- tures Robust to Adversarial Examples,” inInternational Conference on Learning Representations (ICLR) , 2015
work page 2015
-
[60]
Diffusion Models for Adversarial Purification,
W. Nie, B. Guo, Y . Huang, C. Xiao, and A. V . et al., “Diffusion Models for Adversarial Purification,” in International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learn- ing Research, vol. 162. PMLR, 2022, pp. 16 805–16 827
work page 2022
-
[61]
Enhancing robustness of machine learning systems via data transformations,
A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal, “Enhancing robustness of machine learning systems via data transformations,” in 52nd Annual Conference on Information Sciences and Systems, CISS 2018, Princeton, NJ, USA, March 21-23, 2018 . IEEE, 2018, pp. 1–5
work page 2018
-
[62]
Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Genera- tive Models,
P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Genera- tive Models,” in International Conference on Learning Represen- tations (ICLR), 2018
work page 2018
-
[63]
On the reliability of watermarks for large language models.arXiv preprint arXiv:2306.04634, 2023
J. Kirchenbauer, J. Geiping, Y . Wen, M. Shu, and K. S. et al., “On the Reliability of Watermarks for Large Language Models,” Com- puting Research Repository (CoRR) , vol. abs/2306.04634, 2023
-
[64]
FreeLB: En- hanced Adversarial Training for Natural Language Understanding,
C. Zhu, Y . Cheng, Z. Gan, S. Sun, and T. G. et al., “FreeLB: En- hanced Adversarial Training for Natural Language Understanding,” in International Conference on Learning Representations (ICLR) , 2020
work page 2020
-
[65]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: De- fending large language models against jailbreaking attacks,” Com- puting Research Repository (CoRR) , vol. abs/2310.03684, 2023. Appendix A. Data Availability All code, the generated and used datasets, and instruc- tions on how to reproduce our results are published at: blinded for submissio...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.