What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents

Xiaoqi Weng

arxiv: 2606.02668 · v1 · pith:3QUIPO52new · submitted 2026-06-01 · 💻 cs.CR · cs.HC

What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents

Xiaoqi Weng This is my paper

Pith reviewed 2026-06-28 14:10 UTC · model grok-4.3

classification 💻 cs.CR cs.HC

keywords consent integrityLLM agentshuman approvaltrusted mediatorblack-boxaction forgeryboundary event

0 comments

The pith

Consent Integrity requires a trusted mediator to render the real boundary action for human approval rather than accepting the agent's own narration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Consent Integrity to close a gap in human-in-the-loop approvals for black-box LLM agents. Current dialogs let the agent write its own summary, which a compromised agent can falsify to obtain approval for one action while executing another. The property instead demands that the displayed action be produced by a trusted mediator reading the actual boundary event over an unspoofable path and then bound to execution. Because any decoder will leave some events unclassifiable, the design surfaces those cases as uninspectable instead of approving them. Measurements on separate abuse and normal-usage command sets locate the resulting trade-off between silent passes and excessive prompts.

Core claim

Consent Integrity requires that the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented.

What carries the argument

Analyzer-relative decoder that classifies boundary events into inspectable actions or uninspectable ones, together with trusted rendering and execution binding.

Load-bearing premise

A trusted mediator providing total mediation and an unspoofable path from boundary event to human display can be realized in practice.

What would settle it

An implemented mediator in which an agent still succeeds in executing an action different from the one rendered and approved by the human.

Figures

Figures reproduced from arXiv: 2606.02668 by Xiaoqi Weng.

**Figure 2.** Figure 2: Ground-truth rendering of an obfuscated remote-execution payload. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Coding agents gate consequential actions behind a human-in-the-loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies-in-the-Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted-path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low-level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted-tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal-usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over-prompting, and a boundary-only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names Consent Integrity as the missing property for LLM agent approvals and measures the trust-list trade-off, but leaves the trusted mediator unbuilt.

read the letter

The main takeaway is that this paper defines Consent Integrity: the human must see the real boundary action rendered by a trusted mediator over an unspoofable path and bound to execution, not an agent-written summary. It imports WYSIWYS with the twists that the renderer is adversarial and the ground truth is low-level, then sets an analyzer-relative target since no decoder is complete.

The work does well by stating the property clearly and reporting concrete results on two independent sets: 10% silent passes on GTFOBins and 87% uninspectable on tldr. Those numbers illustrate the frontier without overclaiming. The paper is also direct that the prototype only implements the analyzer, renderer, and binding; total mediation and the trusted path are assumed.

The soft spot is exactly that assumption. Without the mediator and path built and tested, the security guarantee reduces to the analyzer's classification behavior under external conditions the agent is presumed unable to affect. The measurements evaluate only that piece.

This is for researchers working on human-in-the-loop controls for coding agents. Readers who need a crisp property definition and some data on the classification limits will find it useful. It has enough structure and reproducible measurements to deserve peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript defines Consent Integrity for black-box LLM agents as the requirement that the action presented for human approval must be rendered by a trusted mediator from the actual boundary action (not the agent's narration), over a path the agent cannot spoof, and bound to the executing action. It distinguishes this from classical WYSIWYS by noting the renderer is adversarial and the ground truth is a low-level event requiring decoding. A prototype implements the analyzer, renderer, and execution binding, while assuming total mediation and the trusted path. Evaluations on the GTFOBins corpus of 1330 trusted-tool abuses report a 10.0% silent pass rate, and on the tldr corpus of 28,798 normal-usage commands report 87.0% marked uninspectable, illustrating the trade-off governed by the trust list.

Significance. This paper contributes a precisely defined security property extending trusted-path concepts to LLM agent approval channels, along with a prototype and two independent empirical measurements that bracket the practical frontier between silent passes and over-prompting. The explicit acknowledgment that key components remain assumed rather than implemented provides a clear scoping of the result. The measurements on independent GTFOBins and tldr corpora are a strength in grounding the discussion of the trust-list trade-off.

major comments (2)

[Prototype Implementation] Prototype Implementation: The definition of Consent Integrity requires a trusted mediator providing total mediation and an unspoofable path from the boundary event to human display. The manuscript states these are 'specified but assumed, not implemented.' This assumption is load-bearing because the GTFOBins (10.0% silent pass) and tldr (87.0% uninspectable) measurements only evaluate the analyzer's classification behavior under the assumption and do not test enforcement against a compromised agent controlling the execution environment.
[Evaluation] Evaluation sections: The paper positions the results as bracketing the design's central tension via the trust list, but does not provide a concrete test or bound showing that a realizable mediator can enforce the unspoofable path; without this, the security claim remains conditional on an external component whose realizability is not demonstrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, noting that the manuscript already explicitly scopes the work to the implemented analyzer while assuming the mediator components.

read point-by-point responses

Referee: Prototype Implementation: The definition of Consent Integrity requires a trusted mediator providing total mediation and an unspoofable path from the boundary event to human display. The manuscript states these are 'specified but assumed, not implemented.' This assumption is load-bearing because the GTFOBins (10.0% silent pass) and tldr (87.0% uninspectable) measurements only evaluate the analyzer's classification behavior under the assumption and do not test enforcement against a compromised agent controlling the execution environment.

Authors: We agree that the measurements characterize the analyzer's behavior under the assumptions of total mediation and trusted path. The manuscript is explicit about this scoping in the abstract, introduction, and prototype section, stating that only the analyzer, renderer, and execution binding are implemented. The GTFOBins and tldr evaluations measure classification accuracy and the trust-list trade-off on independent corpora; they are not presented as enforcement tests against an adversarial agent. No revision is required, as the limitations are already stated. revision: no
Referee: Evaluation sections: The paper positions the results as bracketing the design's central tension via the trust list, but does not provide a concrete test or bound showing that a realizable mediator can enforce the unspoofable path; without this, the security claim remains conditional on an external component whose realizability is not demonstrated.

Authors: The security claims are conditional on the trusted mediator, as the manuscript acknowledges by stating these components are assumed. The contribution centers on defining Consent Integrity, the analyzer mechanism, and empirical measurements that bracket the trust-list trade-off. Demonstrating a full realizable mediator lies outside the paper's scope, which instead supplies the property definition and analyzer as a foundation for such implementations. The results remain valid within this explicit scoping. revision: no

Circularity Check

0 steps flagged

No significant circularity; property defined via external WYSIWYS import

full rationale

The paper's central step is definitional: Consent Integrity is named by importing the classical WYSIWYS and trusted-path properties into the approval channel, with two explicit twists (renderer as adversary; boundary ground truth as low-level event). No equations, fitted parameters, or self-citations appear in the provided text. Measurements rely on independent external corpora (GTFOBins 1330 instances, tldr 28798 commands) rather than any fitted subset. The paper explicitly flags that total mediation and the trusted path are assumed rather than implemented or derived. The derivation chain therefore remains self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a trusted mediator and total mediation are realizable; the trust list is the free parameter that trades silent passes against over-prompting; Consent Integrity is the newly postulated property.

free parameters (1)

trust list
The trust list bounds which actions pass silently and simultaneously drives the rate of uninspectable normal commands; the paper states this is the central tension.

axioms (1)

domain assumption A trusted mediator providing total mediation and an unspoofable path from boundary event to human display exists and can be implemented.
Invoked in the definition of Consent Integrity and stated as assumed rather than implemented in the prototype description.

invented entities (1)

Consent Integrity property no independent evidence
purpose: To name and require the missing guarantee that approved action equals executed action via trusted rendering.
Newly defined property that the paper imports from WYSIWYS with two twists for the LLM-agent setting.

pith-pipeline@v0.9.1-grok · 5860 in / 1665 out tokens · 43237 ms · 2026-06-28T14:10:14.275630+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Goal, Many Commands: Characterizing Denylist Fragility in AI Agents
cs.CR 2026-06 unverdicted novelty 7.0

ShellSieve, an LLM-driven pipeline, detects command denylist fragility in terminal AI agents and finds 69.0-98.6% of 1,709 GitHub-collected denylists to be bypassable.

Reference graph

Works this paper leans on

42 extracted references · 13 linked inside Pith · cited by 1 Pith paper

[1]

Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,

Z. Ji, Z. Li, W. Jiang, Y . Gao, and S. Wang, “Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,” 2026, arXiv:2604.04978

Pith/arXiv arXiv 2026
[2]

Claude code auto mode: Delegating approvals to model- based classifiers,

Anthropic, “Claude code auto mode: Delegating approvals to model- based classifiers,” https://www.anthropic.com/engineering/claude-cod e-auto-mode, 2026

2026
[3]

Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),

Checkmarx Zero Research Team, “Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),” https://checkmarx.com/ze ro-post/turning-ai-safeguards-into-weapons-with-hitl-dialog-forging/, 2025

2025
[4]

Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inACM Work- shop on Artificial Intelligence and Security (AISec), 2023, pp. 79–90, arXiv:2302.12173

Pith/arXiv arXiv 2023
[5]

Robust WYSIWYS: A method for en- suring that what you see is what you sign,

A. Jø sang and B. AlFayyadh, “Robust WYSIWYS: A method for en- suring that what you see is what you sign,” inAustralasian Information Security Conference (AISC), CRPIT vol. 81, 2008, pp. 53–58

2008
[6]

Digital signatures and electronic documents: A cautionary tale,

K. Kain, S. W. Smith, and R. Asokan, “Digital signatures and electronic documents: A cautionary tale,” inIFIP Conference on Communications and Multimedia Security (CMS), 2002

2002
[7]

Building ver- ifiable trusted path on commodity x86 computers,

Z. Zhou, V . D. Gligor, J. Newsome, and J. M. McCune, “Building ver- ifiable trusted path on commodity x86 computers,” inIEEE Symposium on Security and Privacy (S&P), 2012

2012
[8]

Universal and transferable adversarial attacks on aligned language models,

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023, arXiv:2307.15043

Pith/arXiv arXiv 2023
[9]

Prompt injection attack against LLM-integrated applications,

Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, and Y . Liu, “Prompt injection attack against LLM-integrated applications,” 2023, arXiv:2306.05499

Pith/arXiv arXiv 2023
[10]

The lethal trifecta for AI agents: Private data, untrusted content, and external communication,

S. Willison, “The lethal trifecta for AI agents: Private data, untrusted content, and external communication,” https://simonwillison.net/2025/J un/16/the-lethal-trifecta/, 2025

2025
[11]

Silent egress: When implicit prompt injection makes LLM agents leak without a trace,

Q. Lan, A. Kaul, S. Jones, and S. Westrum, “Silent egress: When implicit prompt injection makes LLM agents leak without a trace,” 2026, arXiv:2602.22450

arXiv 2026
[12]

MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,

Z. Wang, Y . Gao, Y . Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li, “MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,” inAAAI Conference on Artificial Intelligence, 2026, arXiv:2508.14925

arXiv 2026
[13]

Defeating prompt injections by design,

E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating prompt injections by design,” inIEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2026, arXiv:2503.18813

Pith/arXiv arXiv 2026
[14]

Securing AI agents with information-flow control,

M. Costa, B. K ¨opf, A. Kolluri, A. Paverd, M. Russinovich, A. Salem, S. Tople, L. Wutschitz, and S. Zanella-B ´eguelin, “Securing AI agents with information-flow control,” 2025, arXiv:2505.23643

Pith/arXiv arXiv 2025
[15]

Progent: Programmable privilege control for LLM agents,

T. Shi, J. He, Z. Wang, L. Wu, H. Li, W. Guo, and D. Song, “Progent: Programmable privilege control for LLM agents,” 2025, arXiv:2504.11703

Pith/arXiv arXiv 2025
[16]

Prompt flow integrity to prevent privilege escalation in LLM agents,

J. Kim, W. Choi, and B. Lee, “Prompt flow integrity to prevent privilege escalation in LLM agents,” 2025, arXiv:2503.15547

arXiv 2025
[17]

IsolateGPT: An execution isolation architecture for LLM-based agentic systems,

Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “IsolateGPT: An execution isolation architecture for LLM-based agentic systems,” inNetwork and Distributed System Security Symposium (NDSS), 2025, arXiv:2403.04960

arXiv 2025
[18]

Design patterns for securing LLM agents against prompt injections,

L. Beurer-Kellner, B. Buesser, A.-M. Cret ¸u, E. Debenedetti, D. Dobos, D. Fabian, M. Fischer, D. Froelicher, K. Grosse, D. Naeff, E. Ozoani, A. Paverd, F. Tram `er, and V . V olhejn, “Design patterns for securing LLM agents against prompt injections,” 2025, arXiv:2506.08837

arXiv 2025
[19]

AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,

H. Wang, C. M. Poskitt, and J. Sun, “AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,” inIEEE/ACM International Conference on Software Engineering (ICSE), 2026, arXiv:2503.18666

Pith/arXiv arXiv 2026
[20]

The instruction hierarchy: Training LLMs to prioritize privileged in- structions,

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” 2024, arXiv:2404.13208

Pith/arXiv arXiv 2024
[21]

StruQ: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” inUSENIX Security Symposium, 2025, arXiv:2402.06363

arXiv 2025
[22]

SecAlign: Defending against prompt injection with preference optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “SecAlign: Defending against prompt injection with preference optimization,” inACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, arXiv:2410.05451

arXiv 2025
[23]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,

E. Debenedetti, J. Zhang, M. Balunovi ´c, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,” inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024, arXiv:2406.13352

Pith/arXiv arXiv 2024
[24]

InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,

Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics (ACL), 2024, arXiv:2403.02691

Pith/arXiv arXiv 2024
[25]

Identifying the risks of LM agents with an LM-emulated sandbox,

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of LM agents with an LM-emulated sandbox,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2309.15817

Pith/arXiv arXiv 2024
[26]

R-Judge: Benchmarking safety risk awareness for LLM agents,

T. Yuanet al., “R-Judge: Benchmarking safety risk awareness for LLM agents,” 2024, arXiv:2401.10019

arXiv 2024
[27]

Overeager coding agents: Measuring out-of-scope actions on benign tasks,

Y . Qu, Y . Zhang, Y . Zhang, G. Deng, Y . Li, L. Y . Zhang, and Y . Liu, “Overeager coding agents: Measuring out-of-scope actions on benign tasks,” 2026, arXiv:2605.18583

Pith/arXiv arXiv 2026
[28]

Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,

N. Maloyan and D. Namiot, “Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,” 2026, arXiv:2601.17548

arXiv 2026
[29]

SoK: The attack surface of agentic AI — tools, and autonomy,

A. Dehghantanha and S. Homayoun, “SoK: The attack surface of agentic AI — tools, and autonomy,” 2026, arXiv:2603.22928

arXiv 2026
[30]

The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,

P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, and Y . Tian, “The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,” 2026, arXiv:2602.10453

arXiv 2026
[31]

Clickjacking: Attacks and defenses,

L.-S. Huang, A. Moshchuk, H. J. Wang, S. Schechter, and C. Jackson, “Clickjacking: Attacks and defenses,” inUSENIX Security Symposium, 2012

2012
[32]

Cloak and dagger: From two permissions to complete control of the UI feedback loop,

Y . Fratantonio, C. Qian, S. P. Chung, and W. Lee, “Cloak and dagger: From two permissions to complete control of the UI feedback loop,” in IEEE Symposium on Security and Privacy (S&P), 2017

2017
[33]

A lattice model of secure information flow,

D. E. Denning, “A lattice model of secure information flow,”Commu- nications of the ACM, vol. 19, no. 5, pp. 236–243, 1976

1976
[34]

A decentralized model for information flow control,

A. C. Myers and B. Liskov, “A decentralized model for information flow control,” inACM Symposium on Operating Systems Principles (SOSP), 1997

1997
[35]

Language-based information-flow se- curity,

A. Sabelfeld and A. C. Myers, “Language-based information-flow se- curity,”IEEE Journal on Selected Areas in Communications, vol. 21, no. 1, pp. 5–19, 2003

2003
[36]

Labels and event processes in the Asbestos operating system,

P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, D. Ziegler, E. Kohler, D. Mazi `eres, F. Kaashoek, and R. Morris, “Labels and event processes in the Asbestos operating system,” inACM Symposium on Operating Systems Principles (SOSP), 2005

2005
[37]

Making information flow explicit in HiStar,

N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazi `eres, “Making information flow explicit in HiStar,” inUSENIX Symposium on Operat- ing Systems Design and Implementation (OSDI), 2006

2006
[38]

Information flow control for standard OS abstractions,

M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek, E. Kohler, and R. Morris, “Information flow control for standard OS abstractions,” inACM Symposium on Operating Systems Principles (SOSP), 2007

2007
[39]

TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,

W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth, “TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,” inUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010

2010
[40]

Capsicum: Practical capabilities for UNIX,

R. N. M. Watson, J. Anderson, B. Laurie, and K. Kennaway, “Capsicum: Practical capabilities for UNIX,” inUSENIX Security Symposium, 2010

2010
[41]

OW ASP top 10 for LLM applications,

OW ASP, “OW ASP top 10 for LLM applications,” https://owasp.org/ww w-project-top-10-for-large-language-model-applications/, 2025

2025
[42]

mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,

Invariant Labs, “mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,” https://github.com/invariantlabs-ai/mcp-sca n, 2025

2025

[1] [1]

Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,

Z. Ji, Z. Li, W. Jiang, Y . Gao, and S. Wang, “Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,” 2026, arXiv:2604.04978

Pith/arXiv arXiv 2026

[2] [2]

Claude code auto mode: Delegating approvals to model- based classifiers,

Anthropic, “Claude code auto mode: Delegating approvals to model- based classifiers,” https://www.anthropic.com/engineering/claude-cod e-auto-mode, 2026

2026

[3] [3]

Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),

Checkmarx Zero Research Team, “Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),” https://checkmarx.com/ze ro-post/turning-ai-safeguards-into-weapons-with-hitl-dialog-forging/, 2025

2025

[4] [4]

Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inACM Work- shop on Artificial Intelligence and Security (AISec), 2023, pp. 79–90, arXiv:2302.12173

Pith/arXiv arXiv 2023

[5] [5]

Robust WYSIWYS: A method for en- suring that what you see is what you sign,

A. Jø sang and B. AlFayyadh, “Robust WYSIWYS: A method for en- suring that what you see is what you sign,” inAustralasian Information Security Conference (AISC), CRPIT vol. 81, 2008, pp. 53–58

2008

[6] [6]

Digital signatures and electronic documents: A cautionary tale,

K. Kain, S. W. Smith, and R. Asokan, “Digital signatures and electronic documents: A cautionary tale,” inIFIP Conference on Communications and Multimedia Security (CMS), 2002

2002

[7] [7]

Building ver- ifiable trusted path on commodity x86 computers,

Z. Zhou, V . D. Gligor, J. Newsome, and J. M. McCune, “Building ver- ifiable trusted path on commodity x86 computers,” inIEEE Symposium on Security and Privacy (S&P), 2012

2012

[8] [8]

Universal and transferable adversarial attacks on aligned language models,

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023, arXiv:2307.15043

Pith/arXiv arXiv 2023

[9] [9]

Prompt injection attack against LLM-integrated applications,

Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, and Y . Liu, “Prompt injection attack against LLM-integrated applications,” 2023, arXiv:2306.05499

Pith/arXiv arXiv 2023

[10] [10]

The lethal trifecta for AI agents: Private data, untrusted content, and external communication,

S. Willison, “The lethal trifecta for AI agents: Private data, untrusted content, and external communication,” https://simonwillison.net/2025/J un/16/the-lethal-trifecta/, 2025

2025

[11] [11]

Silent egress: When implicit prompt injection makes LLM agents leak without a trace,

Q. Lan, A. Kaul, S. Jones, and S. Westrum, “Silent egress: When implicit prompt injection makes LLM agents leak without a trace,” 2026, arXiv:2602.22450

arXiv 2026

[12] [12]

MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,

Z. Wang, Y . Gao, Y . Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li, “MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,” inAAAI Conference on Artificial Intelligence, 2026, arXiv:2508.14925

arXiv 2026

[13] [13]

Defeating prompt injections by design,

E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating prompt injections by design,” inIEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2026, arXiv:2503.18813

Pith/arXiv arXiv 2026

[14] [14]

Securing AI agents with information-flow control,

M. Costa, B. K ¨opf, A. Kolluri, A. Paverd, M. Russinovich, A. Salem, S. Tople, L. Wutschitz, and S. Zanella-B ´eguelin, “Securing AI agents with information-flow control,” 2025, arXiv:2505.23643

Pith/arXiv arXiv 2025

[15] [15]

Progent: Programmable privilege control for LLM agents,

T. Shi, J. He, Z. Wang, L. Wu, H. Li, W. Guo, and D. Song, “Progent: Programmable privilege control for LLM agents,” 2025, arXiv:2504.11703

Pith/arXiv arXiv 2025

[16] [16]

Prompt flow integrity to prevent privilege escalation in LLM agents,

J. Kim, W. Choi, and B. Lee, “Prompt flow integrity to prevent privilege escalation in LLM agents,” 2025, arXiv:2503.15547

arXiv 2025

[17] [17]

IsolateGPT: An execution isolation architecture for LLM-based agentic systems,

Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “IsolateGPT: An execution isolation architecture for LLM-based agentic systems,” inNetwork and Distributed System Security Symposium (NDSS), 2025, arXiv:2403.04960

arXiv 2025

[18] [18]

Design patterns for securing LLM agents against prompt injections,

L. Beurer-Kellner, B. Buesser, A.-M. Cret ¸u, E. Debenedetti, D. Dobos, D. Fabian, M. Fischer, D. Froelicher, K. Grosse, D. Naeff, E. Ozoani, A. Paverd, F. Tram `er, and V . V olhejn, “Design patterns for securing LLM agents against prompt injections,” 2025, arXiv:2506.08837

arXiv 2025

[19] [19]

AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,

H. Wang, C. M. Poskitt, and J. Sun, “AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,” inIEEE/ACM International Conference on Software Engineering (ICSE), 2026, arXiv:2503.18666

Pith/arXiv arXiv 2026

[20] [20]

The instruction hierarchy: Training LLMs to prioritize privileged in- structions,

E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” 2024, arXiv:2404.13208

Pith/arXiv arXiv 2024

[21] [21]

StruQ: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” inUSENIX Security Symposium, 2025, arXiv:2402.06363

arXiv 2025

[22] [22]

SecAlign: Defending against prompt injection with preference optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “SecAlign: Defending against prompt injection with preference optimization,” inACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, arXiv:2410.05451

arXiv 2025

[23] [23]

AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,

E. Debenedetti, J. Zhang, M. Balunovi ´c, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,” inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024, arXiv:2406.13352

Pith/arXiv arXiv 2024

[24] [24]

InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,

Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics (ACL), 2024, arXiv:2403.02691

Pith/arXiv arXiv 2024

[25] [25]

Identifying the risks of LM agents with an LM-emulated sandbox,

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of LM agents with an LM-emulated sandbox,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2309.15817

Pith/arXiv arXiv 2024

[26] [26]

R-Judge: Benchmarking safety risk awareness for LLM agents,

T. Yuanet al., “R-Judge: Benchmarking safety risk awareness for LLM agents,” 2024, arXiv:2401.10019

arXiv 2024

[27] [27]

Overeager coding agents: Measuring out-of-scope actions on benign tasks,

Y . Qu, Y . Zhang, Y . Zhang, G. Deng, Y . Li, L. Y . Zhang, and Y . Liu, “Overeager coding agents: Measuring out-of-scope actions on benign tasks,” 2026, arXiv:2605.18583

Pith/arXiv arXiv 2026

[28] [28]

Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,

N. Maloyan and D. Namiot, “Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,” 2026, arXiv:2601.17548

arXiv 2026

[29] [29]

SoK: The attack surface of agentic AI — tools, and autonomy,

A. Dehghantanha and S. Homayoun, “SoK: The attack surface of agentic AI — tools, and autonomy,” 2026, arXiv:2603.22928

arXiv 2026

[30] [30]

The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,

P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, and Y . Tian, “The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,” 2026, arXiv:2602.10453

arXiv 2026

[31] [31]

Clickjacking: Attacks and defenses,

L.-S. Huang, A. Moshchuk, H. J. Wang, S. Schechter, and C. Jackson, “Clickjacking: Attacks and defenses,” inUSENIX Security Symposium, 2012

2012

[32] [32]

Cloak and dagger: From two permissions to complete control of the UI feedback loop,

Y . Fratantonio, C. Qian, S. P. Chung, and W. Lee, “Cloak and dagger: From two permissions to complete control of the UI feedback loop,” in IEEE Symposium on Security and Privacy (S&P), 2017

2017

[33] [33]

A lattice model of secure information flow,

D. E. Denning, “A lattice model of secure information flow,”Commu- nications of the ACM, vol. 19, no. 5, pp. 236–243, 1976

1976

[34] [34]

A decentralized model for information flow control,

A. C. Myers and B. Liskov, “A decentralized model for information flow control,” inACM Symposium on Operating Systems Principles (SOSP), 1997

1997

[35] [35]

Language-based information-flow se- curity,

A. Sabelfeld and A. C. Myers, “Language-based information-flow se- curity,”IEEE Journal on Selected Areas in Communications, vol. 21, no. 1, pp. 5–19, 2003

2003

[36] [36]

Labels and event processes in the Asbestos operating system,

P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, D. Ziegler, E. Kohler, D. Mazi `eres, F. Kaashoek, and R. Morris, “Labels and event processes in the Asbestos operating system,” inACM Symposium on Operating Systems Principles (SOSP), 2005

2005

[37] [37]

Making information flow explicit in HiStar,

N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazi `eres, “Making information flow explicit in HiStar,” inUSENIX Symposium on Operat- ing Systems Design and Implementation (OSDI), 2006

2006

[38] [38]

Information flow control for standard OS abstractions,

M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek, E. Kohler, and R. Morris, “Information flow control for standard OS abstractions,” inACM Symposium on Operating Systems Principles (SOSP), 2007

2007

[39] [39]

TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,

W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth, “TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,” inUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010

2010

[40] [40]

Capsicum: Practical capabilities for UNIX,

R. N. M. Watson, J. Anderson, B. Laurie, and K. Kennaway, “Capsicum: Practical capabilities for UNIX,” inUSENIX Security Symposium, 2010

2010

[41] [41]

OW ASP top 10 for LLM applications,

OW ASP, “OW ASP top 10 for LLM applications,” https://owasp.org/ww w-project-top-10-for-large-language-model-applications/, 2025

2025

[42] [42]

mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,

Invariant Labs, “mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,” https://github.com/invariantlabs-ai/mcp-sca n, 2025

2025