What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents
Pith reviewed 2026-06-28 14:10 UTC · model grok-4.3
The pith
Consent Integrity requires a trusted mediator to render the real boundary action for human approval rather than accepting the agent's own narration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Consent Integrity requires that the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented.
What carries the argument
Analyzer-relative decoder that classifies boundary events into inspectable actions or uninspectable ones, together with trusted rendering and execution binding.
Load-bearing premise
A trusted mediator providing total mediation and an unspoofable path from boundary event to human display can be realized in practice.
What would settle it
An implemented mediator in which an agent still succeeds in executing an action different from the one rendered and approved by the human.
Figures
read the original abstract
Coding agents gate consequential actions behind a human-in-the-loop approval dialog, but the dialog is narrated by the agent itself: the human approves a summary the agent writes. The Lies-in-the-Loop (LITL) attack shows that summary is forgeable, so a compromised agent can show a benign description while a different action runs. This paper names the missing property, Consent Integrity, by importing What You See Is What You Sign (WYSIWYS) and the trusted-path property into the agent approval channel: the action shown to the human must be rendered by a trusted mediator from the real action at the boundary, not the agent's narration, over a path the agent cannot spoof, and bound to the exact action that executes. Two twists distinguish it from classical WYSIWYS: the renderer is the adversary, and the boundary ground truth is a low-level event that must be decoded without trusting the agent. Since no decoder is complete, the realizable target is analyzer-relative: whatever the analyzer cannot classify is surfaced as uninspectable rather than silently approved. A prototype implements the analyzer, renderer, and bind-to-execution; total mediation and the trusted path are specified but assumed, not implemented. On GTFOBins, an independent corpus of 1330 trusted-tool abuses, the prototype silently passes 10.0% (every instance through a trusted tool); on tldr, 28,798 normal-usage commands, it marks 87.0% uninspectable. These two independent measurements bracket the design's central tension: the trust list that bounds silent passes is the same one that drives over-prompting, and a boundary-only mediator can move along that frontier but not escape it. The contribution is the property, the mechanism, and an honest position on that frontier, not a solved defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript defines Consent Integrity for black-box LLM agents as the requirement that the action presented for human approval must be rendered by a trusted mediator from the actual boundary action (not the agent's narration), over a path the agent cannot spoof, and bound to the executing action. It distinguishes this from classical WYSIWYS by noting the renderer is adversarial and the ground truth is a low-level event requiring decoding. A prototype implements the analyzer, renderer, and execution binding, while assuming total mediation and the trusted path. Evaluations on the GTFOBins corpus of 1330 trusted-tool abuses report a 10.0% silent pass rate, and on the tldr corpus of 28,798 normal-usage commands report 87.0% marked uninspectable, illustrating the trade-off governed by the trust list.
Significance. This paper contributes a precisely defined security property extending trusted-path concepts to LLM agent approval channels, along with a prototype and two independent empirical measurements that bracket the practical frontier between silent passes and over-prompting. The explicit acknowledgment that key components remain assumed rather than implemented provides a clear scoping of the result. The measurements on independent GTFOBins and tldr corpora are a strength in grounding the discussion of the trust-list trade-off.
major comments (2)
- [Prototype Implementation] Prototype Implementation: The definition of Consent Integrity requires a trusted mediator providing total mediation and an unspoofable path from the boundary event to human display. The manuscript states these are 'specified but assumed, not implemented.' This assumption is load-bearing because the GTFOBins (10.0% silent pass) and tldr (87.0% uninspectable) measurements only evaluate the analyzer's classification behavior under the assumption and do not test enforcement against a compromised agent controlling the execution environment.
- [Evaluation] Evaluation sections: The paper positions the results as bracketing the design's central tension via the trust list, but does not provide a concrete test or bound showing that a realizable mediator can enforce the unspoofable path; without this, the security claim remains conditional on an external component whose realizability is not demonstrated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below, noting that the manuscript already explicitly scopes the work to the implemented analyzer while assuming the mediator components.
read point-by-point responses
-
Referee: Prototype Implementation: The definition of Consent Integrity requires a trusted mediator providing total mediation and an unspoofable path from the boundary event to human display. The manuscript states these are 'specified but assumed, not implemented.' This assumption is load-bearing because the GTFOBins (10.0% silent pass) and tldr (87.0% uninspectable) measurements only evaluate the analyzer's classification behavior under the assumption and do not test enforcement against a compromised agent controlling the execution environment.
Authors: We agree that the measurements characterize the analyzer's behavior under the assumptions of total mediation and trusted path. The manuscript is explicit about this scoping in the abstract, introduction, and prototype section, stating that only the analyzer, renderer, and execution binding are implemented. The GTFOBins and tldr evaluations measure classification accuracy and the trust-list trade-off on independent corpora; they are not presented as enforcement tests against an adversarial agent. No revision is required, as the limitations are already stated. revision: no
-
Referee: Evaluation sections: The paper positions the results as bracketing the design's central tension via the trust list, but does not provide a concrete test or bound showing that a realizable mediator can enforce the unspoofable path; without this, the security claim remains conditional on an external component whose realizability is not demonstrated.
Authors: The security claims are conditional on the trusted mediator, as the manuscript acknowledges by stating these components are assumed. The contribution centers on defining Consent Integrity, the analyzer mechanism, and empirical measurements that bracket the trust-list trade-off. Demonstrating a full realizable mediator lies outside the paper's scope, which instead supplies the property definition and analyzer as a foundation for such implementations. The results remain valid within this explicit scoping. revision: no
Circularity Check
No significant circularity; property defined via external WYSIWYS import
full rationale
The paper's central step is definitional: Consent Integrity is named by importing the classical WYSIWYS and trusted-path properties into the approval channel, with two explicit twists (renderer as adversary; boundary ground truth as low-level event). No equations, fitted parameters, or self-citations appear in the provided text. Measurements rely on independent external corpora (GTFOBins 1330 instances, tldr 28798 commands) rather than any fitted subset. The paper explicitly flags that total mediation and the trusted path are assumed rather than implemented or derived. The derivation chain therefore remains self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- trust list
axioms (1)
- domain assumption A trusted mediator providing total mediation and an unspoofable path from boundary event to human display exists and can be implemented.
invented entities (1)
-
Consent Integrity property
no independent evidence
Forward citations
Cited by 1 Pith paper
-
One Goal, Many Commands: Characterizing Denylist Fragility in AI Agents
ShellSieve, an LLM-driven pipeline, detects command denylist fragility in terminal AI agents and finds 69.0-98.6% of 1,709 GitHub-collected denylists to be bypassable.
Reference graph
Works this paper leans on
-
[1]
Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,
Z. Ji, Z. Li, W. Jiang, Y . Gao, and S. Wang, “Measuring the permission gate: A stress-test evaluation of Claude Code’s auto mode,” 2026, arXiv:2604.04978
Pith/arXiv arXiv 2026
-
[2]
Claude code auto mode: Delegating approvals to model- based classifiers,
Anthropic, “Claude code auto mode: Delegating approvals to model- based classifiers,” https://www.anthropic.com/engineering/claude-cod e-auto-mode, 2026
2026
-
[3]
Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),
Checkmarx Zero Research Team, “Turning AI safeguards into weapons with HITL dialog forging (lies-in-the-loop),” https://checkmarx.com/ze ro-post/turning-ai-safeguards-into-weapons-with-hitl-dialog-forging/, 2025
2025
-
[4]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inACM Work- shop on Artificial Intelligence and Security (AISec), 2023, pp. 79–90, arXiv:2302.12173
Pith/arXiv arXiv 2023
-
[5]
Robust WYSIWYS: A method for en- suring that what you see is what you sign,
A. Jø sang and B. AlFayyadh, “Robust WYSIWYS: A method for en- suring that what you see is what you sign,” inAustralasian Information Security Conference (AISC), CRPIT vol. 81, 2008, pp. 53–58
2008
-
[6]
Digital signatures and electronic documents: A cautionary tale,
K. Kain, S. W. Smith, and R. Asokan, “Digital signatures and electronic documents: A cautionary tale,” inIFIP Conference on Communications and Multimedia Security (CMS), 2002
2002
-
[7]
Building ver- ifiable trusted path on commodity x86 computers,
Z. Zhou, V . D. Gligor, J. Newsome, and J. M. McCune, “Building ver- ifiable trusted path on commodity x86 computers,” inIEEE Symposium on Security and Privacy (S&P), 2012
2012
-
[8]
Universal and transferable adversarial attacks on aligned language models,
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023, arXiv:2307.15043
Pith/arXiv arXiv 2023
-
[9]
Prompt injection attack against LLM-integrated applications,
Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, and Y . Liu, “Prompt injection attack against LLM-integrated applications,” 2023, arXiv:2306.05499
Pith/arXiv arXiv 2023
-
[10]
The lethal trifecta for AI agents: Private data, untrusted content, and external communication,
S. Willison, “The lethal trifecta for AI agents: Private data, untrusted content, and external communication,” https://simonwillison.net/2025/J un/16/the-lethal-trifecta/, 2025
2025
-
[11]
Silent egress: When implicit prompt injection makes LLM agents leak without a trace,
Q. Lan, A. Kaul, S. Jones, and S. Westrum, “Silent egress: When implicit prompt injection makes LLM agents leak without a trace,” 2026, arXiv:2602.22450
arXiv 2026
-
[12]
MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,
Z. Wang, Y . Gao, Y . Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, and X. Li, “MCPTox: A benchmark for tool poisoning attack on real- world MCP servers,” inAAAI Conference on Artificial Intelligence, 2026, arXiv:2508.14925
arXiv 2026
-
[13]
Defeating prompt injections by design,
E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating prompt injections by design,” inIEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2026, arXiv:2503.18813
Pith/arXiv arXiv 2026
-
[14]
Securing AI agents with information-flow control,
M. Costa, B. K ¨opf, A. Kolluri, A. Paverd, M. Russinovich, A. Salem, S. Tople, L. Wutschitz, and S. Zanella-B ´eguelin, “Securing AI agents with information-flow control,” 2025, arXiv:2505.23643
Pith/arXiv arXiv 2025
-
[15]
Progent: Programmable privilege control for LLM agents,
T. Shi, J. He, Z. Wang, L. Wu, H. Li, W. Guo, and D. Song, “Progent: Programmable privilege control for LLM agents,” 2025, arXiv:2504.11703
Pith/arXiv arXiv 2025
-
[16]
Prompt flow integrity to prevent privilege escalation in LLM agents,
J. Kim, W. Choi, and B. Lee, “Prompt flow integrity to prevent privilege escalation in LLM agents,” 2025, arXiv:2503.15547
arXiv 2025
-
[17]
IsolateGPT: An execution isolation architecture for LLM-based agentic systems,
Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “IsolateGPT: An execution isolation architecture for LLM-based agentic systems,” inNetwork and Distributed System Security Symposium (NDSS), 2025, arXiv:2403.04960
arXiv 2025
-
[18]
Design patterns for securing LLM agents against prompt injections,
L. Beurer-Kellner, B. Buesser, A.-M. Cret ¸u, E. Debenedetti, D. Dobos, D. Fabian, M. Fischer, D. Froelicher, K. Grosse, D. Naeff, E. Ozoani, A. Paverd, F. Tram `er, and V . V olhejn, “Design patterns for securing LLM agents against prompt injections,” 2025, arXiv:2506.08837
arXiv 2025
-
[19]
AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,
H. Wang, C. M. Poskitt, and J. Sun, “AgentSpec: Customizable run- time enforcement for safe and reliable LLM agents,” inIEEE/ACM International Conference on Software Engineering (ICSE), 2026, arXiv:2503.18666
Pith/arXiv arXiv 2026
-
[20]
The instruction hierarchy: Training LLMs to prioritize privileged in- structions,
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” 2024, arXiv:2404.13208
Pith/arXiv arXiv 2024
-
[21]
StruQ: Defending against prompt injection with structured queries,
S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” inUSENIX Security Symposium, 2025, arXiv:2402.06363
arXiv 2025
-
[22]
SecAlign: Defending against prompt injection with preference optimization,
S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wag- ner, and C. Guo, “SecAlign: Defending against prompt injection with preference optimization,” inACM SIGSAC Conference on Computer and Communications Security (CCS), 2025, arXiv:2410.05451
arXiv 2025
-
[23]
AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,
E. Debenedetti, J. Zhang, M. Balunovi ´c, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,” inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024, arXiv:2406.13352
Pith/arXiv arXiv 2024
-
[24]
Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics (ACL), 2024, arXiv:2403.02691
Pith/arXiv arXiv 2024
-
[25]
Identifying the risks of LM agents with an LM-emulated sandbox,
Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of LM agents with an LM-emulated sandbox,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2309.15817
Pith/arXiv arXiv 2024
-
[26]
R-Judge: Benchmarking safety risk awareness for LLM agents,
T. Yuanet al., “R-Judge: Benchmarking safety risk awareness for LLM agents,” 2024, arXiv:2401.10019
arXiv 2024
-
[27]
Overeager coding agents: Measuring out-of-scope actions on benign tasks,
Y . Qu, Y . Zhang, Y . Zhang, G. Deng, Y . Li, L. Y . Zhang, and Y . Liu, “Overeager coding agents: Measuring out-of-scope actions on benign tasks,” 2026, arXiv:2605.18583
Pith/arXiv arXiv 2026
-
[28]
N. Maloyan and D. Namiot, “Prompt injection attacks on agentic coding assistants: A systematic analysis of vulnerabilities in skills, tools, and protocol ecosystems,” 2026, arXiv:2601.17548
arXiv 2026
-
[29]
SoK: The attack surface of agentic AI — tools, and autonomy,
A. Dehghantanha and S. Homayoun, “SoK: The attack surface of agentic AI — tools, and autonomy,” 2026, arXiv:2603.22928
arXiv 2026
-
[30]
The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,
P. Wang, X. Li, C. Xiang, J. Zhang, Y . Li, L. Zhang, X. Wang, and Y . Tian, “The landscape of prompt injection threats in LLM agents: From taxonomy to analysis,” 2026, arXiv:2602.10453
arXiv 2026
-
[31]
Clickjacking: Attacks and defenses,
L.-S. Huang, A. Moshchuk, H. J. Wang, S. Schechter, and C. Jackson, “Clickjacking: Attacks and defenses,” inUSENIX Security Symposium, 2012
2012
-
[32]
Cloak and dagger: From two permissions to complete control of the UI feedback loop,
Y . Fratantonio, C. Qian, S. P. Chung, and W. Lee, “Cloak and dagger: From two permissions to complete control of the UI feedback loop,” in IEEE Symposium on Security and Privacy (S&P), 2017
2017
-
[33]
A lattice model of secure information flow,
D. E. Denning, “A lattice model of secure information flow,”Commu- nications of the ACM, vol. 19, no. 5, pp. 236–243, 1976
1976
-
[34]
A decentralized model for information flow control,
A. C. Myers and B. Liskov, “A decentralized model for information flow control,” inACM Symposium on Operating Systems Principles (SOSP), 1997
1997
-
[35]
Language-based information-flow se- curity,
A. Sabelfeld and A. C. Myers, “Language-based information-flow se- curity,”IEEE Journal on Selected Areas in Communications, vol. 21, no. 1, pp. 5–19, 2003
2003
-
[36]
Labels and event processes in the Asbestos operating system,
P. Efstathopoulos, M. Krohn, S. VanDeBogart, C. Frey, D. Ziegler, E. Kohler, D. Mazi `eres, F. Kaashoek, and R. Morris, “Labels and event processes in the Asbestos operating system,” inACM Symposium on Operating Systems Principles (SOSP), 2005
2005
-
[37]
Making information flow explicit in HiStar,
N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazi `eres, “Making information flow explicit in HiStar,” inUSENIX Symposium on Operat- ing Systems Design and Implementation (OSDI), 2006
2006
-
[38]
Information flow control for standard OS abstractions,
M. Krohn, A. Yip, M. Brodsky, N. Cliffer, M. F. Kaashoek, E. Kohler, and R. Morris, “Information flow control for standard OS abstractions,” inACM Symposium on Operating Systems Principles (SOSP), 2007
2007
-
[39]
TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,
W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth, “TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones,” inUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010
2010
-
[40]
Capsicum: Practical capabilities for UNIX,
R. N. M. Watson, J. Anderson, B. Laurie, and K. Kennaway, “Capsicum: Practical capabilities for UNIX,” inUSENIX Security Symposium, 2010
2010
-
[41]
OW ASP top 10 for LLM applications,
OW ASP, “OW ASP top 10 for LLM applications,” https://owasp.org/ww w-project-top-10-for-large-language-model-applications/, 2025
2025
-
[42]
mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,
Invariant Labs, “mcp-scan: Scanning and constraining MCP connections for security vulnerabilities,” https://github.com/invariantlabs-ai/mcp-sca n, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.