Recognition: unknown
Alignment Contracts for Agentic Security Systems
Pith reviewed 2026-05-09 20:43 UTC · model grok-4.3
The pith
Alignment contracts turn observable effect traces into enforceable safety properties for agentic security systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors define alignment contracts over observable effect traces, give them finite-trace semantics, characterize satisfaction as a safety property, develop refinement and one-way composition rules, and prove that admissibility checking is decidable. Under the Effect Observability Assumption that all ΣEff-effects are mediated, they establish a soundness theorem that quantifies over the agent model and yields enforcement guarantees, including for monitor-realized traces, together with an assumption-lifted adaptation result and limits expressed via undecidability transfer and observability-boundary theorems.
What carries the argument
Alignment contracts, which specify scope, allowed and forbidden effects, resource budgets, and disclosure policies over finite observable effect traces.
If this is right
- Refinement and one-way composition rules support modular contract engineering.
- Admissibility checking is decidable for any contract.
- Soundness holds for all mediated ΣEff-effects when the observability assumption is met.
- The same contract structure extends from web-focused workflows to other effect profiles.
Where Pith is reading between the lines
- The contracts could be layered on top of existing LLM guardrails to add formal trace-level guarantees.
- The undecidability transfer result marks a hard limit once any effect escapes mediation.
- Real-world implementations would need to verify that the mediation layer truly captures every side-effect the agent can produce.
Load-bearing premise
Every relevant effect must be mediated so that the enforcement mechanism can observe it.
What would settle it
A concrete agent execution that produces a non-mediated effect violating the contract yet remains undetected by the monitor would falsify the soundness theorem.
read the original abstract
Agentic security systems increasingly combine LLM planners with tools that can discover, validate, and report vulnerabilities. This creates an asymmetric control problem: the system should retain strong offensive capability inside an authorized engagement, while the same capabilities must be denied outside scope. Existing guardrails provide useful policy controls, but they do not make this boundary a first-class formal contract over observable effects. We introduce alignment contracts, a framework for specifying and enforcing behavioral constraints over observable effect traces. A contract defines scope, allowed and forbidden effects, resource budgets, and disclosure policies. We give the language finite-trace semantics, characterize satisfaction as a safety property with finite violation witnesses, develop refinement and one-way composition rules for modular contract engineering, and show that admissibility checking is decidable. We instantiate the framework for web-focused agentic security workflows and show how the same structure extends to other effect profiles. Under an explicit Effect Observability Assumption, where all $\SigmaEff$-effects are mediated, the soundness theorem quantifies over the agent model and gives guarantees for mediated $\SigmaEff$-effects, including enforcement soundness for monitor-realized traces. We also state an assumption-lifted adaptation result and formalize limits through undecidability transfer and observability-boundary theorems. A Lean 4 artifact checks the formal core theorems used by the paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces alignment contracts as a framework for specifying and enforcing behavioral constraints over observable effect traces in agentic security systems that combine LLM planners with tools. It defines a language with finite-trace semantics, characterizes satisfaction as a safety property with finite violation witnesses, develops refinement and one-way composition rules, shows that admissibility checking is decidable, and states a soundness theorem (under the explicit Effect Observability Assumption that all ΣEff-effects are mediated) that quantifies over agent models and gives guarantees including enforcement soundness for monitor-realized traces. The core theorems are mechanically checked in a Lean 4 artifact, with an assumption-lifted adaptation result and limits via undecidability transfer and observability-boundary theorems.
Significance. If the results hold, the work provides a formal, machine-checked foundation for making behavioral boundaries first-class contracts in LLM-based agentic security workflows. The explicit finite-trace semantics, decidability result, and Lean artifact are clear strengths that enable modular contract engineering and verifiable enforcement. The paper's transparent treatment of the Effect Observability Assumption and the adaptation result further strengthen the contribution by delineating the precise scope of the guarantees.
major comments (2)
- [Soundness theorem (abstract and §4)] The soundness theorem (as stated in the abstract and developed in the formal core) is load-bearing on the Effect Observability Assumption that all ΣEff-effects are mediated and observable by the enforcement mechanism. The manuscript does not provide a concrete argument or coverage analysis showing that this assumption holds for typical LLM tool invocations (e.g., direct side-channel outputs or unlogged calls), which directly limits the quantification over agent models and the enforcement soundness claim for monitor-realized traces.
- [Adaptation result and observability-boundary theorems] The adaptation result for lifting the assumption is stated but lacks a detailed proof sketch or Lean-checked statement showing how the guarantees transfer when the observability boundary is relaxed; this is central to the claim that the framework extends beyond the strict mediation case.
minor comments (2)
- [Refinement and composition rules] The abstract refers to 'one-way composition rules' without clarifying in the main text whether these rules are strictly one-directional or admit limited bidirectional use under the refinement relation.
- [Instantiation section] Notation for ΣEff and the effect profiles in the web-focused instantiation should be cross-referenced to the general semantics definition to avoid ambiguity when extending to other domains.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review, and for recognizing the strengths of the finite-trace semantics, decidability result, and Lean 4 artifact. We respond to each major comment below, proposing targeted revisions where they strengthen the connection between the formal results and practical agentic security workflows.
read point-by-point responses
-
Referee: [Soundness theorem (abstract and §4)] The soundness theorem (as stated in the abstract and developed in the formal core) is load-bearing on the Effect Observability Assumption that all ΣEff-effects are mediated and observable by the enforcement mechanism. The manuscript does not provide a concrete argument or coverage analysis showing that this assumption holds for typical LLM tool invocations (e.g., direct side-channel outputs or unlogged calls), which directly limits the quantification over agent models and the enforcement soundness claim for monitor-realized traces.
Authors: We agree that the soundness theorem is conditional on the Effect Observability Assumption, which is stated explicitly throughout the manuscript (including in the abstract and §4). The theorem quantifies over agent models under this assumption, and the observability-boundary theorems already delineate the precise limits when mediation does not hold. However, we acknowledge that the manuscript would benefit from a more concrete discussion of applicability to common LLM tool invocations. In the revised version we will add a short subsection in §4 that (i) describes typical mediation patterns for web-focused agentic security tools (e.g., API-wrapped outputs that are logged by the monitor), (ii) notes that direct side-channel or unlogged calls fall outside the assumption, and (iii) cross-references the boundary theorems. This addition clarifies scope without altering the formal statements. revision: yes
-
Referee: [Adaptation result and observability-boundary theorems] The adaptation result for lifting the assumption is stated but lacks a detailed proof sketch or Lean-checked statement showing how the guarantees transfer when the observability boundary is relaxed; this is central to the claim that the framework extends beyond the strict mediation case.
Authors: The adaptation result is stated in the manuscript as an assumption-lifted theorem, and the core theorems (including the boundary theorems) are mechanically checked in the Lean 4 artifact. We recognize that a high-level proof sketch in the main text would improve readability and make the transfer of guarantees more transparent. In the revision we will (i) expand the presentation of the adaptation result with a concise proof sketch in §4 and (ii) verify that the Lean artifact contains the adaptation and boundary statements; if any additional formalization is required we will extend the artifact accordingly. The existing boundary theorems already formalize the limits of the relaxed case. revision: partial
Circularity Check
No circularity; formal semantics and Lean-verified theorems are self-contained.
full rationale
The paper defines alignment contracts via explicit finite-trace semantics, safety properties, refinement rules, and decidable admissibility. The soundness theorem is stated under an explicit Effect Observability Assumption and is mechanically checked in Lean 4, with an assumption-lifted adaptation result. No step reduces a claimed prediction or theorem to a fitted parameter, self-definition, or unverified self-citation chain; the derivation chain consists of standard formal definitions and verified implications that do not presuppose their own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Effect Observability Assumption: all ΣEff-effects are mediated
invented entities (1)
-
Alignment contract
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Ehab S. Al-Shaer and Hazem H. Hamed. 2004. Discovery of Policy Anomalies in Distributed Firewalls. InProceedings IEEE INFOCOM. 2605–2616. doi:10.1109/ INFCOM.2004.1354680
-
[3]
Carolyn Jane Anderson, Nate Foster, Arjun Guha, Jean-Baptiste Jeannin, Dexter Kozen, Cole Schlesinger, and David Walker. 2014. NetKAT: Semantic Foundations for Networks. InProceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 113–126. doi:10.1145/2535838. 2535862
-
[4]
Anderson
James P. Anderson. 1972.Computer Security Technology Planning Study. Vol- ume 2. Technical Report ESD-TR-73-51. Air Force Electronic Systems Division. Introduced the reference monitor concept
1972
-
[5]
Anthropic. 2025. AI agents find $4.6M in blockchain smart contract exploits. https://red.anthropic.com/2025/smart-contracts/ December 1, 2025
2025
-
[6]
Anthropic. 2025. Disrupting the First Reported AI-Orchestrated Cyber Espi- onage Campaign. https://www.anthropic.com/news/disrupting-AI-espionage November 13, 2025
2025
-
[7]
Yair Bartal, Alain Mayer, Kobbi Nissim, and Avishai Wool. 1999. Firmato: A Novel Firewall Management Toolkit. InProceedings of the IEEE Symposium on Security and Privacy (S&P). 17–31. doi:10.1109/SECPRI.1999.766714
-
[8]
Basin, Vincent Jugé, Felix Klaedtke, and Eugen Zălinescu
David A. Basin, Vincent Jugé, Felix Klaedtke, and Eugen Zălinescu. 2013. Enforce- able Security Policies Revisited.ACM Transactions on Information and System Security (TISSEC)16, 1 (2013), 1–26. doi:10.1145/2487222.2487225
-
[9]
Luca Beurer-Kellner, Beat Buesser, Ana-Maria Creţu, Edoardo Debenedetti, Daniel Dobos, Daniel Fabian, Marc Fischer, David Froelicher, Kathrin Grosse, Daniel Naeff, Ezinwanne Ozoani, Andrew Paverd, Florian Tramèr, and Václav Vol- hejn. 2025. Design Patterns for Securing LLM Agents against Prompt Injections. arXiv preprint arXiv:2506.08837(2025)
-
[10]
Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. 2024. CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models. arXiv preprint arXiv:2404.13161(2024)
-
[11]
Manuel Costa, Boris Köpf, Aashish Kolluri, Andrew Paverd, Mark Russinovich, Ahmed Salem, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin
-
[12]
Securing AI Agents with Information-Flow Control.arXiv preprint arXiv:2505.23643(2025)
work page internal anchor Pith review arXiv 2025
-
[13]
Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, Eleftherios Ioannidis, John H. Kastner, Anwar Mamat, Darin McAdams, Matt McCutchen, Neha Rungta, Emina Torlak, and Andrew Wells. 2024. Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization.Proceedings of the ACM on Programming L...
- [14]
-
[15]
Luca de Alfaro and Thomas A. Henzinger. 2001. Interface Automata. InProceed- ings of the 8th European Software Engineering Conference / 9th ACM SIGSOFT Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, 109–120. doi:10.1145/503209.503226
-
[16]
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InAdvances in Neural Information Processing Systems (NeurIPS). Datasets and Benchmarks Track; preprint: arXiv:2406.13352
work page internal anchor Pith review arXiv 2024
-
[17]
Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864
2024
-
[18]
Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, John Kastner, Anwar Mamat, Matt McCutchen, Neha Rungta, Bhakti Shah, Emina Torlak, and Andrew Wells. 2024. How We Built Cedar: A Verification- Guided Approach. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FS...
-
[19]
Schneider
Úlfar Erlingsson and Fred B. Schneider. 2000. IRM Enforcement of Java Stack Inspection. InIEEE Symposium on Security and Privacy (S&P). 246–255
2000
-
[20]
Arthur Gervais and Liyi Zhou. 2026. AI Agent Smart Contract Exploit Generation. InFinancial Cryptography and Data Security (FC 2026)
2026
-
[21]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security
2023
-
[22]
Michael A. Harrison, Walter L. Ruzzo, and Jeffrey D. Ullman. 1976. Protection in Operating Systems.Commun. ACM19, 8 (1976), 461–471. doi:10.1145/360303. 360333
-
[23]
Hu, David Ferraiolo, Rick Kuhn, Adam Schnitzer, Kenneth Sandlin, Robert Miller, and Karen Scarfone
Vincent C. Hu, David Ferraiolo, Rick Kuhn, Adam Schnitzer, Kenneth Sandlin, Robert Miller, and Karen Scarfone. 2014.NIST SP 800-162: Guide to Attribute Based Access Control (ABAC) Definition and Considerations. Technical Report. National Institute of Standards and Technology. doi:10.6028/NIST.SP.800-162
- [24]
-
[25]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR)
2024
- [26]
-
[27]
Peyman Kazemian, Michael Chan, Hongyi Zeng, George Varghese, Nick McKe- own, and Scott Whyte. 2013. Real Time Network Policy Checking Using Header Space Analysis. InProceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 99–111. https://www.usenix.org/ conference/nsdi13/technical-sessions/presentation/kazemian
2013
-
[28]
Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Header Space Analysis: Static Checking for Networks. InProceedings of the 9th USENIX Sympo- sium on Networked Systems Design and Implementation (NSDI). 113–126. https:// www.usenix.org/conference/nsdi12/technical-sessions/presentation/kazemian
2012
-
[29]
Brighten Godfrey
Ahmed Khurshid, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey
-
[30]
InProceedings of the First Workshop on Hot Topics in Software Defined Networks (HotSDN)
VeriFlow: Verifying Network-Wide Invariants in Real Time. InProceedings of the First Workshop on Hot Topics in Software Defined Networks (HotSDN). 49–54. doi:10.1145/2342441.2342452
- [31]
-
[32]
Jay Ligatti, Lujo Bauer, and David Walker. 2005. Edit Automata: Enforcement Mechanisms for Run-time Security Policies.International Journal of Information Security4, 1–2 (2005), 2–16. doi:10.1007/s10207-004-0046-8
-
[33]
Jay Ligatti, Lujo Bauer, and David Walker. 2009. Run-Time Enforcement of Nonsafety Policies.ACM Transactions on Information and System Security12, 3, Article 19 (2009), 41 pages. doi:10.1145/1455526.1455532
-
[34]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. InInternational Conference on Learning Represe...
2024
-
[35]
Microsoft Security Response Center. 2026. Strengthening Secure Software at Global Scale: How MSRC is Evolving with AI. https://www.microsoft.com/en- us/msrc/blog/2026/04/strengthening-secure-software-global-scale-how-msrc- is-evolving-with-ai April 7, 2026
2026
-
[36]
Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr
Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr
- [37]
-
[38]
Sandhu, Edward J
Ravi S. Sandhu, Edward J. Coyne, Hal L. Feinstein, and Charles E. Youman
-
[39]
Role-Based Access Control Models.IEEE Computer29, 2 (1996), 38–47. doi:10.1109/2.485845
-
[40]
Fred B. Schneider. 2000. Enforceable Security Policies.ACM Transactions on Information and System Security (TISSEC)3, 1 (2000), 30–50. doi:10.1145/353323. 353382
-
[41]
Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. 2025. PentestAgent: Incorporating LLM Agents to Automated Penetration Testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security. doi:10.1145/3708821.3733882 Preprint: arXiv:2411.05185
-
[42]
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. 2024. CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models.arXiv preprint arXiv:2408.01605(2024)
-
[43]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customiz- able Runtime Enforcement for Safe and Reliable LLM Agents.arXiv preprint arXiv:2503.18666(2025)
work page internal anchor Pith review arXiv 2025
-
[44]
EVMbench: Evaluating AI Agents on Smart Contract Security,
Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. 2026. EVMbench: Evaluating AI Agents on Smart Contract Security. doi:10.48550/arXiv.2603.04915 arXiv:2603.04915 [cs.LG] OpenAI, Paradigm, and OtterSec technical report
-
[45]
Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. InProceedings of the 31st ACM Isaac David, Marco Guarnieri, and Arthur Gervais SIGKDD Conference on Knowledge Discovery and Data Mining. doi:10.1145/ 3690624.3709179 P...
-
[46]
Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. InFindings of the Association for Computational Linguistics: NAACL 2025. 7116–7132. doi:10.18653/v1/2025.findings-naacl.395
-
[47]
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL
2024
-
[48]
doi: 10.18653/v1/2024.findings-acl.624
10471–10506. doi:10.18653/v1/2024.findings-acl.624
-
[49]
Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations (ICLR). https: //openreview.net/forum?id=oKn9c6ytLx A Mechanization ...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.