Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
Pith reviewed 2026-05-20 10:17 UTC · model grok-4.3
The pith
Safe LLM agent deployment requires a three-layer probabilistic assume-guarantee architecture because no single layer can certify all necessary safety dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a single-layer approach to LLM agent safety is structurally inadequate because the three dimensions of safe operation each require strictly distinct information sets revealed at different execution stages. It proposes a three-layer probabilistic assume-guarantee architecture in which each layer independently certifies one dimension and supplies a probabilistic guarantee that serves as the assumption for the following layer, thereby admitting compositional system-level safety bounds derived via the chain rule of probability.
What carries the argument
A three-layer probabilistic assume-guarantee contract architecture in which each layer enforces one safety dimension with its own certified probabilistic guarantee that becomes the assumption for the next layer.
If this is right
- Overall system safety bounds can be derived compositionally from the individual layer guarantees using the chain rule of probability.
- Each safety dimension can be certified and maintained independently without requiring simultaneous access to all execution-stage information.
- The architecture permits separate verification of each layer before integration.
- System-level guarantees remain well-defined even when individual layer bounds are estimated from finite traces.
Where Pith is reading between the lines
- The same staged-information argument could apply to other sequential decision systems that reveal environmental and dynamical details only after an initial plan is formed.
- Empirical tests in controlled simulators could measure how quickly layer bounds degrade when real-world traces deviate from training distributions.
- Multi-agent extensions would need additional cross-agent assumption contracts to preserve the compositional bounds.
Load-bearing premise
The three safety dimensions each depend on strictly distinct information that becomes available only at different stages of execution.
What would settle it
A concrete demonstration that one guardrail, using only the information available at a single execution stage, can certify semantic compliance, environmental validity, and dynamical feasibility together would falsify the central claim.
Figures
read the original abstract
This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation -- semantic intent and policy compliance, environmental validity, and dynamical feasibility -- each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract-based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer's assumption. We sketch such an architecture and derive the compositional system-level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non-i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi-agent settings -- the most important unfinished business in LLM agent runtime assurance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a position paper claiming that safe LLM agent deployment requires a three-layer probabilistic assume-guarantee architecture. It argues that the three safety dimensions—semantic intent and policy compliance, environmental validity, and dynamical feasibility—each rely on distinct information available at different execution stages, making single-layer safety enforcement categorically insufficient. The authors sketch the architecture, derive compositional safety bounds using the chain rule of probability, and identify three open problems: bound estimation from non-i.i.d. traces, graceful degradation under deployment drift, and extension to multi-agent settings.
Significance. If the structural separation of information holds, this position could significantly influence the design of safety mechanisms for autonomous LLM agents by promoting layered, contract-based systems over monolithic ones. It provides a framework for compositional probabilistic guarantees, which is a strength in moving towards rigorous runtime assurance. The open problems outlined are concrete and could guide subsequent research in the field.
major comments (2)
- The claim that 'no single guardrail can certify all three' is presented as a structural consequence, but the manuscript provides only an intuitive argument based on information timing rather than a formal derivation or impossibility result excluding predictive unification within a single layer.
- The compositional system-level safety bounds derived via the chain rule are sketched without detailed equations, assumptions (e.g., conditional independence), or explicit probability expressions; a more rigorous presentation with specific formulas would be necessary to support the quantitative aspects of the proposal.
minor comments (2)
- Additional references to prior work on assume-guarantee contracts in formal methods and AI safety would strengthen the positioning of the proposed architecture.
- Provide more detail on how bound estimation from non-i.i.d. traces would be approached in the three-layer setup.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential influence of the proposed architecture. We respond to each major comment below and note the planned revisions.
read point-by-point responses
-
Referee: The claim that 'no single guardrail can certify all three' is presented as a structural consequence, but the manuscript provides only an intuitive argument based on information timing rather than a formal derivation or impossibility result excluding predictive unification within a single layer.
Authors: We agree that the argument rests on the distinct timing and availability of information for semantic compliance, environmental validity, and dynamical feasibility rather than a formal impossibility theorem. As a position paper, we articulate why these information sets are partitioned by execution stage and why any single layer would require predictions of data not yet available. We will revise the manuscript to define the information partitions more explicitly and to clarify why predictive unification within one layer cannot generally preserve the required guarantees without additional assumptions that do not hold in open deployments. A complete mathematical impossibility result lies beyond the scope of this position piece. revision: partial
-
Referee: The compositional system-level safety bounds derived via the chain rule are sketched without detailed equations, assumptions (e.g., conditional independence), or explicit probability expressions; a more rigorous presentation with specific formulas would be necessary to support the quantitative aspects of the proposal.
Authors: We accept the observation. The current sketch will be replaced by an expanded section that states the chain-rule decomposition explicitly, lists the conditional-independence assumptions between layers, and supplies the precise probability expressions for the system-level safety bound. revision: yes
Circularity Check
No significant circularity; argument derives from stage-wise information analysis without reduction to inputs or self-citations.
full rationale
The paper's core position—that a single abstraction layer is categorically insufficient—follows from an explicit analysis of three safety dimensions (semantic/policy, environmental validity, dynamical feasibility) each depending on distinct information available only at successive execution stages. This separation is asserted as a structural property of agent execution rather than obtained by fitting, self-definition, or self-citation. Compositional bounds are obtained via the ordinary chain rule of probability, an external mathematical fact independent of the paper's claims. No load-bearing step reduces by construction to the paper's own inputs or prior self-referential results; the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three safety dimensions depend on strictly distinct information sets available at different execution stages
invented entities (1)
-
Three-layer probabilistic assume-guarantee architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The three dimensions ... each depend on a strictly distinct set of information that becomes available at different stages of execution. ... IU ⊈ IO ⊈ IF ... τU < τO < τF
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Pr(safe) = pU · pO|U · pF|OU (B4) ... chain rule of probability
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
D = 3 ... Alexander duality ... SphereAdmitsCircleLinking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu. Safe reinforcement learning via shielding. In S. A. McIlraith and K. Q. Weinberger, editors,Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Ed...
work page 2018
-
[2]
A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: Theory and applications. In17th European Control Conference, ECC 2019, Naples, Italy, June 25-28, 2019, pages 3420–3431. IEEE, 2019
work page 2019
-
[3]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M. Andriushchenko, A. Souly, A. Sezener, E. Cubuk, R. Prenger, C. Rahtz, J. Steinhardt, J. Kolter, X. Davies, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai et al. Constitutional ai: Harmlessness from ai feedback.ArXiv, abs/2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Conformal prediction beyond exchangeability.Annals of Statistics, 51(2):816–845, 2023
work page 2023
-
[6]
E. Bartocci, J. V . Deshmukh, A. Donzé, G. Fainekos, O. Maler, D. Nickovic, and S. Sankara- narayanan. Specification-based monitoring of cyber-physical systems: A survey on theory, tools and applications. InLectures on Runtime Verification, volume 10457 ofLecture Notes in Computer Science, pages 135–175. Springer, 2018
work page 2018
- [7]
- [8]
-
[9]
P. Blohm, M. Fränzle, P. Herber, P. Kröger, and A. Remke. Towards probabilistic contracts for intelligent cyber-physical systems. In T. Margaria and B. Steffen, editors,Leveraging Applications of Formal Methods, Verification and Validation. Specification and Verification - 12th International Symposium, ISoLA 2024, Crete, Greece, October 27-31, 2024, Proce...
work page 2024
-
[10]
British Standards Institution. PAS 1883: Operational design domain (ODD) taxonomy for an automated driving system (ADS)—specification. Technical report, BSI, Aug. 2020
work page 2020
-
[11]
Why Do Multi-Agent LLM Systems Fail?
M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025. Datasets and Benchmarks Track, Spotlight. arXiv:2503.13657
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Z. Chen, M. Kang, and B. Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 8313–8344. PMLR, 2025
work page 2025
-
[13]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[14]
H. Davidov et al. Calibrated predictive lower bounds on time-to-unsafe-sampling in LLMs
- [15]
-
[16]
B. Delahaye, B. Caillaud, and A. Legay. Probabilistic contracts: A compositional reasoning methodology for the design of stochastic systems. In L. Gomes, V . Khomenko, and J. M. Fernandes, editors,10th International Conference on Application of Concurrency to System Design, ACSD 2010, Braga, Portugal, 21-25 June 2010, pages 223–232. IEEE Computer Society,...
work page 2010
-
[17]
B. Delahaye, B. Caillaud, and A. Legay. Probabilistic contracts: a compositional reasoning methodology for the design of systems with stochastic and/or non-deterministic aspects.Formal Methods in System Design, 38(1):1–32, 2011
work page 2011
-
[18]
Y . Dong, R. Mu, G. Jin, Y . Qi, J. Hu, X. Zhao, J. Meng, W. Ruan, and X. Huang. Position: Building guardrails for large language models requires systematic design. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11475–11492. PMLR, 2024
work page 2024
-
[19]
A. Donzé and O. Maler. Robust satisfaction of temporal logic over real-valued signals. In K. Chatterjee and T. A. Henzinger, editors,Formal Modeling and Analysis of Timed Systems - 8th International Conference, FORMATS 2010, Klosterneuburg, Austria, September 8-10, 2010. Proceedings, Lecture Notes in Computer Science, pages 92–106. Springer, 2010
work page 2010
-
[20]
M. Fränzle and M. R. Hansen. A robust interpretation of duration calculus. In D. Van Hung and M. Wirsing, editors,Theoretical Aspects of Computing – ICTAC 2005, pages 257–271, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg
work page 2005
-
[21]
M. Fréchet. Généralisations du théorème des probabilités totales.Fundamenta Mathematicae, 25:379–387, 1935
work page 1935
- [22]
-
[23]
D. Giannakopoulou, K. S. Namjoshi, and C. S. P ˘as˘areanu. Compositional reasoning. In Handbook of Model Checking, pages 345–383. Springer, 2018
work page 2018
-
[24]
J. Y . Halpern.Actual Causality. MIT Press, Cambridge, MA, 2016
work page 2016
-
[25]
A. Hampus and M. Nyberg. A theory of probabilistic contracts. InLeveraging Applications of Formal Methods, Verification and Validation. Specification and Verification (ISoLA 2024), volume 15221 ofLecture Notes in Computer Science, pages 296–319. Springer, 2024
work page 2024
-
[26]
J. Hu, X. Huang, Y . Sun, Y . Dong, and X. Huang. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), 2026
work page 2026
- [27]
-
[28]
X. Huang, W. Ruan, W. Huang, G. Jin, Y . Dong, C. Wu, S. Bensalem, R. Mu, Y . Qi, X. Zhao, K. Cai, Y . Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024
work page 2024
-
[29]
ISO. ISO 34503: Road vehicles—taxonomy and definitions for operational design domain for automated driving systems, 2023
work page 2023
-
[30]
A. Kamath et al. Enforcing temporal constraints for LLM agents, 2025. ICLR 2026 Workshop VerifAI
work page 2025
- [31]
-
[32]
S. M. Kwerel. Bounds on the probability of the union and intersection of m events.Advances in Applied Probability, 7(2):431–448, 1975
work page 1975
- [33]
- [34]
-
[35]
S. Lotfi, Y . Kuang, B. Amos, M. Goldblum, M. Finzi, and A. G. Wilson. Unlocking tokens as data points for generalization bounds on larger language models. InAdvances in Neural Infor- mation Processing Systems 37 (NeurIPS 2024), 2024. Spotlight presentation. arXiv:2407.18158
-
[36]
The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise
M. Lupinacci, F. A. Pironti, F. Blefari, F. Romeo, L. Arena, and A. Furfaro. The dark side of LLMs: Agent-based attacks for complete computer takeover.arXiv preprint arXiv:2507.06850,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
v5, revised November 2025
work page 2025
-
[38]
D. A. McAllester. Pac-bayesian model averaging.Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT), pages 164–170, 1999
work page 1999
-
[39]
P. Mestres et al. Probabilistic control barrier functions: Safety in probability for discrete-time stochastic systems, 2025
work page 2025
-
[40]
L. Miculicich, M. Parmar, H. Palangi, K. D. Dvijotham, M. Montanari, T. Pfister, and L. T. Le. Veriguard: Enhancing llm agent safety via verified code generation, 2025
work page 2025
-
[41]
M. Mohri and A. Rostamizadeh. Rademacher complexity bounds for non-I.I.D. processes. In Advances in Neural Information Processing Systems, volume 21, 2009
work page 2009
-
[42]
M. Mohri and A. Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes.Journal of Machine Learning Research, 11:789–814, 2010
work page 2010
-
[43]
J. Nöther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms.arXiv preprint arXiv:2508.16481, 2025
-
[44]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730– 27744, 2022
work page 2022
-
[45]
R. Pandya. Influence-aware safety for human-robot interaction. Pittsburgh, PA, Oct. 2025. CMU-RI-TR-25-95
work page 2025
-
[46]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: your language model is secretly a reward model. NIPS ’23, Red Hook, NY , USA,
-
[47]
Curran Associates Inc
-
[48]
L. Ralaivola, M. Szafranski, and G. Stempfel. Chromatic PAC-Bayes bounds for non-IID data. InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 416–423, 2009
work page 2009
- [49]
-
[50]
Z. Ravichandran, A. Robey, V . Kumar, G. J. Pappas, and H. Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026
work page 2026
-
[51]
J. Ren, J. Luo, Y . Zhao, K. Krishna, M. Saleh, B. Lakshminarayanan, and P. J. Liu. Out-of- distribution detection and selective generation for conditional language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023
work page 2023
- [52]
-
[53]
SAE International. J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. Technical Report J3016_202104, SAE International, Apr. 2021
work page 2021
- [54]
-
[55]
M. Shamsujjoha, Q. Lu, D. Zhao, and L. Zhu. Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents. InProceedings of the 22nd IEEE International Conference on Software Architecture (ICSA), pages 37–48, 2025
work page 2025
-
[56]
C. Urrea. Probabilistic safety guarantees for learned control barrier functions: Theory and ap- plication to multi-objective Human–Robot collaborative optimization.Mathematics, 14(3):516, 2026
work page 2026
-
[57]
L. G. Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134–1142, 1984
work page 1984
- [58]
-
[59]
B. Wang, Z. Li, X. Huang, X. Huang, and Y . Dong. Chain-of-thought as a lens: Evaluating structured reasoning alignment between human preferences and large language models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), 2026
work page 2026
-
[60]
H. Wang, C. M. Poskitt, and J. Sun. AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026), pages 1–12, Rio de Janeiro, Brazil, 2026. ACM
work page 2026
- [61]
- [62]
-
[63]
cs.AI, cs.CV , cs.RO
-
[64]
Z. Wang, C.-H. Cheng, G. Jin, et al. CluCERT: Certifying LLM robustness via clustering-guided denoising smoothing. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. arXiv:2512.08967
-
[65]
W. Yang, G. Marra, G. Rens, and L. D. Raedt. Safe reinforcement learning via probabilistic logic shields. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 5739–5749. ijcai.org, 2023
work page 2023
-
[66]
Z. Yang, S. S. Raman, A. Shah, and S. Tellex. Plug in the safety chip: Enforcing constraints for llm-driven robot agents. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14435–14442. IEEE, 2024
work page 2024
- [67]
-
[68]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang. Agent-SafetyBench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
W. Zhao, Y . Zhao, G. Liu, Z. Jiang, D. Ma, Z. Li, and X. Li. Sage-llm: Towards safe and generalizable llm controller with fuzzy-cbf verification and graph-structured knowledge retrieval for uav decision, 2026
work page 2026
-
[70]
Y . Zhao, B. Hoxha, G. Fainekos, J. V . Deshmukh, and L. Lindemann. Robust conformal prediction for STL runtime verification under distribution shift. In15th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2024, Hong Kong, May 13-16, 2024, pages 169–179. IEEE, 2024. 13 A Three-Layer Framework This Appendix section presents the technical...
work page 2024
-
[71]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.