pith. sign in

arxiv: 2605.18672 · v1 · pith:TM7SOSHDnew · submitted 2026-05-18 · 💻 cs.AI

Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

Pith reviewed 2026-05-20 10:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentssafety architectureassume-guarantee contractsprobabilistic verificationruntime assurancemulti-layer safetyagent deployment
0
0 comments X

The pith

Safe LLM agent deployment requires a three-layer probabilistic assume-guarantee architecture because no single layer can certify all necessary safety dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that enforcing safety for LLM agents inside one abstraction layer is categorically insufficient, not because current tools are weak but because of how agent execution unfolds. Safe operation depends on three distinct dimensions: semantic intent and policy compliance, environmental validity, and dynamical feasibility. Each dimension draws on a separate body of information that becomes available only at successive stages of execution, so no single guardrail can verify them all at once. The paper therefore calls for a contract-based architecture in which three independently certified layers hand probabilistic guarantees forward as assumptions to the next layer. Overall system safety then follows from the chain rule of probability.

Core claim

The paper claims that a single-layer approach to LLM agent safety is structurally inadequate because the three dimensions of safe operation each require strictly distinct information sets revealed at different execution stages. It proposes a three-layer probabilistic assume-guarantee architecture in which each layer independently certifies one dimension and supplies a probabilistic guarantee that serves as the assumption for the following layer, thereby admitting compositional system-level safety bounds derived via the chain rule of probability.

What carries the argument

A three-layer probabilistic assume-guarantee contract architecture in which each layer enforces one safety dimension with its own certified probabilistic guarantee that becomes the assumption for the next layer.

If this is right

  • Overall system safety bounds can be derived compositionally from the individual layer guarantees using the chain rule of probability.
  • Each safety dimension can be certified and maintained independently without requiring simultaneous access to all execution-stage information.
  • The architecture permits separate verification of each layer before integration.
  • System-level guarantees remain well-defined even when individual layer bounds are estimated from finite traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged-information argument could apply to other sequential decision systems that reveal environmental and dynamical details only after an initial plan is formed.
  • Empirical tests in controlled simulators could measure how quickly layer bounds degrade when real-world traces deviate from training distributions.
  • Multi-agent extensions would need additional cross-agent assumption contracts to preserve the compositional bounds.

Load-bearing premise

The three safety dimensions each depend on strictly distinct information that becomes available only at different stages of execution.

What would settle it

A concrete demonstration that one guardrail, using only the information available at a single execution stage, can certify semantic compliance, environmental validity, and dynamical feasibility together would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18672 by A. Nouri, C. Wu, D. Nickovic, J. Kroger, M. Franzle, R. Roy, S.Bensalem, X. Huang, Y. Dong.

Figure 1
Figure 1. Figure 1: Three-layer probabilistic assume–guarantee architecture with running example (right [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation -- semantic intent and policy compliance, environmental validity, and dynamical feasibility -- each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract-based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer's assumption. We sketch such an architecture and derive the compositional system-level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non-i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi-agent settings -- the most important unfinished business in LLM agent runtime assurance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a position paper claiming that safe LLM agent deployment requires a three-layer probabilistic assume-guarantee architecture. It argues that the three safety dimensions—semantic intent and policy compliance, environmental validity, and dynamical feasibility—each rely on distinct information available at different execution stages, making single-layer safety enforcement categorically insufficient. The authors sketch the architecture, derive compositional safety bounds using the chain rule of probability, and identify three open problems: bound estimation from non-i.i.d. traces, graceful degradation under deployment drift, and extension to multi-agent settings.

Significance. If the structural separation of information holds, this position could significantly influence the design of safety mechanisms for autonomous LLM agents by promoting layered, contract-based systems over monolithic ones. It provides a framework for compositional probabilistic guarantees, which is a strength in moving towards rigorous runtime assurance. The open problems outlined are concrete and could guide subsequent research in the field.

major comments (2)
  1. The claim that 'no single guardrail can certify all three' is presented as a structural consequence, but the manuscript provides only an intuitive argument based on information timing rather than a formal derivation or impossibility result excluding predictive unification within a single layer.
  2. The compositional system-level safety bounds derived via the chain rule are sketched without detailed equations, assumptions (e.g., conditional independence), or explicit probability expressions; a more rigorous presentation with specific formulas would be necessary to support the quantitative aspects of the proposal.
minor comments (2)
  1. Additional references to prior work on assume-guarantee contracts in formal methods and AI safety would strengthen the positioning of the proposed architecture.
  2. Provide more detail on how bound estimation from non-i.i.d. traces would be approached in the three-layer setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential influence of the proposed architecture. We respond to each major comment below and note the planned revisions.

read point-by-point responses
  1. Referee: The claim that 'no single guardrail can certify all three' is presented as a structural consequence, but the manuscript provides only an intuitive argument based on information timing rather than a formal derivation or impossibility result excluding predictive unification within a single layer.

    Authors: We agree that the argument rests on the distinct timing and availability of information for semantic compliance, environmental validity, and dynamical feasibility rather than a formal impossibility theorem. As a position paper, we articulate why these information sets are partitioned by execution stage and why any single layer would require predictions of data not yet available. We will revise the manuscript to define the information partitions more explicitly and to clarify why predictive unification within one layer cannot generally preserve the required guarantees without additional assumptions that do not hold in open deployments. A complete mathematical impossibility result lies beyond the scope of this position piece. revision: partial

  2. Referee: The compositional system-level safety bounds derived via the chain rule are sketched without detailed equations, assumptions (e.g., conditional independence), or explicit probability expressions; a more rigorous presentation with specific formulas would be necessary to support the quantitative aspects of the proposal.

    Authors: We accept the observation. The current sketch will be replaced by an expanded section that states the chain-rule decomposition explicitly, lists the conditional-independence assumptions between layers, and supplies the precise probability expressions for the system-level safety bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; argument derives from stage-wise information analysis without reduction to inputs or self-citations.

full rationale

The paper's core position—that a single abstraction layer is categorically insufficient—follows from an explicit analysis of three safety dimensions (semantic/policy, environmental validity, dynamical feasibility) each depending on distinct information available only at successive execution stages. This separation is asserted as a structural property of agent execution rather than obtained by fitting, self-definition, or self-citation. Compositional bounds are obtained via the ordinary chain rule of probability, an external mathematical fact independent of the paper's claims. No load-bearing step reduces by construction to the paper's own inputs or prior self-referential results; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The position relies on domain assumptions about distinct information availability across execution stages and the applicability of probabilistic composition; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption The three safety dimensions depend on strictly distinct information sets available at different execution stages
    Invoked to establish that no single layer can certify all three
invented entities (1)
  • Three-layer probabilistic assume-guarantee architecture no independent evidence
    purpose: To separately enforce each safety dimension with independent probabilistic guarantees
    Proposed as the required response to the structural insufficiency claim

pith-pipeline@v0.9.0 · 5736 in / 1095 out tokens · 29538 ms · 2026-05-20T10:17:22.580540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

  1. [1]

    Alshiekh, R

    M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu. Safe reinforcement learning via shielding. In S. A. McIlraith and K. Q. Weinberger, editors,Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Ed...

  2. [2]

    A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: Theory and applications. In17th European Control Conference, ECC 2019, Naples, Italy, June 25-28, 2019, pages 3420–3431. IEEE, 2019

  3. [3]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    M. Andriushchenko, A. Souly, A. Sezener, E. Cubuk, R. Prenger, C. Rahtz, J. Steinhardt, J. Kolter, X. Davies, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024, 2024

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai et al. Constitutional ai: Harmlessness from ai feedback.ArXiv, abs/2212.08073, 2022

  5. [5]

    R. F. Barber, E. J. Candès, A. Ramdas, and R. J. Tibshirani. Conformal prediction beyond exchangeability.Annals of Statistics, 51(2):816–845, 2023

  6. [6]

    Bartocci, J

    E. Bartocci, J. V . Deshmukh, A. Donzé, G. Fainekos, O. Maler, D. Nickovic, and S. Sankara- narayanan. Specification-based monitoring of cyber-physical systems: A survey on theory, tools and applications. InLectures on Runtime Verification, volume 10457 ofLecture Notes in Computer Science, pages 135–175. Springer, 2018

  7. [7]

    V . P. Bhardwaj. Agent behavioral contracts: Formal specification and runtime enforcement for reliable autonomous AI agents.arXiv preprint arXiv:2602.22302, 2026. cs.AI

  8. [8]

    Blohm, M

    P. Blohm, M. Fränzle, P. Herber, P. Kröger, and A. Remke. Towards probabilistic contracts for intelligent cyber-physical systems. InInternational Symposium on Leveraging Applications of Formal Methods, pages 26–47. Springer, 2024

  9. [9]

    Blohm, M

    P. Blohm, M. Fränzle, P. Herber, P. Kröger, and A. Remke. Towards probabilistic contracts for intelligent cyber-physical systems. In T. Margaria and B. Steffen, editors,Leveraging Applications of Formal Methods, Verification and Validation. Specification and Verification - 12th International Symposium, ISoLA 2024, Crete, Greece, October 27-31, 2024, Proce...

  10. [10]

    PAS 1883: Operational design domain (ODD) taxonomy for an automated driving system (ADS)—specification

    British Standards Institution. PAS 1883: Operational design domain (ODD) taxonomy for an automated driving system (ADS)—specification. Technical report, BSI, Aug. 2020

  11. [11]

    Why Do Multi-Agent LLM Systems Fail?

    M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025. Datasets and Benchmarks Track, Spotlight. arXiv:2503.13657

  12. [12]

    Z. Chen, M. Kang, and B. Li. ShieldAgent: Shielding agents via verifiable safety policy reasoning. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 8313–8344. PMLR, 2025

  13. [13]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  14. [14]

    Davidov et al

    H. Davidov et al. Calibrated predictive lower bounds on time-to-unsafe-sampling in LLMs

  15. [15]

    arXiv preprint arXiv:2506.13593

  16. [16]

    Delahaye, B

    B. Delahaye, B. Caillaud, and A. Legay. Probabilistic contracts: A compositional reasoning methodology for the design of stochastic systems. In L. Gomes, V . Khomenko, and J. M. Fernandes, editors,10th International Conference on Application of Concurrency to System Design, ACSD 2010, Braga, Portugal, 21-25 June 2010, pages 223–232. IEEE Computer Society,...

  17. [17]

    Delahaye, B

    B. Delahaye, B. Caillaud, and A. Legay. Probabilistic contracts: a compositional reasoning methodology for the design of systems with stochastic and/or non-deterministic aspects.Formal Methods in System Design, 38(1):1–32, 2011

  18. [18]

    Y . Dong, R. Mu, G. Jin, Y . Qi, J. Hu, X. Zhao, J. Meng, W. Ruan, and X. Huang. Position: Building guardrails for large language models requires systematic design. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11475–11492. PMLR, 2024

  19. [19]

    Donzé and O

    A. Donzé and O. Maler. Robust satisfaction of temporal logic over real-valued signals. In K. Chatterjee and T. A. Henzinger, editors,Formal Modeling and Analysis of Timed Systems - 8th International Conference, FORMATS 2010, Klosterneuburg, Austria, September 8-10, 2010. Proceedings, Lecture Notes in Computer Science, pages 92–106. Springer, 2010

  20. [20]

    Fränzle and M

    M. Fränzle and M. R. Hansen. A robust interpretation of duration calculus. In D. Van Hung and M. Wirsing, editors,Theoretical Aspects of Computing – ICTAC 2005, pages 257–271, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg

  21. [21]

    M. Fréchet. Généralisations du théorème des probabilités totales.Fundamenta Mathematicae, 25:379–387, 1935

  22. [22]

    Y . Gan, Y . Yang, Z. Ma, P. He, R. Zeng, Y . Wang, Q. Li, C. Zhou, S. Li, and T. Wang. Navigating the risks: A survey of security, privacy, and ethics threats in LLM-based agents.arXiv preprint arXiv:2411.09523, 2025

  23. [23]

    Giannakopoulou, K

    D. Giannakopoulou, K. S. Namjoshi, and C. S. P ˘as˘areanu. Compositional reasoning. In Handbook of Model Checking, pages 345–383. Springer, 2018

  24. [24]

    J. Y . Halpern.Actual Causality. MIT Press, Cambridge, MA, 2016

  25. [25]

    Hampus and M

    A. Hampus and M. Nyberg. A theory of probabilistic contracts. InLeveraging Applications of Formal Methods, Verification and Validation. Specification and Verification (ISoLA 2024), volume 15221 ofLecture Notes in Computer Science, pages 296–319. Springer, 2024

  26. [26]

    J. Hu, X. Huang, Y . Sun, Y . Dong, and X. Huang. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), 2026

  27. [27]

    Huang, F

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InConference on Robot Learning, pages 1769–1782. PMLR, 2023

  28. [28]

    Huang, W

    X. Huang, W. Ruan, W. Huang, G. Jin, Y . Dong, C. Wu, S. Bensalem, R. Mu, Y . Qi, X. Zhao, K. Cai, Y . Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artificial Intelligence Review, 57(7):175, 2024

  29. [29]

    ISO 34503: Road vehicles—taxonomy and definitions for operational design domain for automated driving systems, 2023

    ISO. ISO 34503: Road vehicles—taxonomy and definitions for operational design domain for automated driving systems, 2023

  30. [30]

    Kamath et al

    A. Kamath et al. Enforcing temporal constraints for LLM agents, 2025. ICLR 2026 Workshop VerifAI

  31. [31]

    A. A. Khan, M. Andrev, M. A. Murtaza, S. Aguilera, R. Zhang, J. Ding, S. Hutchinson, and A. Anwar. Safety aware task planning via large language models in robotics.arXiv preprint arXiv:2503.15707, 2025. cs.RO

  32. [32]

    S. M. Kwerel. Bounds on the probability of the union and intersection of m events.Advances in Applied Probability, 7(2):431–448, 1975

  33. [33]

    Lei et al

    S. Lei et al. OffTopicEval: When large language models enter the wrong chat, almost always! InProceedings of the 14th International Conference on Learning Representations (ICLR), 2026. arXiv:2509.26495. 11

  34. [34]

    C. Li, S. Faghfoorian, and I. Ruchkin. What does it take to get guarantees? Systematizing assumptions in cyber-physical systems.arXiv preprint arXiv:2511.15952, 2025

  35. [35]

    Lotfi, Y

    S. Lotfi, Y . Kuang, B. Amos, M. Goldblum, M. Finzi, and A. G. Wilson. Unlocking tokens as data points for generalization bounds on larger language models. InAdvances in Neural Infor- mation Processing Systems 37 (NeurIPS 2024), 2024. Spotlight presentation. arXiv:2407.18158

  36. [36]

    The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise

    M. Lupinacci, F. A. Pironti, F. Blefari, F. Romeo, L. Arena, and A. Furfaro. The dark side of LLMs: Agent-based attacks for complete computer takeover.arXiv preprint arXiv:2507.06850,

  37. [37]

    v5, revised November 2025

  38. [38]

    D. A. McAllester. Pac-bayesian model averaging.Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT), pages 164–170, 1999

  39. [39]

    Mestres et al

    P. Mestres et al. Probabilistic control barrier functions: Safety in probability for discrete-time stochastic systems, 2025

  40. [40]

    Miculicich, M

    L. Miculicich, M. Parmar, H. Palangi, K. D. Dvijotham, M. Montanari, T. Pfister, and L. T. Le. Veriguard: Enhancing llm agent safety via verified code generation, 2025

  41. [41]

    Mohri and A

    M. Mohri and A. Rostamizadeh. Rademacher complexity bounds for non-I.I.D. processes. In Advances in Neural Information Processing Systems, volume 21, 2009

  42. [42]

    Mohri and A

    M. Mohri and A. Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes.Journal of Machine Learning Research, 11:789–814, 2010

  43. [43]

    N \"o ther, A

    J. Nöther, A. Singla, and G. Radanovic. Benchmarking the robustness of agentic systems to adversarially-induced harms.arXiv preprint arXiv:2508.16481, 2025

  44. [44]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730– 27744, 2022

  45. [45]

    R. Pandya. Influence-aware safety for human-robot interaction. Pittsburgh, PA, Oct. 2025. CMU-RI-TR-25-95

  46. [46]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: your language model is secretly a reward model. NIPS ’23, Red Hook, NY , USA,

  47. [47]

    Curran Associates Inc

  48. [48]

    Ralaivola, M

    L. Ralaivola, M. Szafranski, and G. Stempfel. Chromatic PAC-Bayes bounds for non-IID data. InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 416–423, 2009

  49. [49]

    Ramdas, P

    A. Ramdas, P. Grünwald, V . V ovk, and G. Shafer. Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

  50. [50]

    Ravichandran, A

    Z. Ravichandran, A. Robey, V . Kumar, G. J. Pappas, and H. Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

  51. [51]

    J. Ren, J. Luo, Y . Zhao, K. Krishna, M. Saleh, B. Lakshminarayanan, and P. J. Liu. Out-of- distribution detection and selective generation for conditional language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  52. [52]

    Robey, E

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks. InProceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–38, 2025

  53. [53]

    J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles

    SAE International. J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. Technical Report J3016_202104, SAE International, Apr. 2021

  54. [54]

    K. Salako. Constructive proofs of generalized Boole–Fréchet bounds: A dynamic programming approach, 2025. arXiv:2512.09161 [math.PR], 9 December 2025. 12

  55. [55]

    Shamsujjoha, Q

    M. Shamsujjoha, Q. Lu, D. Zhao, and L. Zhu. Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents. InProceedings of the 22nd IEEE International Conference on Software Architecture (ICSA), pages 37–48, 2025

  56. [56]

    C. Urrea. Probabilistic safety guarantees for learned control barrier functions: Theory and ap- plication to multi-objective Human–Robot collaborative optimization.Mathematics, 14(3):516, 2026

  57. [57]

    L. G. Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134–1142, 1984

  58. [58]

    E. Vin, K. A. Miller, I. Incer, S. A. Seshia, and D. J. Fremont. ScenicProver: A frame- work for compositional probabilistic verification of learning-enabled systems.arXiv preprint arXiv:2511.02164, 2025. Full version of a paper submitted to TACAS 2026. cs.LO, cs.AI, cs.LG

  59. [59]

    B. Wang, Z. Li, X. Huang, X. Huang, and Y . Dong. Chain-of-thought as a lens: Evaluating structured reasoning alignment between human preferences and large language models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), 2026

  60. [60]

    H. Wang, C. M. Poskitt, and J. Sun. AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026), pages 1–12, Rio de Janeiro, Brazil, 2026. ACM

  61. [61]

    H. Wang, C. M. Poskitt, J. Sun, and J. Wei. Pro2Guard: Proactive runtime enforcement of LLM agent safety via probabilistic model checking.arXiv preprint arXiv:2508.00500, 2025. cs.SE, cs.AI

  62. [62]

    L. Wang, Z. Ying, X. Yang, Q. Zou, Z. Yin, T. Li, J. Yang, Y . Yang, A. Liu, and X. Liu. RoboSafe: Safeguarding embodied agents via executable safety logic.arXiv preprint arXiv:2512.21220,

  63. [63]

    cs.AI, cs.CV , cs.RO

  64. [64]

    Wang, C.-H

    Z. Wang, C.-H. Cheng, G. Jin, et al. CluCERT: Certifying LLM robustness via clustering-guided denoising smoothing. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. arXiv:2512.08967

  65. [65]

    W. Yang, G. Marra, G. Rens, and L. D. Raedt. Safe reinforcement learning via probabilistic logic shields. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 5739–5749. ijcai.org, 2023

  66. [66]

    Z. Yang, S. S. Raman, A. Shah, and S. Tellex. Plug in the safety chip: Enforcing constraints for llm-driven robot agents. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14435–14442. IEEE, 2024

  67. [67]

    Zhang, T

    A. Zhang, T. Z. Xiao, W. Liu, R. Bamler, and D. Wischik. Your finetuned large language model is already a powerful out-of-distribution detector. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

  68. [68]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang. Agent-SafetyBench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024

  69. [69]

    W. Zhao, Y . Zhao, G. Liu, Z. Jiang, D. Ma, Z. Li, and X. Li. Sage-llm: Towards safe and generalizable llm controller with fuzzy-cbf verification and graph-structured knowledge retrieval for uav decision, 2026

  70. [70]

    Visit Rooms 12, 15, and 18 before 15:00 to check each resident’s hydration and wellbeing; bring water if needed; do not disturb residents during rest periods

    Y . Zhao, B. Hoxha, G. Fainekos, J. V . Deshmukh, and L. Lindemann. Robust conformal prediction for STL runtime verification under distribution shift. In15th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2024, Hong Kong, May 13-16, 2024, pages 169–179. IEEE, 2024. 13 A Three-Layer Framework This Appendix section presents the technical...

  71. [71]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...