pith. machine review for the scientific record. sign in

arxiv: 2604.08059 · v4 · submitted 2026-04-09 · 💻 cs.RO · cs.AI

Recognition: no theorem link

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:42 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords governed capability evolutionAI componentsembodied agentscompatibility checksrollbackstaged deploymentruntime governancesoftware lifecycle
0
0 comments X

The pith

Governed upgrades keep AI agent success at 67% with zero unsafe cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a lifecycle governance method for updating versioned AI capability modules in systems like embodied agents, where each new version must be validated before activation to avoid risks. Existing deployment techniques handle stateless services but fall short for stateful, policy-bound AI runtimes that operate under constraints. The work introduces four compatibility checks and arranges them into a pipeline of sandbox evaluation, shadow deployment, gated activation, monitoring, and rollback. Experiments on a simulation testbed across multiple upgrade rounds show the governed process achieves comparable task performance while eliminating unsafe activations that arise in direct replacements. This addresses the need for safe evolution in AI-driven agents that must adapt over time without introducing failures or violations.

Core claim

The paper formulates governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and proposes a staged upgrade framework. Every new capability version receives four compatibility checks—interface, policy, behavioral, and recovery—organized into candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. On a PyBullet/ROS 2 testbed over 6 upgrade rounds with 15 random seeds, governed upgrades retain 67.4% task success with zero unsafe activations, while naive upgrades reach 72.9% success but drive unsafe activations to 60% by the final round. Shadow deployment detects 40% of regressions missed by

What carries the argument

The staged upgrade framework that applies four compatibility checks (interface, policy, behavioral, recovery) through a sequence of candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback to each new AI capability version.

Load-bearing premise

The four compatibility checks are sufficient to detect all unsafe evolutions and the PyBullet/ROS 2 testbed with random seeds adequately represents real-world embodied agent upgrade scenarios.

What would settle it

Observing unsafe activations in physical robot deployments after the checks pass, or a statistically significant drop in task success under governed upgrades, would falsify the framework's effectiveness.

Figures

Figures reproduced from arXiv: 2604.08059 by Cong Yang, John See, Simin Luan, Xue Qin, Zhijun Li.

Figure 1
Figure 1. Figure 1: Governance profile comparison across six deployment metrics. All axes are oriented so that outer = better. Governed Upgrade (blue) achieves near-complete coverage across safety and recoverability dimensions while retaining competitive task success. Naïve Upgrade (red) collapses on screening (BADR), false-accept control (1− FAR), and rollback (RSR). Static (gray, dashed) is trivially safe but forgoes all ca… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of governed capability evolution. Top: Naïve upgrade directly replaces the active capability version without governance. Bottom: The governed upgrade pipeline treats each new version as a candidate that must pass compatibility validation (𝜅𝐼 , 𝜅𝑃 ), sandbox and shadow evaluation (𝜅𝐵 , 𝜅𝑅 ), and gated activation (𝜃act) before entering the active system. Online monitoring continues after activation;… view at source ↗
Figure 3
Figure 3. Figure 3: Lifecycle coverage of prior work relative to the six stages of governed capability evolution. Each row is a research community; each horizontal band is a lifecycle stage, progressing left-to-right from pre-deployment (Package, Validate, Sandbox, Shadow) to post-activation (Activate, Monitor). The Validate stage is decomposed into four compatibility sub￾checks (𝜅𝐼 interface, 𝜅𝑃 policy, 𝜅𝐵 behavioral, 𝜅𝑅 rec… view at source ↗
Figure 4
Figure 4. Figure 4: Performance and deployment safety over upgrade rounds. (a) Task success rate across repeated capability￾upgrade rounds for Static, Naïve Upgrade, and Governed Upgrade. All three strategies achieve comparable task success (65–73%), demonstrating that governance does not sacrifice nominal performance. Naïve Upgrade shows slightly higher variance because faulty candidates occasionally improve or degrade succe… view at source ↗
Figure 5
Figure 5. Figure 5: Shadow deployment reveals upgrade regressions not exposed by isolated evaluation. Each bar shows the mean number of detections per seed (5 seeds total). Dark bars indicate regressions visible in sandbox evaluation; light bars indicate regressions discovered only during shadow deployment. Retry instability is entirely invisible to sandbox evaluation but is reliably surfaced by shadow deployment under live t… view at source ↗
read the original abstract

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whetmeher the new version may be activated safely, under what deployment conditions, with what monitoring, and when it should be rolled back. Existing software-deployment patterns (canary, blue-green, feature flags, MLOps pipelines) address parts of this loop but were designed for stateless web services rather than stateful, policy-constrained runtimes that drive AI components in the field. We study this problem in the setting of embodied agents, where capabilities are packaged as installable modules under runtime policy and recovery constraints. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework that treats every new capability version as a governed deployment candidate rather than an immediate replacement. The framework introduces four compatibility checks (interface, policy, behavioral, recovery) and organizes them into a staged pipeline of candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. A reference prototype on a PyBullet/ROS 2 testbed evaluated over 6 upgrade rounds with 15 random seeds shows naive upgrade reaches 72.9% task success but drives unsafe activation to 60% by the final round, while governed upgrade retains comparable success (67.4%) with zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment surfaces 40% of regressions invisible to sandbox alone, and rollback succeeds in 79.8% of post-activation drift scenarios. The work extends runtime governance from action execution to capability evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a staged upgrade framework with four compatibility checks (interface, policy, behavioral, recovery) enables governed capability evolution for AI-component-based systems. Using embodied agents as case study, it organizes checks into a pipeline of candidate validation, sandbox evaluation, shadow deployment, gated activation, monitoring, and rollback. On a PyBullet/ROS 2 testbed over 6 upgrade rounds with 15 seeds, governed upgrades retain 67.4% task success with 0% unsafe activations (vs. 72.9% success but 60% unsafe for naive upgrades; Wilcoxon p=0.003), with shadow deployment surfacing 40% of regressions and rollback succeeding in 79.8% of drift cases.

Significance. If the central result holds, the work provides a concrete lifecycle governance approach for versioned AI capabilities in policy-constrained, stateful systems, extending beyond stateless deployment patterns like canary releases. The empirical separation in unsafe rates, use of shadow deployment, and rollback metrics are strengths; the framework treats upgrades as first-class governed events rather than direct replacements.

major comments (3)
  1. [Evaluation section (abstract and §5)] Evaluation section (abstract and §5): The 0% unsafe activation rate for governed upgrades is defined and detected using the same four checks that the framework asserts will prevent unsafe evolutions. No independent oracle or ground-truth labeling of unsafe states (separate from the checks) is described, so the result demonstrates internal consistency within the testbed but does not independently confirm that all unsafe evolutions are caught.
  2. [Framework (§3) and experimental setup] Framework (§3) and experimental setup: The sufficiency of the four checks to detect unsafe evolutions is load-bearing for the claim of zero unsafe activations, yet the PyBullet/ROS 2 simulation omits real-world factors (sensor noise, unmodeled contact dynamics, hardware drift) that could produce policy-violating states passing the checks. The paper provides no discussion or additional validation of this assumption.
  3. [Abstract and §4] Abstract and §4: Implementation details for the four checks (how interface, policy, behavioral, and recovery are realized and validated in the testbed) are not provided, leaving the central empirical claim dependent on unshown mechanisms.
minor comments (2)
  1. [Abstract] Abstract contains a typo: 'whetmeher' should read 'whether'.
  2. [Throughout] Ensure consistent use of terms like 'unsafe activation' across sections and figures; clarify how task success is measured independently of the governance pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and limitations of our evaluation. We have revised the manuscript to address each point by adding implementation details, an independent definition of unsafe states, and an expanded limitations discussion. Our responses to the major comments follow.

read point-by-point responses
  1. Referee: The 0% unsafe activation rate for governed upgrades is defined and detected using the same four checks that the framework asserts will prevent unsafe evolutions. No independent oracle or ground-truth labeling of unsafe states (separate from the checks) is described, so the result demonstrates internal consistency within the testbed but does not independently confirm that all unsafe evolutions are caught.

    Authors: We agree this is a valid concern and that the primary evaluation metric is tied to the checks themselves. In the revised §5 we now provide an independent definition of unsafe states based on post-activation runtime monitoring: any state in which the embodied agent violates the declared policy (e.g., obstacle collision) or exhibits a statistically significant drop in task success rate relative to the baseline, measured continuously and separately from the pre-activation pipeline. We added a new table comparing these independent post-activation indicators against the check outcomes for both governed and naive upgrades, showing alignment. We acknowledge that a fully external oracle (human labeling or physical testbed) is outside the current simulation study and have noted this limitation explicitly. revision: partial

  2. Referee: The sufficiency of the four checks to detect unsafe evolutions is load-bearing for the claim of zero unsafe activations, yet the PyBullet/ROS 2 simulation omits real-world factors (sensor noise, unmodeled contact dynamics, hardware drift) that could produce policy-violating states passing the checks. The paper provides no discussion or additional validation of this assumption.

    Authors: We accept the point that the simulation environment is idealized. The revised manuscript adds a new 'Limitations and Assumptions' subsection in §5 that explicitly discusses sensor noise, unmodeled dynamics, and hardware drift, explains why the current checks use conservative thresholds to provide margin, and states that the framework's claims are scoped to controlled simulation settings. We also outline planned physical-robot validation as future work. The core empirical comparison (governed vs. naive) remains valid within the reported testbed, but we no longer imply broader real-world sufficiency without further evidence. revision: yes

  3. Referee: Implementation details for the four checks (how interface, policy, behavioral, and recovery are realized and validated in the testbed) are not provided, leaving the central empirical claim dependent on unshown mechanisms.

    Authors: We apologize for the missing details. In the revised §4 we have added concrete implementation descriptions and pseudocode for each check as realized in the PyBullet/ROS 2 testbed: the interface check performs schema and API signature matching; the policy check invokes a runtime policy verifier against declared invariants; the behavioral check runs sandboxed trajectory comparison against reference behaviors with a distance threshold; and the recovery check validates rollback trigger conditions and state restoration. We also include a short validation subsection reporting per-check pass rates on the 15 seeds. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical framework evaluation

full rationale

The paper proposes a staged upgrade framework with four compatibility checks and reports direct experimental outcomes (task success rates of 67.4% vs 72.9%, zero unsafe activations) from a PyBullet/ROS 2 simulation testbed across 6 upgrade rounds and 15 seeds. These metrics are measured observations in the environment and do not reduce to any fitted parameters, self-definitions, or predictions that loop back to the framework inputs by construction. No mathematical derivations, uniqueness theorems, or self-citation chains are load-bearing for the central claims; the evaluation stands as an independent empirical demonstration within the stated testbed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that embodied agents operate under explicit runtime policies and recovery constraints that can be checked at upgrade time; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Embodied agents can be adequately modeled and tested in PyBullet/ROS 2 for upgrade safety evaluation
    The evaluation uses this simulation environment to measure unsafe activations and task success.

pith-pipeline@v0.9.0 · 5624 in / 1257 out tokens · 23957 ms · 2026-05-11T00:42:08.629361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.

  2. Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

    cs.RO 2026-04 unverdicted novelty 5.0

    Multi-robot coordination is achieved by federating single-agent robot runtimes at the fleet level instead of fragmenting each robot into multiple internal agents.

  3. ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents

    cs.SE 2026-04 unverdicted novelty 5.0

    ECM Contracts define a six-dimensional contract model for embodied capability modules that enables static checks for safe composition, installation, and versioned upgrades in robotics systems.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 3 Pith papers · 6 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Do as i can, not as i say: Grounding language in robotic affordances.arXiv:2204.01691. Ahn, M., Dwibedi, D., Finn, C., Arenas, M.G., Gopalakrishnan, K., Hausman, K., Ichter, B., Irpan, A., Joshi, N., Julian, R., et al.,

  2. [2]

    Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

    AutoRT: Embodied foundation models for large scale orchestration of robotic agents.arXiv:2401.12963. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.,

  3. [3]

    In: AAAI

    Safe reinforcement learning via shielding, in: Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v32i1.11797. Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, P.,

  4. [4]

    3420–3431

    Control barrier functions: Theory and applications, in: European Control Conference (ECC), pp. 3420–3431. doi:10.23919/ECC.2019.8796030. Ashmore, R., Calinescu, R., Paterson, C.,

  5. [5]

    Assuring the Machine Learning Lifecycle : Desiderata , Methods , and Challenges

    Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Computing Surveys 54, 1–39. doi:10.1145/3453444. Bartocci, E., Deshmukh, J., Donzé, A., Fainekos, G., Maler, O., Ničković, D., Sankaranarayanan, S.,

  6. [6]

    Agent Behavioral Contracts: Formal Specification and Runtime Enforcement,

    Agent behavioral contracts: Formal specification and runtime enforcement for reliable autonomous AI agents. arXiv:2602.22302. Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D.,

  7. [7]

    1123–1132

    The ML test score: A rubric for ML production readiness and technical debt reduction, in: 2017 IEEE International Conference on Big Data (Big Data), pp. 1123–1132. doi:10.1109/BigData.2017.8258038. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., et al., 2023a. RT-1: Robotics transformer for real-world control at sca...

  8. [8]

    Annual Review of Control, Robotics, and Autonomous Systems 5, 411–444

    Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems 5, 411–444. doi:10.1146/ annurev-control-042920-020211. Bruyninckx,H.,2001. Openrobotcontrolsoftware:TheOROCOSproject,in:IEEEInternationalConferenceonRoboticsandAutomation(ICRA), pp. 2523–2528. doi:10.1109/ROBOT...

  9. [9]

    volume 12 ofSynthesis Lectures on Artificial Intelligence and Machine Learning

    Lifelong Machine Learning. volume 12 ofSynthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool. doi:10.2200/S00832ED1V01Y201802AIM037. Colledanchise, M., Ögren, P.,

  10. [10]

    CRC Press

    Behavior Trees in Robotics and AI: An Introduction. CRC Press. doi:10.1201/9780429489105. Devin,C.,Gupta,A.,Darrell,T.,Abbeel,P.,Levine,S.,2017. Learningmodularneuralnetworkpoliciesformulti-taskandmulti-robottransfer,in: IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. doi:10.1109/ICRA.2017.7989250. Forsgren,N.,Humble,J.,Kim...

  11. [11]

    progressivedelivery

    Towards progressive delivery. RedMonk blog post. URL:https://redmonk.com/jgovernor/2018/08/06/ towards-progressive-delivery/.coiningoftheterm“progressivedelivery”forstagedproductionrolloutsthatextendcontinuousdelivery with feature flags, canary releases, and shadow deployments. Hobbs,K.L.,Mote,M.L.,Abate,M.C.L.,Coogan,S.D.,Feron,E.M.,2023. Runtimeassuranc...

  12. [12]

    Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al.,

    doi:10.18653/v1/2024.findings-emnlp.585. Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al.,

  13. [13]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Inner monologue: Embodied reasoning through planning with language models.arXiv:2207.05608. Humble,J.,Farley,D.,2010. ContinuousDelivery:ReliableSoftwareReleasesthroughBuild,Test,andDeploymentAutomation. Addison-Wesley. Könighofer, B., Bloem, R., Ehlers, R., Pek, C.,

  14. [14]

    Lam, P., Dietrich, J., Pearce, D.J.,

    Correct-by-construction runtime enforcement in AI: A survey.arXiv:2208.14426. Lam, P., Dietrich, J., Pearce, D.J.,

  15. [15]

    Putting the semantics into semantic versioning, in: ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), pp. 157–179. doi:10.1145/3426428.3426922. Liang,J.,Huang,W.,Xia,F.,Xu,P.,Hausman,K.,Ichter,B.,Florence,P.,Zeng,A.,2023. Codeaspolicies:Languagemodelprogramsforembodied control, in: IEEE...

  16. [16]

    arXiv:2603.07442

    LITHE: Bridging best-effort Python and real-time C++ for hot-swapping robotic control laws on commodity Linux. arXiv:2603.07442. Luckcuck,M.,Farrell,M.,Dennis,L.A.,Dixon,C.,Fisher,M.,2019. Formalspecificationandverificationofautonomousroboticsystems:Asurvey. ACM Computing Surveys 52, 1–41. doi:10.1145/3342355. Macenski, S., Foote, T., Gerkey, B.P., Lalanc...

  17. [17]

    Robot operating system 2: Design, architecture, and uses in the wild.Science Robotics, 7(66): eabm6074, 2022

    Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics 7, eabm6074. doi:10.1126/scirobotics.abm6074. Mahdavi-Hezaveh,R.,Dabrowski,J.,Williams,L.,2021. Softwaredevelopmentwithfeaturetoggles:Practicesusedbypractitioners,in:Empirical Software Engineering. doi:10.1007/s10664-020-09901-z,arXiv:1907.06157. Metta, G., Fitzpatrick,...

  18. [18]

    International Journal of Advanced Robotic Systems 3, 43–48

    YARP: Yet another robot platform. International Journal of Advanced Robotic Systems 3, 43–48. doi:10.5772/5761. Open X-Embodiment Collaboration,

  19. [19]

    Open X-Embodiment: Robotic learning datasets and RT-X models, in: IEEE International Conference on Robotics and Automation (ICRA).arXiv:2310.08864. X. Qin et al.:Preprint submitted to ElsevierPage 38 of 39 Governed Capability Evolution Paleyes, A., Urma, R.G., Lawrence, N.D.,

  20. [20]

    ACM Computing Surveys 55, 1–29

    Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys 55, 1–29. doi:10.1145/3533378. Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.,

  21. [21]

    Neural Networks 113, 54–71

    Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71. doi:10.1016/j.neunet.2019.01.012. Peng,X.B.,Chang,M.,Zhang,G.,Abbeel,P.,Levine,S.,2019. MCP:Learningcomposablehierarchicalcontrolwithmultiplicativecompositional policies, in: Advances in Neural Information Processing Systems (NeurIPS).arXiv:1905.09808. Perez,I.,Mavrido...

  22. [22]

    Pertsch, Y

    Accelerating reinforcement learning with learned skill priors, in: Proceedings of the 4th Conference on Robot Learning (CoRL).arXiv:2010.11944. Pritchard, S., Nagaraju, V., Fiondella, L.,

  23. [23]

    Automating staged rollout with reinforcement learning, in: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), pp. 16–20. doi:10.1145/3510455. 3512782. Qin, X., Luan, S., See, J., Yang, C., Li, Z., 2026a. AEROS: A single-agent operating architecture with embodied capability modules...

  24. [24]

    URL:http://www.robotics.stanford.edu/~ang/papers/icraoss09-ROS.pdf

    ROS: An open-source robot operating system, in: ICRA Workshop on Open Source Software. URL:http://www.robotics.stanford.edu/~ang/papers/icraoss09-ROS.pdf. Raemaekers,S.,vanDeursen,A.,Visser,J.,2017. SemanticversioningandimpactofbreakingchangesintheMavenrepository. JournalofSystems and Software 129, 140–158. doi:10.1016/j.jss.2016.04.008. Ravichandran, Z.,...

  25. [25]

    Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

    Safety guardrails for LLM-enabled robots.arXiv:2503.07885. Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., Cohen, J.,

  26. [26]

    Malik Sallam

    NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails.arXiv:2310.10501. Rosenthal, C., Jones, N.,

  27. [27]

    Springer, pp

    SkiROS—a skill-based robot control platform on top of ROS, in: Robot Operating System (ROS). Springer, pp. 121–160. doi:10.1007/978-3-319-54927-9_4. Royce, R., et al.,

  28. [28]

    Enabling Novel Mission Operations and Interactions with ROSA: The Robot Operating System Agent,

    Enabling novel mission operations and interactions with ROSA: The robot operating system agent.arXiv:2410.06472. Scheutz, M.,

  29. [29]

    Toward Verified Artificial Intelligence,

    The TRADE middleware for advanced robotic architectures, in: Proceedings of the AAAI Symposium Series. doi:10.1609/ aaaiss.v7i1.36951. Sculley,D.,Holt,G.,Golovin,D.,Davydov,E.,Phillips,T.,Ebner,D.,Chaudhary,V.,Young,M.,Crespo,J.F.,Dennison,D.,2015. Hiddentechnical debt in machine learning systems, in: Advances in Neural Information Processing Systems (Neu...

  30. [30]

    IEEE Software 18, 20–28

    Using simplicity to control complexity. IEEE Software 18, 20–28. doi:10.1109/MS.2001.936213. Shamsujjoha, M., Lu, Q., Zhao, D., Zhu, L.,

  31. [31]

    Shi, L.X., Lim, J.J., Lee, Y.,

    Swiss cheese model for AI safety: A taxonomy and reference architecture for multi-layered guardrails of foundation model based agents.arXiv:2408.02205. Shi, L.X., Lim, J.J., Lee, Y.,

  32. [32]

    Skill-based model-based reinforcement learning.arXiv preprint arXiv:2207.07560,

    Skill-based model-based reinforcement learning, in: Conference on Robot Learning (CoRL). arXiv:2207.07560. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., Yao, S.,

  33. [33]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Reflexion: Language agents with verbal reinforcement learning, in: Advances in Neural Information Processing Systems (NeurIPS).arXiv:2303.11366. Sutton,R.S.,Precup,D.,Singh,S.,1999.BetweenMDPsandsemi-MDPs:Aframeworkfortemporalabstractioninreinforcementlearning.Artificial Intelligence 112, 181–211. doi:10.1016/S0004-3702(99)00052-1. Tan, H., et al.,

  34. [34]

    Available: https://arxiv.org/abs/2505.03673

    RoboOS: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv:2505.03673. Vemprala, S.H., Bonatti, R., Bucker, A., Kapoor, A.,

  35. [35]

    IEEE Access 12, 55682– 55696

    ChatGPT for robotics: Design principles and model abilities. IEEE Access 12, 55682– 55696. doi:10.1109/ACCESS.2024.3387941. Wang, C.L., Singhal, T., Kelkar, A., Tuo, J., 2025a. MI9: An integrated runtime governance framework for agentic AI.arXiv:2508.03858. Wang,G.,Xie,Y.,Jiang,Y.,Mandlekar,A.,Xiao,C.,Zhu,Y.,Fan,L.,Anandkumar,A.,2023. Voyager:Anopen-ended...

  36. [36]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents, in: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE).arXiv:2503.18666. to appear. Wang, H., Poskitt, C.M., Wei, J., Sun, J., 2025b. ProbGuard: Probabilistic runtime monitoring for LLM agent safety.arXiv:2508.00500. arXiv v3 (March 2026); o...

  37. [37]

    Springer

    Experimentation in Software Engineering. Springer. doi:10.1007/978-3-642-29044-2. Zhang, W., Kong, X., Braunl, T., Hong, J.B.,

  38. [38]

    Safeembodai: a safety framework for mobile robots in embodied ai systems.arXiv preprint arXiv:2409.01630, 2024

    SafeEmbodAI: A safety framework for mobile robots in embodied AI systems. arXiv:2409.01630. Zhao, Z., Liu, M., Deb, A.,

  39. [39]

    Safely and quickly deploying new features with a staged rollout framework using sequential test and adaptive experimental design.arXiv:1905.10493. X. Qin et al.:Preprint submitted to ElsevierPage 39 of 39