pith. machine review for the scientific record. sign in

arxiv: 2604.11174 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.AI

Recognition: unknown

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied AIrobot governancebenchmarksafety evaluationrecoveryauditabilitypolicy enforcementupgrade safety
0
0 comments X

The pith

EmbodiedGovBench evaluates whether robot systems stay controllable and policy-compliant instead of measuring only task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current embodied AI evaluations focus on task completion rates and accuracy but ignore whether systems respect boundaries, enforce policies, recover from problems, or accept human control. This paper introduces EmbodiedGovBench to close that gap with a structured test covering seven governance dimensions across single robots and fleets. A reader would care because future robot policies and foundation models could otherwise operate without reliable oversight or safe rollback in real settings. The benchmark supplies scenario templates, perturbation methods, metrics, and protocols that fit existing modular robot runtimes. It concludes that governance must join task success as a standard evaluation target.

Core claim

EmbodiedGovBench is a benchmark for governance-oriented evaluation of embodied agent systems that assesses whether they remain controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations rather than asking only whether tasks are completed. It organizes evaluation around seven dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. The benchmark supplies a structure for single-robot and fleet scenarios that includes templates, perturbation operators, governance metrics, and baseline protocols, and it describes how the

What carries the argument

EmbodiedGovBench, a benchmark structure spanning seven governance dimensions with scenario templates, perturbation operators, metrics, and protocols for testing controllability and safety in embodied agent systems.

If this is right

  • Task success rates alone will no longer be treated as sufficient evidence of system readiness.
  • Developers must demonstrate prevention of unauthorized capability use and robustness to runtime drift.
  • Safe recovery from errors and responsiveness to human overrides become required, measurable properties.
  • Version upgrades must pass safety verification before deployment in production systems.
  • Audit completeness becomes a standard requirement for accountability in both single and fleet deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robot developers may need to redesign core interfaces to expose governance hooks for easy testing.
  • The benchmark could highlight trade-offs between governance constraints and task flexibility in dynamic environments.
  • Adoption might encourage creation of governance-aware training data for embodied foundation models.

Load-bearing premise

The seven governance dimensions capture the essential requirements for safe and controllable embodied systems.

What would settle it

A demonstration that an embodied system achieves high task success yet routinely violates one or more of the seven governance dimensions without loss of control or safety incidents would challenge the benchmark's necessity.

Figures

Figures reproduced from arXiv: 2604.11174 by Cong Yang, John See, Simin Luan, Xue Qin, Zhijun Li.

Figure 1
Figure 1. Figure 1: From Task Success Benchmarks to Governance Benchmarks. Traditional embodied bench￾marks (left) measure what robots can do. EmbodiedGovBench (right) measures whether their capabili￾ties remain governable under execution, failure, and evolution. and audit completeness. 3. We propose a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, metric familie… view at source ↗
Figure 2
Figure 2. Figure 2: EmbodiedGovBench Structure. The benchmark is organized in layers: seven governance dimensions define the evaluation target; two tracks (single-robot and fleet) provide execution scope; scenario templates and perturbation operators generate benchmark instances; and a multi-level scoring framework produces diagnostic outputs [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark Evaluation Pipeline. Each benchmark run proceeds through five stages, with trace collection spanning perturbation injection and system execution [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark Harness Architecture. The harness wraps around the system under test, injecting scenarios and perturbations through an adapter layer, collecting traces, judging governance compliance, and producing diagnostic reports. 7.3 Governance Judge The Governance Judge compares the observed trace against governance-aware ground truth. For sce￾nario instance S with ground truth G(S), the judge computes J (S… view at source ↗
read the original abstract

Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EmbodiedGovBench as a benchmark for governance-oriented evaluation of embodied agent systems. It highlights limitations in existing task-success metrics and defines seven governance dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness). The paper supplies scenario templates, perturbation operators, metric definitions, and high-level protocols for single-robot and fleet settings, along with guidance on instantiation over modular embodied runtimes with contract-aware interfaces, concluding that embodied governance should become a first-class evaluation target.

Significance. If the proposed structure proves instantiable and yields reproducible measurements, the benchmark could fill a meaningful gap by extending embodied AI evaluation beyond task completion to include controllability, policy adherence, recoverability, and auditability. This has potential to influence safer deployment of robot policies and foundation models. The conceptual organization of dimensions and protocols is a clear contribution, though its significance remains prospective without demonstrated application.

major comments (2)
  1. [Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.
  2. [Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.
minor comments (2)
  1. [Abstract] Abstract: Consider adding a sentence clarifying that the current contribution is the benchmark definition and protocols rather than empirical results on specific systems.
  2. [Metric definitions] Metric definitions: Ensure each governance metric is accompanied by a precise formula or pseudocode to support future implementations and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on EmbodiedGovBench. The comments identify key areas where the support for our claims and the rationale for the benchmark dimensions require strengthening. We address each major comment below and specify the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.

    Authors: We agree that the manuscript presents EmbodiedGovBench as a conceptual framework with high-level definitions, templates, and protocols, without a deployed implementation or empirical results on specific systems. The phrasing regarding ready instantiation reflects the modular structure we describe, which aligns with common interfaces in embodied runtimes, but we acknowledge this claim lacks direct demonstration. In revision, we will update the abstract and introduction to qualify these statements, clarifying that the benchmark supplies a blueprint and protocols whose instantiation will involve integration work by users. We will add an appendix with pseudocode for core operators, interface contracts, and step-by-step instantiation guidance for representative runtimes to make the framework more actionable while preserving its focus as a proposal rather than a completed artifact. revision: partial

  2. Referee: [Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.

    Authors: The referee is correct that an explicit linkage to prior failure modes would improve the justification. We will revise the manuscript by inserting a new subsection (and accompanying table) that maps each of the seven dimensions to concrete examples drawn from the embodied AI literature, such as unauthorized capability use in policy-constrained navigation systems, runtime drift in long-horizon manipulation, and audit gaps in fleet coordination. The table will also contrast these against limitations of existing task-success metrics. This addition will clarify why the dimensions are both necessary and appropriately scoped for governance-oriented evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined from external gaps

full rationale

The paper introduces EmbodiedGovBench by enumerating seven governance dimensions drawn from gaps in existing task-success metrics, then supplies scenario templates, perturbation operators, metric definitions, and high-level instantiation protocols. No equations, fitted parameters, or quantitative predictions appear. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim is a design proposal whose content is independent of any result derived from the benchmark itself; the absence of concrete runtime results is a limitation of demonstration, not a circular reduction of the definition to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on domain assumptions about the importance of governance and the sufficiency of the chosen dimensions, with the benchmark itself as the main new construct.

axioms (2)
  • domain assumption Embodied AI systems require evaluation on governance, recovery, and safety beyond task success metrics
    Stated as the motivation and gap in current evaluations.
  • ad hoc to paper The seven listed dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, audit completeness) cover the key governance aspects
    Defined by the authors as the scope of the benchmark.
invented entities (1)
  • EmbodiedGovBench no independent evidence
    purpose: A structured benchmark and measurement framework for governance in embodied agents
    Newly proposed construct that organizes the evaluation.

pith-pipeline@v0.9.0 · 5532 in / 1387 out tokens · 41747 ms · 2026-05-10T15:17:07.104362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification

    cs.CY 2026-04 unverdicted novelty 4.0

    DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.

Reference graph

Works this paper leans on

96 extracted references · 25 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  2. [2]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multi- modal language model.arXiv preprint arXiv:2303.03378, 2023

  3. [3]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

  4. [4]

    Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009

    Davide Brugali and Patrizia Scandurra. Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009

  5. [5]

    Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y

    Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y . Ng. ROS: An open-source robot operating system.ICRA Workshop on Open Source Software, 3:5, 2009

  6. [6]

    Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018

  7. [7]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019

  8. [8]

    ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10740–10749, 2020

  9. [9]

    Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Sergey Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024

  10. [10]

    ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022

  11. [11]

    Alan F. T. Winfield and Marina Jirotka. Ethical governance is essential to building trust in robotics and artificial intelligence systems.Philosophical Transactions of the Royal Society A, 376(2133):20180085, 2018

  12. [12]

    AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018

    Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, et al. AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018

  13. [13]

    AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: Agent execution runtime operating system for embodied robots.arXiv preprint arXiv:2604.07039, 2026

  14. [14]

    Learning Without Losing Identity: Capability Evolution for Embodied Agents

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Learning without losing identity: Capability evolution for embodied agents.arXiv preprint arXiv:2604.07799, 2026

  15. [15]

    Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Harnessing embodied agents: Runtime governance for policy-constrained execution.arXiv preprint arXiv:2604.07833, 2026

  16. [16]

    Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Governed capability evolution for embodied agents.arXiv preprint arXiv:2604.08059, 2026

  17. [17]

    ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026

  18. [18]

    Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026

    Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026

  19. [19]

    Concrete Problems in AI Safety

    Damon Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

  20. [20]

    A comprehensive survey on safe reinforcement learning

    Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480, 2015

  21. [21]

    CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022

  22. [22]

    Domain randomization for transferring deep neural networks from simulation to the real world,

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Do- main randomization for transferring deep neural networks from simulation to the real world.arXiv preprint arXiv:1703.06907, 2017

  23. [23]

    Sim-to-real transfer in deep reinforce- ment learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020

    Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforce- ment learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020. 27

  24. [24]

    RT-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabber, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

  25. [25]

    Kevin Black, Noah Brown, Danny Driess, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  26. [26]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gober, Karol Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  27. [27]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2023

  28. [28]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InConference on Robot Learning (CoRL), 2022

  29. [29]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cooney, Sergey Levine, and Russ Tedrake. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Sys- tems (RSS), 2023

  30. [30]

    Chang, M., Chhablani, G., Clegg, A., Cote, M

    Matthew Chang, Gunjan Chhablani, Alexander Clegg, et al. PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024

  31. [31]

    OpenEQA: Embodied question answering in the era of foundation models

    Arjun Majumdar, Xiaohan Khanna, et al. OpenEQA: Embodied question answering in the era of foundation models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  32. [32]

    GOAT-Bench: A benchmark for multi- modal lifelong navigation

    Xiaohan Khanna, Arjun Majumdar, Rishav Chadha, et al. GOAT-Bench: A benchmark for multi- modal lifelong navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  33. [33]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  34. [34]

    A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026

    Xiaoyu Hou, Siqi Zhang, et al. A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026

  35. [35]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research (TMLR), 2023

  36. [36]

    ProgPrompt: Generating situated robot task plans using large language models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Di- eter Fox, Jesse Thomason, and Animesh Garg. ProgPrompt: Generating situated robot task plans using large language models. InIEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023

  37. [37]

    Embodied agent interface: Benchmarking LLMs for embodied decision making

    Manling Li, Shiyu Zhao, Qineng Wang, et al. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  38. [38]

    Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020

    Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020. 28

  39. [39]

    Andrew Bagnell, and Jan Peters

    Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research (IJRR), 32(11):1238–1274, 2013

  40. [40]

    MuJoCo: A physics engine for model-based control

    Erez Todorov and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012

  41. [41]

    EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents

    Rui Yang et al. EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025

  42. [42]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019– 3026, 2020

  43. [43]

    SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

    Sheng Yin, Xianghe Pang, and Wenhao Ding. SafeAgentBench: A benchmark for safe task plan- ning of embodied LLM agents.arXiv preprint arXiv:2412.13178, 2024

  44. [44]

    AGENTSAFE: Benchmarking the safety of em- bodied agents on hazardous instructions.arXiv preprint arXiv:2502.02885, 2025

    Yingzhuo Liu, Jiaqi Ying, Zhilong Wang, et al. AGENTSAFE: Benchmarking the safety of em- bodied agents on hazardous instructions.arXiv preprint arXiv:2502.02885, 2025

  45. [45]

    IS-Bench: Evaluating interactive safety of VLM-driven embodied agents

    Pengzhen Lu et al. IS-Bench: Evaluating interactive safety of VLM-driven embodied agents. Proceedings of AAAI, 2025

  46. [46]

    Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,

    Hangtao Zhang, Chenyu Zhu, Xianlong Wang, et al. BadRobot: Jailbreaking LLM-based embod- ied AI in the physical world.arXiv preprint arXiv:2407.20242, 2024

  47. [47]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Zhexin Zhang, Shiyao Cui, Yida Lu, et al. Agent-safetybench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024

  48. [48]

    arXiv preprint arXiv:2503.10009 , year=

    Yike Huang, Fengyi Ding, Yu Tang, et al. A framework for benchmarking and aligning task- planning safety in LLM-based embodied agents.arXiv preprint arXiv:2503.10009, 2025

  49. [49]

    Counts: Benchmarking llm numerical reasoning with verifiable rewards

    Jianing Wu, Xiaofeng Chen, et al. EARBench: Evaluating physical risk awareness for foundation model embodied AI.arXiv preprint arXiv:2501.00000, 2025

  50. [50]

    Alone, Samuel Glockhoff, et al

    Wasif Afzal, Saif U. Alone, Samuel Glockhoff, et al. A study on challenges of testing robotic sys- tems. InIEEE International Conference on Software Testing, Verification and Validation (ICST), pages 314–325, 2020

  51. [51]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InInternational Conference on Learning Representations (ICLR), 2024

  52. [52]

    R-Judge: Benchmarking safety risk aware- ness for LLM agents

    Tongxin Yuan, Zhiwei Zheng, Yilong Dong, et al. R-Judge: Benchmarking safety risk aware- ness for LLM agents. InFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  53. [53]

    AgentHarm: A benchmark for measuring harmfulness of LLM agents

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. AgentHarm: A benchmark for measuring harmfulness of LLM agents. InInternational Conference on Learning Representa- tions (ICLR), 2024

  54. [54]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environ- ment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. 29

  55. [55]

    DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chejian Zhang, Chaowei Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023

  56. [56]

    TEACh: Task-driven embod- ied agents that chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, et al. TEACh: Task-driven embod- ied agents that chat. InAAAI Conference on Artificial Intelligence, 2022

  57. [57]

    Safety of embodied navigation: A survey

    Yixuan Wang, Xiaohan Hu, and Yadong Mu. Safety of embodied navigation: A survey. InInter- national Joint Conference on Artificial Intelligence (IJCAI), 2025

  58. [58]

    Embodied AI: Emerging risks and opportunities for policy action

    Davide Perlo, Alexander Robey, Fazl Barez, Luciano Floridi, and Jakob Mokander. Embodied AI: Emerging risks and opportunities for policy action. InNeurIPS Workshop on Safe and Trustworthy AI, 2025

  59. [59]

    Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P

    Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022

  60. [60]

    State-wise safe reinforcement learning: A survey

    Chen Zhao, Peimin He, Xinna Chen, et al. State-wise safe reinforcement learning: A survey. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 5860– 5868, 2023

  61. [61]

    Chapman and Hall/CRC, 1999

    Eitan Altman.Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999

  62. [62]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning (ICML), pages 22–31, 2017

  63. [63]

    Schoellig, and Andreas Krause

    Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  64. [64]

    Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019

    Alex Ray, Joshua Achiam, and Damon Amodei. Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019

  65. [65]

    Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023

    Jiaming Ji, Borong Zhang, Jiayi Pan, Juntao Zhou, Jia Dai, and Yaodong Yang. Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023

  66. [66]

    Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022

    Zhaocong Yuan, Hongchao He, Yuxuan Zeng, et al. Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022

  67. [67]

    Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023

    Shangding Gu, Antonio Riccardi, Long Yang, et al. Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023

  68. [68]

    Seshia, Natarajan Shanber, and Ashish Tiwari

    Ankush Desai, Shromona Ghosh, Sanjit A. Seshia, Natarajan Shanber, and Ashish Tiwari. Soter: A runtime assurance framework for programming safe robotics systems.arXiv preprint arXiv:1808.07921, 2019

  69. [69]

    An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020

    Gowtham Srinivasan and Stanley Bak. An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020

  70. [70]

    Hobbs, Mark L

    Kerianne L. Hobbs, Mark L. Mote, Matthew C. Abate, et al. Runtime assurance for safety-critical systems: An introduction to safety filtering.IEEE Control Systems Magazine, 43(2):28–65, 2023. 30

  71. [71]

    ROSRV: Runtime verification for robots

    Jeff Huang, Cansu Erdogan, Yi Zhang, Brandon Moore, Qingzhou Luo, Aravind Sundaresan, and Grigore Ro¸ su. ROSRV: Runtime verification for robots. InInternational Conference on Runtime Verification, pages 247–254. Springer, 2014

  72. [72]

    Safe reinforcement learning via shielding

    Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI Conference on Artificial Intelli- gence, pages 2669–2678, 2018

  73. [73]

    Ames, Xiangru Xu, Jessy B

    Aaron D. Ames, Xiangru Xu, Jessy B. Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control (IEEE TAC), 62(8):3861–3876, 2017

  74. [74]

    Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D

    Aaron D. Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D. Sontag, and Paulo Tabuada. Control barrier functions: Theory and applications. InEuropean Control Conference (ECC), pages 3539–3563, 2019

  75. [75]

    Dennis, and Michael Fisher

    Matt Luckcuck, Marie Farrell, Louise A. Dennis, and Michael Fisher. Formal specification and verification of autonomous robotic systems: A survey.ACM Computing Surveys, 52(5):1–41, 2019

  76. [76]

    Seshia, Dorsa Sadigh, and S

    Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Toward verified artificial intelligence. Communications of the ACM, 65(7):46–55, 2022

  77. [77]

    arXiv preprint arXiv:2405.06624 , year =

    David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, et al. Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems.arXiv preprint arXiv:2405.06624, 2024

  78. [78]

    NeMo Guardrails: A toolkit for controllable and safe LLM applications

    Traian Rebedea, Rares Diaconescu, et al. NeMo Guardrails: A toolkit for controllable and safe LLM applications. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP) Demo, 2023

  79. [79]

    The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019

    Anna Jobin, Marcello Ienca, and Effy Vayena. The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019

  80. [80]

    Crandall, Nicholas A

    Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, Nicholas A. Christakis, Iain D. Couzin, Matthew O. Jackson, et al. Machine behaviour.Nature, 568(7753):477–486, 2019

Showing first 80 references.