arxiv: 2604.11174 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.AI

Recognition: unknown

EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

Xue Qin , Simin Luan , John See , Cong Yang , Zhijun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords embodied AIrobot governancebenchmarksafety evaluationrecoveryauditabilitypolicy enforcementupgrade safety

0 comments

The pith

EmbodiedGovBench evaluates whether robot systems stay controllable and policy-compliant instead of measuring only task success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current embodied AI evaluations focus on task completion rates and accuracy but ignore whether systems respect boundaries, enforce policies, recover from problems, or accept human control. This paper introduces EmbodiedGovBench to close that gap with a structured test covering seven governance dimensions across single robots and fleets. A reader would care because future robot policies and foundation models could otherwise operate without reliable oversight or safe rollback in real settings. The benchmark supplies scenario templates, perturbation methods, metrics, and protocols that fit existing modular robot runtimes. It concludes that governance must join task success as a standard evaluation target.

Core claim

EmbodiedGovBench is a benchmark for governance-oriented evaluation of embodied agent systems that assesses whether they remain controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations rather than asking only whether tasks are completed. It organizes evaluation around seven dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. The benchmark supplies a structure for single-robot and fleet scenarios that includes templates, perturbation operators, governance metrics, and baseline protocols, and it describes how the

What carries the argument

EmbodiedGovBench, a benchmark structure spanning seven governance dimensions with scenario templates, perturbation operators, metrics, and protocols for testing controllability and safety in embodied agent systems.

If this is right

Task success rates alone will no longer be treated as sufficient evidence of system readiness.
Developers must demonstrate prevention of unauthorized capability use and robustness to runtime drift.
Safe recovery from errors and responsiveness to human overrides become required, measurable properties.
Version upgrades must pass safety verification before deployment in production systems.
Audit completeness becomes a standard requirement for accountability in both single and fleet deployments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robot developers may need to redesign core interfaces to expose governance hooks for easy testing.
The benchmark could highlight trade-offs between governance constraints and task flexibility in dynamic environments.
Adoption might encourage creation of governance-aware training data for embodied foundation models.

Load-bearing premise

The seven governance dimensions capture the essential requirements for safe and controllable embodied systems.

What would settle it

A demonstration that an embodied system achieves high task success yet routinely violates one or more of the seven governance dimensions without loss of control or safety incidents would challenge the benchmark's necessity.

Figures

Figures reproduced from arXiv: 2604.11174 by Cong Yang, John See, Simin Luan, Xue Qin, Zhijun Li.

**Figure 1.** Figure 1: From Task Success Benchmarks to Governance Benchmarks. Traditional embodied benchmarks (left) measure what robots can do. EmbodiedGovBench (right) measures whether their capabilities remain governable under execution, failure, and evolution. and audit completeness. 3. We propose a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, metric familie… view at source ↗

**Figure 2.** Figure 2: EmbodiedGovBench Structure. The benchmark is organized in layers: seven governance dimensions define the evaluation target; two tracks (single-robot and fleet) provide execution scope; scenario templates and perturbation operators generate benchmark instances; and a multi-level scoring framework produces diagnostic outputs [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Benchmark Evaluation Pipeline. Each benchmark run proceeds through five stages, with trace collection spanning perturbation injection and system execution [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Benchmark Harness Architecture. The harness wraps around the system under test, injecting scenarios and perturbations through an adapter layer, collecting traces, judging governance compliance, and producing diagnostic reports. 7.3 Governance Judge The Governance Judge compares the observed trace against governance-aware ground truth. For scenario instance S with ground truth G(S), the judge computes J (S… view at source ↗

read the original abstract

Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmbodiedGovBench sketches a governance benchmark for embodied agents but stays conceptual with no implementations or results shown.

read the letter

The paper's core move is to argue that task-success metrics miss whether embodied systems are actually governable. It defines EmbodiedGovBench around seven dimensions—unauthorized capability use, runtime drift, recovery, policy portability, upgrade safety, human override response, and audit completeness—and supplies scenario templates, perturbation operators, and high-level protocols for both single-robot and fleet cases. That framing pulls together safety concerns that usually sit in separate papers and treats governance as something that should be measured directly rather than assumed from performance numbers. The structure for modular runtimes and contract-aware upgrades is laid out clearly enough that someone could try to build on it. The gap it names is real; current robotics benchmarks really do stop at completion rates. The limitation is that none of this is tested. The manuscript describes how the benchmark could be instantiated but gives no code, no example runs on an actual policy or runtime, and no numbers on whether the metrics behave as intended. The claim that it works with only minor engineering therefore remains a design assertion rather than a demonstrated fact. Without at least one concrete case, it is difficult to know if the dimensions are practical, redundant, or miss key failure modes. This is the kind of paper that would interest people building safety standards or evaluation suites for physical agents. It is not ready to cite as a working tool, but the idea is coherent and the literature gap it targets is legitimate. I would send it to peer review so the authors can get feedback on tightening the dimensions and adding at least a minimal working example.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EmbodiedGovBench as a benchmark for governance-oriented evaluation of embodied agent systems. It highlights limitations in existing task-success metrics and defines seven governance dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness). The paper supplies scenario templates, perturbation operators, metric definitions, and high-level protocols for single-robot and fleet settings, along with guidance on instantiation over modular embodied runtimes with contract-aware interfaces, concluding that embodied governance should become a first-class evaluation target.

Significance. If the proposed structure proves instantiable and yields reproducible measurements, the benchmark could fill a meaningful gap by extending embodied AI evaluation beyond task completion to include controllability, policy adherence, recoverability, and auditability. This has potential to influence safer deployment of robot policies and foundation models. The conceptual organization of dimensions and protocols is a clear contribution, though its significance remains prospective without demonstrated application.

major comments (2)

[Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.
[Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.

minor comments (2)

[Abstract] Abstract: Consider adding a sentence clarifying that the current contribution is the benchmark definition and protocols rather than empirical results on specific systems.
[Metric definitions] Metric definitions: Ensure each governance metric is accompanied by a precise formula or pseudocode to support future implementations and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on EmbodiedGovBench. The comments identify key areas where the support for our claims and the rationale for the benchmark dimensions require strengthening. We address each major comment below and specify the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.

Authors: We agree that the manuscript presents EmbodiedGovBench as a conceptual framework with high-level definitions, templates, and protocols, without a deployed implementation or empirical results on specific systems. The phrasing regarding ready instantiation reflects the modular structure we describe, which aligns with common interfaces in embodied runtimes, but we acknowledge this claim lacks direct demonstration. In revision, we will update the abstract and introduction to qualify these statements, clarifying that the benchmark supplies a blueprint and protocols whose instantiation will involve integration work by users. We will add an appendix with pseudocode for core operators, interface contracts, and step-by-step instantiation guidance for representative runtimes to make the framework more actionable while preserving its focus as a proposal rather than a completed artifact. revision: partial
Referee: [Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.

Authors: The referee is correct that an explicit linkage to prior failure modes would improve the justification. We will revise the manuscript by inserting a new subsection (and accompanying table) that maps each of the seven dimensions to concrete examples drawn from the embodied AI literature, such as unauthorized capability use in policy-constrained navigation systems, runtime drift in long-horizon manipulation, and audit gaps in fleet coordination. The table will also contrast these against limitations of existing task-success metrics. This addition will clarify why the dimensions are both necessary and appropriately scoped for governance-oriented evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined from external gaps

full rationale

The paper introduces EmbodiedGovBench by enumerating seven governance dimensions drawn from gaps in existing task-success metrics, then supplies scenario templates, perturbation operators, metric definitions, and high-level instantiation protocols. No equations, fitted parameters, or quantitative predictions appear. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim is a design proposal whose content is independent of any result derived from the benchmark itself; the absence of concrete runtime results is a limitation of demonstration, not a circular reduction of the definition to its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on domain assumptions about the importance of governance and the sufficiency of the chosen dimensions, with the benchmark itself as the main new construct.

axioms (2)

domain assumption Embodied AI systems require evaluation on governance, recovery, and safety beyond task success metrics
Stated as the motivation and gap in current evaluations.
ad hoc to paper The seven listed dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, audit completeness) cover the key governance aspects
Defined by the authors as the scope of the benchmark.

invented entities (1)

EmbodiedGovBench no independent evidence
purpose: A structured benchmark and measurement framework for governance in embodied agents
Newly proposed construct that organizes the evaluation.

pith-pipeline@v0.9.0 · 5532 in / 1387 out tokens · 41747 ms · 2026-05-10T15:17:07.104362+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification
cs.CY 2026-04 unverdicted novelty 4.0

DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.

Reference graph

Works this paper leans on

96 extracted references · 25 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review arXiv 2023
[2]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multi- modal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[3]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

work page internal anchor Pith review arXiv 2024
[4]

Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009

Davide Brugali and Patrizia Scandurra. Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009

2009
[5]

Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y

Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y . Ng. ROS: An open-source robot operating system.ICRA Workshop on Open Source Software, 3:5, 2009

2009
[6]

Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018

2018
[7]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019

2019
[8]

ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10740–10749, 2020

2020
[9]

Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Sergey Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024

work page arXiv 2024
[10]

ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022

2022
[11]

Alan F. T. Winfield and Marina Jirotka. Ethical governance is essential to building trust in robotics and artificial intelligence systems.Philosophical Transactions of the Royal Society A, 376(2133):20180085, 2018

2018
[12]

AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018

Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, et al. AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018

2018
[13]

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: Agent execution runtime operating system for embodied robots.arXiv preprint arXiv:2604.07039, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Learning Without Losing Identity: Capability Evolution for Embodied Agents

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Learning without losing identity: Capability evolution for embodied agents.arXiv preprint arXiv:2604.07799, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Harnessing embodied agents: Runtime governance for policy-constrained execution.arXiv preprint arXiv:2604.07833, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Governed capability evolution for embodied agents.arXiv preprint arXiv:2604.08059, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026

2026
[18]

Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026

Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026

2026
[19]

Concrete Problems in AI Safety

Damon Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review arXiv 2016
[20]

A comprehensive survey on safe reinforcement learning

Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480, 2015

2015
[21]

CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022

2022
[22]

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Do- main randomization for transferring deep neural networks from simulation to the real world.arXiv preprint arXiv:1703.06907, 2017

work page Pith review arXiv 2017
[23]

Sim-to-real transfer in deep reinforce- ment learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020

Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforce- ment learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020. 27

work page arXiv 2009
[24]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabber, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

2023
[25]

Kevin Black, Noah Brown, Danny Driess, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gober, Karol Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review arXiv 2022
[27]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2023

2023
[28]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InConference on Robot Learning (CoRL), 2022

2022
[29]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cooney, Sergey Levine, and Russ Tedrake. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Sys- tems (RSS), 2023

2023
[30]

Chang, M., Chhablani, G., Clegg, A., Cote, M

Matthew Chang, Gunjan Chhablani, Alexander Clegg, et al. PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024

work page arXiv 2024
[31]

OpenEQA: Embodied question answering in the era of foundation models

Arjun Majumdar, Xiaohan Khanna, et al. OpenEQA: Embodied question answering in the era of foundation models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[32]

GOAT-Bench: A benchmark for multi- modal lifelong navigation

Xiaohan Khanna, Arjun Majumdar, Rishav Chadha, et al. GOAT-Bench: A benchmark for multi- modal lifelong navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[33]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

2024
[34]

A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026

Xiaoyu Hou, Siqi Zhang, et al. A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026

work page arXiv 2026
[35]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research (TMLR), 2023

2023
[36]

ProgPrompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Di- eter Fox, Jesse Thomason, and Animesh Garg. ProgPrompt: Generating situated robot task plans using large language models. InIEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023

2023
[37]

Embodied agent interface: Benchmarking LLMs for embodied decision making

Manling Li, Shiyu Zhao, Qineng Wang, et al. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[38]

Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020

Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020. 28

2020
[39]

Andrew Bagnell, and Jan Peters

Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research (IJRR), 32(11):1238–1274, 2013

2013
[40]

MuJoCo: A physics engine for model-based control

Erez Todorov and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012

2012
[41]

EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents

Rui Yang et al. EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025

2025
[42]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019– 3026, 2020

2020
[43]

SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

Sheng Yin, Xianghe Pang, and Wenhao Ding. SafeAgentBench: A benchmark for safe task plan- ning of embodied LLM agents.arXiv preprint arXiv:2412.13178, 2024

work page arXiv 2024
[44]

AGENTSAFE: Benchmarking the safety of em- bodied agents on hazardous instructions.arXiv preprint arXiv:2502.02885, 2025

Yingzhuo Liu, Jiaqi Ying, Zhilong Wang, et al. AGENTSAFE: Benchmarking the safety of em- bodied agents on hazardous instructions.arXiv preprint arXiv:2502.02885, 2025

work page arXiv 2025
[45]

IS-Bench: Evaluating interactive safety of VLM-driven embodied agents

Pengzhen Lu et al. IS-Bench: Evaluating interactive safety of VLM-driven embodied agents. Proceedings of AAAI, 2025

2025
[46]

Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, et al. BadRobot: Jailbreaking LLM-based embod- ied AI in the physical world.arXiv preprint arXiv:2407.20242, 2024

work page arXiv 2024
[47]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, et al. Agent-safetybench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024

work page internal anchor Pith review arXiv 2024
[48]

arXiv preprint arXiv:2503.10009 , year=

Yike Huang, Fengyi Ding, Yu Tang, et al. A framework for benchmarking and aligning task- planning safety in LLM-based embodied agents.arXiv preprint arXiv:2503.10009, 2025

work page arXiv 2025
[49]

Counts: Benchmarking llm numerical reasoning with verifiable rewards

Jianing Wu, Xiaofeng Chen, et al. EARBench: Evaluating physical risk awareness for foundation model embodied AI.arXiv preprint arXiv:2501.00000, 2025

work page arXiv 2025
[50]

Alone, Samuel Glockhoff, et al

Wasif Afzal, Saif U. Alone, Samuel Glockhoff, et al. A study on challenges of testing robotic sys- tems. InIEEE International Conference on Software Testing, Verification and Validation (ICST), pages 314–325, 2020

2020
[51]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InInternational Conference on Learning Representations (ICLR), 2024

2024
[52]

R-Judge: Benchmarking safety risk aware- ness for LLM agents

Tongxin Yuan, Zhiwei Zheng, Yilong Dong, et al. R-Judge: Benchmarking safety risk aware- ness for LLM agents. InFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[53]

AgentHarm: A benchmark for measuring harmfulness of LLM agents

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. AgentHarm: A benchmark for measuring harmfulness of LLM agents. InInternational Conference on Learning Representa- tions (ICLR), 2024

2024
[54]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environ- ment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. 29

2024
[55]

DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chejian Zhang, Chaowei Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[56]

TEACh: Task-driven embod- ied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, et al. TEACh: Task-driven embod- ied agents that chat. InAAAI Conference on Artificial Intelligence, 2022

2022
[57]

Safety of embodied navigation: A survey

Yixuan Wang, Xiaohan Hu, and Yadong Mu. Safety of embodied navigation: A survey. InInter- national Joint Conference on Artificial Intelligence (IJCAI), 2025

2025
[58]

Embodied AI: Emerging risks and opportunities for policy action

Davide Perlo, Alexander Robey, Fazl Barez, Luciano Floridi, and Jakob Mokander. Embodied AI: Emerging risks and opportunities for policy action. InNeurIPS Workshop on Safe and Trustworthy AI, 2025

2025
[59]

Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P

Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022

2022
[60]

State-wise safe reinforcement learning: A survey

Chen Zhao, Peimin He, Xinna Chen, et al. State-wise safe reinforcement learning: A survey. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 5860– 5868, 2023

2023
[61]

Chapman and Hall/CRC, 1999

Eitan Altman.Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999

1999
[62]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning (ICML), pages 22–31, 2017

2017
[63]

Schoellig, and Andreas Krause

Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

2017
[64]

Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019

Alex Ray, Joshua Achiam, and Damon Amodei. Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019

2019
[65]

Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023

Jiaming Ji, Borong Zhang, Jiayi Pan, Juntao Zhou, Jia Dai, and Yaodong Yang. Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[66]

Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022

Zhaocong Yuan, Hongchao He, Yuxuan Zeng, et al. Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022

2022
[67]

Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023

Shangding Gu, Antonio Riccardi, Long Yang, et al. Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023

2023
[68]

Seshia, Natarajan Shanber, and Ashish Tiwari

Ankush Desai, Shromona Ghosh, Sanjit A. Seshia, Natarajan Shanber, and Ashish Tiwari. Soter: A runtime assurance framework for programming safe robotics systems.arXiv preprint arXiv:1808.07921, 2019

work page arXiv 2019
[69]

An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020

Gowtham Srinivasan and Stanley Bak. An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020

2020
[70]

Hobbs, Mark L

Kerianne L. Hobbs, Mark L. Mote, Matthew C. Abate, et al. Runtime assurance for safety-critical systems: An introduction to safety filtering.IEEE Control Systems Magazine, 43(2):28–65, 2023. 30

2023
[71]

ROSRV: Runtime verification for robots

Jeff Huang, Cansu Erdogan, Yi Zhang, Brandon Moore, Qingzhou Luo, Aravind Sundaresan, and Grigore Ro¸ su. ROSRV: Runtime verification for robots. InInternational Conference on Runtime Verification, pages 247–254. Springer, 2014

2014
[72]

Safe reinforcement learning via shielding

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI Conference on Artificial Intelli- gence, pages 2669–2678, 2018

2018
[73]

Ames, Xiangru Xu, Jessy B

Aaron D. Ames, Xiangru Xu, Jessy B. Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control (IEEE TAC), 62(8):3861–3876, 2017

2017
[74]

Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D

Aaron D. Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D. Sontag, and Paulo Tabuada. Control barrier functions: Theory and applications. InEuropean Control Conference (ECC), pages 3539–3563, 2019

2019
[75]

Dennis, and Michael Fisher

Matt Luckcuck, Marie Farrell, Louise A. Dennis, and Michael Fisher. Formal specification and verification of autonomous robotic systems: A survey.ACM Computing Surveys, 52(5):1–41, 2019

2019
[76]

Seshia, Dorsa Sadigh, and S

Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Toward verified artificial intelligence. Communications of the ACM, 65(7):46–55, 2022

2022
[77]

arXiv preprint arXiv:2405.06624 , year =

David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, et al. Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems.arXiv preprint arXiv:2405.06624, 2024

work page arXiv 2024
[78]

NeMo Guardrails: A toolkit for controllable and safe LLM applications

Traian Rebedea, Rares Diaconescu, et al. NeMo Guardrails: A toolkit for controllable and safe LLM applications. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP) Demo, 2023

2023
[79]

The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019

Anna Jobin, Marcello Ienca, and Effy Vayena. The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019

2019
[80]

Crandall, Nicholas A

Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, Nicholas A. Christakis, Iain D. Couzin, Matthew O. Jackson, et al. Machine behaviour.Nature, 568(7753):477–486, 2019

2019

Showing first 80 references.