Recognition: unknown
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
EmbodiedGovBench evaluates whether robot systems stay controllable and policy-compliant instead of measuring only task success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EmbodiedGovBench is a benchmark for governance-oriented evaluation of embodied agent systems that assesses whether they remain controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations rather than asking only whether tasks are completed. It organizes evaluation around seven dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. The benchmark supplies a structure for single-robot and fleet scenarios that includes templates, perturbation operators, governance metrics, and baseline protocols, and it describes how the
What carries the argument
EmbodiedGovBench, a benchmark structure spanning seven governance dimensions with scenario templates, perturbation operators, metrics, and protocols for testing controllability and safety in embodied agent systems.
If this is right
- Task success rates alone will no longer be treated as sufficient evidence of system readiness.
- Developers must demonstrate prevention of unauthorized capability use and robustness to runtime drift.
- Safe recovery from errors and responsiveness to human overrides become required, measurable properties.
- Version upgrades must pass safety verification before deployment in production systems.
- Audit completeness becomes a standard requirement for accountability in both single and fleet deployments.
Where Pith is reading between the lines
- Robot developers may need to redesign core interfaces to expose governance hooks for easy testing.
- The benchmark could highlight trade-offs between governance constraints and task flexibility in dynamic environments.
- Adoption might encourage creation of governance-aware training data for embodied foundation models.
Load-bearing premise
The seven governance dimensions capture the essential requirements for safe and controllable embodied systems.
What would settle it
A demonstration that an embodied system achieves high task success yet routinely violates one or more of the seven governance dimensions without loss of control or safety incidents would challenge the benchmark's necessity.
Figures
read the original abstract
Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EmbodiedGovBench as a benchmark for governance-oriented evaluation of embodied agent systems. It highlights limitations in existing task-success metrics and defines seven governance dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness). The paper supplies scenario templates, perturbation operators, metric definitions, and high-level protocols for single-robot and fleet settings, along with guidance on instantiation over modular embodied runtimes with contract-aware interfaces, concluding that embodied governance should become a first-class evaluation target.
Significance. If the proposed structure proves instantiable and yields reproducible measurements, the benchmark could fill a meaningful gap by extending embodied AI evaluation beyond task completion to include controllability, policy adherence, recoverability, and auditability. This has potential to influence safer deployment of robot policies and foundation models. The conceptual organization of dimensions and protocols is a clear contribution, though its significance remains prospective without demonstrated application.
major comments (2)
- [Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.
- [Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.
minor comments (2)
- [Abstract] Abstract: Consider adding a sentence clarifying that the current contribution is the benchmark definition and protocols rather than empirical results on specific systems.
- [Metric definitions] Metric definitions: Ensure each governance metric is accompanied by a precise formula or pseudocode to support future implementations and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript on EmbodiedGovBench. The comments identify key areas where the support for our claims and the rationale for the benchmark dimensions require strengthening. We address each major comment below and specify the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central assertion that EmbodiedGovBench 'provides the initial measurement framework' and 'can be readily instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows' without major additional engineering is unsupported. The manuscript supplies only high-level scenario templates, operators, and protocols but contains no concrete implementation, code, or evaluation results on any robot policy, runtime, or fleet, leaving the readiness and usability claims untested.
Authors: We agree that the manuscript presents EmbodiedGovBench as a conceptual framework with high-level definitions, templates, and protocols, without a deployed implementation or empirical results on specific systems. The phrasing regarding ready instantiation reflects the modular structure we describe, which aligns with common interfaces in embodied runtimes, but we acknowledge this claim lacks direct demonstration. In revision, we will update the abstract and introduction to qualify these statements, clarifying that the benchmark supplies a blueprint and protocols whose instantiation will involve integration work by users. We will add an appendix with pseudocode for core operators, interface contracts, and step-by-step instantiation guidance for representative runtimes to make the framework more actionable while preserving its focus as a proposal rather than a completed artifact. revision: partial
-
Referee: [Benchmark structure] Benchmark structure description: The seven governance dimensions are presented as covering essential requirements, yet the manuscript provides no explicit mapping from these dimensions to documented failure modes in existing embodied systems or to gaps in prior metrics, making it difficult to assess whether the chosen dimensions are complete or minimal for the stated goals.
Authors: The referee is correct that an explicit linkage to prior failure modes would improve the justification. We will revise the manuscript by inserting a new subsection (and accompanying table) that maps each of the seven dimensions to concrete examples drawn from the embodied AI literature, such as unauthorized capability use in policy-constrained navigation systems, runtime drift in long-horizon manipulation, and audit gaps in fleet coordination. The table will also contrast these against limitations of existing task-success metrics. This addition will clarify why the dimensions are both necessary and appropriately scoped for governance-oriented evaluation. revision: yes
Circularity Check
No circularity: benchmark defined from external gaps
full rationale
The paper introduces EmbodiedGovBench by enumerating seven governance dimensions drawn from gaps in existing task-success metrics, then supplies scenario templates, perturbation operators, metric definitions, and high-level instantiation protocols. No equations, fitted parameters, or quantitative predictions appear. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim is a design proposal whose content is independent of any result derived from the benchmark itself; the absence of concrete runtime results is a limitation of demonstration, not a circular reduction of the definition to its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Embodied AI systems require evaluation on governance, recovery, and safety beyond task success metrics
- ad hoc to paper The seven listed dimensions (unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, audit completeness) cover the key governance aspects
invented entities (1)
-
EmbodiedGovBench
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification
DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.
Reference graph
Works this paper leans on
-
[1]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multi- modal language model.arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009
Davide Brugali and Patrizia Scandurra. Component-based robotic engineering (part I): Reusable building blocks.IEEE Robotics & Automation Magazine, 16(4):84–96, 2009
2009
-
[5]
Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y
Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y . Ng. ROS: An open-source robot operating system.ICRA Workshop on Open Source Software, 3:5, 2009
2009
-
[6]
Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018
2018
-
[7]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019
2019
-
[8]
ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions 26 for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10740–10749, 2020
2020
-
[9]
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Sergey Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024
-
[10]
ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied ai using procedural generation.Advances in Neural Information Processing Systems (NeurIPS), 35:5982–5994, 2022
2022
-
[11]
Alan F. T. Winfield and Marina Jirotka. Ethical governance is essential to building trust in robotics and artificial intelligence systems.Philosophical Transactions of the Royal Society A, 376(2133):20180085, 2018
2018
-
[12]
AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018
Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, et al. AI4People— an ethical framework for a good AI society.Minds and Machines, 28:689–707, 2018
2018
-
[13]
AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. AEROS: Agent execution runtime operating system for embodied robots.arXiv preprint arXiv:2604.07039, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Learning Without Losing Identity: Capability Evolution for Embodied Agents
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Learning without losing identity: Capability evolution for embodied agents.arXiv preprint arXiv:2604.07799, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Harnessing embodied agents: Runtime governance for policy-constrained execution.arXiv preprint arXiv:2604.07833, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Governed capability evolution for embodied agents.arXiv preprint arXiv:2604.08059, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. ECM contracts: Contract-aware, versioned, and governable capability interfaces for embodied agents.Under review, 2026
2026
-
[18]
Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026
Xue Qin, Simin Luan, John See, Cong Yang, and Zhijun Li. Federated single-agent robotics: Multi-robot coordination without intra-robot multi-agent fragmentation.Under review, 2026
2026
-
[19]
Concrete Problems in AI Safety
Damon Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review arXiv 2016
-
[20]
A comprehensive survey on safe reinforcement learning
Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480, 2015
2015
-
[21]
CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022
2022
-
[22]
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Do- main randomization for transferring deep neural networks from simulation to the real world.arXiv preprint arXiv:1703.06907, 2017
work page Pith review arXiv 2017
-
[23]
Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforce- ment learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020. 27
-
[24]
RT-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabber, Chelsea Finn, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023
2023
-
[25]
Kevin Black, Noah Brown, Danny Driess, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gober, Karol Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review arXiv 2022
-
[27]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2023
2023
-
[28]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InConference on Robot Learning (CoRL), 2022
2022
-
[29]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cooney, Sergey Levine, and Russ Tedrake. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Sys- tems (RSS), 2023
2023
-
[30]
Chang, M., Chhablani, G., Clegg, A., Cote, M
Matthew Chang, Gunjan Chhablani, Alexander Clegg, et al. PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024
-
[31]
OpenEQA: Embodied question answering in the era of foundation models
Arjun Majumdar, Xiaohan Khanna, et al. OpenEQA: Embodied question answering in the era of foundation models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[32]
GOAT-Bench: A benchmark for multi- modal lifelong navigation
Xiaohan Khanna, Arjun Majumdar, Rishav Chadha, et al. GOAT-Bench: A benchmark for multi- modal lifelong navigation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[33]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
Lei Wang, Chen Ma, Xueyang Feng, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
2024
-
[34]
A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026
Xiaoyu Hou, Siqi Zhang, et al. A survey on evaluation of embodied AI.arXiv preprint arXiv:2601.00000, 2026
-
[35]
V oyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research (TMLR), 2023
2023
-
[36]
ProgPrompt: Generating situated robot task plans using large language models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Di- eter Fox, Jesse Thomason, and Animesh Garg. ProgPrompt: Generating situated robot task plans using large language models. InIEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023
2023
-
[37]
Embodied agent interface: Benchmarking LLMs for embodied decision making
Manling Li, Shiyu Zhao, Qineng Wang, et al. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[38]
Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020
Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Integrated task and motion planning.Annual Review of Control, Robotics, and Autonomous Systems, 3:1–30, 2020. 28
2020
-
[39]
Andrew Bagnell, and Jan Peters
Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research (IJRR), 32(11):1238–1274, 2013
2013
-
[40]
MuJoCo: A physics engine for model-based control
Erez Todorov and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012
2012
-
[41]
EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents
Rui Yang et al. EmbodiedBench: Comprehensive benchmarking multi-modal large language mod- els for vision-driven embodied agents. InInternational Conference on Machine Learning (ICML), 2025
2025
-
[42]
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019– 3026, 2020
2020
-
[43]
SafeAgentBench: A benchmark for safe task planning of embodied LLM agents
Sheng Yin, Xianghe Pang, and Wenhao Ding. SafeAgentBench: A benchmark for safe task plan- ning of embodied LLM agents.arXiv preprint arXiv:2412.13178, 2024
-
[44]
Yingzhuo Liu, Jiaqi Ying, Zhilong Wang, et al. AGENTSAFE: Benchmarking the safety of em- bodied agents on hazardous instructions.arXiv preprint arXiv:2502.02885, 2025
-
[45]
IS-Bench: Evaluating interactive safety of VLM-driven embodied agents
Pengzhen Lu et al. IS-Bench: Evaluating interactive safety of VLM-driven embodied agents. Proceedings of AAAI, 2025
2025
-
[46]
Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,
Hangtao Zhang, Chenyu Zhu, Xianlong Wang, et al. BadRobot: Jailbreaking LLM-based embod- ied AI in the physical world.arXiv preprint arXiv:2407.20242, 2024
-
[47]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, et al. Agent-safetybench: Evaluating the safety of LLM agents.arXiv preprint arXiv:2412.14470, 2024
work page internal anchor Pith review arXiv 2024
-
[48]
arXiv preprint arXiv:2503.10009 , year=
Yike Huang, Fengyi Ding, Yu Tang, et al. A framework for benchmarking and aligning task- planning safety in LLM-based embodied agents.arXiv preprint arXiv:2503.10009, 2025
-
[49]
Counts: Benchmarking llm numerical reasoning with verifiable rewards
Jianing Wu, Xiaofeng Chen, et al. EARBench: Evaluating physical risk awareness for foundation model embodied AI.arXiv preprint arXiv:2501.00000, 2025
-
[50]
Alone, Samuel Glockhoff, et al
Wasif Afzal, Saif U. Alone, Samuel Glockhoff, et al. A study on challenges of testing robotic sys- tems. InIEEE International Conference on Software Testing, Verification and Validation (ICST), pages 314–325, 2020
2020
-
[51]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[52]
R-Judge: Benchmarking safety risk aware- ness for LLM agents
Tongxin Yuan, Zhiwei Zheng, Yilong Dong, et al. R-Judge: Benchmarking safety risk aware- ness for LLM agents. InFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
2024
-
[53]
AgentHarm: A benchmark for measuring harmfulness of LLM agents
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. AgentHarm: A benchmark for measuring harmfulness of LLM agents. InInternational Conference on Learning Representa- tions (ICLR), 2024
2024
-
[54]
Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environ- ment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. 29
2024
-
[55]
DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chejian Zhang, Chaowei Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models.Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[56]
TEACh: Task-driven embod- ied agents that chat
Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, et al. TEACh: Task-driven embod- ied agents that chat. InAAAI Conference on Artificial Intelligence, 2022
2022
-
[57]
Safety of embodied navigation: A survey
Yixuan Wang, Xiaohan Hu, and Yadong Mu. Safety of embodied navigation: A survey. InInter- national Joint Conference on Artificial Intelligence (IJCAI), 2025
2025
-
[58]
Embodied AI: Emerging risks and opportunities for policy action
Davide Perlo, Alexander Robey, Fazl Barez, Luciano Floridi, and Jakob Mokander. Embodied AI: Emerging risks and opportunities for policy action. InNeurIPS Workshop on Safe and Trustworthy AI, 2025
2025
-
[59]
Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P
Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022
2022
-
[60]
State-wise safe reinforcement learning: A survey
Chen Zhao, Peimin He, Xinna Chen, et al. State-wise safe reinforcement learning: A survey. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 5860– 5868, 2023
2023
-
[61]
Chapman and Hall/CRC, 1999
Eitan Altman.Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999
1999
-
[62]
Constrained policy optimization
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning (ICML), pages 22–31, 2017
2017
-
[63]
Schoellig, and Andreas Krause
Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
2017
-
[64]
Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019
Alex Ray, Joshua Achiam, and Damon Amodei. Benchmarking safe exploration in deep reinforce- ment learning.OpenAI Technical Report, 2019
2019
-
[65]
Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023
Jiaming Ji, Borong Zhang, Jiayi Pan, Juntao Zhou, Jia Dai, and Yaodong Yang. Safety-gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[66]
Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022
Zhaocong Yuan, Hongchao He, Yuxuan Zeng, et al. Safe-control-gym: A unified benchmark for safe learning-based control and reinforcement learning.IEEE Robotics and Automation Letters (RA-L), 7(4):10760–10767, 2022
2022
-
[67]
Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023
Shangding Gu, Antonio Riccardi, Long Yang, et al. Safe multi-agent reinforcement learning for multi-robot control.Artificial Intelligence, 325:104016, 2023
2023
-
[68]
Seshia, Natarajan Shanber, and Ashish Tiwari
Ankush Desai, Shromona Ghosh, Sanjit A. Seshia, Natarajan Shanber, and Ashish Tiwari. Soter: A runtime assurance framework for programming safe robotics systems.arXiv preprint arXiv:1808.07921, 2019
-
[69]
An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020
Gowtham Srinivasan and Stanley Bak. An extensible approach to high-assurance runtime moni- toring of autonomous systems.IEEE Transactions on Automation Science and Engineering, 2020
2020
-
[70]
Hobbs, Mark L
Kerianne L. Hobbs, Mark L. Mote, Matthew C. Abate, et al. Runtime assurance for safety-critical systems: An introduction to safety filtering.IEEE Control Systems Magazine, 43(2):28–65, 2023. 30
2023
-
[71]
ROSRV: Runtime verification for robots
Jeff Huang, Cansu Erdogan, Yi Zhang, Brandon Moore, Qingzhou Luo, Aravind Sundaresan, and Grigore Ro¸ su. ROSRV: Runtime verification for robots. InInternational Conference on Runtime Verification, pages 247–254. Springer, 2014
2014
-
[72]
Safe reinforcement learning via shielding
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI Conference on Artificial Intelli- gence, pages 2669–2678, 2018
2018
-
[73]
Ames, Xiangru Xu, Jessy B
Aaron D. Ames, Xiangru Xu, Jessy B. Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control (IEEE TAC), 62(8):3861–3876, 2017
2017
-
[74]
Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D
Aaron D. Ames, Samuel Coogan, Magnus Egerstedt, Giuseppe Notarstefano, Eduardo D. Sontag, and Paulo Tabuada. Control barrier functions: Theory and applications. InEuropean Control Conference (ECC), pages 3539–3563, 2019
2019
-
[75]
Dennis, and Michael Fisher
Matt Luckcuck, Marie Farrell, Louise A. Dennis, and Michael Fisher. Formal specification and verification of autonomous robotic systems: A survey.ACM Computing Surveys, 52(5):1–41, 2019
2019
-
[76]
Seshia, Dorsa Sadigh, and S
Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry. Toward verified artificial intelligence. Communications of the ACM, 65(7):46–55, 2022
2022
-
[77]
arXiv preprint arXiv:2405.06624 , year =
David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, et al. Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems.arXiv preprint arXiv:2405.06624, 2024
-
[78]
NeMo Guardrails: A toolkit for controllable and safe LLM applications
Traian Rebedea, Rares Diaconescu, et al. NeMo Guardrails: A toolkit for controllable and safe LLM applications. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP) Demo, 2023
2023
-
[79]
The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019
Anna Jobin, Marcello Ienca, and Effy Vayena. The global landscape of AI ethics guidelines.Nature Machine Intelligence, 1(9):389–399, 2019
2019
-
[80]
Crandall, Nicholas A
Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob W. Crandall, Nicholas A. Christakis, Iain D. Couzin, Matthew O. Jackson, et al. Machine behaviour.Nature, 568(7753):477–486, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.