REBAR: Reference Ethical Benchmark for Autonomy Readiness

Anthony Hoogs; Anuriha Kodali; Arslan Basharat; Brad Kriel; Cameron Johnson; David Barnes; James Niehaus; Jonathan Diller; Joseph VanPelt; Keith Fieldhouse

arxiv: 2605.18423 · v1 · pith:ECQWSPMJnew · submitted 2026-05-18 · 💻 cs.RO · cs.CY

REBAR: Reference Ethical Benchmark for Autonomy Readiness

Jonathan Diller , David Barnes , Rebekah Bogdanoff , Rhett Collier , Roddy Collins , Keith Fieldhouse , Yonatan Gefen , Cameron Johnson

show 9 more authors

Anuriha Kodali Brad Kriel Varun Murali James Niehaus Mish Sukharev Joseph VanPelt Anthony Hoogs Vijay Kumar Arslan Basharat

This is my paper

Pith reviewed 2026-05-20 09:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CY

keywords ethical benchmarkingautonomous systemsautonomy readiness levelneuro-symbolic AIsimulation testingLLM evaluationethical compliancerobotics safety

0 comments

The pith

REBAR assigns autonomous systems an Autonomy Readiness Level score based on ethical performance measured through simulated scenarios and neuro-symbolic analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REBAR as a quantitative framework to test autonomous systems for ethical and legal compliance. It converts operating metrics into a computable Autonomy Readiness Level rubric using a neuro-symbolic large language model to assess scenario difficulty, generate test cases at scale, and run evaluations inside a photorealistic simulator. This produces an objective benchmark score that tells users whether a given system is ready for a task and supplies interpretable reasons for the assessment. A sympathetic reader would care because current ethical checks for embodied AI stay mostly qualitative and often block behavior without explanation or override options, leaving a gap between principles and accountable deployment.

Core claim

REBAR is a quantitative test and evaluation framework for autonomous systems that maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric to quantify ethical performance. Key components include a neuro-symbolic Large Language Model approach that calculates and explains the ethical difficulty of scenarios, LLM-driven generation of test instances, and a versatile photorealistic simulation environment. By running white-box autonomy solutions through this pipeline, the framework produces an objective and repeatable benchmark score that connects abstract ethical principles to verifiable, accountable autonomy.

What carries the argument

The Autonomy Readiness Level (ARL) rubric, which converts measured operating metrics from simulated ethical scenarios into a single quantifiable score of ethical compliance.

If this is right

Systems receive concrete scores that indicate whether they are suitable for a given autonomy task.
Users obtain interpretable reasons for ethical assessments rather than simple pass-fail blocks.
Accountability improves because benchmark results can be repeated and audited.
Developers gain a pipeline for iteratively improving ethical guardrails in simulation before real-world use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If widely adopted, regulators could require minimum ARL scores for licensing self-driving vehicles or delivery robots.
The simulation-only testing pipeline would benefit from direct comparison against physical robot trials to check whether sim-to-real gaps affect ethical scores.
The framework could be applied to other domains such as medical decision support or financial trading agents that also require ethical compliance checks.

Load-bearing premise

The neuro-symbolic LLM can calculate and explain the ethical difficulty of scenarios in a manner that matches human ethical and legal standards without introducing systematic bias.

What would settle it

Run the same set of generated scenarios through REBAR and through independent panels of ethicists and legal experts, then measure the rate of agreement between the LLM-derived difficulty scores and explanations versus the human judgments.

Figures

Figures reproduced from arXiv: 2605.18423 by Anthony Hoogs, Anuriha Kodali, Arslan Basharat, Brad Kriel, Cameron Johnson, David Barnes, James Niehaus, Jonathan Diller, Joseph VanPelt, Keith Fieldhouse, Mish Sukharev, Rebekah Bogdanoff, Rhett Collier, Roddy Collins, Varun Murali, Vijay Kumar, Yonatan Gefen.

**Figure 1.** Figure 1: Proposed REBAR framework. A user can simply define a mission objective along with a natural language description of a scenario. Given this [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: REBAR graph visualization mapping principles, key attributes, VABs, and observables. The ethical decomposition grounds each principle to a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example showing the UAV searching for a high-priority target (panels 1 through 3) in the FalconSim and a “satellite” view (panel 4) showing [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Representative examples of the pixel-based perception pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Final ARL scores after N runs for KA-03 (bystander classification), KA-05 (adversary classification), KA-09 (object detection), KA-18 (bystander proximity reasoning), and KA-20 (mission accomplishment), showing a high task success rate but a failure to reason about an action’s impact on bystanders. existence of innocent bystanders, we generated simulation configurations using the environment ranges listed … view at source ↗

read the original abstract

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REBAR proposes a concrete pipeline for ethical benchmarking in autonomy using neuro-symbolic LLMs and simulation, but the lack of any validation data leaves the objectivity claims unsupported.

read the letter

The paper's main contribution is a framework called REBAR that combines neuro-symbolic LLMs for ethical scoring, LLM-generated test scenarios, and photorealistic simulation to produce an Autonomy Readiness Level score for autonomous systems. It does a solid job identifying the limitations of existing qualitative ethical frameworks in robotics, which often just block behaviors without explanation or user control. The idea of a computable metric that bridges principles to verifiable performance is worth exploring. The approach is new in how it integrates these components into a single pipeline for white-box evaluation. However, the central problem is the absence of any experimental results or validation. The framework depends on the neuro-symbolic LLM accurately determining ethical difficulty in line with human and legal standards, but the paper provides no data on agreement with experts, no bias testing, and no comparisons to existing methods. This makes the claims of objectivity and repeatability hard to assess right now. The ARL rubric could end up being circular if the scores are based on internal definitions rather than external grounding. This paper is for people working on ethical AI in robotics or those developing benchmarks for autonomous systems. A reader looking for new ideas on how to structure such evaluations might get something out of the architecture, but it won't satisfy someone wanting proven tools. I think it deserves peer review because the problem is important and the proposal is concrete enough for referees to give useful feedback on how to strengthen the validation side.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative framework for evaluating ethical and legal compliance of autonomous systems. It maps operating metrics to an Autonomy Readiness Level (ARL) rubric via a neuro-symbolic LLM for calculating and explaining ethical difficulty of scenarios, LLM-driven generation of test instances at scale, and a photorealistic simulation environment, with the goal of delivering an objective and repeatable benchmark score for white-box autonomy solutions.

Significance. If the core assumptions hold, REBAR could provide a valuable contribution by offering computable, scalable metrics that bridge abstract ethical principles with verifiable testing in robotics and embodied AI, potentially improving accountability and user awareness of system limitations. The combination of neuro-symbolic methods for explainability and simulation-based evaluation has clear potential for practical impact if empirically grounded.

major comments (2)

Abstract: The central claim that 'REBAR delivers an objective and repeatable benchmark score' is load-bearing, yet the manuscript supplies no experimental results, validation data, inter-rater agreement with experts, bias audits, or comparisons to established ethical/legal frameworks to demonstrate that the neuro-symbolic LLM component accurately calculates ethical difficulty without systematic bias.
Abstract: The Autonomy Readiness Level (ARL) rubric is presented as mapping operating metrics into a computable score, but without any derivation, equations, or empirical grounding shown for how LLM outputs are aggregated or validated against external standards, the objectivity assertion risks depending on internal framework definitions rather than independent verification.

minor comments (2)

The description of the photorealistic simulation environment would benefit from additional specifics on scenario parameterization, edge-case handling, and how it interfaces with the LLM scoring pipeline.
Clarify the exact neuro-symbolic architecture, including the symbolic component's role in ensuring alignment with human ethical and legal standards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing the REBAR framework. We address each major comment below and have revised the manuscript to improve clarity, provide additional grounding where possible, and adjust claims to better reflect the current scope of the work.

read point-by-point responses

Referee: [—] Abstract: The central claim that 'REBAR delivers an objective and repeatable benchmark score' is load-bearing, yet the manuscript supplies no experimental results, validation data, inter-rater agreement with experts, bias audits, or comparisons to established ethical/legal frameworks to demonstrate that the neuro-symbolic LLM component accurately calculates ethical difficulty without systematic bias.

Authors: We agree that the claim of delivering an objective and repeatable benchmark score requires stronger support to be fully substantiated. The current manuscript emphasizes the framework design, including the neuro-symbolic approach that combines deterministic symbolic rules for ethical principles with LLM-based explanation and scenario generation. This structure is intended to promote repeatability through explicit rules rather than purely stochastic LLM outputs. However, we acknowledge the absence of comprehensive empirical validation, inter-rater studies, or bias audits in this version. In the revised manuscript, we will add a new subsection on limitations and future validation plans, including preliminary comparisons to expert annotations on a small set of scenarios and references to established frameworks such as the IEEE Ethically Aligned Design guidelines. We will also revise the abstract to state that REBAR provides a methodology for computing such scores, with full empirical grounding to be presented in subsequent work. revision: yes
Referee: [—] Abstract: The Autonomy Readiness Level (ARL) rubric is presented as mapping operating metrics into a computable score, but without any derivation, equations, or empirical grounding shown for how LLM outputs are aggregated or validated against external standards, the objectivity assertion risks depending on internal framework definitions rather than independent verification.

Authors: The full manuscript defines the ARL rubric across five levels with explicit criteria tied to ethical and legal principles drawn from sources including the Asilomar AI Principles and relevant robotics safety standards. The mapping incorporates operating metrics such as scenario difficulty scores produced by the neuro-symbolic component. We recognize that the abstract and introductory sections do not sufficiently detail the aggregation process or provide equations. In the revision, we will insert a dedicated methods subsection that presents the formal aggregation formula, including how LLM-generated difficulty explanations are weighted with symbolic rule violations and normalized into the ARL score. We will also include a brief comparison table aligning ARL levels with external benchmarks to strengthen the grounding. revision: yes

Circularity Check

0 steps flagged

No circularity: REBAR is a definitional framework without self-referential reductions

full rationale

The paper introduces REBAR as a new quantitative benchmark that maps operating metrics to an Autonomy Readiness Level (ARL) rubric via a neuro-symbolic LLM for ethical difficulty scoring and LLM-driven scenario generation in simulation. No equations, derivations, or first-principles results are presented that reduce to fitted parameters, self-citations, or inputs by construction. The central claim of an objective repeatable score follows directly from the proposed pipeline definition rather than any prediction equivalent to its own components. The framework is self-contained as a methodology proposal; external validation of the LLM component is a separate correctness issue, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can serve as reliable proxies for ethical judgment and that the resulting rubric produces meaningful readiness levels; no free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Large language models combined with symbolic methods can accurately determine and explain the ethical difficulty of scenarios
This is invoked as the core mechanism for calculating ethical performance in the described pipeline.

pith-pipeline@v0.9.0 · 5802 in / 1341 out tokens · 40114 ms · 2026-05-20T09:24:22.061076+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key innovations ... neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios ... ARL score computation approach ... Principle of Minimal Ethical Difficulty
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The framework employs the Principle of Minimal Ethical Difficulty: An Observable can only certify an agent to the minimum of its configured ethical difficulties.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Autort: Embodied foundation models for large scale orchestration of robotic agents,

M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julian, S. Kirmani, I. Leal, E. Lee, S. Levine, Y . Lu, I. Leal, S. Maddineni, K. Rao, D. Sadigh, P. Sanketi, P. Sermanet, Q. Vuong, S. Welker, F. Xia, T. Xiao, P. Xu, S. Xu, and Z. Xu, “Autort: Embodied foundation models for large scale orchestrati...

work page 2024
[2]

Generating robot constitutions & benchmarks for semantic safety,

P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”Conference on Robot Learning (CoRL) 2025, 2025, version 1. Project page: https://asimov-benchmark.github.io. [Online]. Available: https://arxiv.org/abs/2503.08663

work page arXiv 2025
[3]

Scifi-benchmark: Leveraging science fiction to improve robot behavior,

P. Sermanet, A. Majumdar, and V . Sindhwani, “Scifi-benchmark: Leveraging science fiction to improve robot behavior,”arXiv preprint arXiv:2503.10706, 2025, project page: https://scifi-benchmark.github. io. [Online]. Available: http://arxiv.org/abs/2503.10706

work page arXiv 2025
[4]

Holistic evaluation of language models,

P. Liang, R. Bommasaniet al., “Holistic evaluation of language models,”Transactions on Machine Learning Research (TMLR), 2023, center for Research on Foundation Models (CRFM). [Online]. Available: https://crfm.stanford.edu/helm/latest/

work page 2023
[5]

Gaia: a benchmark for general ai assistants,

G. Mialonet al., “Gaia: a benchmark for general ai assistants,” 2023

work page 2023
[6]

Partnr: Planning and reasoning tasks for embodied agents,

M. AI, “Partnr: Planning and reasoning tasks for embodied agents,” 2024, project/benchmark release. [Online]. Available: https: //ai.meta.com/research/

work page 2024
[7]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kunduet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brownet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inProceedings of Robotics: Science and Systems (RSS), 2022

work page 2022
[9]

Aligning AI With Shared Human Values

D. Hendrycks, C. Burns, S. Basartet al., “Aligning ai with shared human values,”arXiv preprint arXiv:2008.02275, 2021

work page internal anchor Pith review arXiv 2008
[10]

Responsible AI guidelines,

Defense Innovation Unit (DIU), “Responsible AI guidelines,” 2025. [Online]. Available: https://www.diu.mil/responsible-ai

work page 2025
[11]

RAI toolkit,

Chief Digital and Artificial Intelligence Office (CDAO), “RAI toolkit,” 2025. [Online]. Available: https://rai.tradewindai.com/

work page 2025
[12]

Artificial intelligence risk management framework (ai rmf 1.0),

E. Tabassi, “Artificial intelligence risk management framework (AI RMF 1.0),” National Institute of Standards and Technology, Tech. Rep. NIST AI 100-1, 2023. [Online]. Available: https: //doi.org/10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[13]

From ai ethics principles to practices: A teleological methodology to apply ai ethics principles in the defence domain,

M. Taddeo, A. Blanchard, and C. Thomas, “From ai ethics principles to practices: A teleological methodology to apply ai ethics principles in the defence domain,”Philosophy & Technology, vol. 37, no. 1, p. 42, 2024. [Online]. Available: https://doi.org/10.1007/s13347-024-00710-6

work page doi:10.1007/s13347-024-00710-6 2024
[14]

Artificial intelligence: Approaches to safety,

W. D’Alessandro and C. D. Kirk-Giannini, “Artificial intelligence: Approaches to safety,”Philosophy Compass, vol. 20, no. 5, p. e70039, 2025. [Online]. Available: https://compass.onlinelibrary. wiley.com/doi/abs/10.1111/phc3.70039

work page doi:10.1111/phc3.70039 2025
[15]

Doppel- ganger saliency: Towards more ethical person re-identification,

B. RichardWebster, B. Hu, K. Fieldhouse, and A. Hoogs, “Doppel- ganger saliency: Towards more ethical person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2022, pp. 2847–2857

work page 2022
[16]

Language models are alignable decision-makers: Dataset and application to the medical triage domain,

B. Hu, B. Ray, A. Leung, A. Summerville, D. Joy, C. Funk, and A. Basharat, “Language models are alignable decision-makers: Dataset and application to the medical triage domain,” 2024, under review at NAACL 2024 Industry Track

work page 2024
[17]

Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance,

R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance,”Frontiers in Computer Science, vol. V olume 5 - 2023, 2023. [Online]. Available: https://www.frontiersin.org/journals/ computer-science/articles/10.3389/fcomp.2023.1096257

work page doi:10.3389/fcomp.2023.1096257 2023
[18]

Police and military as good strangers,

G. Klein, H. A. Klein, B. Lande, J. Borders, and J. C. Whitacre, “Police and military as good strangers,”Journal of Occupational and Organizational Psychology, vol. 88, no. 2, pp. 231–250,

work page
[19]

Available: https://bpspsychub.onlinelibrary.wiley.com/ doi/abs/10.1111/joop.12110

[Online]. Available: https://bpspsychub.onlinelibrary.wiley.com/ doi/abs/10.1111/joop.12110

work page doi:10.1111/joop.12110
[20]

Fetic, T

L. Fetic, T. Fleischer, P. Gr ¨unke, T. Hagendorf, S. Hallensleben, M. Hauer, M. Herrmann, R. Hillerbrand, C. Hustedt, C. Hubig, A. Kaminski, T. Krafft, W. Loh, P. Otto, and M. Puntschuh,From Prin- ciples to Practice. An Interdisciplinary Framework to Operationalise Ai Ethics.Bertelsmann-Stiftung, 2020

work page 2020
[21]

Falcon: Digital twin simulation platform,

Duality Robotics, “Falcon: Digital twin simulation platform,” 2025. [Online]. Available: https://www.duality.ai/product

work page 2025
[22]

Ai principles: recommendations on the ethical use of artificial intelligence by the department of defense: supporting document,

D. I. Board, “Ai principles: recommendations on the ethical use of artificial intelligence by the department of defense: supporting document,”United States Department of Defense, 2019

work page 2019
[23]

Spine: Online semantic planning for missions with incomplete natural language specifications in unstructured envi- ronments,

Z. Ravichandranet al., “Spine: Online semantic planning for missions with incomplete natural language specifications in unstructured envi- ronments,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025

work page 2025

[1] [1]

Autort: Embodied foundation models for large scale orchestration of robotic agents,

M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julian, S. Kirmani, I. Leal, E. Lee, S. Levine, Y . Lu, I. Leal, S. Maddineni, K. Rao, D. Sadigh, P. Sanketi, P. Sermanet, Q. Vuong, S. Welker, F. Xia, T. Xiao, P. Xu, S. Xu, and Z. Xu, “Autort: Embodied foundation models for large scale orchestrati...

work page 2024

[2] [2]

Generating robot constitutions & benchmarks for semantic safety,

P. Sermanet, A. Majumdar, A. Irpan, D. Kalashnikov, and V . Sindhwani, “Generating robot constitutions & benchmarks for semantic safety,”Conference on Robot Learning (CoRL) 2025, 2025, version 1. Project page: https://asimov-benchmark.github.io. [Online]. Available: https://arxiv.org/abs/2503.08663

work page arXiv 2025

[3] [3]

Scifi-benchmark: Leveraging science fiction to improve robot behavior,

P. Sermanet, A. Majumdar, and V . Sindhwani, “Scifi-benchmark: Leveraging science fiction to improve robot behavior,”arXiv preprint arXiv:2503.10706, 2025, project page: https://scifi-benchmark.github. io. [Online]. Available: http://arxiv.org/abs/2503.10706

work page arXiv 2025

[4] [4]

Holistic evaluation of language models,

P. Liang, R. Bommasaniet al., “Holistic evaluation of language models,”Transactions on Machine Learning Research (TMLR), 2023, center for Research on Foundation Models (CRFM). [Online]. Available: https://crfm.stanford.edu/helm/latest/

work page 2023

[5] [5]

Gaia: a benchmark for general ai assistants,

G. Mialonet al., “Gaia: a benchmark for general ai assistants,” 2023

work page 2023

[6] [6]

Partnr: Planning and reasoning tasks for embodied agents,

M. AI, “Partnr: Planning and reasoning tasks for embodied agents,” 2024, project/benchmark release. [Online]. Available: https: //ai.meta.com/research/

work page 2024

[7] [7]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kunduet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Do as i can, not as i say: Grounding language in robotic affordances,

M. Ahn, A. Brohan, N. Brownet al., “Do as i can, not as i say: Grounding language in robotic affordances,” inProceedings of Robotics: Science and Systems (RSS), 2022

work page 2022

[9] [9]

Aligning AI With Shared Human Values

D. Hendrycks, C. Burns, S. Basartet al., “Aligning ai with shared human values,”arXiv preprint arXiv:2008.02275, 2021

work page internal anchor Pith review arXiv 2008

[10] [10]

Responsible AI guidelines,

Defense Innovation Unit (DIU), “Responsible AI guidelines,” 2025. [Online]. Available: https://www.diu.mil/responsible-ai

work page 2025

[11] [11]

RAI toolkit,

Chief Digital and Artificial Intelligence Office (CDAO), “RAI toolkit,” 2025. [Online]. Available: https://rai.tradewindai.com/

work page 2025

[12] [12]

Artificial intelligence risk management framework (ai rmf 1.0),

E. Tabassi, “Artificial intelligence risk management framework (AI RMF 1.0),” National Institute of Standards and Technology, Tech. Rep. NIST AI 100-1, 2023. [Online]. Available: https: //doi.org/10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023

[13] [13]

From ai ethics principles to practices: A teleological methodology to apply ai ethics principles in the defence domain,

M. Taddeo, A. Blanchard, and C. Thomas, “From ai ethics principles to practices: A teleological methodology to apply ai ethics principles in the defence domain,”Philosophy & Technology, vol. 37, no. 1, p. 42, 2024. [Online]. Available: https://doi.org/10.1007/s13347-024-00710-6

work page doi:10.1007/s13347-024-00710-6 2024

[14] [14]

Artificial intelligence: Approaches to safety,

W. D’Alessandro and C. D. Kirk-Giannini, “Artificial intelligence: Approaches to safety,”Philosophy Compass, vol. 20, no. 5, p. e70039, 2025. [Online]. Available: https://compass.onlinelibrary. wiley.com/doi/abs/10.1111/phc3.70039

work page doi:10.1111/phc3.70039 2025

[15] [15]

Doppel- ganger saliency: Towards more ethical person re-identification,

B. RichardWebster, B. Hu, K. Fieldhouse, and A. Hoogs, “Doppel- ganger saliency: Towards more ethical person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2022, pp. 2847–2857

work page 2022

[16] [16]

Language models are alignable decision-makers: Dataset and application to the medical triage domain,

B. Hu, B. Ray, A. Leung, A. Summerville, D. Joy, C. Funk, and A. Basharat, “Language models are alignable decision-makers: Dataset and application to the medical triage domain,” 2024, under review at NAACL 2024 Industry Track

work page 2024

[17] [17]

Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance,

R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable ai: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-ai performance,”Frontiers in Computer Science, vol. V olume 5 - 2023, 2023. [Online]. Available: https://www.frontiersin.org/journals/ computer-science/articles/10.3389/fcomp.2023.1096257

work page doi:10.3389/fcomp.2023.1096257 2023

[18] [18]

Police and military as good strangers,

G. Klein, H. A. Klein, B. Lande, J. Borders, and J. C. Whitacre, “Police and military as good strangers,”Journal of Occupational and Organizational Psychology, vol. 88, no. 2, pp. 231–250,

work page

[19] [19]

Available: https://bpspsychub.onlinelibrary.wiley.com/ doi/abs/10.1111/joop.12110

[Online]. Available: https://bpspsychub.onlinelibrary.wiley.com/ doi/abs/10.1111/joop.12110

work page doi:10.1111/joop.12110

[20] [20]

Fetic, T

L. Fetic, T. Fleischer, P. Gr ¨unke, T. Hagendorf, S. Hallensleben, M. Hauer, M. Herrmann, R. Hillerbrand, C. Hustedt, C. Hubig, A. Kaminski, T. Krafft, W. Loh, P. Otto, and M. Puntschuh,From Prin- ciples to Practice. An Interdisciplinary Framework to Operationalise Ai Ethics.Bertelsmann-Stiftung, 2020

work page 2020

[21] [21]

Falcon: Digital twin simulation platform,

Duality Robotics, “Falcon: Digital twin simulation platform,” 2025. [Online]. Available: https://www.duality.ai/product

work page 2025

[22] [22]

Ai principles: recommendations on the ethical use of artificial intelligence by the department of defense: supporting document,

D. I. Board, “Ai principles: recommendations on the ethical use of artificial intelligence by the department of defense: supporting document,”United States Department of Defense, 2019

work page 2019

[23] [23]

Spine: Online semantic planning for missions with incomplete natural language specifications in unstructured envi- ronments,

Z. Ravichandranet al., “Spine: Online semantic planning for missions with incomplete natural language specifications in unstructured envi- ronments,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025

work page 2025