ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

Pratyush Chaudhari

arxiv: 2606.13282 · v1 · pith:UUBPP7IInew · submitted 2026-06-11 · 💻 cs.AI

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

Pratyush Chaudhari This is my paper

Pith reviewed 2026-06-27 06:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords ethical robustnessadversarial testingAI ethicsconsequence spacesemantic perturbationLLM evaluationethical instabilitypre-deployment assessment

0 comments

The pith

ERTS tests AI ethical robustness by perturbing dilemmas in a 22-dimensional consequence space and finds most models unstable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Ethical Robustness Testing System to check whether AI systems keep their ethical decisions stable when scenarios receive small semantic changes. It maps dilemmas into a 22-dimensional Ethical Consequence Space drawn from ethical theory, then applies 17 perturbation functions that must satisfy six validity rules, and finally scores how far the AI's output drifts using a four-part instability index. When the system runs 1500 test cases on four baseline models and two production LLMs across eight domains, only one third of the models meet the clearance threshold. A reader would care because AI is already being placed in healthcare, hiring, and vehicle control where shifting ethical judgments can produce real harm. The framework supplies a repeatable pipeline that turns ethical evaluation into a measurable engineering task.

Core claim

The paper claims that ERTS provides a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space grounded in established ethical theory, applies 17 semantic perturbation functions subject to six validity constraint classes including a novel semantic coherence constraint, measures decision deviation via a four-component Ethical Instability Index, and produces domain-adaptive pre-deployment robustness assessment verdicts. Evaluation of four structured baseline models and two production LLMs across fifty ethical scenarios spanning eight deployment domains, generating fifteen hundred adversarial test cases, shows that only thirty-three percent of mode

What carries the argument

The 22-dimensional Ethical Consequence Space together with the seventeen semantic perturbation functions and six validity constraint classes, which together generate controlled adversarial ethical scenarios and quantify decision instability.

If this is right

Only thirty-three percent of the evaluated models would receive clearance for deployment in ethical decision domains.
The local Llama-3.2 model would require targeted fixes for fairness corruption and information degradation vulnerabilities.
The system supports domain-adaptive verdicts that can be applied separately to healthcare, employment screening, and other fields.
Pre-deployment use of the pipeline identifies specific attack types that destabilize ethical reasoning.
Results indicate that production LLMs need additional safeguards to maintain consistency under semantic changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could add the perturbation functions to training loops to increase stability of ethical outputs.
Regulatory audits might adopt similar bounded spaces to certify AI systems for high-stakes use.
The method could be combined with factual robustness tests to produce joint safety scores.
Extending the space to additional ethical theories would allow comparison of model behavior across different moral frameworks.

Load-bearing premise

The 22-dimensional Ethical Consequence Space and the seventeen perturbation functions, subject to the six validity constraints, capture ethical reasoning without introducing artifacts that invalidate the instability measurements.

What would settle it

If human experts facing the same perturbed scenarios produce decision shifts that fail to correlate with the models' Ethical Instability Index scores, the framework would not be measuring the intended form of ethical instability.

read the original abstract

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERTS proposes a combined pipeline for ethical robustness testing but the abstract shows no validation of the 22D space or perturbations, so the 33% clearance claim rests on unexamined assumptions.

read the letter

The main takeaway is that this paper introduces ERTS as a closed pipeline that maps ethical dilemmas into a 22-dimensional Ethical Consequence Space, applies 17 semantic perturbations under six constraints including a new semantic coherence rule, computes a four-part Ethical Instability Index, and issues domain-adaptive verdicts. It tests four structured baselines plus Gemini 2.0 Flash and Llama 3.2 on 50 scenarios and 1,500 cases, reporting only 33 percent clearance and an ERS of 0.737 for Llama 3.2 on fairness and information attacks.

What is actually new is the single-pipeline integration of the bounded consequence space, the coherence-constrained perturbations, and the multi-component index for producing pre-deployment assessments. The authors correctly note that prior work has not assembled these pieces together. The paper does a service by naming the practical gap in high-stakes domains and by trying to make the test cases domain-adaptive.

The soft spots are large and central. The abstract supplies no equations, no description of how the 22 dimensions were derived from ethical theory, no account of how the six constraint classes are enforced in code, and no check that the perturbations preserve the original ethical structure rather than inject noise. The reported clearance rate and ERS value therefore sit on top of free parameters whose effects are not measured. The circularity concern is real: the instability index is defined inside the same perturbation-and-constraint system it is meant to evaluate, with no external benchmark or independent falsification shown. The stress-test point about possible framework-induced artifacts holds on the evidence available.

This paper is for researchers who want a high-level sketch of an ethical testing architecture. A reader looking for concrete methods or reproducible results will find little to use. It does not yet deserve a serious referee because the core claims lack the supporting derivations, validation experiments, or data that would let a reviewer check whether the measured deviations reflect model behavior or the test design itself. I would desk-reject and ask for the missing methodology and checks before sending it out.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in ethical theory, applies 17 semantic perturbation functions under 6 validity constraint classes (including a novel semantic coherence constraint), measures decision deviation via a 4-component Ethical Instability Index (EII), and produces domain-adaptive pre-deployment robustness verdicts. It evaluates 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) on 50 ethical scenarios across 8 domains, generating 1500 adversarial test cases, and reports that only 33% of models achieve assessment clearance, with Llama-3.2 particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).

Significance. If the 22D ECS and perturbation functions can be shown to test ethical reasoning without introducing unvalidated artifacts, the framework would provide a structured, domain-adaptive approach to adversarial testing of ethical AI that is currently underdeveloped. The evaluation across multiple models and domains, combined with the introduction of semantic coherence constraints, offers a concrete pipeline that could inform pre-deployment assessments in high-stakes areas like healthcare and autonomous systems. The attempt to ground the space in ethical theory and generate a large number of test cases (1500) is a positive step toward reproducible robustness metrics.

major comments (3)

[§3 (Ethical Consequence Space)] §3 (Ethical Consequence Space): The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.
[§4 (Semantic Perturbation Functions)] §4 (Semantic Perturbation Functions): The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.
[§5 (Experimental Results)] §5 (Experimental Results): The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.

minor comments (2)

[Abstract] The abstract claims novelty for combining bounded consequence space, semantic coherence constraints, and domain-adaptive assessment but does not cite or compare against prior work on ethical AI evaluation frameworks.
[§3.3 (Ethical Instability Index)] Notation for the four components of the Ethical Instability Index is introduced without a clear equation or table defining each component's computation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback on the manuscript and provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [§3 (Ethical Consequence Space)] The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.

Authors: We agree that the manuscript would benefit from greater explicitness on dimension selection. The 22 dimensions were chosen to cover core consequence types from utilitarianism, deontology, and virtue ethics as discussed in ethical AI literature. In the revised version we will add a dedicated subsection and mapping table in §3 detailing the theoretical basis and selection rationale for each dimension. This clarification will help demonstrate that EII reflects behavior within a motivated space rather than arbitrary artifacts. revision: yes
Referee: [§4 (Semantic Perturbation Functions)] The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.

Authors: The semantic coherence constraint is designed to maintain dilemma integrity, but we acknowledge the lack of empirical checks such as human ratings or EII comparisons. We will revise §4 to include qualitative examples of preserved structure and, where feasible, a limited human coherence assessment in an appendix. A comprehensive study across all cases exceeds current scope, so this constitutes a partial response. revision: partial
Referee: [§5 (Experimental Results)] The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.

Authors: We accept the need for stronger statistical support. The revised §5 will add significance tests for model differences, error bars or intervals on key metrics, and ablations on dimension count and constraint enforcement. External benchmarks are limited by the framework's novelty; we will discuss this limitation explicitly and reference related evaluation approaches. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces ERTS as a new closed-pipeline framework whose core components—the 22-dimensional ECS grounded in established ethical theory, the 17 semantic perturbation functions under 6 validity constraint classes, and the 4-component EII—are presented as definitional elements of the testing system rather than derived from one another. The reported results (33% clearance rate, ERS=0.737) are empirical outcomes of applying these components to the evaluated models across 1,500 test cases. No equations or steps in the provided abstract reduce a claimed prediction or result to a fitted parameter or self-citation by construction, and no load-bearing self-citation or uniqueness theorem from prior author work is invoked. The derivation is therefore self-contained against the stated external grounding in ethical theory.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 3 invented entities

The central claim rests on several design choices and domain assumptions introduced in the abstract without further justification or external evidence. The 22D space, perturbation counts, and index components are treated as given rather than derived.

free parameters (4)

22 dimensions of Ethical Consequence Space
Specific dimensionality chosen to ground in ethical theory; no derivation or sensitivity analysis provided.
17 semantic perturbation functions
Exact number and definitions are framework design parameters.
6 validity constraint classes
Includes novel semantic coherence constraint; selection is ad hoc to the system.
4-component Ethical Instability Index
Component definitions and weighting are internal to the framework.

axioms (2)

domain assumption Ethical dilemmas can be faithfully encoded into a 22-dimensional space grounded in established ethical theory
Invoked as the foundation of the ECS without proof or external validation in the abstract.
domain assumption Semantic perturbations under the 6 constraint classes (including semantic coherence) preserve the ethical character of the original dilemma
Required for the perturbations to be valid tests; assumed rather than demonstrated.

invented entities (3)

Ethical Consequence Space (ECS) no independent evidence
purpose: Bounded representation of ethical dilemmas for systematic perturbation
Newly introduced 22D construct with no independent evidence outside the framework.
Ethical Instability Index (EII) no independent evidence
purpose: 4-component measure of decision deviation under perturbation
Newly defined index whose components are not externally validated.
semantic coherence constraint no independent evidence
purpose: Novel validity rule within the 6 constraint classes
Introduced as part of the framework without prior literature support shown.

pith-pipeline@v0.9.1-grok · 5769 in / 1974 out tokens · 31000 ms · 2026-06-27T06:42:40.068995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 9 canonical work pages · 6 internal anchors

[1]

Machine learning in medicine,

A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” New England Journal of Medicine, vol. 380, no. 14, pp. 1347–1358, 2019

2019
[2]

Autonomous vehicle safety: An interdis- ciplinary challenge,

P. Koopman and M. Wagner, “Autonomous vehicle safety: An interdis- ciplinary challenge,”IEEE Intelligent Transportation Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017

2017
[3]

Mitigating bias in algorithmic hiring: Evaluating claims and practices,

M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. ACM FAT*, 2020, pp. 469–481

2020
[4]

Scharre,Army of None: Autonomous Weapons and the Future of War

P. Scharre,Army of None: Autonomous Weapons and the Future of War. New York, NY: W.W. Norton, 2018

2018
[5]

Dignum,Responsible Artificial Intelligence

V . Dignum,Responsible Artificial Intelligence. Cham, Switzerland: Springer, 2019

2019
[6]

Explaining and harnessing adversarial examples,

I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

2015
[7]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

2018
[8]

Adversarial Robustness Toolbox v1.0.0,

M.-I. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V . Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. M. Molloy, and B. Edwards, “Adversarial Robustness Toolbox v1.0.0,”arXiv preprint arXiv:1807.01069, 2018

work page arXiv 2018
[9]

Garak: Generative AI Red-teaming & Assessment Kit,

NVIDIA, “Garak: Generative AI Red-teaming & Assessment Kit,” NVIDIA AI Red Team, 2023. [Online]. Available: https://github.com/ NVIDIA/garak

2023
[10]

TrustLLM: Trustworthiness in large language models,

Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Liet al., “TrustLLM: Trustworthiness in large language models,” inProc. ICML, 2024

2024
[11]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga et al., “Holistic evaluation of language models,”Transactions on Ma- chine Learning Research, 2023

2023
[12]

TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,

J. Morris, E. Lifland, J. Yoo, J. Grigsby, D. Jin, and Y . Qi, “TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,” inProc. EMNLP, 2020, pp. 119–126

2020
[13]

Adversarial policies: Attacking deep reinforcement learning,

A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” inProc. ICLR, 2020

2020
[14]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” inProc. IEEE S&P, 2017, pp. 39–57

2017
[15]

Russell,Human Compatible: Artificial Intelligence and the Problem of Control

S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. New York, NY: Viking, 2019

2019
[16]

Inverse reward design,

D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, and A. Dragan, “Inverse reward design,” inProc. NeurIPS, 2017, pp. 6765–6774

2017
[17]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Joneset al., “Constitutional AI: Harmlessness from AI feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” inProc. NeurIPS, 2022, pp. 27730–27744

2022
[19]

Aligning AI with shared human values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI with shared human values,” inProc. ICLR, 2021

2021
[20]

UL 3115: Outline of investigation for safety of AI-based products,

UL Solutions, “UL 3115: Outline of investigation for safety of AI-based products,” 2025

2025
[21]

ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,

ISO/IEC, “ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,” 2022

2022
[22]

ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,

ISO/IEC, “ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,” 2023

2023
[23]

Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),

European Parliament, “Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),”Official Journal of the European Union, 2024

2024
[24]

Rawls,A Theory of Justice

J. Rawls,A Theory of Justice. Cambridge, MA: Harvard University Press, 1971

1971
[25]

W. D. Ross,The Right and the Good. Oxford, UK: Clarendon Press, 1930

1930
[26]

Kant,Groundwork of the Metaphysics of Morals, M

I. Kant,Groundwork of the Metaphysics of Morals, M. Gregor, Trans. Cambridge, UK: Cambridge University Press, 1785/1998

1998
[27]

J. S. Mill,Utilitarianism. London, UK: Parker, Son, and Bourn, 1863
[28]

Sen,The Idea of Justice

A. Sen,The Idea of Justice. Cambridge, MA: Harvard University Press, 2009

2009
[29]

Nussbaum,Creating Capabilities: The Human Development Ap- proach

M. Nussbaum,Creating Capabilities: The Human Development Ap- proach. Cambridge, MA: Harvard University Press, 2011

2011
[30]

T. L. Beauchamp and J. F. Childress,Principles of Biomedical Ethics, 8th ed. New York, NY: Oxford University Press, 2019

2019
[31]

Bostrom,Superintelligence: Paths, Dangers, Strategies

N. Bostrom,Superintelligence: Paths, Dangers, Strategies. Oxford, UK: Oxford University Press, 2014

2014
[32]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,”arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

AI Safety Gridworlds

J. Leike, M. Martic, V . Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,”arXiv preprint arXiv:1711.09883, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Wild patterns: Ten years after the rise of adversarial machine learning,

B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,”Pattern Recognition, vol. 84, pp. 317– 331, 2018

2018
[35]

Robust physical-world attacks on deep learning visual classification,

K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” inProc. CVPR, 2018, pp. 1625– 1634

2018
[36]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language models are few-shot learners,” inProc. NeurIPS, 2020, pp. 1877–1901

2020
[37]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- mar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,”arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022

2022
[39]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Red teaming language models with language models,

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProc. EMNLP, 2022

2022
[41]

What is data ethics?

L. Floridi and M. Taddeo, “What is data ethics?”Phil. Trans. Roy. Soc. A, vol. 374, no. 2083, 2016

2083
[42]

From what to how: An initial review of publicly available AI ethics tools,

J. Morley, L. Floridi, L. Kinsey, and A. Elhalal, “From what to how: An initial review of publicly available AI ethics tools,”Sci. Eng. Ethics, vol. 26, pp. 2141–2168, 2020

2020
[43]

Model cards for model reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inProc. ACM FAT*, 2019, pp. 220–229

2019
[44]

Datasheets for datasets,

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford, “Datasheets for datasets,”Commun. ACM, vol. 64, no. 12, pp. 86–92, 2021

2021
[45]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arber, S. von Arx et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Artificial Intelligence Risk Management Framework (AI RMF 1.0),

National Institute of Standards and Technology, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1, 2023

2023
[47]

IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,

IEEE, “IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,” IEEE Standards Association, 2021

2021
[48]

The global landscape of AI ethics guidelines,

A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, pp. 389–399, 2019

2019
[49]

When to make exceptions: Exploring language models as accounts of human moral judgment,

Z. Jin, S. Levine, F. Gonzalez Adauto, O. Kamath, Y . Zheng, J. Sachan, and B. Schölkopf, “When to make exceptions: Exploring language models as accounts of human moral judgment,” inProc. NeurIPS, 2022

2022
[50]

Fine-tuning aligned language models compromises safety, even when users do not intend to,

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to,” inProc. ICLR, 2024

2024
[51]

Delphi: Towards machine ethics and norms,

L. Jiang, J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Liang, J. Dodge, K. Sakaguchi, M. Forbes, J. Borchardt, S. Saber, N. Lourie, Y . Choi, and A. Farhadi, “Delphi: Towards machine ethics and norms,”arXiv preprint arXiv:2110.07574, 2021

work page arXiv 2021
[52]

You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,

Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Mankowitz, and A. Lauscher, “You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,” inProc. ACL BigScience Workshop, 2022

2022
[53]

Survey on AI ethics: A socio-technical per- spective,

D. Mbiazi, M. Bhange, M. Babaei, I. Sheth, P. Kenfack, and S. Ebrahimi Kahou, “Survey on AI ethics: A socio-technical per- spective,”Computational Intelligence, vol. 41, no. 6, 2025. [Online]. Available: https://doi.org/10.1111/coin.70149

work page doi:10.1111/coin.70149 2025

[1] [1]

Machine learning in medicine,

A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” New England Journal of Medicine, vol. 380, no. 14, pp. 1347–1358, 2019

2019

[2] [2]

Autonomous vehicle safety: An interdis- ciplinary challenge,

P. Koopman and M. Wagner, “Autonomous vehicle safety: An interdis- ciplinary challenge,”IEEE Intelligent Transportation Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017

2017

[3] [3]

Mitigating bias in algorithmic hiring: Evaluating claims and practices,

M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. ACM FAT*, 2020, pp. 469–481

2020

[4] [4]

Scharre,Army of None: Autonomous Weapons and the Future of War

P. Scharre,Army of None: Autonomous Weapons and the Future of War. New York, NY: W.W. Norton, 2018

2018

[5] [5]

Dignum,Responsible Artificial Intelligence

V . Dignum,Responsible Artificial Intelligence. Cham, Switzerland: Springer, 2019

2019

[6] [6]

Explaining and harnessing adversarial examples,

I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

2015

[7] [7]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

2018

[8] [8]

Adversarial Robustness Toolbox v1.0.0,

M.-I. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V . Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. M. Molloy, and B. Edwards, “Adversarial Robustness Toolbox v1.0.0,”arXiv preprint arXiv:1807.01069, 2018

work page arXiv 2018

[9] [9]

Garak: Generative AI Red-teaming & Assessment Kit,

NVIDIA, “Garak: Generative AI Red-teaming & Assessment Kit,” NVIDIA AI Red Team, 2023. [Online]. Available: https://github.com/ NVIDIA/garak

2023

[10] [10]

TrustLLM: Trustworthiness in large language models,

Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Liet al., “TrustLLM: Trustworthiness in large language models,” inProc. ICML, 2024

2024

[11] [11]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga et al., “Holistic evaluation of language models,”Transactions on Ma- chine Learning Research, 2023

2023

[12] [12]

TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,

J. Morris, E. Lifland, J. Yoo, J. Grigsby, D. Jin, and Y . Qi, “TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,” inProc. EMNLP, 2020, pp. 119–126

2020

[13] [13]

Adversarial policies: Attacking deep reinforcement learning,

A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” inProc. ICLR, 2020

2020

[14] [14]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” inProc. IEEE S&P, 2017, pp. 39–57

2017

[15] [15]

Russell,Human Compatible: Artificial Intelligence and the Problem of Control

S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. New York, NY: Viking, 2019

2019

[16] [16]

Inverse reward design,

D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, and A. Dragan, “Inverse reward design,” inProc. NeurIPS, 2017, pp. 6765–6774

2017

[17] [17]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Joneset al., “Constitutional AI: Harmlessness from AI feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” inProc. NeurIPS, 2022, pp. 27730–27744

2022

[19] [19]

Aligning AI with shared human values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI with shared human values,” inProc. ICLR, 2021

2021

[20] [20]

UL 3115: Outline of investigation for safety of AI-based products,

UL Solutions, “UL 3115: Outline of investigation for safety of AI-based products,” 2025

2025

[21] [21]

ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,

ISO/IEC, “ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,” 2022

2022

[22] [22]

ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,

ISO/IEC, “ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,” 2023

2023

[23] [23]

Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),

European Parliament, “Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),”Official Journal of the European Union, 2024

2024

[24] [24]

Rawls,A Theory of Justice

J. Rawls,A Theory of Justice. Cambridge, MA: Harvard University Press, 1971

1971

[25] [25]

W. D. Ross,The Right and the Good. Oxford, UK: Clarendon Press, 1930

1930

[26] [26]

Kant,Groundwork of the Metaphysics of Morals, M

I. Kant,Groundwork of the Metaphysics of Morals, M. Gregor, Trans. Cambridge, UK: Cambridge University Press, 1785/1998

1998

[27] [27]

J. S. Mill,Utilitarianism. London, UK: Parker, Son, and Bourn, 1863

[28] [28]

Sen,The Idea of Justice

A. Sen,The Idea of Justice. Cambridge, MA: Harvard University Press, 2009

2009

[29] [29]

Nussbaum,Creating Capabilities: The Human Development Ap- proach

M. Nussbaum,Creating Capabilities: The Human Development Ap- proach. Cambridge, MA: Harvard University Press, 2011

2011

[30] [30]

T. L. Beauchamp and J. F. Childress,Principles of Biomedical Ethics, 8th ed. New York, NY: Oxford University Press, 2019

2019

[31] [31]

Bostrom,Superintelligence: Paths, Dangers, Strategies

N. Bostrom,Superintelligence: Paths, Dangers, Strategies. Oxford, UK: Oxford University Press, 2014

2014

[32] [32]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,”arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

AI Safety Gridworlds

J. Leike, M. Martic, V . Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,”arXiv preprint arXiv:1711.09883, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Wild patterns: Ten years after the rise of adversarial machine learning,

B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,”Pattern Recognition, vol. 84, pp. 317– 331, 2018

2018

[35] [35]

Robust physical-world attacks on deep learning visual classification,

K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” inProc. CVPR, 2018, pp. 1625– 1634

2018

[36] [36]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language models are few-shot learners,” inProc. NeurIPS, 2020, pp. 1877–1901

2020

[37] [37]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- mar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,”arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022

2022

[39] [39]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Red teaming language models with language models,

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProc. EMNLP, 2022

2022

[41] [41]

What is data ethics?

L. Floridi and M. Taddeo, “What is data ethics?”Phil. Trans. Roy. Soc. A, vol. 374, no. 2083, 2016

2083

[42] [42]

From what to how: An initial review of publicly available AI ethics tools,

J. Morley, L. Floridi, L. Kinsey, and A. Elhalal, “From what to how: An initial review of publicly available AI ethics tools,”Sci. Eng. Ethics, vol. 26, pp. 2141–2168, 2020

2020

[43] [43]

Model cards for model reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inProc. ACM FAT*, 2019, pp. 220–229

2019

[44] [44]

Datasheets for datasets,

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford, “Datasheets for datasets,”Commun. ACM, vol. 64, no. 12, pp. 86–92, 2021

2021

[45] [45]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arber, S. von Arx et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Artificial Intelligence Risk Management Framework (AI RMF 1.0),

National Institute of Standards and Technology, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1, 2023

2023

[47] [47]

IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,

IEEE, “IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,” IEEE Standards Association, 2021

2021

[48] [48]

The global landscape of AI ethics guidelines,

A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, pp. 389–399, 2019

2019

[49] [49]

When to make exceptions: Exploring language models as accounts of human moral judgment,

Z. Jin, S. Levine, F. Gonzalez Adauto, O. Kamath, Y . Zheng, J. Sachan, and B. Schölkopf, “When to make exceptions: Exploring language models as accounts of human moral judgment,” inProc. NeurIPS, 2022

2022

[50] [50]

Fine-tuning aligned language models compromises safety, even when users do not intend to,

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to,” inProc. ICLR, 2024

2024

[51] [51]

Delphi: Towards machine ethics and norms,

L. Jiang, J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Liang, J. Dodge, K. Sakaguchi, M. Forbes, J. Borchardt, S. Saber, N. Lourie, Y . Choi, and A. Farhadi, “Delphi: Towards machine ethics and norms,”arXiv preprint arXiv:2110.07574, 2021

work page arXiv 2021

[52] [52]

You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,

Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Mankowitz, and A. Lauscher, “You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,” inProc. ACL BigScience Workshop, 2022

2022

[53] [53]

Survey on AI ethics: A socio-technical per- spective,

D. Mbiazi, M. Bhange, M. Babaei, I. Sheth, P. Kenfack, and S. Ebrahimi Kahou, “Survey on AI ethics: A socio-technical per- spective,”Computational Intelligence, vol. 41, no. 6, 2025. [Online]. Available: https://doi.org/10.1111/coin.70149

work page doi:10.1111/coin.70149 2025