pith. sign in

arxiv: 2606.13282 · v1 · pith:UUBPP7IInew · submitted 2026-06-11 · 💻 cs.AI

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

Pith reviewed 2026-06-27 06:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords ethical robustnessadversarial testingAI ethicsconsequence spacesemantic perturbationLLM evaluationethical instabilitypre-deployment assessment
0
0 comments X

The pith

ERTS tests AI ethical robustness by perturbing dilemmas in a 22-dimensional consequence space and finds most models unstable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Ethical Robustness Testing System to check whether AI systems keep their ethical decisions stable when scenarios receive small semantic changes. It maps dilemmas into a 22-dimensional Ethical Consequence Space drawn from ethical theory, then applies 17 perturbation functions that must satisfy six validity rules, and finally scores how far the AI's output drifts using a four-part instability index. When the system runs 1500 test cases on four baseline models and two production LLMs across eight domains, only one third of the models meet the clearance threshold. A reader would care because AI is already being placed in healthcare, hiring, and vehicle control where shifting ethical judgments can produce real harm. The framework supplies a repeatable pipeline that turns ethical evaluation into a measurable engineering task.

Core claim

The paper claims that ERTS provides a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space grounded in established ethical theory, applies 17 semantic perturbation functions subject to six validity constraint classes including a novel semantic coherence constraint, measures decision deviation via a four-component Ethical Instability Index, and produces domain-adaptive pre-deployment robustness assessment verdicts. Evaluation of four structured baseline models and two production LLMs across fifty ethical scenarios spanning eight deployment domains, generating fifteen hundred adversarial test cases, shows that only thirty-three percent of mode

What carries the argument

The 22-dimensional Ethical Consequence Space together with the seventeen semantic perturbation functions and six validity constraint classes, which together generate controlled adversarial ethical scenarios and quantify decision instability.

If this is right

  • Only thirty-three percent of the evaluated models would receive clearance for deployment in ethical decision domains.
  • The local Llama-3.2 model would require targeted fixes for fairness corruption and information degradation vulnerabilities.
  • The system supports domain-adaptive verdicts that can be applied separately to healthcare, employment screening, and other fields.
  • Pre-deployment use of the pipeline identifies specific attack types that destabilize ethical reasoning.
  • Results indicate that production LLMs need additional safeguards to maintain consistency under semantic changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could add the perturbation functions to training loops to increase stability of ethical outputs.
  • Regulatory audits might adopt similar bounded spaces to certify AI systems for high-stakes use.
  • The method could be combined with factual robustness tests to produce joint safety scores.
  • Extending the space to additional ethical theories would allow comparison of model behavior across different moral frameworks.

Load-bearing premise

The 22-dimensional Ethical Consequence Space and the seventeen perturbation functions, subject to the six validity constraints, capture ethical reasoning without introducing artifacts that invalidate the instability measurements.

What would settle it

If human experts facing the same perturbed scenarios produce decision shifts that fail to correlate with the models' Ethical Instability Index scores, the framework would not be measuring the intended form of ethical instability.

read the original abstract

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in ethical theory, applies 17 semantic perturbation functions under 6 validity constraint classes (including a novel semantic coherence constraint), measures decision deviation via a 4-component Ethical Instability Index (EII), and produces domain-adaptive pre-deployment robustness verdicts. It evaluates 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) on 50 ethical scenarios across 8 domains, generating 1500 adversarial test cases, and reports that only 33% of models achieve assessment clearance, with Llama-3.2 particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).

Significance. If the 22D ECS and perturbation functions can be shown to test ethical reasoning without introducing unvalidated artifacts, the framework would provide a structured, domain-adaptive approach to adversarial testing of ethical AI that is currently underdeveloped. The evaluation across multiple models and domains, combined with the introduction of semantic coherence constraints, offers a concrete pipeline that could inform pre-deployment assessments in high-stakes areas like healthcare and autonomous systems. The attempt to ground the space in ethical theory and generate a large number of test cases (1500) is a positive step toward reproducible robustness metrics.

major comments (3)
  1. [§3 (Ethical Consequence Space)] §3 (Ethical Consequence Space): The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.
  2. [§4 (Semantic Perturbation Functions)] §4 (Semantic Perturbation Functions): The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.
  3. [§5 (Experimental Results)] §5 (Experimental Results): The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.
minor comments (2)
  1. [Abstract] The abstract claims novelty for combining bounded consequence space, semantic coherence constraints, and domain-adaptive assessment but does not cite or compare against prior work on ethical AI evaluation frameworks.
  2. [§3.3 (Ethical Instability Index)] Notation for the four components of the Ethical Instability Index is introduced without a clear equation or table defining each component's computation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback on the manuscript and provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [§3 (Ethical Consequence Space)] The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.

    Authors: We agree that the manuscript would benefit from greater explicitness on dimension selection. The 22 dimensions were chosen to cover core consequence types from utilitarianism, deontology, and virtue ethics as discussed in ethical AI literature. In the revised version we will add a dedicated subsection and mapping table in §3 detailing the theoretical basis and selection rationale for each dimension. This clarification will help demonstrate that EII reflects behavior within a motivated space rather than arbitrary artifacts. revision: yes

  2. Referee: [§4 (Semantic Perturbation Functions)] The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.

    Authors: The semantic coherence constraint is designed to maintain dilemma integrity, but we acknowledge the lack of empirical checks such as human ratings or EII comparisons. We will revise §4 to include qualitative examples of preserved structure and, where feasible, a limited human coherence assessment in an appendix. A comprehensive study across all cases exceeds current scope, so this constitutes a partial response. revision: partial

  3. Referee: [§5 (Experimental Results)] The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.

    Authors: We accept the need for stronger statistical support. The revised §5 will add significance tests for model differences, error bars or intervals on key metrics, and ablations on dimension count and constraint enforcement. External benchmarks are limited by the framework's novelty; we will discuss this limitation explicitly and reference related evaluation approaches. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces ERTS as a new closed-pipeline framework whose core components—the 22-dimensional ECS grounded in established ethical theory, the 17 semantic perturbation functions under 6 validity constraint classes, and the 4-component EII—are presented as definitional elements of the testing system rather than derived from one another. The reported results (33% clearance rate, ERS=0.737) are empirical outcomes of applying these components to the evaluated models across 1,500 test cases. No equations or steps in the provided abstract reduce a claimed prediction or result to a fitted parameter or self-citation by construction, and no load-bearing self-citation or uniqueness theorem from prior author work is invoked. The derivation is therefore self-contained against the stated external grounding in ethical theory.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 3 invented entities

The central claim rests on several design choices and domain assumptions introduced in the abstract without further justification or external evidence. The 22D space, perturbation counts, and index components are treated as given rather than derived.

free parameters (4)
  • 22 dimensions of Ethical Consequence Space
    Specific dimensionality chosen to ground in ethical theory; no derivation or sensitivity analysis provided.
  • 17 semantic perturbation functions
    Exact number and definitions are framework design parameters.
  • 6 validity constraint classes
    Includes novel semantic coherence constraint; selection is ad hoc to the system.
  • 4-component Ethical Instability Index
    Component definitions and weighting are internal to the framework.
axioms (2)
  • domain assumption Ethical dilemmas can be faithfully encoded into a 22-dimensional space grounded in established ethical theory
    Invoked as the foundation of the ECS without proof or external validation in the abstract.
  • domain assumption Semantic perturbations under the 6 constraint classes (including semantic coherence) preserve the ethical character of the original dilemma
    Required for the perturbations to be valid tests; assumed rather than demonstrated.
invented entities (3)
  • Ethical Consequence Space (ECS) no independent evidence
    purpose: Bounded representation of ethical dilemmas for systematic perturbation
    Newly introduced 22D construct with no independent evidence outside the framework.
  • Ethical Instability Index (EII) no independent evidence
    purpose: 4-component measure of decision deviation under perturbation
    Newly defined index whose components are not externally validated.
  • semantic coherence constraint no independent evidence
    purpose: Novel validity rule within the 6 constraint classes
    Introduced as part of the framework without prior literature support shown.

pith-pipeline@v0.9.1-grok · 5769 in / 1974 out tokens · 31000 ms · 2026-06-27T06:42:40.068995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Machine learning in medicine,

    A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” New England Journal of Medicine, vol. 380, no. 14, pp. 1347–1358, 2019

  2. [2]

    Autonomous vehicle safety: An interdis- ciplinary challenge,

    P. Koopman and M. Wagner, “Autonomous vehicle safety: An interdis- ciplinary challenge,”IEEE Intelligent Transportation Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017

  3. [3]

    Mitigating bias in algorithmic hiring: Evaluating claims and practices,

    M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. ACM FAT*, 2020, pp. 469–481

  4. [4]

    Scharre,Army of None: Autonomous Weapons and the Future of War

    P. Scharre,Army of None: Autonomous Weapons and the Future of War. New York, NY: W.W. Norton, 2018

  5. [5]

    Dignum,Responsible Artificial Intelligence

    V . Dignum,Responsible Artificial Intelligence. Cham, Switzerland: Springer, 2019

  6. [6]

    Explaining and harnessing adversarial examples,

    I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

  7. [7]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

  8. [8]

    Adversarial Robustness Toolbox v1.0.0,

    M.-I. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V . Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. M. Molloy, and B. Edwards, “Adversarial Robustness Toolbox v1.0.0,”arXiv preprint arXiv:1807.01069, 2018

  9. [9]

    Garak: Generative AI Red-teaming & Assessment Kit,

    NVIDIA, “Garak: Generative AI Red-teaming & Assessment Kit,” NVIDIA AI Red Team, 2023. [Online]. Available: https://github.com/ NVIDIA/garak

  10. [10]

    TrustLLM: Trustworthiness in large language models,

    Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Liet al., “TrustLLM: Trustworthiness in large language models,” inProc. ICML, 2024

  11. [11]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga et al., “Holistic evaluation of language models,”Transactions on Ma- chine Learning Research, 2023

  12. [12]

    TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,

    J. Morris, E. Lifland, J. Yoo, J. Grigsby, D. Jin, and Y . Qi, “TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,” inProc. EMNLP, 2020, pp. 119–126

  13. [13]

    Adversarial policies: Attacking deep reinforcement learning,

    A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” inProc. ICLR, 2020

  14. [14]

    Towards evaluating the robustness of neural networks,

    N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” inProc. IEEE S&P, 2017, pp. 39–57

  15. [15]

    Russell,Human Compatible: Artificial Intelligence and the Problem of Control

    S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. New York, NY: Viking, 2019

  16. [16]

    Inverse reward design,

    D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, and A. Dragan, “Inverse reward design,” inProc. NeurIPS, 2017, pp. 6765–6774

  17. [17]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Joneset al., “Constitutional AI: Harmlessness from AI feedback,”arXiv preprint arXiv:2212.08073, 2022

  18. [18]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” inProc. NeurIPS, 2022, pp. 27730–27744

  19. [19]

    Aligning AI with shared human values,

    D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI with shared human values,” inProc. ICLR, 2021

  20. [20]

    UL 3115: Outline of investigation for safety of AI-based products,

    UL Solutions, “UL 3115: Outline of investigation for safety of AI-based products,” 2025

  21. [21]

    ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,

    ISO/IEC, “ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,” 2022

  22. [22]

    ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,

    ISO/IEC, “ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,” 2023

  23. [23]

    Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),

    European Parliament, “Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),”Official Journal of the European Union, 2024

  24. [24]

    Rawls,A Theory of Justice

    J. Rawls,A Theory of Justice. Cambridge, MA: Harvard University Press, 1971

  25. [25]

    W. D. Ross,The Right and the Good. Oxford, UK: Clarendon Press, 1930

  26. [26]

    Kant,Groundwork of the Metaphysics of Morals, M

    I. Kant,Groundwork of the Metaphysics of Morals, M. Gregor, Trans. Cambridge, UK: Cambridge University Press, 1785/1998

  27. [27]

    J. S. Mill,Utilitarianism. London, UK: Parker, Son, and Bourn, 1863

  28. [28]

    Sen,The Idea of Justice

    A. Sen,The Idea of Justice. Cambridge, MA: Harvard University Press, 2009

  29. [29]

    Nussbaum,Creating Capabilities: The Human Development Ap- proach

    M. Nussbaum,Creating Capabilities: The Human Development Ap- proach. Cambridge, MA: Harvard University Press, 2011

  30. [30]

    T. L. Beauchamp and J. F. Childress,Principles of Biomedical Ethics, 8th ed. New York, NY: Oxford University Press, 2019

  31. [31]

    Bostrom,Superintelligence: Paths, Dangers, Strategies

    N. Bostrom,Superintelligence: Paths, Dangers, Strategies. Oxford, UK: Oxford University Press, 2014

  32. [32]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,”arXiv preprint arXiv:1606.06565, 2016

  33. [33]

    AI Safety Gridworlds

    J. Leike, M. Martic, V . Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,”arXiv preprint arXiv:1711.09883, 2017

  34. [34]

    Wild patterns: Ten years after the rise of adversarial machine learning,

    B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,”Pattern Recognition, vol. 84, pp. 317– 331, 2018

  35. [35]

    Robust physical-world attacks on deep learning visual classification,

    K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” inProc. CVPR, 2018, pp. 1625– 1634

  36. [36]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language models are few-shot learners,” inProc. NeurIPS, 2020, pp. 1877–1901

  37. [37]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- mar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,”arXiv preprint arXiv:2303.12712, 2023

  38. [38]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022

  39. [39]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

  40. [40]

    Red teaming language models with language models,

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProc. EMNLP, 2022

  41. [41]

    What is data ethics?

    L. Floridi and M. Taddeo, “What is data ethics?”Phil. Trans. Roy. Soc. A, vol. 374, no. 2083, 2016

  42. [42]

    From what to how: An initial review of publicly available AI ethics tools,

    J. Morley, L. Floridi, L. Kinsey, and A. Elhalal, “From what to how: An initial review of publicly available AI ethics tools,”Sci. Eng. Ethics, vol. 26, pp. 2141–2168, 2020

  43. [43]

    Model cards for model reporting,

    M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inProc. ACM FAT*, 2019, pp. 220–229

  44. [44]

    Datasheets for datasets,

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford, “Datasheets for datasets,”Commun. ACM, vol. 64, no. 12, pp. 86–92, 2021

  45. [45]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arber, S. von Arx et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

  46. [46]

    Artificial Intelligence Risk Management Framework (AI RMF 1.0),

    National Institute of Standards and Technology, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1, 2023

  47. [47]

    IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,

    IEEE, “IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,” IEEE Standards Association, 2021

  48. [48]

    The global landscape of AI ethics guidelines,

    A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, pp. 389–399, 2019

  49. [49]

    When to make exceptions: Exploring language models as accounts of human moral judgment,

    Z. Jin, S. Levine, F. Gonzalez Adauto, O. Kamath, Y . Zheng, J. Sachan, and B. Schölkopf, “When to make exceptions: Exploring language models as accounts of human moral judgment,” inProc. NeurIPS, 2022

  50. [50]

    Fine-tuning aligned language models compromises safety, even when users do not intend to,

    X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to,” inProc. ICLR, 2024

  51. [51]

    Delphi: Towards machine ethics and norms,

    L. Jiang, J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Liang, J. Dodge, K. Sakaguchi, M. Forbes, J. Borchardt, S. Saber, N. Lourie, Y . Choi, and A. Farhadi, “Delphi: Towards machine ethics and norms,”arXiv preprint arXiv:2110.07574, 2021

  52. [52]

    You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,

    Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Mankowitz, and A. Lauscher, “You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,” inProc. ACL BigScience Workshop, 2022

  53. [53]

    Survey on AI ethics: A socio-technical per- spective,

    D. Mbiazi, M. Bhange, M. Babaei, I. Sheth, P. Kenfack, and S. Ebrahimi Kahou, “Survey on AI ethics: A socio-technical per- spective,”Computational Intelligence, vol. 41, no. 6, 2025. [Online]. Available: https://doi.org/10.1111/coin.70149