pith. sign in

arxiv: 2512.04246 · v2 · submitted 2025-12-03 · 💻 cs.AI

Toward Virtuous Reinforcement Learning: A Critique and Roadmap

Pith reviewed 2026-05-17 01:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords virtuous reinforcement learningmachine ethicsreinforcement learningvirtue ethicsmulti-agent RLethical AIpolicy dispositionsmoral trade-offs
0
0 comments X

The pith

Reinforcement learning should treat ethics as stable policy-level habits rather than rules or scalar rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that rule-based methods in ethical RL often break under ambiguity and nonstationarity while failing to build lasting habits. Reward-based approaches, especially single-objective ones, tend to hide moral trade-offs and encourage proxy gaming. In response the authors propose evaluating agents on durable dispositions that hold across changing incentives and contexts. They outline a roadmap that combines social learning from exemplars, multi-objective optimization to keep value conflicts visible, affinity regularization for trait stability, and explicit operationalization of ethical traditions to surface cultural assumptions.

Core claim

Common patterns in machine ethics for RL either encode duties as constraints that struggle with nonstationarity or compress diverse values into single rewards that obscure trade-offs; instead, ethics should be treated as policy-level dispositions—relatively stable habits that persist when incentives, partners, or contexts change—supported by a roadmap of social learning in multi-agent RL, multi-objective and constrained formulations, affinity-based regularization, and operationalizing ethical traditions as practical control signals.

What carries the argument

Policy-level dispositions, defined as relatively stable habits that hold up when incentives or contexts change and implemented via the four-component roadmap of social learning, multi-objective optimization, affinity regularization, and explicit ethical traditions.

If this is right

  • Ethical evaluation moves beyond rule compliance or scalar returns to trait summaries, durability under interventions, and explicit reporting of moral trade-offs.
  • Agents acquire virtue-like patterns through social learning from imperfect but normatively informed exemplars in multi-agent settings.
  • Value conflicts remain visible and are managed by multi-objective formulations and risk-aware criteria that guard against harm.
  • Affinity-based regularization supports trait stability under distribution shift while allowing norms to evolve over time.
  • Benchmarks for ethical RL must make value and cultural assumptions explicit rather than leaving them implicit in reward design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framing could reduce the risk of agents learning brittle ethical shortcuts that collapse outside narrow training distributions.
  • The emphasis on reporting moral trade-offs may support more transparent auditing of deployed RL systems in domains such as healthcare or autonomous vehicles.
  • Operationalizing multiple ethical traditions side by side could surface practical methods for handling value pluralism in global AI governance.
  • Testing the roadmap in long-horizon multi-agent environments might reveal whether virtue-like stability emerges more reliably than from purely reward-shaped baselines.

Load-bearing premise

The four proposed components can be combined to produce stable virtue-like behavior without introducing new forms of ambiguity, implementation difficulty, or cultural bias that undermine the approach.

What would settle it

A controlled experiment in which agents trained via the four-component roadmap fail to maintain consistent ethical behavior or exhibit increased proxy gaming when incentives or partner behaviors are shifted after training.

Figures

Figures reproduced from arXiv: 2512.04246 by Majid Ghasemi, Mark Crowley.

Figure 1
Figure 1. Figure 1: Policy orchestration for ethical behavior. An orchestrator layer first enforces non-negotiable constraints (deontic guard), then selects between a virtue-oriented policy (V) and a utilitarian policy (U) based on context, with a safe fallback. This composition preserves hard safety while enabling context-sensitive trade-offs, improving disposition retention under partner-swap and incentive-flip intervention… view at source ↗
read the original abstract

This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper critiques rule-based (deontological) and scalar-reward approaches in RL-based machine ethics for struggling with ambiguity, non-stationarity, and obscuring moral trade-offs. It proposes treating ethics as stable policy-level dispositions (virtues) evaluable via trait summaries, durability under interventions, and explicit trade-off reporting, and outlines a four-component roadmap: (1) social learning in multi-agent RL from normatively informed exemplars, (2) multi-objective/constrained RL with risk-aware criteria, (3) affinity-based regularization for updateable virtue priors, and (4) operationalizing diverse ethical traditions as control signals.

Significance. If the proposed components can be integrated to deliver stable, context-adaptive ethical behavior without new ambiguities, the work could meaningfully shift RL ethics research toward frameworks that preserve value pluralism and support norm evolution. The conceptual critique of existing limitations is plausible and draws on established philosophical distinctions, but the absence of formalization, interaction analysis, or feasibility arguments means the primary contribution is directional rather than immediately enabling new implementations.

major comments (2)
  1. [Roadmap section (components 1–4)] Roadmap section (components 1–4): the manuscript presents the four components as a combined solution but supplies no analysis of compatibility, conflict resolution, or interaction effects. For instance, it is unclear how affinity-based regularization would enforce trait stability under distribution shift while multi-objective optimization simultaneously preserves explicit trade-offs, or how social learning from exemplars would avoid proxy behaviors in non-stationary environments.
  2. [Central claim on policy-level dispositions] Central claim on policy-level dispositions: the shift from rule checks or scalar returns to evaluation via 'trait summaries' and 'durability under interventions' is load-bearing for the virtue-ethics alternative, yet the paper provides no concrete metrics, intervention protocols, or RL-specific operationalization for assessing durability, leaving the claim without a clear path to implementation or falsification.
minor comments (1)
  1. [Abstract and Roadmap] The abstract and roadmap description use terms such as 'normatively informed exemplars' and 'updateable virtue priors' without providing working definitions or references to how these would be formalized in an RL setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful report. The comments correctly identify areas where the manuscript, as a conceptual critique and high-level roadmap, would benefit from greater specificity on component interactions and operationalization. We address each major comment below and commit to revisions that strengthen these aspects without altering the paper's directional focus.

read point-by-point responses
  1. Referee: Roadmap section (components 1–4): the manuscript presents the four components as a combined solution but supplies no analysis of compatibility, conflict resolution, or interaction effects. For instance, it is unclear how affinity-based regularization would enforce trait stability under distribution shift while multi-objective optimization simultaneously preserves explicit trade-offs, or how social learning from exemplars would avoid proxy behaviors in non-stationary environments.

    Authors: We agree that the manuscript does not include a dedicated analysis of interactions among the four components. This omission stems from the paper's framing as an initial roadmap rather than a complete architecture. In revision we will add a dedicated subsection that outlines plausible integration strategies and potential tensions. For example, we will describe how affinity-based regularization can serve as a stability-inducing term within a multi-objective formulation, how social learning from exemplars can be followed by constrained optimization to reduce proxy risks, and how explicit trade-off reporting can be preserved across components. We will also note open questions regarding non-stationarity that future work would need to resolve. revision: yes

  2. Referee: Central claim on policy-level dispositions: the shift from rule checks or scalar returns to evaluation via 'trait summaries' and 'durability under interventions' is load-bearing for the virtue-ethics alternative, yet the paper provides no concrete metrics, intervention protocols, or RL-specific operationalization for assessing durability, leaving the claim without a clear path to implementation or falsification.

    Authors: The manuscript presents the evaluation shift at a conceptual level to highlight the distinction from existing approaches. We acknowledge that this leaves the central claim without immediate implementation details. In the revised version we will expand the relevant section to propose concrete evaluation directions, including trait-summary statistics computed over context distributions, intervention protocols adapted from robustness testing in RL (e.g., policy perturbation under changed reward or transition dynamics), and references to existing multi-agent and constrained RL literature that could support falsifiable tests. These additions will provide clearer next steps while preserving the paper's emphasis on the underlying philosophical motivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; conceptual roadmap without derivations or self-referential reductions.

full rationale

This is a position paper critiquing rule-based and reward-based machine ethics in RL and proposing a virtue-oriented alternative via four conceptual components. The provided text contains no equations, fitted parameters, derivations, or mathematical predictions. All load-bearing claims rest on philosophical distinctions (e.g., treating ethics as stable policy-level dispositions) and literature patterns rather than any self-citation chain, ansatz smuggling, or renaming of known results that reduces to the paper's own inputs by construction. The central roadmap is presented as a synthesis of independent ideas; no step is shown to be equivalent to its inputs via definition or fit. This matches the default expectation for non-circular papers and aligns with the reader's assessment of score 1.0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central proposal rests on the domain assumption that virtue ethics provides a superior modeling choice for RL ethics compared with rules or rewards; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Ethics in RL is best captured as stable policy-level dispositions rather than encoded rules or scalar rewards.
    This premise is stated directly in the abstract as the alternative to the two critiqued patterns.

pith-pipeline@v0.9.0 · 5524 in / 1356 out tokens · 46732 ms · 2026-05-17T01:52:09.060199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Towards artificial virtuous agents: games, dilemmas and machine learning.AI and Ethics, 3(3):663–672, 2023

    Ajay Vishwanath, Einar Duenger Bøhn, Ole-Christoffer Granmo, Charl Maree, and Christian Omlin. Towards artificial virtuous agents: games, dilemmas and machine learning.AI and Ethics, 3(3):663–672, 2023. 5 Virtuous Reinforcement Learning

  2. [2]

    Reinforcement learning as a framework for ethical decision making

    David Abel, James MacGlashan, and Michael L Littman. Reinforcement learning as a framework for ethical decision making. InAAAI workshop: AI, ethics, and society, volume 16. Phoenix, AZ, 2016

  3. [3]

    Reinforcement learning and machine ethics: a systematic review.arXiv preprint arXiv:2407.02425, 2024

    Ajay Vishwanath, Louise A Dennis, and Marija Slavkovik. Reinforcement learning and machine ethics: a systematic review.arXiv preprint arXiv:2407.02425, 2024

  4. [4]

    Groundwork of the metaphysic of morals

    Immanuel Kant. Groundwork of the metaphysic of morals. InImmanuel Kant, pages 17–98. Routledge, 2020

  5. [5]

    Utilitarianism

    John Stuart Mill. Utilitarianism. InSeven masterpieces of philosophy, pages 329–375. Routledge, 2016

  6. [6]

    Safe reinforcement learning via shielding

    Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  7. [7]

    Can model-free reinforcement learning explain deontological moral judgments?Cognition, 150:232–242, 2016

    Alisabeth Ayars. Can model-free reinforcement learning explain deontological moral judgments?Cognition, 150:232–242, 2016

  8. [8]

    Reinforcement learning under moral uncertainty

    Adrien Ecoffet and Joel Lehman. Reinforcement learning under moral uncertainty. InInternational conference on machine learning, pages 2926–2936. PMLR, 2021

  9. [9]

    Q-learning as a model of utilitarianism in a human–machine team.Neural Computing and Applications, 35(23):16853–16864, 2023

    Samantha Krening. Q-learning as a model of utilitarianism in a human–machine team.Neural Computing and Applications, 35(23):16853–16864, 2023

  10. [10]

    Artificial morality: Top-down, bottom-up, and hybrid approaches

    Colin Allen, Iva Smit, and Wendell Wallach. Artificial morality: Top-down, bottom-up, and hybrid approaches. Ethics and information technology, 7(3):149–155, 2005

  11. [11]

    Building ethically bounded ai

    Francesca Rossi and Nicholas Mattei. Building ethically bounded ai. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9785–9789, 2019

  12. [12]

    Building Ethics into Artificial Intelligence

    Han Yu, Zhiqi Shen, Chunyan Miao, Cyril Leung, Victor R Lesser, and Qiang Yang. Building ethics into artificial intelligence.arXiv preprint arXiv:1812.02953, 2018

  13. [13]

    A low-cost ethics shaping approach for designing reinforcement learning agents

    Yueh-Hua Wu and Shou-De Lin. A low-cost ethics shaping approach for designing reinforcement learning agents. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  14. [14]

    Virtuous vs

    William A Bauer. Virtuous vs. utilitarian artificial moral agents.AI & SOCIETY, 35(1):263–271, 2020

  15. [15]

    Teaching ai agents ethical values using reinforcement learning and policy orchestration.IBM Journal of Research and Development, 63(4/5):2–1, 2019

    Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R Varshney, Murray Campbell, Moninder Singh, and Francesca Rossi. Teaching ai agents ethical values using reinforcement learning and policy orchestration.IBM Journal of Research and Development, 63(4/5):2–1, 2019

  16. [16]

    Cambridge University Press, 2014

    Roger Crisp.Aristotle: nicomachean ethics. Cambridge University Press, 2014

  17. [17]

    Right action and the non-virtuous agent.Journal of Applied Philosophy, 28(1):80–92, 2011

    Liezl Van Zyl. Right action and the non-virtuous agent.Journal of Applied Philosophy, 28(1):80–92, 2011

  18. [18]

    Introduction to reinforcement learning.arXiv preprint arXiv:2408.07712, 2024

    Majid Ghasemi and Dariush Ebrahimi. Introduction to reinforcement learning.arXiv preprint arXiv:2408.07712, 2024

  19. [19]

    Joint attention for multi-agent coordination and social learning.arXiv preprint arXiv:2104.07750, 2021

    Dennis Lee, Natasha Jaques, Chase Kew, Jiaxing Wu, Douglas Eck, Dale Schuurmans, and Aleksandra Faust. Joint attention for multi-agent coordination and social learning.arXiv preprint arXiv:2104.07750, 2021

  20. [20]

    Learning few-shot imitation as cultural transmission.Nature Communications, 14(1):7536, 2023

    Avishkar Bhoopchand, Bethanie Brownfield, Adrian Collister, Agustin Dal Lago, Ashley Edwards, Richard Everett, Alexandre Fréchette, Yanko Gitahy Oliveira, Edward Hughes, Kory W Mathewson, et al. Learning few-shot imitation as cultural transmission.Nature Communications, 14(1):7536, 2023

  21. [21]

    An efficient open world environment for multi-agent social learning.arXiv preprint arXiv:2508.15679, 2025

    Eric Ye, Ren Tao, and Natasha Jaques. An efficient open world environment for multi-agent social learning.arXiv preprint arXiv:2508.15679, 2025

  22. [22]

    Emergent social learning via multi-agent reinforcement learning

    Kamal K Ndousse, Douglas Eck, Sergey Levine, and Natasha Jaques. Emergent social learning via multi-agent reinforcement learning. InInternational conference on machine learning, pages 7991–8004. PMLR, 2021

  23. [23]

    Social influence as intrinsic motivation for multi-agent deep reinforcement learning

    Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pages 3040–3049. PMLR, 2019

  24. [24]

    Multi-objective reinforcement learning: an ethical perspective

    Timon Deschamps, Rémy Chaput, and Laetitia Matignon. Multi-objective reinforcement learning: an ethical perspective. InRJCIA, 2024

  25. [25]

    Exploring affinity-based reinforcement learning for designing artificial virtuous agents in stochastic environments

    Ajay Vishwanath and Christian Omlin. Exploring affinity-based reinforcement learning for designing artificial virtuous agents in stochastic environments. InInternational Conference on Frontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications, pages 25–38. Springer, 2023

  26. [26]

    The core of confucian learning

    Jin Li. The core of confucian learning. 2003

  27. [27]

    The daoist thought of wu wei–action through non-action and its influence in vietnam.Synesis (ISSN 1984-6754), 17(2):55–71, 2025

    Vu Hong Van. The daoist thought of wu wei–action through non-action and its influence in vietnam.Synesis (ISSN 1984-6754), 17(2):55–71, 2025. 6 Virtuous Reinforcement Learning

  28. [28]

    An anthology of philosophy in persia

    Seyyed Hossein Nasr and Mehdi Aminrazavi. An anthology of philosophy in persia. 2012

  29. [29]

    Composable modular reinforcement learning

    Christopher Simpkins and Charles Isbell. Composable modular reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 4975–4982, 2019

  30. [30]

    Formal verification of ethical choices in autonomous systems.Robotics and Autonomous Systems, 77:1–14, 2016

    Louise Dennis, Michael Fisher, Marija Slavkovik, and Matt Webster. Formal verification of ethical choices in autonomous systems.Robotics and Autonomous Systems, 77:1–14, 2016

  31. [31]

    Ltl and beyond: Formal languages for reward function specification in reinforcement learning

    Alberto Camacho, Rodrigo Toro Icarte, Toryn Q Klassen, Richard Anthony Valenzano, and Sheila A McIlraith. Ltl and beyond: Formal languages for reward function specification in reinforcement learning. InIJCAI, volume 19, pages 6065–6073, 2019

  32. [32]

    Using reward machines for high- level task specification and decomposition in reinforcement learning

    Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high- level task specification and decomposition in reinforcement learning. InInternational Conference on Machine Learning, pages 2107–2116. PMLR, 2018. 7