Toward Virtuous Reinforcement Learning: A Critique and Roadmap
Pith reviewed 2026-05-17 01:52 UTC · model grok-4.3
The pith
Reinforcement learning should treat ethics as stable policy-level habits rather than rules or scalar rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Common patterns in machine ethics for RL either encode duties as constraints that struggle with nonstationarity or compress diverse values into single rewards that obscure trade-offs; instead, ethics should be treated as policy-level dispositions—relatively stable habits that persist when incentives, partners, or contexts change—supported by a roadmap of social learning in multi-agent RL, multi-objective and constrained formulations, affinity-based regularization, and operationalizing ethical traditions as practical control signals.
What carries the argument
Policy-level dispositions, defined as relatively stable habits that hold up when incentives or contexts change and implemented via the four-component roadmap of social learning, multi-objective optimization, affinity regularization, and explicit ethical traditions.
If this is right
- Ethical evaluation moves beyond rule compliance or scalar returns to trait summaries, durability under interventions, and explicit reporting of moral trade-offs.
- Agents acquire virtue-like patterns through social learning from imperfect but normatively informed exemplars in multi-agent settings.
- Value conflicts remain visible and are managed by multi-objective formulations and risk-aware criteria that guard against harm.
- Affinity-based regularization supports trait stability under distribution shift while allowing norms to evolve over time.
- Benchmarks for ethical RL must make value and cultural assumptions explicit rather than leaving them implicit in reward design.
Where Pith is reading between the lines
- This framing could reduce the risk of agents learning brittle ethical shortcuts that collapse outside narrow training distributions.
- The emphasis on reporting moral trade-offs may support more transparent auditing of deployed RL systems in domains such as healthcare or autonomous vehicles.
- Operationalizing multiple ethical traditions side by side could surface practical methods for handling value pluralism in global AI governance.
- Testing the roadmap in long-horizon multi-agent environments might reveal whether virtue-like stability emerges more reliably than from purely reward-shaped baselines.
Load-bearing premise
The four proposed components can be combined to produce stable virtue-like behavior without introducing new forms of ambiguity, implementation difficulty, or cultural bias that undermine the approach.
What would settle it
A controlled experiment in which agents trained via the four-component roadmap fail to maintain consistent ethical behavior or exhibit increased proxy gaming when incentives or partner behaviors are shifted after training.
Figures
read the original abstract
This paper critiques common patterns in machine ethics for Reinforcement Learning (RL) and argues for a virtue focused alternative. We highlight two recurring limitations in much of the current literature: (i) rule based (deontological) methods that encode duties as constraints or shields often struggle under ambiguity and nonstationarity and do not cultivate lasting habits, and (ii) many reward based approaches, especially single objective RL, implicitly compress diverse moral considerations into a single scalar signal, which can obscure trade offs and invite proxy gaming in practice. We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change. This shifts evaluation beyond rule checks or scalar returns toward trait summaries, durability under interventions, and explicit reporting of moral trade offs. Our roadmap combines four components: (1) social learning in multi agent RL to acquire virtue like patterns from imperfect but normatively informed exemplars; (2) multi objective and constrained formulations that preserve value conflicts and incorporate risk aware criteria to guard against harm; (3) affinity based regularization toward updateable virtue priors that support trait like stability under distribution shift while allowing norms to evolve; and (4) operationalizing diverse ethical traditions as practical control signals, making explicit the value and cultural assumptions that shape ethical RL benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper critiques rule-based (deontological) and scalar-reward approaches in RL-based machine ethics for struggling with ambiguity, non-stationarity, and obscuring moral trade-offs. It proposes treating ethics as stable policy-level dispositions (virtues) evaluable via trait summaries, durability under interventions, and explicit trade-off reporting, and outlines a four-component roadmap: (1) social learning in multi-agent RL from normatively informed exemplars, (2) multi-objective/constrained RL with risk-aware criteria, (3) affinity-based regularization for updateable virtue priors, and (4) operationalizing diverse ethical traditions as control signals.
Significance. If the proposed components can be integrated to deliver stable, context-adaptive ethical behavior without new ambiguities, the work could meaningfully shift RL ethics research toward frameworks that preserve value pluralism and support norm evolution. The conceptual critique of existing limitations is plausible and draws on established philosophical distinctions, but the absence of formalization, interaction analysis, or feasibility arguments means the primary contribution is directional rather than immediately enabling new implementations.
major comments (2)
- [Roadmap section (components 1–4)] Roadmap section (components 1–4): the manuscript presents the four components as a combined solution but supplies no analysis of compatibility, conflict resolution, or interaction effects. For instance, it is unclear how affinity-based regularization would enforce trait stability under distribution shift while multi-objective optimization simultaneously preserves explicit trade-offs, or how social learning from exemplars would avoid proxy behaviors in non-stationary environments.
- [Central claim on policy-level dispositions] Central claim on policy-level dispositions: the shift from rule checks or scalar returns to evaluation via 'trait summaries' and 'durability under interventions' is load-bearing for the virtue-ethics alternative, yet the paper provides no concrete metrics, intervention protocols, or RL-specific operationalization for assessing durability, leaving the claim without a clear path to implementation or falsification.
minor comments (1)
- [Abstract and Roadmap] The abstract and roadmap description use terms such as 'normatively informed exemplars' and 'updateable virtue priors' without providing working definitions or references to how these would be formalized in an RL setting.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful report. The comments correctly identify areas where the manuscript, as a conceptual critique and high-level roadmap, would benefit from greater specificity on component interactions and operationalization. We address each major comment below and commit to revisions that strengthen these aspects without altering the paper's directional focus.
read point-by-point responses
-
Referee: Roadmap section (components 1–4): the manuscript presents the four components as a combined solution but supplies no analysis of compatibility, conflict resolution, or interaction effects. For instance, it is unclear how affinity-based regularization would enforce trait stability under distribution shift while multi-objective optimization simultaneously preserves explicit trade-offs, or how social learning from exemplars would avoid proxy behaviors in non-stationary environments.
Authors: We agree that the manuscript does not include a dedicated analysis of interactions among the four components. This omission stems from the paper's framing as an initial roadmap rather than a complete architecture. In revision we will add a dedicated subsection that outlines plausible integration strategies and potential tensions. For example, we will describe how affinity-based regularization can serve as a stability-inducing term within a multi-objective formulation, how social learning from exemplars can be followed by constrained optimization to reduce proxy risks, and how explicit trade-off reporting can be preserved across components. We will also note open questions regarding non-stationarity that future work would need to resolve. revision: yes
-
Referee: Central claim on policy-level dispositions: the shift from rule checks or scalar returns to evaluation via 'trait summaries' and 'durability under interventions' is load-bearing for the virtue-ethics alternative, yet the paper provides no concrete metrics, intervention protocols, or RL-specific operationalization for assessing durability, leaving the claim without a clear path to implementation or falsification.
Authors: The manuscript presents the evaluation shift at a conceptual level to highlight the distinction from existing approaches. We acknowledge that this leaves the central claim without immediate implementation details. In the revised version we will expand the relevant section to propose concrete evaluation directions, including trait-summary statistics computed over context distributions, intervention protocols adapted from robustness testing in RL (e.g., policy perturbation under changed reward or transition dynamics), and references to existing multi-agent and constrained RL literature that could support falsifiable tests. These additions will provide clearer next steps while preserving the paper's emphasis on the underlying philosophical motivation. revision: yes
Circularity Check
No significant circularity; conceptual roadmap without derivations or self-referential reductions.
full rationale
This is a position paper critiquing rule-based and reward-based machine ethics in RL and proposing a virtue-oriented alternative via four conceptual components. The provided text contains no equations, fitted parameters, derivations, or mathematical predictions. All load-bearing claims rest on philosophical distinctions (e.g., treating ethics as stable policy-level dispositions) and literature patterns rather than any self-citation chain, ansatz smuggling, or renaming of known results that reduces to the paper's own inputs by construction. The central roadmap is presented as a synthesis of independent ideas; no step is shown to be equivalent to its inputs via definition or fit. This matches the default expectation for non-circular papers and aligns with the reader's assessment of score 1.0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ethics in RL is best captured as stable policy-level dispositions rather than encoded rules or scalar rewards.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We instead treat ethics as policy level dispositions... trait summaries, durability under interventions, and explicit reporting of moral trade offs.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
affinity-based regularization toward updateable virtue priors... J(θ)=E[R]−λL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ajay Vishwanath, Einar Duenger Bøhn, Ole-Christoffer Granmo, Charl Maree, and Christian Omlin. Towards artificial virtuous agents: games, dilemmas and machine learning.AI and Ethics, 3(3):663–672, 2023. 5 Virtuous Reinforcement Learning
work page 2023
-
[2]
Reinforcement learning as a framework for ethical decision making
David Abel, James MacGlashan, and Michael L Littman. Reinforcement learning as a framework for ethical decision making. InAAAI workshop: AI, ethics, and society, volume 16. Phoenix, AZ, 2016
work page 2016
-
[3]
Reinforcement learning and machine ethics: a systematic review.arXiv preprint arXiv:2407.02425, 2024
Ajay Vishwanath, Louise A Dennis, and Marija Slavkovik. Reinforcement learning and machine ethics: a systematic review.arXiv preprint arXiv:2407.02425, 2024
-
[4]
Groundwork of the metaphysic of morals
Immanuel Kant. Groundwork of the metaphysic of morals. InImmanuel Kant, pages 17–98. Routledge, 2020
work page 2020
-
[5]
John Stuart Mill. Utilitarianism. InSeven masterpieces of philosophy, pages 329–375. Routledge, 2016
work page 2016
-
[6]
Safe reinforcement learning via shielding
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[7]
Alisabeth Ayars. Can model-free reinforcement learning explain deontological moral judgments?Cognition, 150:232–242, 2016
work page 2016
-
[8]
Reinforcement learning under moral uncertainty
Adrien Ecoffet and Joel Lehman. Reinforcement learning under moral uncertainty. InInternational conference on machine learning, pages 2926–2936. PMLR, 2021
work page 2021
-
[9]
Samantha Krening. Q-learning as a model of utilitarianism in a human–machine team.Neural Computing and Applications, 35(23):16853–16864, 2023
work page 2023
-
[10]
Artificial morality: Top-down, bottom-up, and hybrid approaches
Colin Allen, Iva Smit, and Wendell Wallach. Artificial morality: Top-down, bottom-up, and hybrid approaches. Ethics and information technology, 7(3):149–155, 2005
work page 2005
-
[11]
Francesca Rossi and Nicholas Mattei. Building ethically bounded ai. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9785–9789, 2019
work page 2019
-
[12]
Building Ethics into Artificial Intelligence
Han Yu, Zhiqi Shen, Chunyan Miao, Cyril Leung, Victor R Lesser, and Qiang Yang. Building ethics into artificial intelligence.arXiv preprint arXiv:1812.02953, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
A low-cost ethics shaping approach for designing reinforcement learning agents
Yueh-Hua Wu and Shou-De Lin. A low-cost ethics shaping approach for designing reinforcement learning agents. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[14]
William A Bauer. Virtuous vs. utilitarian artificial moral agents.AI & SOCIETY, 35(1):263–271, 2020
work page 2020
-
[15]
Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan, Kush R Varshney, Murray Campbell, Moninder Singh, and Francesca Rossi. Teaching ai agents ethical values using reinforcement learning and policy orchestration.IBM Journal of Research and Development, 63(4/5):2–1, 2019
work page 2019
-
[16]
Cambridge University Press, 2014
Roger Crisp.Aristotle: nicomachean ethics. Cambridge University Press, 2014
work page 2014
-
[17]
Right action and the non-virtuous agent.Journal of Applied Philosophy, 28(1):80–92, 2011
Liezl Van Zyl. Right action and the non-virtuous agent.Journal of Applied Philosophy, 28(1):80–92, 2011
work page 2011
-
[18]
Introduction to reinforcement learning.arXiv preprint arXiv:2408.07712, 2024
Majid Ghasemi and Dariush Ebrahimi. Introduction to reinforcement learning.arXiv preprint arXiv:2408.07712, 2024
-
[19]
Dennis Lee, Natasha Jaques, Chase Kew, Jiaxing Wu, Douglas Eck, Dale Schuurmans, and Aleksandra Faust. Joint attention for multi-agent coordination and social learning.arXiv preprint arXiv:2104.07750, 2021
-
[20]
Learning few-shot imitation as cultural transmission.Nature Communications, 14(1):7536, 2023
Avishkar Bhoopchand, Bethanie Brownfield, Adrian Collister, Agustin Dal Lago, Ashley Edwards, Richard Everett, Alexandre Fréchette, Yanko Gitahy Oliveira, Edward Hughes, Kory W Mathewson, et al. Learning few-shot imitation as cultural transmission.Nature Communications, 14(1):7536, 2023
work page 2023
-
[21]
Eric Ye, Ren Tao, and Natasha Jaques. An efficient open world environment for multi-agent social learning.arXiv preprint arXiv:2508.15679, 2025
-
[22]
Emergent social learning via multi-agent reinforcement learning
Kamal K Ndousse, Douglas Eck, Sergey Levine, and Natasha Jaques. Emergent social learning via multi-agent reinforcement learning. InInternational conference on machine learning, pages 7991–8004. PMLR, 2021
work page 2021
-
[23]
Social influence as intrinsic motivation for multi-agent deep reinforcement learning
Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pages 3040–3049. PMLR, 2019
work page 2019
-
[24]
Multi-objective reinforcement learning: an ethical perspective
Timon Deschamps, Rémy Chaput, and Laetitia Matignon. Multi-objective reinforcement learning: an ethical perspective. InRJCIA, 2024
work page 2024
-
[25]
Ajay Vishwanath and Christian Omlin. Exploring affinity-based reinforcement learning for designing artificial virtuous agents in stochastic environments. InInternational Conference on Frontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications, pages 25–38. Springer, 2023
work page 2023
- [26]
-
[27]
Vu Hong Van. The daoist thought of wu wei–action through non-action and its influence in vietnam.Synesis (ISSN 1984-6754), 17(2):55–71, 2025. 6 Virtuous Reinforcement Learning
work page 1984
-
[28]
An anthology of philosophy in persia
Seyyed Hossein Nasr and Mehdi Aminrazavi. An anthology of philosophy in persia. 2012
work page 2012
-
[29]
Composable modular reinforcement learning
Christopher Simpkins and Charles Isbell. Composable modular reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 4975–4982, 2019
work page 2019
-
[30]
Louise Dennis, Michael Fisher, Marija Slavkovik, and Matt Webster. Formal verification of ethical choices in autonomous systems.Robotics and Autonomous Systems, 77:1–14, 2016
work page 2016
-
[31]
Ltl and beyond: Formal languages for reward function specification in reinforcement learning
Alberto Camacho, Rodrigo Toro Icarte, Toryn Q Klassen, Richard Anthony Valenzano, and Sheila A McIlraith. Ltl and beyond: Formal languages for reward function specification in reinforcement learning. InIJCAI, volume 19, pages 6065–6073, 2019
work page 2019
-
[32]
Using reward machines for high- level task specification and decomposition in reinforcement learning
Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high- level task specification and decomposition in reinforcement learning. InInternational Conference on Machine Learning, pages 2107–2116. PMLR, 2018. 7
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.