Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making
Pith reviewed 2026-05-23 02:57 UTC · model grok-4.3
The pith
Reinforcement learning agents can be ethically refined by replacing human feedback with belief values from a large language model embodying five moral principles, aggregated into shaping rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an ethical layer which uses an LLM to generate belief values from five moral principles, aggregates those values via Belief Jensen-Shannon Divergence and Dempster-Shafer Theory into probability scores, and feeds the scores back as shaping rewards, enables a reinforcement learning agent to navigate moral uncertainty and produce morally sound decisions across diverse tasks while lowering dependence on manually designed ethical rewards.
What carries the argument
The ethical layer that aggregates LLM-generated belief scores from five moral perspectives using Belief Jensen-Shannon Divergence and Dempster-Shafer Theory to form probability scores that serve as shaping rewards.
If this is right
- The framework produces improved consistency and adaptability compared with other belief aggregation methods.
- It reduces reliance on handcrafted ethical rewards for each new task.
- It remains effective when ethical challenges appear unexpectedly in dynamic environments.
- It works across multiple LLM variants and yields decisions suited to real-world applications.
Where Pith is reading between the lines
- The same aggregation step could be inserted into other learning loops, such as fine-tuning language models directly on ethical preferences.
- If the underlying LLM systematically under-represents one of the five principles, the resulting reward signal would embed that imbalance across all tasks.
- Deployment in high-stakes domains such as autonomous vehicles would require an additional verification layer that checks whether the aggregate score actually matches human expert judgments on edge cases.
Load-bearing premise
The large language model can accurately and unbiasedly embody and apply the five specified moral principles to assign belief values to actions.
What would settle it
A controlled test in which the agent faces an action that one moral principle strongly endorses while the aggregated score strongly opposes it; if the agent reliably follows the aggregate score even when individual principles conflict, the claim holds; systematic deviation would falsify it.
read the original abstract
We present an ethical decision-making framework that refines a pre-trained reinforcement learning (RL) model using a task-agnostic ethical layer. Following initial training, the RL model undergoes ethical fine-tuning, where human feedback is replaced by feedback generated from a large language model (LLM). The LLM embodies consequentialist, deontological, virtue, social justice, and care ethics as moral principles to assign belief values to recommended actions during ethical decision-making. An ethical layer aggregates belief scores from multiple LLM-derived moral perspectives using Belief Jensen-Shannon Divergence and Dempster-Shafer Theory into probability scores that also serve as the shaping reward, steering the agent toward choices that align with a balanced ethical framework. This integrated learning framework helps the RL agent navigate moral uncertainty in complex environments and enables it to make morally sound decisions across diverse tasks. Our approach, tested across different LLM variants and compared with other belief aggregation techniques, demonstrates improved consistency, adaptability, and reduced reliance on handcrafted ethical rewards. This method is especially effective in dynamic scenarios where ethical challenges arise unexpectedly, making it well-suited for real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a framework for ethical decision-making in reinforcement learning agents. It uses a large language model to generate feedback based on five moral principles—consequentialist, deontological, virtue, social justice, and care ethics—to assign belief values to actions. These are aggregated using Belief Jensen-Shannon Divergence and Dempster-Shafer Theory to create probability scores that serve as shaping rewards for the RL agent, aiming to handle moral uncertainty without human feedback or handcrafted rewards. The approach is tested across LLM variants and compared to other aggregation techniques, claiming improved consistency and adaptability.
Significance. If the empirical claims hold and external validation is added, the work could contribute to scalable methods for incorporating multiple ethical perspectives into AI systems, addressing moral uncertainty in dynamic environments. The use of established aggregation techniques like DST is a positive element. However, without human benchmarks, the significance for producing 'morally sound decisions' remains limited.
major comments (3)
- [Abstract] Abstract: The assertion that the method 'demonstrates improved consistency, adaptability, and reduced reliance on handcrafted ethical rewards' after testing across LLM variants is unsupported by any reported data, baselines, quantitative metrics, or statistical comparisons, which is load-bearing for the central claim.
- [Experiments / Evaluation section] Experiments / Evaluation section: No comparisons of LLM-assigned belief values to human ethical judgments on the same action sets or established moral dilemma benchmarks (e.g., trolley problems or standard ethics datasets) are provided. This directly undermines the claim that the framework enables 'morally sound decisions,' as consistency among LLMs may capture shared model biases rather than independent moral validity.
- [Framework section] Framework section: The core assumption that the LLM can accurately and unbiasedly 'embody' the five moral principles to assign belief values is not validated against external human standards or inter-rater reliability measures, which is load-bearing for the claim that the aggregated rewards steer the agent toward ethical choices.
minor comments (2)
- [Framework section] The description of how Belief Jensen-Shannon Divergence is computed from the LLM outputs could include an explicit equation or pseudocode for reproducibility.
- [Framework section] Clarify whether the 'ethical fine-tuning layer' modifies the RL policy directly or only the reward function, as this affects interpretation of the results.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for clarification and improvement in our work on ethical decision-making for RL agents. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the method 'demonstrates improved consistency, adaptability, and reduced reliance on handcrafted ethical rewards' after testing across LLM variants is unsupported by any reported data, baselines, quantitative metrics, or statistical comparisons, which is load-bearing for the central claim.
Authors: We agree with this observation. The experiments section describes testing across LLM variants and comparisons to other aggregation techniques, but does not include specific quantitative metrics or statistical comparisons to support claims of 'improved consistency' and 'adaptability'. We will revise the abstract to remove these unsupported assertions and instead describe the experimental setup more accurately. revision: yes
-
Referee: [Experiments / Evaluation section] Experiments / Evaluation section: No comparisons of LLM-assigned belief values to human ethical judgments on the same action sets or established moral dilemma benchmarks (e.g., trolley problems or standard ethics datasets) are provided. This directly undermines the claim that the framework enables 'morally sound decisions,' as consistency among LLMs may capture shared model biases rather than independent moral validity.
Authors: This is a valid criticism. Our framework is intended to address moral uncertainty by aggregating diverse LLM perspectives without relying on human feedback, which is a key contribution. However, we recognize that without external human validation, claims of moral soundness are limited. We will add a dedicated limitations paragraph in the discussion section to explicitly address this point, noting that LLM agreement may reflect training data biases and recommending human studies on standard benchmarks as future work. revision: partial
-
Referee: [Framework section] Framework section: The core assumption that the LLM can accurately and unbiasedly 'embody' the five moral principles to assign belief values is not validated against external human standards or inter-rater reliability measures, which is load-bearing for the claim that the aggregated rewards steer the agent toward ethical choices.
Authors: We acknowledge that the manuscript does not provide validation of the LLM's embodiment of moral principles against human standards. The approach relies on the capability of LLMs to simulate different ethical frameworks through prompting, as described in the framework section. To address this, we will revise the framework section to more clearly articulate this assumption and discuss potential biases, while emphasizing that the aggregation via Belief JSD and DST is meant to mitigate individual perspective biases. revision: yes
Circularity Check
No circularity; derivation relies on external LLM assumptions and standard aggregation methods
full rationale
The paper proposes replacing human feedback with LLM-generated belief values from five fixed moral principles, then aggregates them via Belief Jensen-Shannon Divergence and Dempster-Shafer Theory to produce shaping rewards. No equations or steps define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the central claim to a self-citation chain. The framework is self-contained against its stated inputs; any weakness lies in unvalidated external assumptions about LLM moral accuracy rather than internal circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- Belief values from LLM
axioms (1)
- domain assumption LLMs can reliably simulate multiple ethical frameworks without introducing significant bias
invented entities (1)
-
Ethical fine-tuning layer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Nature computational science4(9), 629–632 (2024) https://doi.org/10.1038/s43588-024-00695-4
Chen, S.: The lost data: how AI systems censor LGBTQ+ content in the name of safety. Nature computational science4(9), 629–632 (2024) https://doi.org/10.1038/s43588-024-00695-4
-
[2]
Journal of Experimental & Theoretical Artificial Intelligence12(3), 251–261 (2000)
Allen, C., Varner, G., Zinser, J.: Prolegomena to any future artificial moral agent. Journal of Experimental & Theoretical Artificial Intelligence12(3), 251–261 (2000)
work page 2000
-
[3]
Wallach, W., Allen, C.: Moral Machines: Teaching Robots Right from Wrong, (2008)
work page 2008
-
[4]
IEEE Intelligent Systems21(4), 38–44 (2006) https://doi.org/10.1109/mis.2006.82
Bringsjord, S., Arkoudas, K., Bello, P.: Toward a general logicist methodology for engineering ethically correct robots. IEEE Intelligent Systems21(4), 38–44 (2006) https://doi.org/10.1109/mis.2006.82
-
[5]
AI magazine36(4), 105–114 (2015)
Russell, S., Dewey, D., Tegmark, M.: Research priorities for robust and beneficial artificial intelligence. AI magazine36(4), 105–114 (2015)
work page 2015
-
[6]
Ecoffet, A., Lehman, J.: Reinforcement Learning Under Moral Uncertainty. arXiv (2020). https://doi. org/10.48550/arxiv.2006.04734
-
[7]
Oxford Univer- sity Press (2018).https://doi.org/10.1093/oso/9780198814788.001.0001
MacAskill, M., Bykvist, K., Ord, T.: Moral Uncertainty. Oxford University Press, ??? (2020). https: //doi.org/10.1093/oso/9780198722274.001.0001
-
[8]
Edward Elgar Publishing, ??? (2019)
L¨ utge, C.: The Ethics of Competition: How a Competitive Society Is Good for All. Edward Elgar Publishing, ??? (2019)
work page 2019
-
[9]
A Formalization of Kant's Second Formulation of the Categorical Imperative
Lindner, F., Bentzen, M.M.: A formalization of kant’s second formulation of the categorical imperative. arXiv preprint arXiv:1801.03160 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
In: Workshops at the Thirtieth AAAI Conference on Artificial Intelligence (2016)
Abel, D., MacGlashan, J., Littman, M.L.: Reinforcement learning as a framework for ethical decision making. In: Workshops at the Thirtieth AAAI Conference on Artificial Intelligence (2016)
work page 2016
-
[11]
Dragan, Pieter Abbeel, and Stuart Russell
Hadfield-Menell, D., Dragan, A., Abbeel, P., Russell, S.: The off-switch game. In: Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, pp. 220–227. International Joint Conferences on Artificial Intelligence Organization, ??? (2017). https://doi.org/10.24963/ijcai.2017/32
-
[12]
Ng, A.Y., Russell, S.,et al.: Algorithms for inverse reinforcement learning. In: ICML, vol. 1, p. 2 (2000) 14
work page 2000
-
[13]
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.,et al.: Maximum entropy inverse reinforcement learning. In: AAAI, vol. 8, pp. 1433–1438 (2008). Chicago, IL, USA
work page 2008
-
[14]
Ethics and Information Technology18, 243–256 (2016) https://doi.org/10.1007/ s10676-015-9367-8
Malle, B.F.: Integrating robot ethics and machine morality: the study and design of moral com- petence in robots. Ethics and Information Technology18, 243–256 (2016) https://doi.org/10.1007/ s10676-015-9367-8
work page 2016
-
[15]
Robotics and Autonomous Systems77, 1–14 (2016) https://doi.org/10.1016/j.robot.2015.11
Dennis, L., Fisher, M., Slavkovik, M., Webster, M.: Formal verification of ethical choices in autonomous systems. Robotics and Autonomous Systems77, 1–14 (2016) https://doi.org/10.1016/j.robot.2015.11. 012
-
[16]
Privacy-preserving and unforgeable searchable encrypted audit logs for cloud storage,
Sio, F., Hoven, J.: Meaningful human control over autonomous systems: A philosophical account. Frontiers in Robotics and AI5, 15 (2018) https://doi.org/10.3389/frobt.2018.00015
-
[17]
Cummings, M.M.: Man versus machine or man+ machine? IEEE Intelligent Systems29(5), 62–69 (2014) https://doi.org/10.1109/mis.2014.87
-
[18]
Nature Computational Science4(9), 641–643 (2024) https://doi.org/10.1038/s43588-024-00694-5
Suri, S.: Defining our future with generative AI. Nature Computational Science4(9), 641–643 (2024) https://doi.org/10.1038/s43588-024-00694-5
-
[19]
Nature621(7980), 672–675 (2023) https://doi.org/10.1038/d41586-023-02980-0
Van Noorden, R., Perkel, J.M.: Ai and science: what 1,600 researchers think. Nature621(7980), 672–675 (2023) https://doi.org/10.1038/d41586-023-02980-0
-
[20]
Generative Agents: Interactive Simulacra of Human Behavior
Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative Agents: Interactive Simulacra of Human Behavior. arXiv (2023). https://doi.org/10.48550/arxiv.2304.03442
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.03442 2023
-
[21]
Yang, J.C., Dailisan, D., Korecki, M., Hausladen, C.I., Helbing, D.: LLM voting: Human choices and AI collective decision making. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7, 1696–1708 (2024) https://doi.org/10.1609/aies.v7i1.31758
-
[22]
Gudi˜ no-Rosero, J., Grandi, U., Hidalgo, C.A.: Large language models (LLMs) as agents for augmented democracy. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences382(2285) (2024) https://doi.org/10.1098/rsta.2024.0100
-
[23]
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W.X., Wei, Z., Wen, J.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6) (2024) https://doi.org/10.1007/s11704-024-40231-1
-
[24]
In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Scherrer, N., Shi, C., Feder, A., Blei, D.: Evaluating the moral beliefs encoded in LLMs. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 51778–51809 (2023)
work page 2023
-
[25]
Garcia, B., Qian, C., Palminteri, S.: The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making. arXiv (2024). https://doi.org/10.48550/arxiv.2410.07304
-
[26]
Sen, A.: Collective Choice and Social Welfare: Expanded Edition. Penguin UK, ??? (2017)
work page 2017
-
[27]
Nature Computational Science3(10), 833–838 (2023) https://doi.org/10.1038/s43588-023-00527-x
Hagendorff, T., Fabi, S., Kosinski, M.: Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nature Computational Science3(10), 833–838 (2023) https://doi.org/10.1038/s43588-023-00527-x
-
[28]
Machine ethics1, 244–253 (2011) https://doi.org/10.1017/ cbo9780511978036.019
Gips, J.: Towards the ethical robot. Machine ethics1, 244–253 (2011) https://doi.org/10.1017/ cbo9780511978036.019
work page 2011
-
[29]
arXiv preprint arXiv:2406.18841 (2024)
Jiao, J., Afroogh, S., Xu, Y., Phillips, C.: Navigating LLM ethics: Advancements, challenges, and future directions. arXiv preprint arXiv:2406.18841 (2024)
-
[30]
Science and Engineering Ethics 31(1), 1–13 (2025)
Coeckelbergh, M.: LLMs, truth, and democracy: An overview of risks. Science and Engineering Ethics 31(1), 1–13 (2025)
work page 2025
-
[31]
arXiv preprint arXiv:2411.00784 (2024) 15
Xie, Z., Xing, R., Wang, Y., Geng, J., Iqbal, H., Sahnan, D., Gurevych, I., Nakov, P.: Fire: Fact-checking with iterative retrieval and verification. arXiv preprint arXiv:2411.00784 (2024) 15
-
[32]
In: Proceedings of the 34th International Conference on Machine Learning, pp
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, pp. 22–31 (2017). PMLR
work page 2017
-
[33]
In: Advances in Neural Information Processing Systems, pp
Tamar, A., Chow, Y., Ghavamzadeh, M., Mannor, S.: Policy gradients with variance related risk criteria. In: Advances in Neural Information Processing Systems, pp. 167–175 (2015)
work page 2015
-
[34]
In: Proceedings of the 35th International Conference on Machine Learning, pp
Chow, Y., Nachum, O., Du´ enez-Guzm´ an, E., Ghavamzadeh, M.: Lyapunov-based safe policy optimiza- tion for continuous control. In: Proceedings of the 35th International Conference on Machine Learning, pp. 809–818 (2018). PMLR
work page 2018
-
[35]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Alshiekh, M., Bloem, R., Ehlers, R., Koch, M., K¨ onighofer, B., Niekum, S., Topcu, U.: Safe reinforce- ment learning via shielding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
work page 2018
-
[36]
In: Advances in Neural Information Processing Systems, pp
Berkenkamp, F., Turchetta, M., Schoellig, A.P., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: Advances in Neural Information Processing Systems, pp. 908–918 (2017)
work page 2017
-
[37]
Neural Computation17(2), 335–359 (2005)
Morimoto, J., Doya, K.: Robust reinforcement learning. Neural Computation17(2), 335–359 (2005)
work page 2005
-
[38]
Journal of Machine Learning Research16(1), 1437–1480 (2015)
Garc´ ıa, J., Fern´ andez, F.: A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research16(1), 1437–1480 (2015)
work page 2015
-
[39]
arXiv preprint arXiv:2205.10330 (2022)
Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., Knoll, A.: A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330 (2022)
-
[40]
Duan, Y., Edwards, J.S., Dwivedi, Y.K.: Artificial intelligence for decision making in the era of big data–evolution, challenges and research agenda. International journal of information management48, 63–71 (2019) https://doi.org/10.1016/j.ijinfomgt.2019.01.021
-
[41]
AI & SOCIETY, 1–8 (2024) https://doi.org/10.1007/s00146-024-01968-2
Mahajan, S.: The executioner paradox: understanding self-referential dilemma in computational systems. AI & SOCIETY, 1–8 (2024) https://doi.org/10.1007/s00146-024-01968-2
-
[42]
ACM Computing Surveys (CSUR)53(6), 1–38 (2020) https://doi.org/10.1145/3419633
Tolmeijer, S., Kneer, M., Sarasua, C., Christen, M., Bernstein, A.: Implementations in machine ethics: A survey. ACM Computing Surveys (CSUR)53(6), 1–38 (2020) https://doi.org/10.1145/3419633
-
[43]
Nature Machine Intelligence4(3), 258–268 (2022)
Schramowski, P., Turan, C., Andersen, N., Rothkopf, C.A., Kersting, K.: Large pre-trained language models contain human-like biases of what is right and wrong to do. Nature Machine Intelligence4(3), 258–268 (2022)
work page 2022
-
[44]
Cambridge University Press, ??? (2011)
Anderson, M., Anderson, S.L.: Machine Ethics. Cambridge University Press, ??? (2011). https://doi. org/10.1017/cbo9780511978036
-
[45]
IEEE Intelligent Systems21(4), 46–51 (2006)
Powers, T.M.: Prospects for a kantian machine. IEEE Intelligent Systems21(4), 46–51 (2006)
work page 2006
-
[46]
Alexander, L., Moore, M.: Deontological Ethics. In: Zalta, E.N., Nodelman, U. (eds.) The Stanford Encyclopedia of Philosophy, Winter 2024 edn. Metaphysics Research Lab, Stanford University, ??? (2024)
work page 2024
-
[47]
Journal of social philosophy45(1), 89–106 (2014)
Gilligan, C.: Moral injury and the ethic of care: Reframing the conversation about differences. Journal of social philosophy45(1), 89–106 (2014)
work page 2014
-
[48]
Oxford University Press (2007)
Pogge, T.: John Rawls: His Life and Theory of Justice. Oxford University Press (2007). https://doi. org/10.5860/choice.45-1128
-
[49]
Ethics and Information Technology20(1), 1–3 (2018) https://doi.org/10.1007/s10676-018-9450-z
Dignum, V.: Ethics in artificial intelligence: introduction to the special issue. Ethics and Information Technology20(1), 1–3 (2018) https://doi.org/10.1007/s10676-018-9450-z
-
[50]
In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol
Conitzer, V., Sinnott-Armstrong, W., Borg, J.S., Deng, Y., Kramer, M.: Moral decision making frameworks for artificial intelligence. In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 31. Association for the Advancement of Artificial Intelligence (AAAI), ??? (2017). https: //doi.org/10.1609/aaai.v31i1.11140
-
[51]
Artificial Intelligence281, 103239 16 (2020)
Bench-Capon, T.J.: Ethical approaches and autonomous systems. Artificial Intelligence281, 103239 16 (2020)
work page 2020
-
[52]
Soft Computing 27(22), 16483–16492 (2023) https://doi.org/10.1007/s00500-023-09112-w
Chen, X., Deng, Y.: A novel combination rule for conflict management in data fusion. Soft Computing 27(22), 16483–16492 (2023) https://doi.org/10.1007/s00500-023-09112-w
-
[53]
Applied Soft Computing124, 109075 (2022) https://doi.org/10.1016/j.asoc.2022.109075
Zhao, K., Li, L., Chen, Z., Sun, R., Yuan, G., Li, J.: A survey: Optimization and applications of evidence fusion algorithm based on Dempster–Shafer theory. Applied Soft Computing124, 109075 (2022) https://doi.org/10.1016/j.asoc.2022.109075
-
[54]
Information Fusion46, 23–32 (2019) https://doi.org/10.1016/j.inffus.2018.04.003
Xiao, F.: Multi-sensor data fusion based on the belief divergence measure of evidences and the belief entropy. Information Fusion46, 23–32 (2019) https://doi.org/10.1016/j.inffus.2018.04.003
-
[55]
In: Classic Works of the Dempster-Shafer Theory of Belief Functions, pp
Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping. In: Classic Works of the Dempster-Shafer Theory of Belief Functions, pp. 57–72. Springer, ??? (2008). https://doi.org/ 10.1007/978-3-540-44792-4 3
-
[56]
PhD thesis, University of Oxford (2014)
MacAskill, W.: Normative uncertainty. PhD thesis, University of Oxford (2014)
work page 2014
-
[57]
Ugazio, G., Grueschow, M., Polania, R., Lamm, C., Tobler, P., Ruff, C.: Neuro-computational foun- dations of moral preferences. Social Cognitive and Affective Neuroscience17(3), 253–265 (2021) https://doi.org/10.1093/scan/nsab100
-
[58]
In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain- of-thought prompting elicits reasoning in large language models. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
work page 2022
-
[59]
Bellman, R.: A markovian decision process. J. Math. Mech.6, 679–684 (1957)
work page 1957
-
[60]
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press, ??? (2018). https: //doi.org/10.1109/tnn.1998.712192
-
[61]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms (2017). https://doi.org/10.48550/arxiv.1707.06347
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
-
[62]
Journal of Machine Learning Research23(274), 1–18 (2022)
Huang, S., Dossa, R.F.J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., Ara´ ujo, J.G.M.: Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research23(274), 1–18 (2022)
work page 2022
-
[63]
In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
Butlin, P.: AI alignment and human reward. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’21, pp. 437–445. ACM, ??? (2021). https://doi.org/10.1145/3461702. 3462570
-
[64]
In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learn- ing from human preferences. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30 (2017)
work page 2017
-
[65]
Lambert, N., Castricato, L., Werra, L., Havrilla, A.: Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog (2022). https://huggingface.co/blog/rlhf
work page 2022
-
[66]
Wu, Y.-H., Lin, S.-D.: A low-cost ethics shaping approach for designing reinforcement learning agents. Proceedings of the AAAI Conference on Artificial Intelligence32(1) (2018) https://doi.org/10.1609/ aaai.v32i1.11498
work page 2018
-
[67]
Scientific Reports9(1) (2019) https://doi.org/10.1038/ s41598-019-49411-7
Frank, D.-A., Chrysochou, P., Mitkidis, P., Ariely, D.: Human decision-making biases in the moral dilemmas of autonomous vehicles. Scientific Reports9(1) (2019) https://doi.org/10.1038/ s41598-019-49411-7
work page 2019
-
[68]
White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Naidu, S., Hegde, C., LeCun, Y., Goldstein, T., Neiswanger, W., Goldblum, M.: LiveBench: A Challenging, Contamination-Free LLM Benchmark. arXiv (2024). https://doi.org/10.48550/arxiv.2406. 19314 17
-
[69]
Helbing, D.: Summary: What’s wrong with AI? Humanistic technology needed! Next Civilization: Digital Democracy and Socio-Ecological Finance-How to Avoid Dystopia and Upgrade Society by Digital Means, 285–313 (2021)
work page 2021
-
[70]
Helbing, D., Ienca, M.: Why converging technologies need converging international regulation. Ethics and Information Technology26(1), 15 (2024) https://doi.org/10.1007/s10676-024-09756-8 18 Appendix A LLM Prompts Throughout our simulations, the moral agent is embodied by a large language model (LLM) interacting with the simulation environment. These inter...
-
[71]
BJS 1j . . .BJS 1k ... ... ... ... ... BJS i1 . . .0. . .BJS ik ... ... ... ... ... BJS k1 . . .BJS kj . . .0 (C2) The belief divergence as a distance measure quantifies the level of consistency across evidence from different sources. This measure of consistency allows for the identification of sources that are in alignment versus those that are...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.