Scaling Laws for Moral Machine Judgment in Large Language Models
Pith reviewed 2026-05-16 11:53 UTC · model grok-4.3
The pith
Language models align more closely with human moral preferences as their size increases following a power law.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We observe a consistent power-law relationship with distance from human preferences (D) decreasing as D ∝ S^{-0.10±0.01} (R²=0.50, p<0.001) where S is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size×reasoning interaction: p = 0.024). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale.
What carries the argument
The distance metric D from average human preferences on Moral Machine life-death dilemmas, plotted against model parameter count S to reveal the scaling exponent.
If this is right
- Moral alignment improves systematically and predictably with model size.
- Extended reasoning boosts alignment more for smaller models than for larger ones.
- Response consistency increases and variance decreases as models scale up.
- The pattern appears independent of specific model family or architecture.
Where Pith is reading between the lines
- Continued scaling would imply that future models could reach near-human consistency on these dilemmas through size alone.
- Governance rules for autonomous systems might eventually use parameter count as one indicator of expected moral reliability.
- The same scaling might or might not appear when the same models face moral questions outside the Moral Machine format.
Load-bearing premise
Responses on the Moral Machine framework constitute a stable, generalizable proxy for moral judgment that is not dominated by training-data artifacts or prompt sensitivity.
What would settle it
A model much larger than the tested range whose distance from human preferences fails to follow the predicted power-law decrease, or a change in prompt wording that removes the scaling effect.
Figures
read the original abstract
Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that moral judgment in LLMs, measured via alignment with human preferences on Moral Machine life-death dilemmas, follows a power-law scaling with model size: distance D from human preferences decreases as D ∝ S^{-0.10±0.01} (R²=0.50, p<0.001) across 75 configurations (0.27B–1000B parameters). Mixed-effects models show the relationship persists after controlling for model family and reasoning, with extended reasoning improving alignment more in smaller models (size×reasoning interaction p=0.024). The relationship holds across architectures and variance decreases at larger scales.
Significance. If robust, the work extends scaling-law research from language and reasoning tasks to value-based moral judgments, supplying empirical data relevant to AI governance. The moderate R²=0.50 and explicit mixed-effects controls are strengths, but the single-framework proxy and lack of prompt-invariance checks limit the strength of the generalization claim.
major comments (3)
- [Abstract] Abstract: the power-law claim rests on R²=0.50; the manuscript must specify the exact distance metric for D, whether S was log-transformed before fitting, outlier handling, and the full mixed-effects model specification (random effects, covariance structure).
- [Results] Results (mixed-effects analysis): no explicit checks are reported for prompt paraphrasing invariance or overlap between Moral Machine scenarios and pretraining corpora; without these, residual variance could reflect training-data artifacts rather than stable moral judgment scaling.
- [Abstract] Abstract: the size×reasoning interaction (p=0.024) is reported without effect size, coefficient table, or model equation, preventing assessment of whether the interaction is load-bearing for the central scaling claim.
minor comments (1)
- [Methods] Clarify the exact number of models per size bin and any exclusion criteria applied to the 75 configurations.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We have carefully considered each point and made revisions to enhance the transparency of our methods and statistical reporting. Below, we provide point-by-point responses.
read point-by-point responses
-
Referee: [Abstract] Abstract: the power-law claim rests on R²=0.50; the manuscript must specify the exact distance metric for D, whether S was log-transformed before fitting, outlier handling, and the full mixed-effects model specification (random effects, covariance structure).
Authors: We agree that these details are essential. In the revised version, we will explicitly state in the abstract and methods that D represents the mean absolute deviation from aggregated human preference proportions across Moral Machine scenarios. The power-law was fitted using log-transformed S (model parameters) and log(D) via linear regression. No outliers were removed from the analysis. The mixed-effects model is specified as D ~ log(S) + reasoning + (1 | model_family), using a random intercept for model family with an unstructured covariance matrix. revision: yes
-
Referee: [Results] Results (mixed-effects analysis): no explicit checks are reported for prompt paraphrasing invariance or overlap between Moral Machine scenarios and pretraining corpora; without these, residual variance could reflect training-data artifacts rather than stable moral judgment scaling.
Authors: We acknowledge this as a potential limitation. The original study did not include explicit prompt paraphrasing invariance tests or pretraining corpus overlap analyses. We argue that the Moral Machine dilemmas are abstract and unlikely to be directly memorized, and the scaling relationship persists across diverse model families, which helps control for training data differences. In the revision, we will add a paragraph in the discussion section addressing this concern and suggesting it as an avenue for future work. revision: partial
-
Referee: [Abstract] Abstract: the size×reasoning interaction (p=0.024) is reported without effect size, coefficient table, or model equation, preventing assessment of whether the interaction is load-bearing for the central scaling claim.
Authors: We will expand the reporting in the revised manuscript. We will include the full model equation: D ~ log(S) * reasoning + (1 | model_family), report the interaction coefficient (β = -0.05, SE = 0.02, p = 0.024), and provide a supplementary table with all fixed and random effects coefficients to allow full evaluation of the interaction's role. revision: yes
Circularity Check
No circularity: empirical power-law fit to independently measured distances
full rationale
The central claim is an observed scaling D ∝ S^{-0.10±0.01} obtained by evaluating 75 distinct model configurations on the Moral Machine task, computing alignment distance D to human preferences for each, and performing a regression on the resulting (S, D) pairs. This is a direct empirical measurement followed by statistical fitting; the power-law exponent is not defined from the same data in a way that forces the result by construction, nor does any step invoke self-citation, ansatz smuggling, or renaming of a known result as a derivation. Mixed-effects controls for family and reasoning are likewise post-hoc statistical adjustments on the measured values. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling exponent =
-0.10
axioms (1)
- domain assumption Moral Machine responses provide a stable, generalizable measure of alignment with human moral preferences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We observe a consistent power-law relationship with distance from human preferences (D) decreasing as D ∝ S^{-0.10±0.01} (R²=0.50, p<0.001)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not y...
Reference graph
Works this paper leans on
-
[1]
Cambridge University Press, 2011
Michael Anderson and Susan Leigh Anderson.Machine ethics. Cambridge University Press, 2011
work page 2011
-
[2]
The social dilemma of autonomous vehicles.Science, 352(6293):1573–1576, 2016
Jean-François Bonnefon, Azim Shariff, and Iyad Rahwan. The social dilemma of autonomous vehicles.Science, 352(6293):1573–1576, 2016
work page 2016
-
[3]
Emanuele Ratti, Michael Morrison, and Ivett Jakab. Ethical and social considerations of applying artificial intelligence in healthcare—a two-pronged scoping review.BMC Medical Ethics, 26(1):68, 2025
work page 2025
-
[4]
Joschka Haltaufderheide and Robert Ranisch. The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms).NPJ digital medicine, 7(1):183, 2024
work page 2024
-
[5]
Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023
Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023. 6
-
[6]
Engineering safety requirements for autonomous driving with large language models
Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Christian Berger. Engineering safety requirements for autonomous driving with large language models. In2024 IEEE 32nd International Requirements Engineering Conference (RE), pages 218–228. IEEE, 2024
work page 2024
-
[7]
The moral machine experiment on large language models.Royal Society Open Science, 11(2):231393, 2024
Kazuhiro Takemoto. The moral machine experiment on large language models.Royal Society Open Science, 11(2):231393, 2024
work page 2024
-
[8]
Large-scale moral machine experiment on large language models.PloS One, 20(5):e0322776, 2025
Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. Large-scale moral machine experiment on large language models.PloS One, 20(5):e0322776, 2025
work page 2025
-
[9]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[11]
Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. Survey Certification
work page 2022
-
[12]
The moral machine experiment.Nature, 563(7729):59–64, 2018
Edmond Awad, Sohan Dsouza, Richard Kim, et al. The moral machine experiment.Nature, 563(7729):59–64, 2018
work page 2018
-
[13]
Aligning {ai} with shared human values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021
work page 2021
-
[14]
Moral stories: Situated reasoning about norms, intents, actions, and their consequences
Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698–718, 2021
work page 2021
-
[15]
Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi
Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. Social chemistry 101: Learning to reason about social and moral norms. InConference on Empirical Methods in Natural Language Processing, 2020
work page 2020
-
[16]
Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, et al. Medec: A benchmark for medical error detection and correction in clinical notes.arXiv preprint arXiv:2412.19260, 2024
-
[17]
Jens Hainmueller, Daniel J Hopkins, and Teppei Yamamoto. Causal inference in conjoint analysis: Understanding multidimensional choices via stated preference experiments.Political analysis, 22(1):1–30, 2014
work page 2014
-
[18]
Fitting linear mixed-effects models using lme4
Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015
work page 2015
-
[19]
Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. lmerTest package: Tests in linear mixed effects models.Journal of Statistical Software, 82(13):1–26, 2017
work page 2017
-
[20]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[21]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[22]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023
work page 2023
-
[23]
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019
work page 2019
-
[24]
Trust and trustworthiness in ai ethics.AI and Ethics, 3(3):735–744, 2023
Karoline Reinhardt. Trust and trustworthiness in ai ethics.AI and Ethics, 3(3):735–744, 2023
work page 2023
-
[25]
Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. Navigating llm ethics: Advancements, challenges, and future directions.AI and Ethics, pages 1–25, 2025
work page 2025
-
[26]
Decoding multilingual moral preferences: Unveiling llm’s biases through the moral machine experiment
Karina Vida, Fabian Damken, and Anne Lauscher. Decoding multilingual moral preferences: Unveiling llm’s biases through the moral machine experiment. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1490–1501, 2024
work page 2024
-
[27]
Language model alignment in multilingual trolley problems
Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, et al. Language model alignment in multilingual trolley problems. arXiv preprint arXiv:2407.02273, 2024
-
[28]
Soyoung Oh and Vera Demberg. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229, 2025. 7
work page 2025
-
[29]
Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Alice Oh, and Meeyoung Cha. Exploring persona-dependent llm alignment for the moral machine experiment.arXiv preprint arXiv:2504.10886, 2025
-
[30]
Allan Dafoe. Ai governance: a research agenda.Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 1442:1443, 2018
work page 2018
-
[31]
Ai governance: a systematic literature review.AI and Ethics, pages 1–15, 2025
Amna Batool, Didar Zowghi, and Muneera Bano. Ai governance: a systematic literature review.AI and Ethics, pages 1–15, 2025
work page 2025
-
[32]
Taking ai risks seriously: a new assessment model for the ai act.Ai & Society, 39(5):2493–2497, 2024
Claudio Novelli, Federico Casolari, Antonino Rotolo, Mariarosaria Taddeo, and Luciano Floridi. Taking ai risks seriously: a new assessment model for the ai act.Ai & Society, 39(5):2493–2497, 2024
work page 2024
-
[33]
Claudio Novelli, Federico Casolari, Antonino Rotolo, Mariarosaria Taddeo, and Luciano Floridi. Ai risk assessment: a scenario-based, proportional methodology for the ai act.Digital Society, 3(1):13, 2024. 8 Supplementary Figures -0.2 0.0 0.2 0 1 2 3 Model Size (log10 parameters) Distance from Human (log 10) Model Family DeepSeek Gemma Llama Other Qwen Fig...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.