pith. sign in

arxiv: 2601.17637 · v3 · submitted 2026-01-25 · 💻 cs.CY · cs.HC

Scaling Laws for Moral Machine Judgment in Large Language Models

Pith reviewed 2026-05-16 11:53 UTC · model grok-4.3

classification 💻 cs.CY cs.HC
keywords scaling lawsmoral judgmentlarge language modelsMoral MachineAI alignmentpower lawethical dilemmasmodel size
0
0 comments X

The pith

Language models align more closely with human moral preferences as their size increases following a power law.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 75 large language models ranging from 0.27 billion to 1000 billion parameters on the Moral Machine framework, which presents life-and-death dilemmas and compares model choices against average human preferences. A consistent power-law pattern appears in which the gap between model outputs and human answers shrinks as model size grows. This relationship survives controls for model family and reasoning style, with extended reasoning helping smaller models more than larger ones. Variance in responses also drops at bigger scales. The results indicate that value-based judgments can emerge in a predictable way with added computation, offering a basis for forecasting when AI systems might reach usable levels of moral consistency.

Core claim

We observe a consistent power-law relationship with distance from human preferences (D) decreasing as D ∝ S^{-0.10±0.01} (R²=0.50, p<0.001) where S is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size×reasoning interaction: p = 0.024). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale.

What carries the argument

The distance metric D from average human preferences on Moral Machine life-death dilemmas, plotted against model parameter count S to reveal the scaling exponent.

If this is right

  • Moral alignment improves systematically and predictably with model size.
  • Extended reasoning boosts alignment more for smaller models than for larger ones.
  • Response consistency increases and variance decreases as models scale up.
  • The pattern appears independent of specific model family or architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continued scaling would imply that future models could reach near-human consistency on these dilemmas through size alone.
  • Governance rules for autonomous systems might eventually use parameter count as one indicator of expected moral reliability.
  • The same scaling might or might not appear when the same models face moral questions outside the Moral Machine format.

Load-bearing premise

Responses on the Moral Machine framework constitute a stable, generalizable proxy for moral judgment that is not dominated by training-data artifacts or prompt sensitivity.

What would settle it

A model much larger than the tested range whose distance from human preferences fails to follow the predicted power-law decrease, or a change in prompt wording that removes the scaling effect.

Figures

Figures reproduced from arXiv: 2601.17637 by Kazuhiro Takemoto.

Figure 1
Figure 1. Figure 1: Scaling relationship between model size ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that moral judgment in LLMs, measured via alignment with human preferences on Moral Machine life-death dilemmas, follows a power-law scaling with model size: distance D from human preferences decreases as D ∝ S^{-0.10±0.01} (R²=0.50, p<0.001) across 75 configurations (0.27B–1000B parameters). Mixed-effects models show the relationship persists after controlling for model family and reasoning, with extended reasoning improving alignment more in smaller models (size×reasoning interaction p=0.024). The relationship holds across architectures and variance decreases at larger scales.

Significance. If robust, the work extends scaling-law research from language and reasoning tasks to value-based moral judgments, supplying empirical data relevant to AI governance. The moderate R²=0.50 and explicit mixed-effects controls are strengths, but the single-framework proxy and lack of prompt-invariance checks limit the strength of the generalization claim.

major comments (3)
  1. [Abstract] Abstract: the power-law claim rests on R²=0.50; the manuscript must specify the exact distance metric for D, whether S was log-transformed before fitting, outlier handling, and the full mixed-effects model specification (random effects, covariance structure).
  2. [Results] Results (mixed-effects analysis): no explicit checks are reported for prompt paraphrasing invariance or overlap between Moral Machine scenarios and pretraining corpora; without these, residual variance could reflect training-data artifacts rather than stable moral judgment scaling.
  3. [Abstract] Abstract: the size×reasoning interaction (p=0.024) is reported without effect size, coefficient table, or model equation, preventing assessment of whether the interaction is load-bearing for the central scaling claim.
minor comments (1)
  1. [Methods] Clarify the exact number of models per size bin and any exclusion criteria applied to the 75 configurations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We have carefully considered each point and made revisions to enhance the transparency of our methods and statistical reporting. Below, we provide point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the power-law claim rests on R²=0.50; the manuscript must specify the exact distance metric for D, whether S was log-transformed before fitting, outlier handling, and the full mixed-effects model specification (random effects, covariance structure).

    Authors: We agree that these details are essential. In the revised version, we will explicitly state in the abstract and methods that D represents the mean absolute deviation from aggregated human preference proportions across Moral Machine scenarios. The power-law was fitted using log-transformed S (model parameters) and log(D) via linear regression. No outliers were removed from the analysis. The mixed-effects model is specified as D ~ log(S) + reasoning + (1 | model_family), using a random intercept for model family with an unstructured covariance matrix. revision: yes

  2. Referee: [Results] Results (mixed-effects analysis): no explicit checks are reported for prompt paraphrasing invariance or overlap between Moral Machine scenarios and pretraining corpora; without these, residual variance could reflect training-data artifacts rather than stable moral judgment scaling.

    Authors: We acknowledge this as a potential limitation. The original study did not include explicit prompt paraphrasing invariance tests or pretraining corpus overlap analyses. We argue that the Moral Machine dilemmas are abstract and unlikely to be directly memorized, and the scaling relationship persists across diverse model families, which helps control for training data differences. In the revision, we will add a paragraph in the discussion section addressing this concern and suggesting it as an avenue for future work. revision: partial

  3. Referee: [Abstract] Abstract: the size×reasoning interaction (p=0.024) is reported without effect size, coefficient table, or model equation, preventing assessment of whether the interaction is load-bearing for the central scaling claim.

    Authors: We will expand the reporting in the revised manuscript. We will include the full model equation: D ~ log(S) * reasoning + (1 | model_family), report the interaction coefficient (β = -0.05, SE = 0.02, p = 0.024), and provide a supplementary table with all fixed and random effects coefficients to allow full evaluation of the interaction's role. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical power-law fit to independently measured distances

full rationale

The central claim is an observed scaling D ∝ S^{-0.10±0.01} obtained by evaluating 75 distinct model configurations on the Moral Machine task, computing alignment distance D to human preferences for each, and performing a regression on the resulting (S, D) pairs. This is a direct empirical measurement followed by statistical fitting; the power-law exponent is not defined from the same data in a way that forces the result by construction, nor does any step invoke self-citation, ansatz smuggling, or renaming of a known result as a derivation. Mixed-effects controls for family and reasoning are likewise post-hoc statistical adjustments on the measured values. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that Moral Machine choices are a valid proxy for moral judgment and that model size is the dominant variable after family and reasoning controls; the exponent itself is a fitted parameter.

free parameters (1)
  • scaling exponent = -0.10
    Fitted coefficient -0.10 obtained by regressing observed distance D against model size S.
axioms (1)
  • domain assumption Moral Machine responses provide a stable, generalizable measure of alignment with human moral preferences
    The framework is treated as the ground-truth benchmark without further validation in the abstract.

pith-pipeline@v0.9.0 · 5474 in / 1205 out tokens · 30090 ms · 2026-05-16T11:53:43.932984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

    cs.AI 2026-04 unverdicted novelty 5.0

    LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not y...

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Cambridge University Press, 2011

    Michael Anderson and Susan Leigh Anderson.Machine ethics. Cambridge University Press, 2011

  2. [2]

    The social dilemma of autonomous vehicles.Science, 352(6293):1573–1576, 2016

    Jean-François Bonnefon, Azim Shariff, and Iyad Rahwan. The social dilemma of autonomous vehicles.Science, 352(6293):1573–1576, 2016

  3. [3]

    Ethical and social considerations of applying artificial intelligence in healthcare—a two-pronged scoping review.BMC Medical Ethics, 26(1):68, 2025

    Emanuele Ratti, Michael Morrison, and Ivett Jakab. Ethical and social considerations of applying artificial intelligence in healthcare—a two-pronged scoping review.BMC Medical Ethics, 26(1):68, 2025

  4. [4]

    The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms).NPJ digital medicine, 7(1):183, 2024

    Joschka Haltaufderheide and Robert Ranisch. The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms).NPJ digital medicine, 7(1):183, 2024

  5. [5]

    Llm4drive: A survey of large language models for autonomous driving.ArXiv, abs/2311.01043, 2023

    Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023. 6

  6. [6]

    Engineering safety requirements for autonomous driving with large language models

    Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Håkan Sivencrona, and Christian Berger. Engineering safety requirements for autonomous driving with large language models. In2024 IEEE 32nd International Requirements Engineering Conference (RE), pages 218–228. IEEE, 2024

  7. [7]

    The moral machine experiment on large language models.Royal Society Open Science, 11(2):231393, 2024

    Kazuhiro Takemoto. The moral machine experiment on large language models.Royal Society Open Science, 11(2):231393, 2024

  8. [8]

    Large-scale moral machine experiment on large language models.PloS One, 20(5):e0322776, 2025

    Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. Large-scale moral machine experiment on large language models.PloS One, 20(5):e0322776, 2025

  9. [9]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  10. [10]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022

  11. [11]

    Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

    Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022. Survey Certification

  12. [12]

    The moral machine experiment.Nature, 563(7729):59–64, 2018

    Edmond Awad, Sohan Dsouza, Richard Kim, et al. The moral machine experiment.Nature, 563(7729):59–64, 2018

  13. [13]

    Aligning {ai} with shared human values

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. InInternational Conference on Learning Representations, 2021

  14. [14]

    Moral stories: Situated reasoning about norms, intents, actions, and their consequences

    Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reasoning about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698–718, 2021

  15. [15]

    Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi

    Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. Social chemistry 101: Learning to reason about social and moral norms. InConference on Empirical Methods in Natural Language Processing, 2020

  16. [16]

    Medec: A benchmark for medical error detection and correction in clinical notes.arXiv preprint arXiv:2412.19260, 2024

    Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, et al. Medec: A benchmark for medical error detection and correction in clinical notes.arXiv preprint arXiv:2412.19260, 2024

  17. [17]

    Causal inference in conjoint analysis: Understanding multidimensional choices via stated preference experiments.Political analysis, 22(1):1–30, 2014

    Jens Hainmueller, Daniel J Hopkins, and Teppei Yamamoto. Causal inference in conjoint analysis: Understanding multidimensional choices via stated preference experiments.Political analysis, 22(1):1–30, 2014

  18. [18]

    Fitting linear mixed-effects models using lme4

    Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015

  19. [19]

    Brockhoff, and Rune H

    Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. lmerTest package: Tests in linear mixed effects models.Journal of Statistical Software, 82(13):1–26, 2017

  20. [20]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  21. [21]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  22. [22]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  23. [23]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence, 1(5):206–215, 2019

  24. [24]

    Trust and trustworthiness in ai ethics.AI and Ethics, 3(3):735–744, 2023

    Karoline Reinhardt. Trust and trustworthiness in ai ethics.AI and Ethics, 3(3):735–744, 2023

  25. [25]

    Navigating llm ethics: Advancements, challenges, and future directions.AI and Ethics, pages 1–25, 2025

    Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. Navigating llm ethics: Advancements, challenges, and future directions.AI and Ethics, pages 1–25, 2025

  26. [26]

    Decoding multilingual moral preferences: Unveiling llm’s biases through the moral machine experiment

    Karina Vida, Fabian Damken, and Anne Lauscher. Decoding multilingual moral preferences: Unveiling llm’s biases through the moral machine experiment. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 1490–1501, 2024

  27. [27]

    Language model alignment in multilingual trolley problems

    Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, et al. Language model alignment in multilingual trolley problems. arXiv preprint arXiv:2407.02273, 2024

  28. [28]

    Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229, 2025

    Soyoung Oh and Vera Demberg. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229, 2025. 7

  29. [29]

    Exploring persona-dependent llm alignment for the moral machine experiment.arXiv preprint arXiv:2504.10886, 2025

    Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Alice Oh, and Meeyoung Cha. Exploring persona-dependent llm alignment for the moral machine experiment.arXiv preprint arXiv:2504.10886, 2025

  30. [30]

    Ai governance: a research agenda.Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 1442:1443, 2018

    Allan Dafoe. Ai governance: a research agenda.Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 1442:1443, 2018

  31. [31]

    Ai governance: a systematic literature review.AI and Ethics, pages 1–15, 2025

    Amna Batool, Didar Zowghi, and Muneera Bano. Ai governance: a systematic literature review.AI and Ethics, pages 1–15, 2025

  32. [32]

    Taking ai risks seriously: a new assessment model for the ai act.Ai & Society, 39(5):2493–2497, 2024

    Claudio Novelli, Federico Casolari, Antonino Rotolo, Mariarosaria Taddeo, and Luciano Floridi. Taking ai risks seriously: a new assessment model for the ai act.Ai & Society, 39(5):2493–2497, 2024

  33. [33]

    Ai risk assessment: a scenario-based, proportional methodology for the ai act.Digital Society, 3(1):13, 2024

    Claudio Novelli, Federico Casolari, Antonino Rotolo, Mariarosaria Taddeo, and Luciano Floridi. Ai risk assessment: a scenario-based, proportional methodology for the ai act.Digital Society, 3(1):13, 2024. 8 Supplementary Figures -0.2 0.0 0.2 0 1 2 3 Model Size (log10 parameters) Distance from Human (log 10) Model Family DeepSeek Gemma Llama Other Qwen Fig...