Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions
Pith reviewed 2026-05-09 21:55 UTC · model grok-4.3
The pith
Large language models base moral decisions on fixed fairness rules rather than their own predictions of human loyalty in close relationships.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the Whistleblower's Dilemma with variations in crime severity and relational closeness, moral rightness judgments remain consistently fairness-oriented. Predicted human behavior shifts significantly toward loyalty as relational closeness increases. Model decisions align with moral rightness judgments rather than their own behavioral predictions, indicating that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.
What carries the argument
The Whistleblower's Dilemma experiment that manipulates crime severity and relational closeness to compare moral rightness judgments, predicted human behavior, and autonomous model decisions.
If this is right
- LLM decisions may not reflect the descriptive social expectations they can internally generate.
- Decision-support systems using LLMs could enforce fairness at the expense of relational considerations.
- The observed gap between world-modeling and decision output may produce misalignments in real-world deployments involving personal relationships.
- LLMs appear to treat prescriptive norms as more authoritative than the social patterns they predict for humans.
Where Pith is reading between the lines
- If the pattern generalizes, LLMs may prove more reliable for impartial advisory roles than for tasks requiring simulation of relationship-based human responses.
- Training approaches that explicitly reward consistency between a model's behavioral predictions and its final decisions could narrow the observed divergence.
- Testing the same three-perspective design on other relational dilemmas or across different model sizes would clarify whether the prescriptive bias is architecture-specific.
Load-bearing premise
Varying crime severity and relational closeness in prompts sufficiently isolates the effects on moral judgments without confounding factors from prompt engineering or model training data biases.
What would settle it
If model decisions shifted toward loyalty-based outcomes in high relational closeness conditions instead of staying aligned with fairness-oriented moral rightness judgments, the claim of prescriptive prioritization would be undermined.
Figures
read the original abstract
Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines machine behavior in relational moral dilemmas using the Whistleblower's Dilemma paradigm. It varies crime severity and relational closeness across three perspectives—moral rightness judgments (prescriptive norms), predicted human behavior (descriptive expectations), and autonomous model decisions—finding that moral rightness remains fairness-oriented while predicted human behavior shifts toward loyalty with increasing closeness, but model decisions align with moral rightness rather than their own behavioral predictions.
Significance. If the central divergence holds after controls, the work provides a useful probe into how LLMs separate prescriptive rules from descriptive social modeling, with direct relevance to alignment and deployment in decision-support roles. The paradigm itself is a clear strength for isolating these dimensions.
major comments (1)
- [Experimental design] Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'
minor comments (2)
- The abstract and summary provide no details on the specific LLMs tested, number of trials per condition, statistical tests used, or inter-rater reliability for reasoning-process analysis.
- Figure or table captions (if present) should explicitly state the exact prompt templates for each perspective to allow replication.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript examining LLM behavior in relational moral dilemmas. We agree that the experimental design would benefit from additional controls to address potential prompt-related confounds, and we will revise accordingly. We address the major comment below.
read point-by-point responses
-
Referee: Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'
Authors: We appreciate the referee highlighting this potential methodological concern. Distinct linguistic frames are required to operationalize the three theoretically distinct perspectives (prescriptive norms, descriptive expectations, and autonomous decisions). Critically, our results demonstrate differential sensitivity to the manipulated factors: the predicted-human-behavior perspective shifts toward loyalty as relational closeness increases, while model decisions remain aligned with the fairness-oriented moral-rightness perspective across all levels of closeness and crime severity. This interaction between perspective and experimental condition would be difficult to explain as a pure surface-prompt artifact, as the same varying factors (severity and closeness) produce divergent patterns depending on the elicited perspective. Nevertheless, we acknowledge that explicit ablations would further strengthen the interpretation. In the revised manuscript we will add prompt-ablation studies, matched-template controls, and fixed-phrasing baselines to isolate the contribution of internal world-modeling from prompt regularities. revision: yes
Circularity Check
No circularity: empirical comparisons of LLM outputs across prompt conditions
full rationale
The paper performs a direct experimental comparison of LLM responses to three distinct prompt framings (moral rightness, predicted human behavior, model decision) while varying crime severity and relational closeness. No equations, parameter fitting, or first-principles derivations are present; the central claim that model decisions align with moral-rightness judgments follows from tabulated output distributions rather than any reduction to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The analysis is therefore self-contained against external benchmarks of observed model behavior.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompt-based responses from LLMs can be interpreted as reflecting distinct moral perspectives (prescriptive, descriptive, autonomous)
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2505.14633
Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Yu Ying Chiu, Liwei Jiang, and Yejin Choi. 2025a. Dai- lyDilemmas: Revealing value preferences of LLMs with quandaries of daily life. InThe Thirteenth Inter- national Conference on Learning Representations....
-
[2]
InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130
Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130. Elsevier. Jonathan Haidt. 2007. The new synthesis in moral psy- chology.Science, 316(5827):998–1002. Jonathan Haidt and Craig Joseph. 2004. Intuitive ethics: How innately prepared intuitions generate culturally va...
work page 2007
-
[3]
InProceedings of the Interna- tional Conference on Learning Representations
Language Model Alignment in Multilingual Trolley Problems. InProceedings of the Interna- tional Conference on Learning Representations. Roberta Ann Johnson. 2003.Whistleblowing: When it Works–and why. Lynne Rienner Publishers. Jaehong Kim, Chaeyoon Jeong, Seongchan Park, Meey- oung Cha, and Wonjae Lee. 2024. How do moral emotions shape political participa...
work page 2003
-
[4]
DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437. Joan G Miller and David M Bersoff. 1992. Culture and moral judgment: How are conflicts between justice and interpersonal responsibilities resolved?Journal of Personality and Social Psychology, 62(4):541. Tuan Dung Nguyen, Ziyu Chen, Nicholas George Car- roll, Alasdair Tran, Colin Klein, and Lexing Xie
work page internal anchor Pith review Pith/arXiv arXiv 1992
-
[5]
Measuring moral dimensions in social me- dia with mformer. InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147. Soyoung Oh and Vera Demberg. 2025. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229. OpenAI. 2025. GPT-5 system card. Technical report, OpenAI. ...
work page 2025
-
[6]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, and Morteza Dehghani. 2026. Tracing moral foundations in large language models.arXiv preprint arXiv:2601.05437. Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. 2025. Large-scale moral m...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Minor padding a travel expense by $20 on a small receipt
-
[8]
Moderate padding a travel expense by $200 across multiple receipts
-
[9]
Major padding a quarterly travel report by $2,000 using altered invoices
-
[10]
Critical padding a departmental travel budget by $20,000 with fabricated vendor slip Burglary
-
[11]
Minor entering a shop through an ajar back door to grab small goods
-
[12]
Moderate lifting the latch on a side entrance to remove boxed merchandise
-
[13]
Major smashing a glass door to sweep shelves into bags
-
[14]
should”) and pre- dicted (“would
Critical cutting through a security gate and clearing premium stockrooms Table 5: Examples illustrating severity levels for the fraud and burglary crime types. Moral Value # Word Care 2.23 Fairness 1.33 Authority 1.30 Loyalty 0.90 Sanctity 0.01 Total 5.77 Table 6: Average number of moral words per reasoning. Model Spearman Pearson MAE Gemini 2.5 Pro 0.886...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.