pith. sign in

arxiv: 2604.21871 · v1 · submitted 2026-04-23 · 💻 cs.CL

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Pith reviewed 2026-05-09 21:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM moral decisionswhistleblower dilemmarelational closenessfairness vs loyaltyprescriptive normsmachine behaviorAI ethicsmodel alignment
0
0 comments X

The pith

Large language models base moral decisions on fixed fairness rules rather than their own predictions of human loyalty in close relationships.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests large language models using the whistleblower's dilemma, in which a person decides whether to report a crime committed by someone close to them. Researchers vary the seriousness of the crime and the closeness of the relationship to compare three perspectives: what counts as morally right, what the models expect actual humans would do, and what the models choose when deciding themselves. Moral rightness judgments stay focused on fairness regardless of relationship closeness. Predicted human behavior shifts toward loyalty when the relationship is close. Model decisions nevertheless follow the fairness judgments instead of the loyalty predictions they generate.

Core claim

In the Whistleblower's Dilemma with variations in crime severity and relational closeness, moral rightness judgments remain consistently fairness-oriented. Predicted human behavior shifts significantly toward loyalty as relational closeness increases. Model decisions align with moral rightness judgments rather than their own behavioral predictions, indicating that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.

What carries the argument

The Whistleblower's Dilemma experiment that manipulates crime severity and relational closeness to compare moral rightness judgments, predicted human behavior, and autonomous model decisions.

If this is right

  • LLM decisions may not reflect the descriptive social expectations they can internally generate.
  • Decision-support systems using LLMs could enforce fairness at the expense of relational considerations.
  • The observed gap between world-modeling and decision output may produce misalignments in real-world deployments involving personal relationships.
  • LLMs appear to treat prescriptive norms as more authoritative than the social patterns they predict for humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern generalizes, LLMs may prove more reliable for impartial advisory roles than for tasks requiring simulation of relationship-based human responses.
  • Training approaches that explicitly reward consistency between a model's behavioral predictions and its final decisions could narrow the observed divergence.
  • Testing the same three-perspective design on other relational dilemmas or across different model sizes would clarify whether the prescriptive bias is architecture-specific.

Load-bearing premise

Varying crime severity and relational closeness in prompts sufficiently isolates the effects on moral judgments without confounding factors from prompt engineering or model training data biases.

What would settle it

If model decisions shifted toward loyalty-based outcomes in high relational closeness conditions instead of staying aligned with fairness-oriented moral rightness judgments, the claim of prescriptive prioritization would be undermined.

Figures

Figures reproduced from arXiv: 2604.21871 by Jaehong Kim, Jea Kwon, Jiseon Kim, Luiz Felipe Vecchietti, Meeyoung Cha, Wenchao Dong.

Figure 1
Figure 1. Figure 1: Illustration of the Whistleblower’s Dilemma [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Divergent moral landscapes in LLM reporting behavior. Heatmaps illustrate the reporting ratios across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of reporting ratios across models and perspectives. This plot aggregates the distributions of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Moral ratio differences by model responses [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Moral ratio distributions across perspectives [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Estimated effects of contextual changes on moral-foundation ratios across perspectives. The top row [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Moral ratios by model responses (Report vs. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reporting ratios across different levels of crime severity and relational closeness for all models. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines machine behavior in relational moral dilemmas using the Whistleblower's Dilemma paradigm. It varies crime severity and relational closeness across three perspectives—moral rightness judgments (prescriptive norms), predicted human behavior (descriptive expectations), and autonomous model decisions—finding that moral rightness remains fairness-oriented while predicted human behavior shifts toward loyalty with increasing closeness, but model decisions align with moral rightness rather than their own behavioral predictions.

Significance. If the central divergence holds after controls, the work provides a useful probe into how LLMs separate prescriptive rules from descriptive social modeling, with direct relevance to alignment and deployment in decision-support roles. The paradigm itself is a clear strength for isolating these dimensions.

major comments (1)
  1. [Experimental design] Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'
minor comments (2)
  1. The abstract and summary provide no details on the specific LLMs tested, number of trials per condition, statistical tests used, or inter-rater reliability for reasoning-process analysis.
  2. Figure or table captions (if present) should explicitly state the exact prompt templates for each perspective to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript examining LLM behavior in relational moral dilemmas. We agree that the experimental design would benefit from additional controls to address potential prompt-related confounds, and we will revise accordingly. We address the major comment below.

read point-by-point responses
  1. Referee: Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'

    Authors: We appreciate the referee highlighting this potential methodological concern. Distinct linguistic frames are required to operationalize the three theoretically distinct perspectives (prescriptive norms, descriptive expectations, and autonomous decisions). Critically, our results demonstrate differential sensitivity to the manipulated factors: the predicted-human-behavior perspective shifts toward loyalty as relational closeness increases, while model decisions remain aligned with the fairness-oriented moral-rightness perspective across all levels of closeness and crime severity. This interaction between perspective and experimental condition would be difficult to explain as a pure surface-prompt artifact, as the same varying factors (severity and closeness) produce divergent patterns depending on the elicited perspective. Nevertheless, we acknowledge that explicit ablations would further strengthen the interpretation. In the revised manuscript we will add prompt-ablation studies, matched-template controls, and fixed-phrasing baselines to isolate the contribution of internal world-modeling from prompt regularities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons of LLM outputs across prompt conditions

full rationale

The paper performs a direct experimental comparison of LLM responses to three distinct prompt framings (moral rightness, predicted human behavior, model decision) while varying crime severity and relational closeness. No equations, parameter fitting, or first-principles derivations are present; the central claim that model decisions align with moral-rightness judgments follows from tabulated output distributions rather than any reduction to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The analysis is therefore self-contained against external benchmarks of observed model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical behavioral study on AI models with standard assumptions about interpreting LLM outputs as proxies for internal states.

axioms (1)
  • domain assumption Prompt-based responses from LLMs can be interpreted as reflecting distinct moral perspectives (prescriptive, descriptive, autonomous)
    Central to the three-perspective analysis.

pith-pipeline@v0.9.0 · 5489 in / 1230 out tokens · 61897 ms · 2026-05-09T21:55:28.300144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    URL https://arxiv.org/abs/2505.14633

    Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Yu Ying Chiu, Liwei Jiang, and Yejin Choi. 2025a. Dai- lyDilemmas: Revealing value preferences of LLMs with quandaries of daily life. InThe Thirteenth Inter- national Conference on Learning Representations....

  2. [2]

    InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130

    Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130. Elsevier. Jonathan Haidt. 2007. The new synthesis in moral psy- chology.Science, 316(5827):998–1002. Jonathan Haidt and Craig Joseph. 2004. Intuitive ethics: How innately prepared intuitions generate culturally va...

  3. [3]

    InProceedings of the Interna- tional Conference on Learning Representations

    Language Model Alignment in Multilingual Trolley Problems. InProceedings of the Interna- tional Conference on Learning Representations. Roberta Ann Johnson. 2003.Whistleblowing: When it Works–and why. Lynne Rienner Publishers. Jaehong Kim, Chaeyoon Jeong, Seongchan Park, Meey- oung Cha, and Wonjae Lee. 2024. How do moral emotions shape political participa...

  4. [4]

    DeepSeek-V3 Technical Report

    DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437. Joan G Miller and David M Bersoff. 1992. Culture and moral judgment: How are conflicts between justice and interpersonal responsibilities resolved?Journal of Personality and Social Psychology, 62(4):541. Tuan Dung Nguyen, Ziyu Chen, Nicholas George Car- roll, Alasdair Tran, Colin Klein, and Lexing Xie

  5. [5]

    InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147

    Measuring moral dimensions in social me- dia with mformer. InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147. Soyoung Oh and Vera Demberg. 2025. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229. OpenAI. 2025. GPT-5 system card. Technical report, OpenAI. ...

  6. [6]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, and Morteza Dehghani. 2026. Tracing moral foundations in large language models.arXiv preprint arXiv:2601.05437. Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. 2025. Large-scale moral m...

  7. [7]

    Minor padding a travel expense by $20 on a small receipt

  8. [8]

    Moderate padding a travel expense by $200 across multiple receipts

  9. [9]

    Major padding a quarterly travel report by $2,000 using altered invoices

  10. [10]

    Critical padding a departmental travel budget by $20,000 with fabricated vendor slip Burglary

  11. [11]

    Minor entering a shop through an ajar back door to grab small goods

  12. [12]

    Moderate lifting the latch on a side entrance to remove boxed merchandise

  13. [13]

    Major smashing a glass door to sweep shelves into bags

  14. [14]

    should”) and pre- dicted (“would

    Critical cutting through a security gate and clearing premium stockrooms Table 5: Examples illustrating severity levels for the fraud and burglary crime types. Moral Value # Word Care 2.23 Fairness 1.33 Authority 1.30 Loyalty 0.90 Sanctity 0.01 Total 5.77 Table 6: Average number of moral words per reasoning. Model Spearman Pearson MAE Gemini 2.5 Pro 0.886...