Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Jaehong Kim; Jea Kwon; Jiseon Kim; Luiz Felipe Vecchietti; Meeyoung Cha; Wenchao Dong

arxiv: 2604.21871 · v1 · submitted 2026-04-23 · 💻 cs.CL

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Jiseon Kim , Jea Kwon , Luiz Felipe Vecchietti , Wenchao Dong , Jaehong Kim , Meeyoung Cha This is my paper

Pith reviewed 2026-05-09 21:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM moral decisionswhistleblower dilemmarelational closenessfairness vs loyaltyprescriptive normsmachine behaviorAI ethicsmodel alignment

0 comments

The pith

Large language models base moral decisions on fixed fairness rules rather than their own predictions of human loyalty in close relationships.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests large language models using the whistleblower's dilemma, in which a person decides whether to report a crime committed by someone close to them. Researchers vary the seriousness of the crime and the closeness of the relationship to compare three perspectives: what counts as morally right, what the models expect actual humans would do, and what the models choose when deciding themselves. Moral rightness judgments stay focused on fairness regardless of relationship closeness. Predicted human behavior shifts toward loyalty when the relationship is close. Model decisions nevertheless follow the fairness judgments instead of the loyalty predictions they generate.

Core claim

In the Whistleblower's Dilemma with variations in crime severity and relational closeness, moral rightness judgments remain consistently fairness-oriented. Predicted human behavior shifts significantly toward loyalty as relational closeness increases. Model decisions align with moral rightness judgments rather than their own behavioral predictions, indicating that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.

What carries the argument

The Whistleblower's Dilemma experiment that manipulates crime severity and relational closeness to compare moral rightness judgments, predicted human behavior, and autonomous model decisions.

If this is right

LLM decisions may not reflect the descriptive social expectations they can internally generate.
Decision-support systems using LLMs could enforce fairness at the expense of relational considerations.
The observed gap between world-modeling and decision output may produce misalignments in real-world deployments involving personal relationships.
LLMs appear to treat prescriptive norms as more authoritative than the social patterns they predict for humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern generalizes, LLMs may prove more reliable for impartial advisory roles than for tasks requiring simulation of relationship-based human responses.
Training approaches that explicitly reward consistency between a model's behavioral predictions and its final decisions could narrow the observed divergence.
Testing the same three-perspective design on other relational dilemmas or across different model sizes would clarify whether the prescriptive bias is architecture-specific.

Load-bearing premise

Varying crime severity and relational closeness in prompts sufficiently isolates the effects on moral judgments without confounding factors from prompt engineering or model training data biases.

What would settle it

If model decisions shifted toward loyalty-based outcomes in high relational closeness conditions instead of staying aligned with fairness-oriented moral rightness judgments, the claim of prescriptive prioritization would be undermined.

Figures

Figures reproduced from arXiv: 2604.21871 by Jaehong Kim, Jea Kwon, Jiseon Kim, Luiz Felipe Vecchietti, Meeyoung Cha, Wenchao Dong.

**Figure 2.** Figure 2: Divergent moral landscapes in LLM reporting behavior. Heatmaps illustrate the reporting ratios across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of reporting ratios across models and perspectives. This plot aggregates the distributions of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Moral ratio differences by model responses [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Moral ratio distributions across perspectives [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Estimated effects of contextual changes on moral-foundation ratios across perspectives. The top row [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Moral ratios by model responses (Report vs. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Reporting ratios across different levels of crime severity and relational closeness for all models. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds LLMs sticking to moral rightness over their own human-behavior predictions in relational dilemmas, but different prompt frames could be producing that split.

read the letter

The main point is that in the Whistleblower's Dilemma, model decisions track prescriptive moral rightness judgments even as predicted human behavior shifts toward loyalty when relational closeness increases. Moral rightness stays fairness-oriented across conditions while the other two perspectives diverge, and the models land on the rightness side. That three-way comparison is the new piece here. They vary crime severity and closeness as the two experimental factors and query the three perspectives separately, which gives a direct look at how context affects each angle. This extends earlier LLM moral judgment work by bringing in interpersonal relationships, which is a sensible addition given how real decisions often hinge on who is involved. The abstract presents the setup clearly enough to see the intended logic. The soft spot is exactly the one the stress-test flags. The three perspectives use distinct linguistic frames, and LLMs are known to respond to directive versus descriptive wording even when the underlying scenario is the same. Without prompt ablations or matched templates that hold the content fixed while changing only the perspective, it is difficult to attribute the divergence to internal world-modeling rather than surface phrasing effects. The abstract also omits which models were tested, how many trials per condition, and what statistical checks were run, so the reliability of the reported patterns is hard to judge from the summary alone. If the full methods include those controls and report consistent results across models, the finding would land more solidly. This is the sort of paper that would interest people working on AI alignment and decision-support systems. A reader who wants to see how social context shows up in model outputs would get a useful starting point, though anyone needing tight evidence on misalignment mechanisms would want the prompt controls tightened first. It deserves peer review. The question is live and the design is straightforward; referees can push on the framing issue and the missing details without the work being fundamentally broken.

Referee Report

1 major / 2 minor

Summary. The paper examines machine behavior in relational moral dilemmas using the Whistleblower's Dilemma paradigm. It varies crime severity and relational closeness across three perspectives—moral rightness judgments (prescriptive norms), predicted human behavior (descriptive expectations), and autonomous model decisions—finding that moral rightness remains fairness-oriented while predicted human behavior shifts toward loyalty with increasing closeness, but model decisions align with moral rightness rather than their own behavioral predictions.

Significance. If the central divergence holds after controls, the work provides a useful probe into how LLMs separate prescriptive rules from descriptive social modeling, with direct relevance to alignment and deployment in decision-support roles. The paradigm itself is a clear strength for isolating these dimensions.

major comments (1)

[Experimental design] Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'

minor comments (2)

The abstract and summary provide no details on the specific LLMs tested, number of trials per condition, statistical tests used, or inter-rater reliability for reasoning-process analysis.
Figure or table captions (if present) should explicitly state the exact prompt templates for each perspective to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript examining LLM behavior in relational moral dilemmas. We agree that the experimental design would benefit from additional controls to address potential prompt-related confounds, and we will revise accordingly. We address the major comment below.

read point-by-point responses

Referee: Experimental design (as described in the abstract and methods outline): the three perspectives are elicited via distinct linguistic frames (e.g., 'what is morally right' vs. 'what would humans do' vs. 'what would you decide') while simultaneously varying crime severity and relational closeness, with no reported prompt-ablation studies, matched-template controls, or fixed-phrasing baselines. This leaves open the possibility that the reported alignment of model decisions with moral-rightness prompts is driven by surface prompt regularities rather than internal world-modeling, directly undermining the claim that LLMs 'prioritize rigid, prescriptive rules over the social sensitivity present in their internal world-modeling.'

Authors: We appreciate the referee highlighting this potential methodological concern. Distinct linguistic frames are required to operationalize the three theoretically distinct perspectives (prescriptive norms, descriptive expectations, and autonomous decisions). Critically, our results demonstrate differential sensitivity to the manipulated factors: the predicted-human-behavior perspective shifts toward loyalty as relational closeness increases, while model decisions remain aligned with the fairness-oriented moral-rightness perspective across all levels of closeness and crime severity. This interaction between perspective and experimental condition would be difficult to explain as a pure surface-prompt artifact, as the same varying factors (severity and closeness) produce divergent patterns depending on the elicited perspective. Nevertheless, we acknowledge that explicit ablations would further strengthen the interpretation. In the revised manuscript we will add prompt-ablation studies, matched-template controls, and fixed-phrasing baselines to isolate the contribution of internal world-modeling from prompt regularities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons of LLM outputs across prompt conditions

full rationale

The paper performs a direct experimental comparison of LLM responses to three distinct prompt framings (moral rightness, predicted human behavior, model decision) while varying crime severity and relational closeness. No equations, parameter fitting, or first-principles derivations are present; the central claim that model decisions align with moral-rightness judgments follows from tabulated output distributions rather than any reduction to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The analysis is therefore self-contained against external benchmarks of observed model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical behavioral study on AI models with standard assumptions about interpreting LLM outputs as proxies for internal states.

axioms (1)

domain assumption Prompt-based responses from LLMs can be interpreted as reflecting distinct moral perspectives (prescriptive, descriptive, autonomous)
Central to the three-perspective analysis.

pith-pipeline@v0.9.0 · 5489 in / 1230 out tokens · 61897 ms · 2026-05-09T21:55:28.300144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

URL https://arxiv.org/abs/2505.14633

Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Yu Ying Chiu, Liwei Jiang, and Yejin Choi. 2025a. Dai- lyDilemmas: Revealing value preferences of LLMs with quandaries of daily life. InThe Thirteenth Inter- national Conference on Learning Representations....

work page arXiv 1990
[2]

InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130

Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130. Elsevier. Jonathan Haidt. 2007. The new synthesis in moral psy- chology.Science, 316(5827):998–1002. Jonathan Haidt and Craig Joseph. 2004. Intuitive ethics: How innately prepared intuitions generate culturally va...

work page 2007
[3]

InProceedings of the Interna- tional Conference on Learning Representations

Language Model Alignment in Multilingual Trolley Problems. InProceedings of the Interna- tional Conference on Learning Representations. Roberta Ann Johnson. 2003.Whistleblowing: When it Works–and why. Lynne Rienner Publishers. Jaehong Kim, Chaeyoon Jeong, Seongchan Park, Meey- oung Cha, and Wonjae Lee. 2024. How do moral emotions shape political participa...

work page 2003
[4]

DeepSeek-V3 Technical Report

DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437. Joan G Miller and David M Bersoff. 1992. Culture and moral judgment: How are conflicts between justice and interpersonal responsibilities resolved?Journal of Personality and Social Psychology, 62(4):541. Tuan Dung Nguyen, Ziyu Chen, Nicholas George Car- roll, Alasdair Tran, Colin Klein, and Lexing Xie

work page internal anchor Pith review Pith/arXiv arXiv 1992
[5]

InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147

Measuring moral dimensions in social me- dia with mformer. InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147. Soyoung Oh and Vera Demberg. 2025. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229. OpenAI. 2025. GPT-5 system card. Technical report, OpenAI. ...

work page 2025
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, and Morteza Dehghani. 2026. Tracing moral foundations in large language models.arXiv preprint arXiv:2601.05437. Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. 2025. Large-scale moral m...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Minor padding a travel expense by $20 on a small receipt

work page
[8]

Moderate padding a travel expense by $200 across multiple receipts

work page
[9]

Major padding a quarterly travel report by $2,000 using altered invoices

work page
[10]

Critical padding a departmental travel budget by $20,000 with fabricated vendor slip Burglary

work page
[11]

Minor entering a shop through an ajar back door to grab small goods

work page
[12]

Moderate lifting the latch on a side entrance to remove boxed merchandise

work page
[13]

Major smashing a glass door to sweep shelves into bags

work page
[14]

should”) and pre- dicted (“would

Critical cutting through a security gate and clearing premium stockrooms Table 5: Examples illustrating severity levels for the fraud and burglary crime types. Moral Value # Word Care 2.23 Fairness 1.33 Authority 1.30 Loyalty 0.90 Sanctity 0.01 Total 5.77 Table 6: Average number of moral words per reasoning. Model Spearman Pearson MAE Gemini 2.5 Pro 0.886...

work page 2024

[1] [1]

URL https://arxiv.org/abs/2505.14633

Large language models show amplified cognitive biases in moral decision-making.Pro- ceedings of the National Academy of Sciences, 122(25):e2412015122. Yu Ying Chiu, Liwei Jiang, and Yejin Choi. 2025a. Dai- lyDilemmas: Revealing value preferences of LLMs with quandaries of daily life. InThe Thirteenth Inter- national Conference on Learning Representations....

work page arXiv 1990

[2] [2]

InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130

Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in Experi- mental Social Psychology, volume 47, pages 55–130. Elsevier. Jonathan Haidt. 2007. The new synthesis in moral psy- chology.Science, 316(5827):998–1002. Jonathan Haidt and Craig Joseph. 2004. Intuitive ethics: How innately prepared intuitions generate culturally va...

work page 2007

[3] [3]

InProceedings of the Interna- tional Conference on Learning Representations

Language Model Alignment in Multilingual Trolley Problems. InProceedings of the Interna- tional Conference on Learning Representations. Roberta Ann Johnson. 2003.Whistleblowing: When it Works–and why. Lynne Rienner Publishers. Jaehong Kim, Chaeyoon Jeong, Seongchan Park, Meey- oung Cha, and Wonjae Lee. 2024. How do moral emotions shape political participa...

work page 2003

[4] [4]

DeepSeek-V3 Technical Report

DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437. Joan G Miller and David M Bersoff. 1992. Culture and moral judgment: How are conflicts between justice and interpersonal responsibilities resolved?Journal of Personality and Social Psychology, 62(4):541. Tuan Dung Nguyen, Ziyu Chen, Nicholas George Car- roll, Alasdair Tran, Colin Klein, and Lexing Xie

work page internal anchor Pith review Pith/arXiv arXiv 1992

[5] [5]

InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147

Measuring moral dimensions in social me- dia with mformer. InProceedings of the Interna- tional AAAI Conference on Web and Social Media, volume 18, pages 1134–1147. Soyoung Oh and Vera Demberg. 2025. Robustness of large language models in moral judgements.Royal Society Open Science, 12(4):241229. OpenAI. 2025. GPT-5 system card. Technical report, OpenAI. ...

work page 2025

[6] [6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, and Morteza Dehghani. 2026. Tracing moral foundations in large language models.arXiv preprint arXiv:2601.05437. Muhammad Shahrul Zaim bin Ahmad and Kazuhiro Takemoto. 2025. Large-scale moral m...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Minor padding a travel expense by $20 on a small receipt

work page

[8] [8]

Moderate padding a travel expense by $200 across multiple receipts

work page

[9] [9]

Major padding a quarterly travel report by $2,000 using altered invoices

work page

[10] [10]

Critical padding a departmental travel budget by $20,000 with fabricated vendor slip Burglary

work page

[11] [11]

Minor entering a shop through an ajar back door to grab small goods

work page

[12] [12]

Moderate lifting the latch on a side entrance to remove boxed merchandise

work page

[13] [13]

Major smashing a glass door to sweep shelves into bags

work page

[14] [14]

should”) and pre- dicted (“would

Critical cutting through a security gate and clearing premium stockrooms Table 5: Examples illustrating severity levels for the fraud and burglary crime types. Moral Value # Word Care 2.23 Fairness 1.33 Authority 1.30 Loyalty 0.90 Sanctity 0.01 Total 5.77 Table 6: Average number of moral words per reasoning. Model Spearman Pearson MAE Gemini 2.5 Pro 0.886...

work page 2024