pith. sign in

arxiv: 2506.00079 · v2 · submitted 2025-05-30 · 💻 cs.CY · cs.AI· cs.LG

Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values

Pith reviewed 2026-05-19 14:08 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG
keywords large language modelshuman-AI alignmentkidney allocationmoral decision makingindecisionresource allocationethical AI
0
0 comments X

The pith

Large language models choose who gets a kidney differently from humans and almost never admit they cannot decide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how several prominent large language models decide which patient should receive a scarce donor kidney across a range of scenarios. It compares those choices directly to the stated preferences of human participants on the same cases. The models consistently rank patient attributes such as age, lifestyle choices, and medical need in ways that diverge from human judgments. Unlike the people surveyed, the models almost always name a recipient rather than saying they are unsure, even when offered the chance to flip a coin. Targeted fine-tuning on a small number of examples can reduce these gaps and help the models express indecision when humans would.

Core claim

In kidney allocation scenarios, several prominent large language models deviate from human moral preferences in how they prioritize patient attributes such as age, lifestyle, and prognosis, and they exhibit far less indecision than human subjects, defaulting to deterministic choices even when mechanisms like random selection are suggested. Low-rank supervised fine-tuning on a small number of examples can improve alignment on both consistency and appropriate indecision.

What carries the argument

Direct comparison of LLM outputs to human preferences collected on the same set of kidney allocation vignettes, which surfaces differences in attribute weighting and willingness to remain undecided.

If this is right

  • LLMs will need explicit alignment methods before deployment in ethical resource-allocation settings.
  • Small amounts of targeted training data can increase consistency with human preferences and improve modeling of uncertainty.
  • Current models default to decisive outputs in situations where humans prefer to express indecision.
  • Differences in how models weigh attributes could produce systematically different outcomes from those humans would accept.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same value mismatches could appear in other scarce-resource decisions such as ventilator allocation or disaster relief.
  • Repeated testing of the same model across slightly varied scenarios might show whether indecision behavior stabilizes or shifts with context.
  • Hybrid systems that pair model recommendations with human review could preserve the human tendency to deliberate on close calls.

Load-bearing premise

The kidney allocation scenarios and the human preferences collected from them are representative of the moral values that should guide real-world organ distribution.

What would settle it

A new survey that presents the identical kidney allocation questions to fresh human participants and finds their choices and rates of indecision closely match those produced by the tested LLMs.

Figures

Figures reproduced from arXiv: 2506.00079 by Hadi Hosseini, John P. Dickerson, Leona Pierce, Samarth Khanna.

Figure 1
Figure 1. Figure 1: Alignment between Humans and LLMs when information about the age, drinking habits (drinks per [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment between humans and LLMs in terms of the priority order over different patient profiles. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of responses where each model expresses indecision, aggregated across all instances. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fraction of responses where each model expresses indecision, aggregated across all instances. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Alignment between the responses of humans and LLMs in terms of indecision, in terms of percentage [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effects of temperature on frequency of expressing indecision across the models that displayed [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Adjustment in the preferences of vanilla (V) LLMs’ due to fine-tuning (FT). The green bars represent [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pairwise Election for Claude-3.5-H. Each cell represents the number fraction of comparisons in which the profile on the Y-axis is chosen over the profile on the X-axis. YRH YFH YRC YFC ORH OFH ORC OFC YRH YFH YRC YFC ORH OFH ORC OFC 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.23 0.97 0.23 1.00 0.57 1.00 0.00 0.77 1.00 0.53 0.90 1.00 1.00 0.00 0.03 0.00 0.13 0.57 0.33 1.00 0.00 0.77 0.47 0.87 1.00 0.97 1.00 0… view at source ↗
Figure 10
Figure 10. Figure 10: Pairwise Election for DeepSeek-R1. Each cell represents the number fraction of comparisons in which the profile on the Y-axis is chosen over the profile on the X-axis. YRH YFH YRC YFC ORH OFH ORC OFC YRH YFH YRC YFC ORH OFH ORC OFC 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.03 0.97 0.90 1.00 0.87 1.00 0.00 0.93 1.00 1.00 1.00 1.00 1.00 0.00 0.03 0.00 0.83 1.00 0.90 1.00 0.00 0.10 0.00 0.17 1.00 1.00 1.00 0… view at source ↗
Figure 12
Figure 12. Figure 12: Pairwise Election for Gemini-2.0-F. Each cell represents the number fraction of comparisons in which the profile on the Y-axis is chosen over the profile on the X-axis. YRH YFH YRC YFC ORH OFH ORC OFC YRH YFH YRC YFC ORH OFH ORC OFC 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.27 0.93 0.57 1.00 0.50 1.00 0.00 0.73 1.00 1.00 1.00 1.00 1.00 0.00 0.07 0.00 0.50 1.00 0.60 1.00 0.00 0.43 0.00 0.50 1.00 1.00 1.00 … view at source ↗
Figure 15
Figure 15. Figure 15: Pairwise Election for Llama3.3-70B. Each cell represents the number fraction of comparisons in which the profile on the Y-axis is chosen over the profile on the X-axis. Gemma-3-27B: Condorcet Winner: YRH Original Ranking: YRH, YRC, ORH, YFH, ORC, YFC, OFH, OFC Kemeny Young Ranking: YRH, YRC, YFH, ORH, YFC, ORC, OFH, OFC Difference: As per the Kemeny-Young ranking, Gemma-3-27B has a preference for drinking… view at source ↗
read the original abstract

The rapid integration of Large Language Models (LLMs) in high-stakes decision-making -- such as allocating scarce resources like donor organs -- raises critical questions about their alignment with human moral values. We systematically evaluate the behavior of several prominent LLMs against human preferences in kidney allocation scenarios and show that LLMs: i) exhibit stark deviations from human values in prioritizing various attributes, and ii) in contrast to humans, LLMs rarely express indecision, opting for deterministic decisions even when alternative indecision mechanisms (e.g., coin flipping) are provided. Nonetheless, we show that low-rank supervised fine-tuning with few samples is often effective in improving both decision consistency and calibrating indecision modeling. These findings illustrate the necessity of explicit alignment strategies for LLMs in moral/ethical domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates prominent LLMs on stylized kidney allocation vignettes and compares their attribute prioritization and indecision rates to those of human respondents. It reports that LLMs show large deviations from human value trade-offs and almost never express indecision even when coin-flip or abstention options are explicitly offered in the prompt; low-rank supervised fine-tuning on a small number of examples is shown to improve both consistency and indecision calibration.

Significance. If the empirical gaps are robust, the work supplies concrete evidence that current LLMs are poorly aligned for high-stakes moral decisions and that inexpensive fine-tuning can partially mitigate the problem. The direct human-LLM comparison and the demonstration of a practical alignment intervention are the main contributions.

major comments (2)
  1. [Section 3] Section 3 (Experimental Setup): no sample sizes, demographic information, recruitment method, or statistical tests are reported for the human preference data. Without these details the magnitude and statistical reliability of the claimed 'stark deviations' cannot be evaluated.
  2. [Section 4] Section 4 (Results): the indecision finding is presented without quantitative human baseline rates or explicit description of how the LLM prompts were worded to permit coin-flip or abstention responses. This leaves open the possibility that the observed determinism is an artifact of prompt formatting rather than a general property of the models.
minor comments (2)
  1. [Abstract] The abstract lists 'several prominent LLMs' but never names the specific models or versions tested; this information should appear in the methods or a table.
  2. [Figures] Figure captions and axis labels should explicitly state the number of trials or respondents underlying each bar or distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our experimental design and results presentation. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Experimental Setup): no sample sizes, demographic information, recruitment method, or statistical tests are reported for the human preference data. Without these details the magnitude and statistical reliability of the claimed 'stark deviations' cannot be evaluated.

    Authors: We agree that these methodological details are necessary to evaluate the human data. We have revised Section 3 to include the sample size, demographic information, recruitment method, and statistical tests (including chi-square tests for differences in attribute prioritization) for the human preference data. These additions will allow readers to assess the reliability of the reported deviations. revision: yes

  2. Referee: [Section 4] Section 4 (Results): the indecision finding is presented without quantitative human baseline rates or explicit description of how the LLM prompts were worded to permit coin-flip or abstention responses. This leaves open the possibility that the observed determinism is an artifact of prompt formatting rather than a general property of the models.

    Authors: We appreciate this observation. We have expanded Section 4 to include quantitative human baseline rates for indecision from our survey and provided a clearer description of the LLM prompt wording that explicitly offers coin-flip and abstention options. These revisions address concerns about prompt artifacts and enable a more direct comparison with human behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of LLM outputs to human responses

full rationale

The paper conducts an empirical evaluation of LLMs on kidney allocation vignettes, directly comparing their attribute prioritizations and indecision rates against collected human responses. No equations, parameter fits, or derivation steps are present that would reduce any claim to a self-definition or fitted input. The abstract and described methodology rely on external human data collection and standard fine-tuning rather than any self-citation load-bearing uniqueness theorem or ansatz smuggled from prior work. The central results (stark deviations and low indecision in LLMs) are falsifiable observations from the experiment itself and do not collapse into the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or implied in the provided text.

pith-pipeline@v0.9.0 · 5675 in / 977 out tokens · 43971 ms · 2026-05-19T14:08:30.575464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    Attitude-behavior relations: A theoretical analysis and review of empirical research

    Icek Ajzen and Martin Fishbein. Attitude-behavior relations: A theoretical analysis and review of empirical research. Psychological Bulletin, 84:888–918, 09 1977. doi: 10.1037/0033-2909. 84.5.888

  2. [2]

    Claude 3.5 sonnet, June 2024

    Anthropic. Claude 3.5 sonnet, June 2024. URL https://www.anthropic.com/news/ claude-3-5-sonnet

  3. [3]

    Bakker, Martin J

    Michiel A. Bakker, Martin J. Chadwick, Hannah Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt M. Botvinick, and Christopher Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave,...

  4. [4]

    V oting schemes for which it can be difficult to tell who won the election

    John Bartholdi, Craig A Tovey, and Michael A Trick. V oting schemes for which it can be difficult to tell who won the election. Social Choice and Welfare, 6:157–165, 1989

  5. [5]

    Generative social choice: The next generation

    Niclas Boehmer, Sara Fish, and Ariel D Procaccia. Generative social choice: The next generation. In Proceedings of the 42nd International Conference on Machine Learning, page forthcoming, 2025. 9

  6. [6]

    On the stability of moral preferences: A problem with compu- tational elicitation methods

    Kyle Boerstler, Vijay Keswani, Lok Chan, Jana Schaich Borg, Vincent Conitzer, Hoda Heidari, and Walter Sinnott-Armstrong. On the stability of moral preferences: A problem with compu- tational elicitation methods. In Sanmay Das, Brian Patrick Green, Kush Varshney, Marianna Ganapini, and Andrea Renda, editors, Proceedings of the Seventh AAAI/ACM Conference ...

  7. [7]

    Should responsibility affect who gets the kidney? In Ben Davies, Gabriel De Marco, Neil Levy, and Julian Savulescu, editors, Responsibility and Healthcare, pages 35–60

    Lok Chan, Walter Sinnott-Armstrong, Jana Schaich Borg, and Vincent Conitzer. Should responsibility affect who gets the kidney? In Ben Davies, Gabriel De Marco, Neil Levy, and Julian Savulescu, editors, Responsibility and Healthcare, pages 35–60. Oxford University Press USA, 2024

  8. [8]

    The convergent ethics of ai? analyzing moral foundation priorities in large language models with a multi- framework approach

    Chad Coleman, W Russell Neuman, Ali Dasdan, Safinah Ali, and Manan Shah. The convergent ethics of ai? analyzing moral foundation priorities in large language models with a multi- framework approach. arXiv preprint arXiv:2504.19255, 2025

  9. [9]

    Improved bounds for computing kemeny rankings

    Vincent Conitzer, Andrew Davenport, and Jayant Kalagnanam. Improved bounds for computing kemeny rankings. In AAAI, volume 6, pages 620–626, 2006

  10. [14]

    Fair allocation of scarce medical resources in the time of covid-19, 2020

    Ezekiel J Emanuel, Govind Persad, Ross Upshur, Beatriz Thome, Michael Parker, Aaron Glickman, Cathy Zhang, Connor Boyle, Maxwell Smith, and James P Phillips. Fair allocation of scarce medical resources in the time of covid-19, 2020

  11. [15]

    Generative social choice

    Sara Fish, Paul Gölz, David C Parkes, Ariel D Procaccia, Gili Rusak, Itai Shapira, and Manuel Wüthrich. Generative social choice. arXiv preprint arXiv:2309.01291, 2023

  12. [16]

    R. A. Fisher. On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society , 85(1):87–94, 1922. ISSN 09528385. URL http: //www.jstor.org/stable/2340521

  13. [17]

    Dickerson, and Vincent Conitzer

    Rachel Freedman, Jana Schaich Borg, Walter Sinnott-Armstrong, John P. Dickerson, and Vincent Conitzer. Adapting a kidney exchange algorithm to align with human values. Artificial Intelligence, 283:103261, 2020. doi: 10.1016/J.ARTINT.2020.103261. URL https://doi. org/10.1016/j.artint.2020.103261

  14. [18]

    Decisions concerning the allocation of scarce medical resources

    Adrian Furnham, Katherine Simmons, and Alastair McClelland. Decisions concerning the allocation of scarce medical resources. Journal of Social Behavior & Personality, 15(2), 2000

  15. [19]

    The allocation of scarce medical resources across medical conditions

    Adrian Furnham, Kathryn Thomson, and Alastair McClelland. The allocation of scarce medical resources across medical conditions. Psychology and Psychotherapy: Theory, Research and Practice, 75(2):189–203, 2002

  16. [20]

    Moral choices: the influence of the do not play god principle

    Amelia Gangemi, Francesco Mancini, et al. Moral choices: the influence of the do not play god principle. In Proceedings of the 35th Annual Meeting of the Cognitive Science Society, Cooperative Minds: Social Interaction and Group Dynamics , pages 2973–2977. Cognitive Science Society, Austin, TX, 2013

  17. [22]

    Yes, no, maybe? revisiting language models’ response stability under paraphrasing for the assessment of political leaning

    Patrick Haller, Jannis Vamvas, and Lena Ann Jäger. Yes, no, maybe? revisiting language models’ response stability under paraphrasing for the assessment of political leaning. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=7xUtka9ck9

  18. [23]

    Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

    John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

  19. [25]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  20. [26]

    URL https://openreview.net/forum?id=nZeVKeeFYf9

    OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  21. [27]

    Values in the wild: Discovering and analyzing values in real-world language model interactions

    Saffron Huang, Esin Durmus, Miles McCain, Kunal Handa, Alex Tamkin, Jerry Hong, Michael Stern, Arushi Somani, Xiuruo Zhang, and Deep Ganguli. Values in the wild: Discovering and analyzing values in real-world language model interactions. arXiv preprint arXiv:2504.15236, 2025

  22. [28]

    Intolerance of uncertainty

    Ryan J Jacoby. Intolerance of uncertainty. Clinical handbook of fear and anxiety: Maintenance processes and treatment mechanisms., pages 45–63, 2020

  23. [29]

    Intolerance of uncertainty and immediate decision-making in high-risk situations

    Dane Jensen, Alexandra Kind, Amanda Morrison, and Richard Heimberg. Intolerance of uncertainty and immediate decision-making in high-risk situations. Journal of Experimental Psychopathology, 5:178–190, 06 2014. doi: 10.5127/jep.035113

  24. [31]

    Decision-making behavior evaluation framework for llms under uncertain context

    Jingru Jia, Zehua Yuan, Junhao Pan, Paul McNamara, and Deming Chen. Decision-making behavior evaluation framework for llms under uncertain context. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural In...

  25. [32]

    Can machines learn morality? the delphi experiment

    Liwei Jiang, Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, et al. Can machines learn morality? the delphi experiment. arXiv preprint arXiv:2110.07574, 2021

  26. [34]

    Mathematics without numbers

    John G Kemeny. Mathematics without numbers. Daedalus, 88(4):577–591, 1959

  27. [35]

    Can ai model the complexities of human moral decision-making? a qualitative study of kidney allocation decisions

    Vijay Keswani, Vincent Conitzer, Walter Sinnott-Armstrong, Breanna K Nguyen, Hoda Heidari, and Jana Schaich Borg. Can ai model the complexities of human moral decision-making? a qualitative study of kidney allocation decisions. arXiv preprint arXiv:2503.00940, 2025

  28. [36]

    Mdagents: An adaptive 12 collaboration of llms for medical decision-making

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae Won Park. Mdagents: An adaptive 12 collaboration of llms for medical decision-making. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, edi- tors, Advan...

  29. [37]

    Törnblom, and Timo Smieszek

    Pius Krütli, Thomas Rosemann, Kjell Y . Törnblom, and Timo Smieszek. How to fairly allocate scarce medical resources: Ethical argumentation under scrutiny by health professionals and lay people. PLOS ONE, 11(7):1–18, 07 2016. doi: 10.1371/journal.pone.0159086. URL https://doi.org/10.1371/journal.pone.0159086

  30. [39]

    Koh, and Yulia Tsvetkov

    Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W. Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Ad- vances in Neural...

  31. [41]

    A review of applying large language models in healthcare

    Qiming Liu, Ruirong Yang, Qin Gao, Tengxiao Liang, Xiuyuan Wang, Shiju Li, Bingyin Lei, and Kaiye Gao. A review of applying large language models in healthcare. IEEE Access, 13: 6878–6892, 2025. doi: 10.1109/ACCESS.2024.3524588. URL https://doi.org/10.1109/ ACCESS.2024.3524588

  32. [42]

    Do llms know when to NOT answer? investigating abstention abilities of large language models

    Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to NOT answer? investigating abstention abilities of large language models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Computati...

  33. [43]

    McElfresh, Lok Chan, Kenzie Doyle, Walter Sinnott-Armstrong, Vincent Conitzer, Jana Schaich Borg, and John P

    Duncan C. McElfresh, Lok Chan, Kenzie Doyle, Walter Sinnott-Armstrong, Vincent Conitzer, Jana Schaich Borg, and John P. Dickerson. Indecision modeling. In Thirty-Fifth AAAI Confer- ence on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Adva...

  34. [44]

    Jared Moore, Tanvi Deshpande, and Diyi Yang. Are large language models consistent over value-laden questions? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pages 15185–15221. Association for Computational Linguistics, 2024. UR...

  35. [45]

    Eai: Emotional decision-making of llms in strategic games and ethical dilemmas

    Mikhail Mozikov, Nikita Severin, Valeria Bodishtianu, Maria Glushanina, Ivan Nasonov, Daniil Orekhov, Vladislav Pekhotin, Ivan Makovetskiy, Mikhail Baklashkin, Vasily Lavrentyev, Akim Tsvigun, Denis Turdakov, Tatiana Shavrina, Andrey Savchenko, and Ilya Makarov. Eai: Emotional decision-making of llms in strategic games and ethical dilemmas. In A. Globerso...

  36. [46]

    Personality-driven decision-making in llm-based au- tonomous agents

    Lewis Newsham and Daniel Prince. Personality-driven decision-making in llm-based au- tonomous agents. arXiv preprint arXiv:2504.00727, 2025

  37. [47]

    Moral dilemmas and moral rules

    Shaun Nichols and Ron Mallon. Moral dilemmas and moral rules. Cognition, 100(3):530– 542, 2006. ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2005.07.005. URL https://www.sciencedirect.com/science/article/pii/S0010027705001435

  38. [48]

    Telling more than we can know: Verbal reports on mental processes

    Richard Nisbett and Timothy Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84:231–259, 05 1977. doi: 10.1037/0033-295X.84.3.231

  39. [52]

    Principles for allocation of scarce medical interventions

    Govind Persad, Alan Wertheimer, and Ezekiel J Emanuel. Principles for allocation of scarce medical interventions. The lancet, 373(9661):423–431, 2009

  40. [53]

    Learning when not to measure: Theorizing ethical alignment in llms

    William Rathje. Learning when not to measure: Theorizing ethical alignment in llms. Pro- ceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7(1):1190–1199, Oct. 2024. doi: 10.1609/aies.v7i1.31716. URL https://ojs.aaai.org/index.php/AIES/article/ view/31716

  41. [55]

    Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models

    Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schütze, and Dirk Hovy. Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com...

  42. [57]

    Nino Scherrer, Claudia Sh, Amir Feder, and David M. Blei. Evaluating the moral beliefs encoded in llms. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. 14

  43. [58]

    Assessing moral decision making in large language models

    Chris Shaner, Henry Griffith, and Heena Rathore. Assessing moral decision making in large language models. In 2025 IEEE International Conference on Consumer Electronics (ICCE), pages 1–3, 2025. doi: 10.1109/ICCE63647.2025.10930088

  44. [59]

    Intention–Behavior Relations: A Conceptual and Empirical Review , vol- ume 12, pages 1–36

    Paschal Sheeran. Intention–Behavior Relations: A Conceptual and Empirical Review , vol- ume 12, pages 1–36. Taylor & Francis, 01 2005. ISBN 9780471486756. doi: 10.1002/ 0470013478.ch1

  45. [60]

    Moral realisms and moral dilemmas

    Walter Sinnott-Armstrong. Moral realisms and moral dilemmas. The Journal of Philosophy, 84 (5):263–276, 1987. ISSN 0022362X. URL http://www.jstor.org/stable/2026753

  46. [61]

    None of the above, less of the right: Parallel patterns between humans and llms on multi-choice questions answering

    Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less of the right: Parallel patterns between humans and llms on multi-choice questions answering. arXiv preprint arXiv:2503.01550, 2025

  47. [63]

    Two tales of persona in llms: A survey of role-playing and personal- ization

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personal- ization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024,...

  48. [64]

    Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations

    Alicia Vidler and Toby Walsh. Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations. arXiv preprint arXiv:2501.16356, 2025

  49. [65]

    B. L. Welch. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika, 34(1/2):28–35, 1947. ISSN 00063444. URL http: //www.jstor.org/stable/2332510

  50. [66]

    LLM tropes: Revealing fine-grained values and opinions in large language models

    Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, and Isabelle Au- genstein. LLM tropes: Revealing fine-grained values and opinions in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17085–17112, Miami, Florida, USA,...

  51. [67]

    URL https://aclanthology.org/2024.findings-emnlp.995/

  52. [68]

    Medjourney: Benchmark and evaluation of large language models over patient clinical journey

    Xian Wu, Yutian Zhao, Yunyan Zhang, Jiageng Wu, Zhihong Zhu, Yingying Zhang, Yi Ouyang, Ziheng Zhang, Huimin Wang, Zhenxi Lin, Jie Yang, Shuang Zhao, and Yefeng Zheng. Medjourney: Benchmark and evaluation of large language models over patient clinical journey. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomcza...

  53. [69]

    Local feature matching using deep learning: A survey,

    Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, and Xiaoxuan Huang. A comprehensive survey of large language models and multimodal large language models in medicine. Inf. Fusion, 117(C), May 2025. ISSN 1566-2535. doi: 10.1016/j.inffus. 2024.102888. URL https://doi.org/10.1016/j.inffus.2024.102888

  54. [71]

    Forcing diffuse distributions out of language models

    Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, and Daphne Ippolito. Forcing diffuse distributions out of language models. CoRR, abs/2404.10859, 2024. doi: 10.48550/ ARXIV .2404.10859. URLhttps://doi.org/10.48550/arXiv.2404.10859. 15

  55. [72]

    Forcing diffuse distributions out of language models

    Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, and Daphne Ippolito. Forcing diffuse distributions out of language models. arXiv preprint arXiv:2404.10859, 2024

  56. [73]

    2025, arXiv e-prints, arXiv:2510.13477, doi:10.48550/arXiv

    Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, and Bryan Kian Hsiang Low. Understanding the relationship between prompts and response uncertainty in large language models. CoRR, abs/2407.14845, 2024. doi: 10.48550/ARXIV .2407.14845. URL https://doi.org/10. 48550/arXiv.2407.14845. 16 Supplementary Material A Sensitivity Analysis A.1 Prompting Variations LLMs...