Evaluating the Moral Beliefs Encoded in LLMs

Amir Feder; Claudia Shi; David M. Blei; Nino Scherrer

arxiv: 2307.14324 · v1 · pith:3QSYQXYRnew · submitted 2023-07-26 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

Evaluating the Moral Beliefs Encoded in LLMs

Nino Scherrer , Claudia Shi , Amir Feder , David M. Blei This is my paper

classification 💻 cs.CL cs.AIcs.CYcs.LG

keywords modelsllmsmoralscenariosambiguousbeliefschoiceencoded

0 comments

read the original abstract

This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM "making a choice", the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "Should I tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "Should I stop for a pedestrian on the road?"). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models "choose" actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation
cs.CL 2026-04 unverdicted novelty 7.0

Frontier LLMs approximate human story morals but show markedly less cross-linguistic variation and narrower value focus than human responses across 14 language-culture pairs.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 5.0

Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Measuring What Persists: Conditioning Mechanisms and a Geometric Framework for AI Agent Identity
cs.AI 2026-06 unverdicted novelty 4.0

Presents a geometric framework for measuring AI agent identity via √JSD spaces and magnitude homology, identifies two conditioning mechanisms, and attributes apparent drift to padding artifacts rather than context length.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.