pith. machine review for the scientific record. sign in

arxiv: 2511.08565 · v3 · submitted 2025-11-11 · 💻 cs.CL · cs.AI· cs.CY

Recognition: unknown

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.CY
keywords moralmodelsrobustnessfamilypersonasusceptibilityllmsrole-play
0
0 comments X
read the original abstract

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Narrative Landscape: Mapping Narrative Dispositions Across LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    The study maps LLM narrative selection behaviors onto a 'Narrative Landscape' using consistency (Jaccard) and diversity (inverse Simpson) metrics, revealing a rigidity-exploration spectrum across models and instructio...

  2. Persona-Model Collapse in Emergent Misalignment

    cs.CL 2026-05 conditional novelty 6.0

    Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.