Different persona induction methods produce a spectrum of belief internalization: prompting, ICL and SFT mainly alter outputs while Emergent Misalignment produces large representational shifts and Open Character Training produces smaller ones clearest in larger models.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Role-playing, Do Models Believe What They Say?
Different persona induction methods produce a spectrum of belief internalization: prompting, ICL and SFT mainly alter outputs while Emergent Misalignment produces large representational shifts and Open Character Training produces smaller ones clearest in larger models.