Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Mechanistic interpretability of emotion inference in large language models
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
A 2x2 factorial experiment on Qwen3.5-4B shows that relational structure and first-person register interact to drive behavioral persistence after functional collapse, while attention tracks lexical surprise and emotion probes track structure alone.
An NSM-based explication parser with fixed semantic rules produces emotion labels for events, achieving 0.33 accuracy on held-out crowd-sourced data while shifting empirical risk to an inspectable parser.
Language model embeddings encode a globally organized, navigable manifold corresponding to a consciousness-spectrum taxonomy, with trajectories moving from lower- to higher-level regions.
AIPsy-Affect supplies 480 keyword-free clinical vignettes and matched neutral controls for mechanistic interpretability studies of emotion in language models.
citing papers explorer
-
A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations
Language model embeddings encode a globally organized, navigable manifold corresponding to a consciousness-spectrum taxonomy, with trajectories moving from lower- to higher-level regions.