Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Daniel Freeman, Theodore R
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2roles
background 1polarities
background 1representative citing papers
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
citing papers explorer
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.