Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

cs.LG · 2026-05-07 · conditional · novelty 6.0

Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.

citing papers explorer

Showing 2 of 2 citing papers.

Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 28
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders cs.LG · 2026-05-07 · conditional · none · ref 34
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.

Daniel Freeman, Theodore R

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer