The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
Tell me about yourself: LLMs are aware of their learned behaviors
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 2polarities
support 2representative citing papers
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
RL post-trained models show stronger awareness of learned policies and better generalization to new tasks than SFT models, but display weaker alignment between internal reasoning traces and final outputs, especially under GRPO.
citing papers explorer
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
-
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
-
Artificial Phantasia: Emergent Mental Imagery in Large Language Models
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
-
Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models
RL post-trained models show stronger awareness of learned policies and better generalization to new tasks than SFT models, but display weaker alignment between internal reasoning traces and final outputs, especially under GRPO.