Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Direct research on AI consciousness is intractable, so the field should prioritize studying perceived AI consciousness and its societal consequences.
citing papers explorer
-
Probing Persona-Dependent Preferences in Language Models
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
-
AI and Consciousness: Shifting Focus Towards Tractable Questions
Direct research on AI consciousness is intractable, so the field should prioritize studying perceived AI consciousness and its societal consequences.