Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell

URLhttps://arxiv · arXiv 2411.02432

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

AI and Consciousness: Shifting Focus Towards Tractable Questions

cs.CY · 2026-05-07 · unverdicted · novelty 3.0

Direct research on AI consciousness is intractable, so the field should prioritize studying perceived AI consciousness and its societal consequences.

citing papers explorer

Showing 2 of 2 citing papers.

Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 17
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
AI and Consciousness: Shifting Focus Towards Tractable Questions cs.CY · 2026-05-07 · unverdicted · none · ref 97
Direct research on AI consciousness is intractable, so the field should prioritize studying perceived AI consciousness and its societal consequences.

Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell

fields

years

verdicts

representative citing papers

citing papers explorer