The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.
Why do some language models fake alignment while others don’t?
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
citing papers explorer
No citing papers match the current filters.