Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
Why do some language models fake alignment while others don’t?
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
citing papers explorer
-
Behavioural Analysis of Alignment Faking
Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
-
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.