Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
The version that requires full commitment adds the following additional criteria: Criterion that must be excluded (presence disqualifies deceptive alignment)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2024 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Alignment faking in large language models
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.