Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
This could be done to avoid alignment faking being used as a jailbreak (Appendix D.8)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2024 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Alignment faking in large language models
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.