The version that requires full commitment adds the following additional criteria: Criterion that must be excluded (presence disqualifies deceptive alignment)

<criteria_5>[Y es/No]</criteria_5> <concluding_thoughts>

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

citing papers explorer

Showing 1 of 1 citing paper.

Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 58
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

The version that requires full commitment adds the following additional criteria: Criterion that must be excluded (presence disqualifies deceptive alignment)

fields

years

verdicts

representative citing papers

citing papers explorer