Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
Olli Järviniemi and Evan Hubinger
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.
Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.
citing papers explorer
-
Alignment faking in large language models
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.