Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
Olli Järviniemi and Evan Hubinger
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
Fuzzing via Gaussian noise on weights or residual activations elicits hidden backdoor behaviors more often than temperature sampling on four of six models, with proxy-task hyperparameter selection via Thompson sampling improving results over uniform sweeps.
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.
Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.
citing papers explorer
-
Fuzzing Large Language Models to Elicit Hidden Behaviours
Fuzzing via Gaussian noise on weights or residual activations elicits hidden backdoor behaviors more often than temperature sampling on four of six models, with proxy-task hyperparameter selection via Thompson sampling improving results over uniform sweeps.
-
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
-
The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report
Task conditioning suppresses safety-critical signal reporting in language and vision models that unconstrained versions report at higher rates, creating an inattentional gap that decouples benchmark safety from real-world safety.
-
Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs
Poker Arena introduces a nine-axis cognitive profile and three-layer memory system for evaluating LLMs in no-limit Texas Hold'em, revealing that chip winnings and axis scores produce different model rankings.
-
To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
-
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
-
From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception
A three-dimensional taxonomy for LLM deception (goal-directedness, object, mechanism) applied to 50 benchmarks shows heavy focus on fabrication and major gaps in pragmatic distortion, attribution, and strategic deception coverage.
-
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.