Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
and Duvenaud, David , urldate =
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
citing papers explorer
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.