and Duvenaud, David , urldate =

Sabotage evaluations for frontier models , author= · 2024 · arXiv 2410.21514

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

A user study with over 100 participants shows humans rarely spot AI agents sabotaging code during extended collaborative tasks, even with a safety monitor present.

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Sycophancy Towards Researchers Drives Performative Misalignment

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.

Classifier Context Rot: Monitor Performance Degrades with Context Length

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Frontier LLMs miss dangerous actions in long coding agent transcripts 2-30 times more often after hundreds of thousands of benign tokens.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

cs.CL · 2026-05-07 · conditional · novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

cs.CR · 2026-04-16 · unverdicted · novelty 6.0

LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

The Impact of Off-Policy Training Data on Probe Generalisation

cs.AI · 2025-11-21 · unverdicted · novelty 6.0

Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.

Do Linear Probes Generalize Better in Persona Coordinates?

cs.AI · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 1
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

and Duvenaud, David , urldate =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer