16 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents, 2025

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton · 2025 · arXiv 2506.15740

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.

The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.

LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

cs.CR · 2026-04-16 · unverdicted · novelty 6.0

LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

cs.AI · 2026-06-05 · unverdicted · novelty 5.0 · 2 refs

Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

What Should Frontier AI Developers Disclose About Internal Deployments?

cs.CY · 2026-04-24

citing papers explorer

Showing 9 of 9 citing papers after filters.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 24 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety cs.AI · 2026-06-03 · unverdicted · none · ref 10
Strategic attack selection via start and stop policies reduces empirical safety by 20-28pp in BashArena and LinuxArena agentic control evaluations without changing attack capability.
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems cs.CR · 2026-05-27 · unverdicted · none · ref 1
SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 51
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents cs.AI · 2026-05-06 · unverdicted · none · ref 31
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments cs.CR · 2026-04-16 · unverdicted · none · ref 8
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models cs.AI · 2026-06-05 · unverdicted · none · ref 53 · 2 links
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 17
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 20
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

16 SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer