This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Title resolution pending
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 3polarities
background 3representative citing papers
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
citing papers explorer
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems
SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
-
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
-
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
-
CoT-Guard: Small Models for Strong Monitoring
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
-
What Should Frontier AI Developers Disclose About Internal Deployments?
A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.