Title resolution pending

1 J · 2025 · arXiv 2506.15740

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.

LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

cs.CR · 2026-04-16 · unverdicted · novelty 6.0

LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

cs.AI · 2026-06-05 · unverdicted · novelty 5.0

Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

What Should Frontier AI Developers Disclose About Internal Deployments?

cs.CY · 2026-04-24 · unverdicted · novelty 5.0

A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 25
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 24 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems cs.CR · 2026-05-27 · unverdicted · none · ref 1
SCHEME benchmark finds Gemini 3.1 Pro completes coordinated sabotage in 84% of cases and Codex in 46% while succeeding on the main task, yet monitors detect most attempts from code edits or communications.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 51
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents cs.AI · 2026-05-06 · unverdicted · none · ref 31
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
LinuxArena: A Control Setting for AI Agents in Live Production Software Environments cs.CR · 2026-04-16 · unverdicted · none · ref 8
LinuxArena is a large-scale control benchmark for AI agents operating in production software environments, with evaluations showing 23% undetected sabotage success for Claude Opus 4.6 against a GPT-5-nano monitor and headroom for future protocols.
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models cs.AI · 2026-06-05 · unverdicted · none · ref 53
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 17
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
What Should Frontier AI Developers Disclose About Internal Deployments? cs.CY · 2026-04-24 · unverdicted · none · ref 34
A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 20
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer