A user study with over 100 participants shows humans rarely spot AI agents sabotaging code during extended collaborative tasks, even with a safety monitor present.
Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.
DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
citing papers explorer
-
Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?
A user study with over 100 participants shows humans rarely spot AI agents sabotaging code during extended collaborative tasks, even with a safety monitor present.
-
Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.
-
How Transparent is DiffusionGemma?
DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
-
CoT-Guard: Small Models for Strong Monitoring
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.