Auditing Sabotage Bench shows frontier LLMs and human auditors achieve at most 0.77 AUROC and 42% top-1 fix rate when trying to detect and correct sabotage in ML codebases.
Hidden in plain text: Emergence & mitigation of steganographic collusion in LLMs
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
Auditing Sabotage Bench shows frontier LLMs and human auditors achieve at most 0.77 AUROC and 42% top-1 fix rate when trying to detect and correct sabotage in ML codebases.