Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2verdicts
UNVERDICTED 2representative citing papers
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.