Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.
citing papers explorer
-
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.
-
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.