Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
Building blocks for assurance cases
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.