Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.
Safety cases for frontier
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems
Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.
-
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
-
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.