Safety cases for frontier

Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, Markus Anderljung · 2024 · arXiv 2410.21572

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems

cs.CR · 2026-06-25 · unverdicted · novelty 6.0

Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems cs.CR · 2026-06-25 · unverdicted · none · ref 29
Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms cs.AI · 2026-06-10 · unverdicted · none · ref 67
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands cs.LG · 2026-05-14 · unverdicted · none · ref 6
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 9
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Safety cases for frontier

fields

years

verdicts

representative citing papers

citing papers explorer