Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 4representative citing papers
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
Containment Verification: AI Safety Guarantees Independent of Alignment
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.