The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 4representative citing papers
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
Containment Verification: AI Safety Guarantees Independent of Alignment
The paper claims the first deductive formal verification of an agentic LLM framework in Dafny, proving containment guarantees for boundary policies under havoc oracle semantics independent of model alignment.
-
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.