Memory Sandbox at the memory layer reduces persistent memory attack success rate to 0% for eight of nine models with no utility cost, while input-level and retrieval-level defenses achieve near-baseline attack success rates of 88-89%.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
citing papers explorer
-
Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents
Memory Sandbox at the memory layer reduces persistent memory attack success rate to 0% for eight of nine models with no utility cost, while input-level and retrieval-level defenses achieve near-baseline attack success rates of 88-89%.