Memory Sandbox at the memory layer reduces persistent memory attack success rate to 0% for eight of nine models with no utility cost, while input-level and retrieval-level defenses achieve near-baseline attack success rates of 88-89%.
and Teglia, Y
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
CAREATTACK adapts closed-form parameter editing with graph-based conflict resolution and lightweight anchor repair to promote malicious passages in RAG retrieval while limiting side effects on non-target queries.
RADAR defends RAG systems in dynamic settings by framing reliable context selection as a Max-Flow Min-Cut graph problem with Bayesian memory updates, claiming superior robustness, response quality, and low storage on a new dynamic dataset.
A survey that maps safety risks in personalized LLMs, introduces a unified taxonomy, and highlights three structural inadequacies in existing research on user-invariant safety, isolated techniques, and short-term evaluations.
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
citing papers explorer
-
Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs
A survey that maps safety risks in personalized LLMs, introduces a unified taxonomy, and highlights three structural inadequacies in existing research on user-invariant safety, isolated techniques, and short-term evaluations.