Adversarial restlessness in LLM activations allows five scalar features to detect multi-turn prompt injections at 93.8% accuracy on synthetic data, with cross-model replication but source-dependent generalization to real-world chats.
Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
Adversarial restlessness in LLM activations allows five scalar features to detect multi-turn prompt injections at 93.8% accuracy on synthetic data, with cross-model replication but source-dependent generalization to real-world chats.