PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
LLM Jailbreak Detection for (Almost) Free! , url=
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3representative citing papers
Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.
Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.
citing papers explorer
-
What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.