PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
LLM Jailbreak Detection for (Almost) Free! , url=
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3representative citing papers
Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.
citing papers explorer
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.
- Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs