Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.
Preprints (May 2026)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
Entropy dynamics across token positions in intermediate layers of LLMs separate jailbreak prompts from benign ones using trend-based features without extra training.