MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.
Manifold learning: what, how, and why,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.