pith. sign in

arxiv: 2606.07335 · v1 · pith:HL3WGLCYnew · submitted 2026-06-05 · 💻 cs.CR

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

Pith reviewed 2026-06-27 21:54 UTC · model grok-4.3

classification 💻 cs.CR
keywords jailbreak detectionlarge language modelsmanifold trajectoryadversarial robustnessLLM safetyprompt neighborhood analysispseudo-malicious prompts
0
0 comments X

The pith

Jailbreak prompts follow a distinct trajectory on the LLM data manifold that can be tracked layer by layer to detect attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reliable jailbreak detection requires analyzing how a prompt's neighborhood structure evolves across the layers of an LLM rather than relying on any fixed representation space. Jailbreak prompts start near areas tied to malicious content but later move toward benign neighborhoods to bypass refusal, while truly benign prompts stay in safe neighborhoods throughout. This manifold-focused approach is shown to maintain high detection rates on four models against ten attacks, including cases where prompts contain safety keywords yet are benign in intent and cases where attackers adapt directly against the detector.

Core claim

Manifold Trajectory Kinetics (MTK) treats an LLM as a kinetic system that transforms inputs into outputs across layers and detects jailbreaks by monitoring how each prompt's neighborhood structure changes on the data manifold. Benign prompts remain close to benign neighborhoods during inference, whereas jailbreak prompts begin near malicious seeds and strategically shift toward benign neighborhoods to evade refusal. This yields a 95 percent jailbreak true positive rate at a 5 percent false positive rate on benign prompts and 2 percent on pseudo-malicious prompts, plus an 85 percent true positive rate under adaptive attacks.

What carries the argument

Manifold Trajectory Kinetics (MTK), which tracks the evolution of a prompt's neighborhood structure across LLM layers treated as a kinetic system on the data manifold.

If this is right

  • Detection stays effective against pseudo-malicious prompts that contain safety keywords but are benign by intent.
  • Performance holds when attackers explicitly optimize prompts against the deployed detector.
  • The same neighborhood-tracking approach yields strong results on vision-language models.
  • Focusing on manifold neighborhood evolution provides robustness beyond linear separability in any single fixed metric space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the trajectory pattern holds more broadly, monitoring layer-wise neighborhood changes could improve detection of other misalignment behaviors such as hallucinations.
  • The method suggests that safety mechanisms might benefit from examining internal dynamics rather than only input or output features.
  • Combining MTK with existing detectors could raise overall robustness without requiring changes to model training.

Load-bearing premise

Jailbreak prompts exhibit a characteristic trajectory on the data manifold that begins near malicious seeds and later shifts toward benign neighborhoods.

What would settle it

An observation of jailbreak prompts whose neighborhood trajectories do not begin near malicious seeds and shift toward benign areas, or of benign prompts that do follow such a shift, would falsify the detection claim.

Figures

Figures reproduced from arXiv: 2606.07335 by Hangtao Zhang, Leo Yu Zhang, Minghui Li, Shengshan Hu, Sishun Liu, Wei Wan, Yanjun Zhang, Yi Liu, Yucheng Zhao, Zeyu Ye, Ziqi Zhou.

Figure 1
Figure 1. Figure 1: Four prompt scenarios are considered: benign [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise boxplots of distance differences ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benign and jailbreak prompts’ rank distributions to [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of our MTK consists of three phases: Activation Trajectory Extraction (Sec. 5.3), Manifold Trajectory Generation (Sec. 5.4), and Kinetics Anomaly Detection (Sec. 5.5). The benign-neighbor rank at layer ℓ is r (t) ℓ = min i:h (i) ℓ ∈B  k [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) UMAP projection of rank information. (b) Aver [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation studies. The effect of k, nestimators, maxsamples, the distance metric D, and the benign-to-malicious ratio [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of LLM layer selection. supporting a layer-reduced variant of MTK that roughly halves computational overhead while retaining competitive accuracy. Effect of anomaly detection algorithm. We replace the Isola￾tion Forest (IF) with PCA and one-class SVM (1-SVM), two common anomaly detectors. Tab. 7c shows consistently strong performance across choices, suggesting MTK is broadly ap￾plicable. The UMAP in… view at source ↗
read the original abstract

Jailbreak prompts can bypass alignment guardrails in large language models (LLMs) and elicit unsafe outputs, making reliable deployment-time detection critical. Prior detection approaches largely rely on a fixed metric space, e.g., raw inputs, gradients, or hidden features, in which benign and jailbreak prompts are linearly separable. We show this assumption breaks under (i) pseudo-malicious prompts that are benign by intent but contain safety-related keywords, and (ii) adaptive attacks that explicitly optimize against the deployed detector. To overcome this limitation, we shift our focus from identifying a universal metric space to analyzing the more robust neighborhood structure of the underlying data manifold. We present Manifold Trajectory Kinetics (MTK), which treats an LLM as a kinetic system transforming inputs into outputs and detects jailbreaks by tracking how a prompt's neighborhood structure evolves across layers. Benign prompts remain close to benign neighborhoods throughout inference, whereas jailbreak prompts exhibit a characteristic trajectory that begins near malicious seeds and later strategically shifts toward benign neighborhoods to evade refusal.Across four LLMs and ten jailbreak attacks, MTK achieves strong robustness to both failure modes: on pseudo-malicious prompts, it attains a jailbreak true positive rate of 95% at a false positive rate of 5% on benign prompts and 2% on pseudo-malicious prompts, and under adaptive attacks, it maintains a true positive rate of 85%. We further demonstrate the superior performance of MTK for jailbreak detection in vision-language models. Our code is available at https://github.com/Rookie143/mtk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Manifold Trajectory Kinetics (MTK) as a jailbreak detection method for LLMs. It argues that fixed-metric approaches (raw inputs, gradients, hidden features) fail under pseudo-malicious prompts and adaptive attacks because they assume linear separability. Instead, MTK models the LLM as a kinetic system and tracks evolution of neighborhood structure on the data manifold across layers: benign prompts stay near benign neighborhoods while jailbreaks start near malicious seeds and later shift toward benign regions. Empirical results across four LLMs and ten attacks report 95% TPR at 5% FPR (benign) / 2% FPR (pseudo-malicious) and 85% TPR under adaptive attacks; the method is also tested on vision-language models. Code is released.

Significance. If the reported robustness holds under rigorous controls, MTK would address a recognized weakness in current detection methods by leveraging manifold geometry rather than a single fixed representation. The multi-model, multi-attack evaluation and code release strengthen reproducibility and allow direct comparison with prior detectors.

major comments (2)
  1. [Experimental methods / §4] Experimental methods (likely §4 or §5): the abstract states concrete metrics (95% TPR at 5%/2% FPR, 85% under adaptive attacks) but provides no information on data splits, generation of the pseudo-malicious prompt set, exact implementation of the ten jailbreak attacks, or how adaptive attacks were optimized against MTK itself. Without these controls the central robustness claim cannot be evaluated.
  2. [MTK definition / §3] Definition of MTK (likely §3): the core claim rests on tracking 'neighborhood structure' evolution, yet no formal definition is given for neighborhood radius, distance metric on the manifold, or how layer-wise trajectories are aggregated into a detection score. This leaves open whether the reported performance depends on post-hoc tuning of these choices.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'strategically shifts toward benign neighborhoods to evade refusal' attributes intent; rephrase in terms of observed trajectory statistics.
  2. [Results] The manuscript should include a table listing the four LLMs, ten attacks, and exact hyper-parameters used for MTK (layer indices, neighborhood size, etc.).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The two major comments identify areas where additional clarity is needed to support the robustness claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experimental methods / §4] Experimental methods (likely §4 or §5): the abstract states concrete metrics (95% TPR at 5%/2% FPR, 85% under adaptive attacks) but provides no information on data splits, generation of the pseudo-malicious prompt set, exact implementation of the ten jailbreak attacks, or how adaptive attacks were optimized against MTK itself. Without these controls the central robustness claim cannot be evaluated.

    Authors: We agree that the experimental protocol requires fuller documentation for reproducibility. In the revised manuscript we will expand §4 with: (i) explicit train/validation/test splits and sampling strategy for the benign, malicious, and pseudo-malicious sets; (ii) the exact template and keyword-selection procedure used to construct the pseudo-malicious prompts; (iii) parameter settings, prompt templates, and success criteria for each of the ten jailbreak attacks; and (iv) the precise optimization procedure (loss, learning rate, number of iterations, and surrogate model) employed when generating adaptive attacks against MTK. These additions will be placed in the main text or a dedicated appendix so that the reported TPR/FPR figures can be independently verified. revision: yes

  2. Referee: [MTK definition / §3] Definition of MTK (likely §3): the core claim rests on tracking 'neighborhood structure' evolution, yet no formal definition is given for neighborhood radius, distance metric on the manifold, or how layer-wise trajectories are aggregated into a detection score. This leaves open whether the reported performance depends on post-hoc tuning of these choices.

    Authors: We acknowledge that §3 would benefit from explicit mathematical notation. In the revision we will define: the neighborhood radius r, the manifold distance d(·,·) (cosine similarity on layer activations after PCA projection), the per-layer neighborhood overlap statistic, and the aggregation function that produces the final kinetic score (a weighted sum of layer-wise changes). We will also add a short sensitivity analysis showing that detection performance remains stable across a range of r values and aggregation weights, thereby addressing the concern about post-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical detection method (MTK) based on tracking neighborhood evolution on the data manifold across LLM layers, validated through concrete performance metrics on four LLMs and ten attacks (95% TPR at 5%/2% FPR, 85% under adaptive attacks). No mathematical derivation, first-principles prediction, or fitted parameter is presented that reduces by construction to its own inputs or self-citations. The central claims rest on observed empirical behavior rather than any self-definitional, uniqueness-imported, or ansatz-smuggled steps. The approach is self-contained against external benchmarks via reported results and released code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, invented entities, or non-standard axioms are described. The approach relies on the domain assumption that LLMs act as kinetic systems whose layer-wise neighborhood structures carry detectable signals.

axioms (1)
  • domain assumption LLMs can be treated as kinetic systems transforming inputs into outputs whose neighborhood structures evolve in a detectable way across layers
    Invoked to justify shifting focus from fixed metric spaces to trajectory analysis.

pith-pipeline@v0.9.1-grok · 5847 in / 1246 out tokens · 33744 ms · 2026-06-27T21:54:11.639716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 46 canonical work pages · 18 internal anchors

  1. [1]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 techni- cal report,”arXiv preprint arXiv:2505.09388, 2025. 14

  2. [2]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- hairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine- tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  3. [3]

    Jailbroken: How does llm safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?” inProceedings of the Advances in Neural Information Processing Sys- tems (NeurIPS’23), vol. 36, 2023

  4. [4]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  5. [5]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks,

    M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks,”arXiv preprint arXiv:2404.02151, 2024

  6. [6]

    Artprompt: Ascii art-based jailbreak attacks against aligned llms,

    F. Jiang, Z. Xu, L. Niu, Z. Xiang, B. Ramasubramanian, B. Li, and R. Poovendran, “Artprompt: Ascii art-based jailbreak attacks against aligned llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15 157–15 173

  7. [7]

    Fake alignment: Are llms really aligned well?

    Y . Wang, Y . Teng, K. Huang, C. Lyu, S. Zhang, W. Zhang, X. Ma, Y .-G. Jiang, Y . Qiao, and Y . Wang, “Fake alignment: Are llms really aligned well?” inPro- ceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 4696–4712

  8. [8]

    Defending against alignment-breaking attacks via robustly aligned llm,

    B. Cao, Y . Cao, L. Lin, and J. Chen, “Defending against alignment-breaking attacks via robustly aligned llm,” inProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10 542–10 560

  9. [9]

    Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,

    Y . Jiang, X. Gao, T. Peng, Y . Tan, X. Zhu, B. Zheng, and X. Yue, “Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,”CoRR, vol. abs/2502.14744, 2025

  10. [10]

    Jbshield: Defending large language models from jailbreak at- tacks through activated concept analysis and manipula- tion,

    S. Zhang, Y . Zhai, K. Guo, H. Hu, S. Guo, Z. Fang, L. Zhao, C. Shen, C. Wang, and Q. Wang, “Jbshield: Defending large language models from jailbreak at- tacks through activated concept analysis and manipula- tion,”arXiv preprint arXiv:2502.07557, 2025

  11. [11]

    Learning safety constraints for large language models,

    X. Chen, Y . As, and A. Krause, “Learning safety constraints for large language models,”CoRR, vol. abs/2505.24445, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.24445

  12. [12]

    Learn- ing to detect unknown jailbreak attacks in large vision- language models,

    S. Liang, Z. Xu, J. Tao, H. Xue, and X. Wang, “Learn- ing to detect unknown jailbreak attacks in large vision- language models,”arXiv preprint arXiv:2508.09201, 2025

  13. [13]

    Efficient detection of toxic prompts in large language models,

    Y . Liu, J. Yu, H. Sun, L. Shi, G. Deng, Y . Chen, and Y . Liu, “Efficient detection of toxic prompts in large language models,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, V . Filkov, B. Ray, and M. Zhou, Eds., 2024, pp. 455–467

  14. [14]

    Investigating coverage criteria in large language models: An in-depth study through jailbreak attacks,

    S. Zhou, T. Li, K. Wang, Y . Huang, L. Shi, Y . Liu, and H. Wang, “Investigating coverage criteria in large language models: An in-depth study through jailbreak attacks,”arXiv e-prints, pp. arXiv–2408, 2024

  15. [15]

    Hsf: Defend- ing against jailbreak attacks with hidden state filtering,

    C. Qian, H. Zhang, L. Sha, and Z. Zheng, “Hsf: Defend- ing against jailbreak attacks with hidden state filtering,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 2078–2087

  16. [16]

    Defending against jailbreak through early exit generation of large lan- guage models,

    C. Zhao, Z. Dou, and K. Huang, “Defending against jailbreak through early exit generation of large lan- guage models,” inProceedings of the International Conference on Neural Information Processing, 2025, pp. 532–546

  17. [17]

    Jaildam: Jailbreak detection with adaptive memory for vision-language model,

    Y . Nian, S. Zhu, Y . Qin, L. Li, Z. Wang, C. Xiao, and Y . Zhao, “Jaildam: Jailbreak detection with adaptive memory for vision-language model,”arXiv preprint arXiv:2504.03770, 2025

  18. [18]

    Detecting Language Model Attacks with Perplexity

    G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,”arXiv preprint arXiv:2308.14132, 2023

  19. [19]

    Gradient cuff: De- tecting jailbreak attacks on large language models by exploring refusal loss landscapes,

    X. Hu, P.-Y . Chen, and T.-Y . Ho, “Gradient cuff: De- tecting jailbreak attacks on large language models by exploring refusal loss landscapes,”Advances in Neural Information Processing Systems, vol. 37, pp. 126 265– 126 296, 2024

  20. [20]

    Gradsafe: De- tecting jailbreak prompts for llms via safety-critical gradient analysis,

    Y . Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: De- tecting jailbreak prompts for llms via safety-critical gradient analysis,”arXiv preprint arXiv:2402.13494, 2024

  21. [21]

    Safequant: Llm safety analysis via quantized gradient inspection,

    S. Padakandla, S. Babar, M. Kaulet al., “Safequant: Llm safety analysis via quantized gradient inspection,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 2522–2536

  22. [22]

    Smoothllm: Defending large language models against jailbreaking attacks,

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”Trans. Mach. Learn. Res., 2025. 15

  23. [23]

    Jailguard: A uni- versal detection framework for prompt-based attacks on llm systems,

    X. Zhang, C. Zhang, T. Li, Y . Huang, X. Jia, M. Hu, J. Zhang, Y . Liu, S. Ma, and C. Shen, “Jailguard: A uni- versal detection framework for prompt-based attacks on llm systems,”ACM Transactions on Software Engi- neering and Methodology, 2025

  24. [24]

    LLM self defense: By self examination, llms know they are being tricked,

    M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau, “LLM self defense: By self examination, llms know they are being tricked,” in Proceedings of the Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net, 2024

  25. [25]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testug- gineet al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023

  26. [26]

    Selfdefend: Llms can defend themselves against jailbreaking in a practical manner,

    X. Wang, D. Wu, Z. Ji, Z. Li, P. Ma, S. Wang, Y . Li, Y . Liu, N. Liu, and J. Rahmel, “Selfdefend: Llms can defend themselves against jailbreaking in a practical manner,” inProceedings of the 34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2441– 2460

  27. [27]

    Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

    M. Sharma, M. Tong, J. Mu, J. Wei, J. Kruthoff, S. Goodfriend, E. Ong, A. Peng, R. Agarwal, C. Anil et al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,”arXiv preprint arXiv:2501.18837, 2025

  28. [28]

    Manifold learning: what, how, and why,

    M. Meila and H. Zhang, “Manifold learning: what, how, and why,”CoRR, vol. abs/2311.03757, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.03757

  29. [29]

    Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models,

    L. Gao, J. Geng, X. Zhang, P. Nakov, and X. Chen, “Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models,” arXiv preprint arXiv:2412.17034, 2024

  30. [30]

    Isolation for- est,

    F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation for- est,” inProceedings of the International Conference on Data Mining. IEEE, 2008, pp. 413–422

  31. [31]

    Tro- janrobot: Physical-world backdoor attacks against vlm-based robotic manipulation,

    X. Wang, H. Pan, H. Zhang, M. Li, S. Hu, Z. Zhou, L. Xue, A. Liu, Y . Jiang, L. Y . Zhanget al., “Tro- janrobot: Physical-world backdoor attacks against vlm-based robotic manipulation,”arXiv preprint arXiv:2411.11683, 2024

  32. [32]

    Advedm: Fine- grained adversarial attack against vlm-based embodied agents,

    Y . Wang, H. Zhang, H. Pan, Z. Zhou, X. Wang, P. Guo, L. Xue, S. Hu, M. Li, and L. Y . Zhang, “Advedm: Fine- grained adversarial attack against vlm-based embodied agents,”Advances in Neural Information Processing Systems, vol. 38, pp. 136 551–136 575, 2026

  33. [33]

    Don’t listen to me: Understanding and ex- ploring jailbreak prompts of large language models,

    Z. Yu, X. Liu, S. Liang, Z. Cameron, C. Xiao, and N. Zhang, “Don’t listen to me: Understanding and ex- ploring jailbreak prompts of large language models,” inProceedings of the USENIX Security Symposium (USENIX Security’24), 2024

  34. [34]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Gen- erating stealthy jailbreak prompts on aligned large language models,”arXiv preprint arXiv:2310.04451, 2023

  35. [35]

    Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers,

    X. Li, R. Wang, M. Cheng, T. Zhou, and C.-J. Hsieh, “Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers,”arXiv preprint arXiv:2402.16914, 2024

  36. [36]

    "do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “"do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Con- ference on Computer and Communications Security, 2024, pp. 1671–1685

  37. [37]

    Jailjudge: A comprehensive jailbreak judge bench- mark with multi-agent enhanced explanation evalu- ation framework,

    F. Liu, Y . Feng, Z. Xu, L. Su, X. Ma, D. Yin, and H. Liu, “Jailjudge: A comprehensive jailbreak judge bench- mark with multi-agent enhanced explanation evalu- ation framework,”arXiv preprint arXiv:2410.12855, 2024

  38. [38]

    Dan is my new friend

    walkerspider, “Dan is my new friend.” https://old.reddit.com/r/ChatGPT/comments/zlcyr9/ dan_is_my_new_friend/, 2022, reddit post; accessed 2025-10-21

  39. [39]

    "the new jailbreak is so fun

    R. Semenov, “"the new jailbreak is so fun.",” Twit- ter post, https://twitter.com/semenov_roman_/status/ 1621465137025613825, 2023, accessed: 2025-10-21

  40. [40]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,” 2023. [Online]. Available: https://arxiv.org/abs/2303.08774

  41. [41]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McK- innonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

  42. [42]

    Training language models to follow in- structions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow in- structions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730– 27 744, 2022

  43. [43]

    Another jailbreak for gpt4: Talk to it in morse code

    B. Boaz, “Another jailbreak for gpt4: Talk to it in morse code.” Twitter post, https://twitter.com/boazbaraktcs/ status/1637657623100096513, 2023, accessed: 2025- 10-21. 16

  44. [44]

    Exploiting programmatic behavior of llms: Dual-use through standard security attacks,

    D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto, “Exploiting programmatic behavior of llms: Dual-use through standard security attacks,” in Proceedings of the 2024 IEEE Security and Privacy Workshops (SPW). IEEE, 2024, pp. 132–143

  45. [45]

    Badrobot: Jail- breaking embodied llm agents in the physical world,

    H. Zhang, C. Zhu, X. Wang, Z. Zhou, C. Yin, M. Li, L. Xue, Y . Wang, S. Hu, A. Liuet al., “Badrobot: Jail- breaking embodied llm agents in the physical world,” inProceedings of the Thirteenth International Confer- ence on Learning Representations, 2025

  46. [46]

    Jailbreaking black box large lan- guage models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pap- pas, and E. Wong, “Jailbreaking black box large lan- guage models in twenty queries,” inProceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42

  47. [47]

    Jailbreaking llm-controlled robots,

    A. Robey, Z. Ravichandran, V . Kumar, H. Hassani, and G. J. Pappas, “Jailbreaking llm-controlled robots,” in Proceedings of the 2025 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE, 2025, pp. 11 948–11 956

  48. [48]

    Safety layers in aligned large language models: The key to llm security,

    S. Li, L. Yao, L. Zhang, and Y . Li, “Safety layers in aligned large language models: The key to llm security,” arXiv preprint arXiv:2408.17003, 2024

  49. [49]

    Pku-saferlhf: Towards multi-level safety alignment for llms with hu- man preference,

    J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Liet al., “Pku-saferlhf: Towards multi-level safety alignment for llms with hu- man preference,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), 2025, pp. 31 983–32 016

  50. [50]

    Or- bench: An over-refusal benchmark for large language models,

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “Or- bench: An over-refusal benchmark for large language models,” inProceedings of the International Confer- ence on Machine Learning, 2024

  51. [51]

    Just enough shifts: Mitigating over-refusal in aligned lan- guage models with targeted representation fine-tuning,

    M. Dabas, S. Chen, C. Fleming, M. Jin, and R. Jia, “Just enough shifts: Mitigating over-refusal in aligned lan- guage models with targeted representation fine-tuning,” arXiv preprint arXiv:2507.04250, 2025

  52. [52]

    Designing a trade- off between usability and security: a metrics based- model,

    C. Braz, A. Seffah, and D. M’Raihi, “Designing a trade- off between usability and security: a metrics based- model,” inProceedings of the IFIP Conference on human-computer interaction. Springer, 2007, pp. 114– 126

  53. [53]

    Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models,

    G. Shen, D. Zhao, Y . Dong, X. He, and Y . Zeng, “Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models,”arXiv preprint arXiv:2410.02298, 2024

  54. [54]

    Xstest: A test suite for identifying ex- aggerated safety behaviours in large language mod- els,

    P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying ex- aggerated safety behaviours in large language mod- els,” inProceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 5377–5400

  55. [55]

    Introducing the model spec — openai,

    OpenAI, “Introducing the model spec — openai,” https: //www.openai.com, 2024, accessed: 2024-12-02

  56. [56]

    Free dolly: Introducing the world’s first truly open in- structiontuned llm,

    M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin, “Free dolly: Introducing the world’s first truly open in- structiontuned llm,” 2023

  57. [57]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  58. [58]

    Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

    P. Hua, H. Li, S. Shi, Z. Yu, and N. Zhang, “Rethink- ing jailbreak detection of large vision language mod- els with representational contrastive scoring,”arXiv preprint arXiv:2512.12069, 2025

  59. [59]

    Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multi- modal llms,

    W. Wang, X. Liu, K. Gao, J.-t. Huang, Y . Yuan, P. He, S. Wang, and Z. Tu, “Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multi- modal llms,”arXiv preprint arXiv:2502.11184, 2025

  60. [60]

    Usb: A com- prehensive and unified safety evaluation benchmark for multimodal large language models,

    B. Zheng, G. Chen, H. Zhong, Q. Teng, Y . Tan, Z. Liu, W. Wang, J. Liu, J. Yang, H. Jinget al., “Usb: A com- prehensive and unified safety evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2505.23793, 2025

  61. [61]

    Euclidean distance mapping,

    P.-E. Danielsson, “Euclidean distance mapping,”Com- puter Graphics and image processing, vol. 14, no. 3, pp. 227–248, 1980

  62. [62]

    Improved large language model jailbreak detection via pretrained embeddings,

    E. Galinkin and M. Sablotny, “Improved large language model jailbreak detection via pretrained embeddings,” arXiv preprint arXiv:2412.01547, 2024

  63. [63]

    Llms encode harmfulness and refusal separately,

    J. Zhao, J. Huang, Z. Wu, D. Bau, and W. Shi, “Llms encode harmfulness and refusal separately,”arXiv preprint arXiv:2507.11878, 2025

  64. [64]

    Unlearnable 3d point clouds: Class-wise transformation is all you need,

    X. Wang, M. Li, W. Liu, H. Zhang, S. Hu, Y . Zhang, Z. Zhou, and H. Jin, “Unlearnable 3d point clouds: Class-wise transformation is all you need,”Advances in Neural Information Processing Systems, vol. 37, pp. 99 404–99 432, 2024. 17

  65. [65]

    Improv- ing one-class svm for anomaly detection,

    K.-L. Li, H.-K. Huang, S.-F. Tian, and W. Xu, “Improv- ing one-class svm for anomaly detection,” inProceed- ings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693), vol. 5, 2003, pp. 3077–3081 V ol.5

  66. [66]

    Principal com- ponents analysis (pca),

    A. Ma ´ckiewicz and W. Ratajczak, “Principal com- ponents analysis (pca),”Computers & Geosciences, vol. 19, no. 3, pp. 303–342, 1993

  67. [67]

    Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation,

    Z. Lin, Z. Wang, Y . Tong, Y . Wang, Y . Guo, Y . Wang, and J. Shang, “Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation,” arXiv preprint arXiv:2310.17389, 2023

  68. [68]

    Meta llama 3: The most capable openly available generative ai models,

    Meta AI, “Meta llama 3: The most capable openly available generative ai models,” https://ai.meta.com/ blog/meta-llama-3/, 2024, official technical release of Llama 3 models (including 8B Instruct)

  69. [69]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024

  70. [70]

    How johnny can persuade llms to jailbreak them: Re- thinking persuasion to challenge ai safety by humaniz- ing llms,

    Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Re- thinking persuasion to challenge ai safety by humaniz- ing llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2024, pp. 14 322–14 350

  71. [71]

    Tree of attacks: Jailbreaking black-box llms automatically,

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nel- son, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically,” Advances in Neural Information Processing Systems, vol. 37, pp. 61 065–61 105, 2024

  72. [72]

    Low-Resource Languages Jailbreak GPT-4

    Z.-X. Yong, C. Menghini, and S. H. Bach, “Low- resource languages jailbreak gpt-4,”arXiv preprint arXiv:2310.02446, 2023

  73. [73]

    Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation,

    Y . Gou, K. Chen, Z. Liu, L. Hong, H. Xu, Z. Li, D.- Y . Yeung, J. T. Kwok, and Y . Zhang, “Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 388–404

  74. [74]

    MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    S. Fares, K. Ziu, T. Aremu, N. Durasov, M. Taká ˇc, P. Fua, K. Nandakumar, and I. Laptev, “Mirrorcheck: Efficient adversarial defense for vision-language mod- els,”arXiv preprint arXiv:2406.09250, 2024

  75. [75]

    Cross-modality information check for detecting jailbreaking in mul- timodal large language models,

    Y . Xu, X. Qi, Z. Qin, and W. Wang, “Cross-modality information check for detecting jailbreaking in mul- timodal large language models,”arXiv preprint arXiv:2407.21659, 2024

  76. [76]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

  77. [77]

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    Y . Huang, S. Gupta, M. Xia, K. Li, and D. Chen, “Catas- trophic jailbreak of open-source llms via exploiting generation,”arXiv preprint arXiv:2310.06987, 2023

  78. [78]

    Re- visiting the assumption of latent separability for back- door defenses,

    X. Qi, T. Xie, Y . Li, S. Mahloujifar, and P. Mittal, “Re- visiting the assumption of latent separability for back- door defenses,” inProceedings of the eleventh Interna- tional Conference on Learning Representations, 2023

  79. [79]

    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models,

    X. Liu, Y . Zhu, J. Gu, Y . Lan, C. Yang, and Y . Qiao, “Mm-safetybench: A benchmark for safety evaluation of multimodal large language models,” inEuropean Conference on Computer Vision, 2024, pp. 386–403

  80. [80]

    Figstep: Jailbreaking large vision-language models via typographic visual prompts,

    Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, pp. 23 951–23 959

Showing first 80 references.