Recognition: unknown
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
Pith reviewed 2026-05-08 03:24 UTC · model grok-4.3
The pith
Directly optimizing prompt token embeddings enables stronger jailbreaks against aligned LLMs without changing the visible prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompt Embedding Optimization (PEO) performs gradient-based optimization in the continuous embedding space of the original prompt tokens, using structured continuation targets and an adaptive failure-focused schedule across multiple rounds. The resulting embeddings lie close enough to their starting points that nearest-token projection restores the exact original prompt string, yet the approach yields higher attack success rates than competing white-box methods that rely on appended discrete suffixes or search-based generation.
What carries the argument
Prompt Embedding Optimization (PEO), a gradient-driven process that directly perturbs the embeddings of existing prompt tokens in continuous space rather than appending new adversarial tokens.
If this is right
- PEO records higher attack success rates than discrete suffix search, appended adversarial embeddings, and search-based adversarial generation on two standard harmful-behavior benchmarks.
- The optimized embeddings remain sufficiently close to the originals that nearest-token projection recovers the exact original prompt string in every case tested.
- Quantitative checks show that model responses stay on the original topic for the large majority of prompts despite the embedding shifts.
- Later optimization rounds can incorporate heuristic composite response scaffolds that improve performance without producing outputs that are merely scaffold artifacts, according to ASR-Judge evaluation.
Where Pith is reading between the lines
- Jailbreaks could become harder to detect automatically if they leave the token sequence unchanged and only alter internal embeddings.
- Alignment training may need additional regularization in embedding space or monitoring of activation patterns rather than relying solely on token-level filters.
- Extending the same continuous optimization idea to other safety-critical tasks, such as preventing leakage of private information, could be tested by swapping the harmful targets for privacy targets.
- The approach might interact differently with models trained with explicit embedding-space safety constraints, providing a natural next experiment.
Load-bearing premise
The ASR-Judge scores and on-topic quantitative checks reflect genuine semantic preservation and real attack gains rather than artifacts created by the structured continuation targets or composite response scaffolds used in later rounds.
What would settle it
Apply PEO to a held-out set of prompts, project every final embedding to its nearest vocabulary token, and verify whether the recovered text string matches the input prompt exactly while the model still generates the targeted harmful content.
Figures
read the original abstract
Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the continuous embeddings of the original prompt tokens (without appending adversarial suffixes or tokens). It combines this with structured continuation targets and an adaptive failure-focused schedule, claiming that the optimized embeddings remain sufficiently close to originals for exact nearest-token projection (preserving the visible prompt string) while achieving higher attack success rates than prior white-box methods (discrete suffix search, appended adversarial embeddings, search-based generation) on two standard harmful-behavior benchmarks. Quantitative on-topic analysis and ASR-Judge evaluations are reported to support semantic preservation and that gains are not scaffold-only artifacts.
Significance. If the central empirical claims hold after addressing attribution concerns, the work would be significant for demonstrating that direct embedding-space optimization can yield effective, low-visibility jailbreaks without destroying prompt semantics, contrary to prior assumptions in the field. This provides a new attack vector and could inform alignment research, though the current evidence for isolating the contribution of embedding optimization is limited.
major comments (2)
- [§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.
- [§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.
minor comments (2)
- [§3] The abstract and §3 mention 'nearest-token projection' but do not specify the exact distance metric or projection procedure used to confirm exact string preservation; a short algorithmic description or pseudocode would improve reproducibility.
- [Table 1] Table 1 (or equivalent benchmark results table) lacks error bars, number of runs, or statistical significance tests for the reported ASR improvements; adding these would strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify limitations in how we currently isolate the contribution of embedding optimization from the other components of PEO. We address each point below and will revise the manuscript with additional ablations and controls to strengthen attribution.
read point-by-point responses
-
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.
Authors: We agree that the current experimental design does not fully isolate the embedding optimization. The targets and scaffolds are integral to PEO's multi-round process, and the baselines follow their original formulations without them. In the revision we will add two new sets of experiments: (1) applying the identical structured continuation targets and composite scaffolds to the discrete-suffix and search-based baselines, and (2) an ablation of PEO that disables the adaptive schedule while retaining only the embedding optimization. These results will be reported alongside the existing comparisons to clarify the specific contribution of continuous embedding perturbation. revision: yes
-
Referee: [§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.
Authors: We accept this critique. The on-topic and ASR-Judge metrics currently lack explicit scaffold-only controls. We will add, in the revised §4.3, evaluations that apply the same composite scaffolds to the original (non-optimized) prompts and to the baseline methods, reporting both on-topic rates and ASR-Judge scores for these conditions. This will demonstrate that the high on-topic rates arise from the optimized embeddings remaining sufficiently close to the originals for exact nearest-token projection, rather than from the scaffolds alone. revision: yes
Circularity Check
No significant circularity; empirical method with external benchmarks
full rationale
The paper proposes an empirical jailbreak technique (PEO) that optimizes prompt embeddings while using continuation targets and an adaptive schedule, then validates it via direct comparisons to prior white-box attacks on standard benchmarks. No equations, derivations, or self-referential predictions appear in the abstract or description. Performance claims rest on experimental results rather than any reduction of outputs to fitted inputs or self-citations by construction. The work is self-contained against external baselines, consistent with the reader's assessment of score 2.0 for minor or absent circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W., 2018. Generating natural language adversarial exam- ples, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin- guistics. pp. 2890–2896
2018
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073
work page internal anchor Pith review arXiv 2022
-
[3]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems
Chao,P.,Debenedetti,E.,Robey,A.,Andriushchenko,M.,Croce,F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramèr, F., Hassani, H., Wong, E., 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems
2024
-
[4]
Jailbreaking Black Box Large Language Models in Twenty Queries
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E., 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419
work page internal anchor Pith review arXiv 2023
-
[5]
Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/
Chiang,W.L.,Li,Z.,Lin,Z.,Sheng,Y.,Wu,Z.,Zhang,H.,Zheng,L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P., 2023. Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/
2023
-
[6]
Ebrahimi, J., Rao, A., Lowd, D., Dou, D., 2018. Hotflip: White- box adversarial examples for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), Association for Computational Linguistics. pp. 31–36
2018
-
[7]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al- Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.,
-
[8]
The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review arXiv
-
[9]
nanogcg.https://github.com/GraySwanAI/ nanoGCG
Gray Swan AI, 2024. nanogcg.https://github.com/GraySwanAI/ nanoGCG
2024
-
[10]
Iyyer, M., Wieting, J., Gimpel, K., Zettlemoyer, L., 2018. Adversar- ial example generation with syntactically controlled paraphrase net- works,in:Proceedingsofthe2018ConferenceoftheNorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics. pp. 1875–1885
2018
-
[11]
arXiv preprint arXiv:2501.18280
Liang,H.,Sun,Y.,Cai,Y.,Zhu,J.,Zhang,B.,2025.Jailbreakingllms’ safeguard with universal magic words for text embedding models. arXiv preprint arXiv:2501.18280
-
[12]
Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H., 2024a. Jailjudge: A comprehensive jailbreak judge benchmark with multi- agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855
-
[13]
Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations
Liu, X., Xu, N., Chen, M., Xiao, C., 2024b. Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations
-
[14]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D., 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249
work page internal anchor Pith review arXiv 2024
-
[15]
Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A., 2024. Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems
2024
-
[16]
Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates
Ouyang,L.,Wu,J.,Jiang,X.,Almeida,D.,Wainwright,C.,Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates. pp. 27730–27744
2022
-
[17]
arXiv preprint arXiv:2412.03876
Peng,J.,Tang,Z.,Liu,G.,Fleming,C.,Hong,M.,2024.Safeguarding text-to-image generation via inference-time prompt-noise optimiza- tion. arXiv preprint arXiv:2412.03876
-
[18]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn,C.,2023. Directpreferenceoptimization:Yourlanguagemodel is secretly a reward model. arXiv preprint arXiv:2305.18290
work page internal anchor Pith review arXiv 2023
-
[19]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Robey, A., Wong, E., Hassani, H., Pappas, G.J., 2024. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684
work page internal anchor Pith review arXiv 2024
-
[20]
Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR
Sadasivan, V.S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., Feizi, S., 2024. Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR
2024
-
[21]
C., Lupu, A., Hambro, E., Markosyan, A
Samvelyan, M., Raparthy, S.C., Lupu, A., Hambro, E., Markosyan, A.H., Bhatt, M., Tian, Y., Jiang, E., Raileanu, R., Rocktäschel, T., Whiteson, S., 2024. Rainbow teaming: Open-ended generation of diverse adversarial prompts. arXiv preprint arXiv:2402.16822
-
[22]
Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., Gunnemann, S.,
-
[23]
Softpromptthreats:Attackingsafetyalignmentandunlearning in open-source LLMs through the embedding space, in: Advances in Neural Information Processing Systems
-
[24]
Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S., 2020. Autoprompt: Eliciting knowledge from language models with auto- maticallygeneratedprompts,in:Proceedingsofthe2020Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics. pp. 4222–4235
2020
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. Llama2:Openfoundationandfine-tunedchatmodels. arXivpreprint arXiv:2307.09288
work page internal anchor Pith review arXiv 2023
-
[26]
Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics
Wang,Y.,Wang,P.,Xi,C.,Tang,B.,Zhu,J.,Wei,W.,Chen,C.,Yang, C.,Zhang,J.,Lu,C.,Niu,Y.,Mao,K.,Li,Z.,Xiong,F.,Hu,J.,Yang, M., 2025. Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics
2025
-
[27]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023
Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T., 2023. Hard prompts made easy: Gradient-based dis- crete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668
-
[28]
Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models
Xu, Z., Liu, Y., Deng, G., Wang, K., Li, Y., Shi, L., Picek, S., 2024. Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models. arXiv preprint arXiv:2407.13796
-
[29]
Yang,A.,Li,A.,Yang,B.,Zhang,B.,Hui,B.,Zheng,B.,Yu,B.,Gao, C., Huang, C., Lv, C., et al., 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 . M. Q. Li et al.:Preprint submitted to ElsevierPage 12 of 13 Adaptive Prompt Embedding Optimization for LLM Jailbreaking
work page internal anchor Pith review arXiv 2025
-
[30]
Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates
Zhu, S., Amos, B., Tian, Y., Guo, C., Evtimov, I., 2025. Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates
2025
-
[31]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M., 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 . A. Judge Pipeline and Agreement ASR-Judge uses two judges: GPT-5.4 and Claude Opus 4.6. We classify each final response independently with both judges; a response is coun...
work page internal anchor Pith review arXiv 2023
-
[32]
JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge
uses a single fine-tuned Llama-2-13B classifier, and JailbreakBench [3] likewise chooses a single default judge, Llama-3-70B, specifically because of its strong agreement with experts and relatively low false-positive rate. JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge. None of these t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.