Recognition: no theorem link
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3
The pith
Maximizing entropy at high-entropy refusal tokens flips VLM outputs to harmful content without fixed targets and improves cross-model transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Maximizing entropy at these decision tokens via UJEM-KL flips refusal outcomes while preserving output quality at low-entropy positions, yielding competitive white-box attack success rates and consistently higher transferability across models than prior constrained methods.
What carries the argument
UJEM-KL, an untargeted attack that maximizes entropy (via KL divergence) selectively at high-entropy decision tokens while constraining low-entropy positions.
If this is right
- The method achieves competitive white-box attack success rates on three VLMs across two safety benchmarks.
- Transferability improves consistently compared with prior gradient-based universal image jailbreaks.
- The attack remains effective under representative defenses.
- Limited transferability in earlier work stems primarily from overly constrained optimization objectives rather than inherent model robustness.
Where Pith is reading between the lines
- Safety mechanisms in autoregressive VLMs may be more localized to high-entropy decision points than previously modeled.
- Defenses could be strengthened by explicitly penalizing entropy spikes at candidate refusal tokens during decoding.
- The same selective entropy approach might apply to other autoregressive multimodal or language-only models without requiring target strings.
Load-bearing premise
Refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack.
What would settle it
If an attack that instead maximizes entropy only at low-entropy tokens produces equal or higher transferability, or if UJEM-KL shows no transfer improvement over prior constrained objectives on the same models.
Figures
read the original abstract
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that limited cross-model transferability in prior gradient-based universal image jailbreaks on VLMs stems primarily from overly constrained optimization objectives (e.g., fixed prefixes or response patterns). A preliminary experiment is cited showing that refusal concentrates at high-entropy tokens during autoregressive decoding while non-refusal tokens already hold substantial top-ranked probability mass; this motivates the proposed UJEM-KL attack, which maximizes entropy at selected decision tokens and stabilizes low-entropy positions. Across three VLMs and two benchmarks the method reportedly achieves competitive white-box success rates, improved transferability, and remains effective under defenses.
Significance. If the entropy-concentration observation is robust and the transfer gains are reproducible with proper controls, the work would meaningfully advance VLM adversarial robustness by showing that relaxing objective constraints via entropy maximization can yield better transfer without target patterns. The lightweight nature of the attack and its reported resilience to defenses would be useful for safety evaluation, but the current lack of quantitative baselines, ablations, and statistical details limits the strength of the contribution.
major comments (2)
- [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
- [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.
minor comments (1)
- [Abstract] The acronym UJEM-KL is introduced in the abstract without expansion or a forward reference to its definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the attribution of poor transferability to 'overly constrained optimization objectives' is load-bearing for the paper's motivation and conclusion, yet rests on a preliminary experiment whose details (prompt count, models, quantitative entropy distributions, or probability-mass comparisons) are not reported, leaving the key assumption unverified.
Authors: We agree that the abstract should make the preliminary experiment verifiable. The full manuscript (Section 3.1) reports the experiment using 100 prompts sampled from the two safety benchmarks across the three VLMs, with quantitative entropy distributions and top-k probability mass comparisons provided in Figure 2 and the associated text. These show refusal concentrating at higher-entropy positions while non-refusal tokens already occupy substantial mass in the top candidates. We will revise the abstract to concisely include the prompt count, models, and key quantitative observations, with a pointer to Section 3.1 and Figure 2. revision: yes
-
Referee: [Abstract] Abstract / Experimental Results: the claims of 'competitive white-box attack success rates' and 'consistently improves transferability' are presented without any numerical values, baseline comparisons, statistical significance tests, or ablation controls, making it impossible to assess whether the entropy-maximization component is responsible for the reported gains.
Authors: The abstract is intentionally high-level, but the full manuscript (Section 4 and Tables 1-4) provides the requested quantitative details: white-box ASR values competitive with prior methods, transferability gains across models, direct baseline comparisons (including fixed-target gradient attacks), ablation studies isolating the entropy-maximization term, and statistical significance via repeated runs with t-tests. We will update the abstract to report representative numerical results and explicitly reference the ablations and controls so readers can immediately evaluate the contribution of the entropy-maximization component. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper motivates UJEM-KL from a preliminary empirical observation that refusal concentrates at high-entropy tokens and then demonstrates improved transferability via experiments on three VLMs. No equation or result is shown to reduce by construction to a fitted parameter, self-citation, or renamed input; the attribution of limited transferability to constrained objectives follows from comparative attack success rates rather than tautological redefinition. This is a standard empirical workflow with no load-bearing circular step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A
Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zis- serman, A., Simonyan, K.: Flaming...
work page 2022
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. CoRRabs/2502.13923(2025).https: //doi.org/10.48550/ARXIV.2502.13923,http...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
-
[3]
Carlini, N., Nasr, M., Choquette-Choo, C.A., Jagielski, M., Gao, I., Koh, P.W., Ippolito, D., Tramèr, F., Schmidt, L.: Are aligned neural networks adversarially aligned? In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/2023/hash/ c1f0b856a35986348ab3414177266f75-Abstrac...
work page 2023
-
[4]
In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.C.H.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/paper/ 2023/hash/9a6a435e75419a836fe47ab6793623e6-...
work page 2023
-
[5]
In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H
Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial train- ing for vision-and-language representation learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020),https://proceedings. neurips.cc/paper/2020/hash/49562478de4c54fafd4ec46fdb297de5-Abstract. html
-
[6]
URL https://doi.org/10.1609/aaai.v39i22.34568
Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., Wang, X.: Figstep: Jailbreaking large vision-language models via typographic visual prompts. In: Walsh, T., Shah, J., Kolter, Z. (eds.) AAAI. pp. 23951–23959. AAAI Press (2025).https://doi.org/10.1609/AAAI.V39I22.34568,https://doi.org/10. 1609/aaai.v39i22.34568
-
[7]
Explaining and Harnessing Adversarial Examples
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015),http://arxiv.org/abs/ 1412.6572
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F
Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. vol. 235, pp. 16974–17002. PMLR / OpenReview.net (2024),https://proceedings.mlr. press/v235/guo24i.html
work page 2024
-
[9]
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
He, M., Tian, X., Shen, X., Ni, J., Zou, S., Yang, Z., Zhang, J.: Few tokens mat- ter: Entropy guided attacks on vision-language models. CoRRabs/2512.21815 (2025).https://doi.org/10.48550/ARXIV.2512.21815,https://doi.org/10. 48550/arXiv.2512.21815
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.21815 2025
-
[10]
Huang, X., Hu, W., Zheng, T., Xiu, K., Jia, X., Wang, D., Qin, Z., Ren, K.: Untargeted jailbreak attack. CoRRabs/2510.02999(2025).https://doi.org/ 10.48550/ARXIV.2510.02999,https://doi.org/10.48550/arXiv.2510.02999 Title Suppressed Due to Excessive Length 25
- [11]
-
[12]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama guard: Llm-based input- output safeguard for human-ai conversations. CoRRabs/2312.06674(2023). https://doi.org/10.48550/ARXIV.2312.06674,https://doi.org/10.48550/ arXiv.2312.06674
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.06674 2023
-
[13]
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicineinoneday.In:Oh,A.,Naumann,T.,Globerson,A.,Saenko,K.,Hardt, M., Levine, S. (eds.) NeurIPS (2023),http://papers.nips.cc/paper_files/ paper / 2023 / hash / 5abcdf8ecdcacba028c6662789194572 - Abs...
work page 2023
-
[14]
In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) ICML. Pro- ceedings of Machine Learning Research, vol. 202, pp. 19730–19742. PMLR (2023), https://proceedings.mlr.press/v2...
work page 2023
-
[15]
CoRR abs/2509.21029(2025).https : / / doi
Lin, R., Paren, A., Yuan, S., Li, M., Torr, P., Bibi, A., Liu, T.: FORCE: trans- ferable visual jailbreaking attacks via feature over-reliance correction. CoRR abs/2509.21029(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2509 . 21029, https://doi.org/10.48550/arXiv.2509.21029
-
[16]
In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) NeurIPS (2023),http : / / papers . nips . cc / paper _ files / paper / 2023 / hash / 6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html
work page 2023
-
[17]
Prompt Injection attack against LLM-integrated Applications
Liu, Y., Deng, G., Li, Y., Wang, K., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications. CoRR abs/2306.05499(2023).https : / / doi . org / 10 . 48550 / ARXIV . 2306 . 05499, https://doi.org/10.48550/arXiv.2306.05499
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05499 2023
-
[18]
Tao, Z., Lin, T., Chen, X., Li, H., Wu, Y ., Li, Y ., Jin, Z., Huang, F., Tao, D., and Zhou, J
Luo, W., Ma, S., Liu, X., Guo, X., Xiao, C.: Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak at- tacks. CoRRabs/2404.03027(2024).https://doi.org/10.48550/ARXIV.2404. 03027,https://doi.org/10.48550/arXiv.2404.03027
- [19]
-
[20]
In: International Conference on Learning Representations (2018)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (2018)
work page 2018
-
[21]
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D.A., Hendrycks, D.: Harmbench: A standardized eval- uation framework for automated red teaming and robust refusal. In: Salakhutdi- nov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) ICML. Proceedings of ...
work page 2024
-
[22]
Mia, M.J., Amini, M.H.: Jailip: Jailbreaking vision-language models via loss guided image perturbation. CoRRabs/2509.21401(2025).https://doi.org/ 10.48550/ARXIV.2509.21401,https://doi.org/10.48550/arXiv.2509.21401 26 No Author Given
-
[23]
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks
Oh, S., Jin, Y., Sharma, M., Kim, D., Ma, E., Verma, G., Kumar, S.: Uniguard: To- wardsuniversalsafetyguardrailsforjailbreakattacksonmultimodallargelanguage models. CoRRabs/2411.01703(2024).https://doi.org/10.48550/ARXIV. 2411.01703,https://doi.org/10.48550/arXiv.2411.01703
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[24]
OpenAI: GPT-4 technical report. CoRRabs/2303.08774(2023).https://doi. org/10.48550/ARXIV.2303.08774,https://doi.org/10.48550/arXiv.2303. 08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[25]
OpenAI: Gpt-4o system card. CoRRabs/2410.21276(2024).https://doi.org/ 10.48550/ARXIV.2410.21276,https://doi.org/10.48550/arXiv.2410.21276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
-
[26]
In: Wooldridge, M.J., Dy, J.G., Natarajan, S
Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual ad- versarial examples jailbreak aligned large language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) AAAI. pp. 21527–21536. AAAI Press (2024). https://doi.org/10.1609/AAAI.V38I19.30150,https://doi.org/10.1609/ aaai.v38i19.30150
-
[27]
Rando, J., Korevaar, H., Brinkman, E., Evtimov, I., Tramèr, F.: Gradient-based jailbreak images for multimodal fusion models. CoRRabs/2410.03489(2024). https://doi.org/10.48550/ARXIV.2410.03489,https://doi.org/10.48550/ arXiv.2410.03489
-
[28]
Ren, J., Dras, M., Naseem, U.: Seeing the threat: Vulnerabilities in vision-language models to adversarial attack. CoRRabs/2505.21967(2025).https://doi.org/ 10.48550/ARXIV.2505.21967,https://doi.org/10.48550/arXiv.2505.21967
-
[29]
Schaeffer, R., Valentine, D., Bailey, L., Chua, J., Eyzaguirre, C., Durante, Z., Benton, J., Miranda, B., Sleight, H., Wang, T.T., Hughes, J., Agrawal, R., Sharma, M., Emmons, S., Koyejo, S., Perez, E.: Failures to find transferable im- age jailbreaks between vision-language models. In: ICLR. OpenReview.net (2025), https://openreview.net/forum?id=wvFnqVVUhN
work page 2025
-
[30]
Tian, J.: Vision-language models in teaching and learning: A systematic lit- erature review. Education Sciences16(1) (2026).https://doi.org/10.3390/ educsci16010123,https://www.mdpi.com/2227-7102/16/1/123
work page 2026
-
[31]
CoRR abs/2508.01741(2025).https : / / doi
Wang, R., Wang, X., Yao, Y., Tong, X., Ma, X.: Simulated ensemble at- tack: Transferring jailbreaks across fine-tuned vision-language models. CoRR abs/2508.01741(2025).https : / / doi . org / 10 . 48550 / ARXIV . 2508 . 01741, https://doi.org/10.48550/arXiv.2508.01741
-
[32]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.18265 2025
-
[33]
Wang, Z., Yang, H., Feng, Y., Sun, P., Guo, H., Zhang, Z., Ren, K.: Towards trans- ferable targeted adversarial examples. In: CVPR. pp. 20534–20543. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01967,https://doi.org/10.1109/ CVPR52729.2023.01967 Title Suppressed Due to Excessive Length 27
-
[34]
Waseda, F., Nishikawa, S., Le, T., Nguyen, H.H., Echizen, I.: Closer look at the transferability of adversarial examples: How they fool different models dif- ferently. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023. pp. 1360–1368. IEEE (2023). https://doi.org/10.1109/WACV56688.2023.00141,...
-
[35]
In: Ku, L., Mar- tins, A., Srikumar, V
Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B.Y., Poovendran, R.: Safedecoding: Defending against jailbreak attacks via safety-aware decoding. In: Ku, L., Mar- tins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024. pp. 558...
-
[36]
What’s in the image? a deep-dive into the vision of vision language models
Xu, Z., Bai, Y., Zhang, Y., Li, Z., Xia, F., Wong, K.K., Wang, J., Zhao, H.: Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed- loop autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. pp. 17261– 17270. Computer Vision Foundation / IEEE (202...
-
[37]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Yang, J., Zhang, Z., Cui, S., Wang, H., Huang, M.: Guiding not forcing: En- hancing the transferability of jailbreaking attacks on llms via removing super- fluous constraints. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 202...
work page 2025
-
[38]
Ying, Z., Liu, A., Liang, S., Huang, L., Guo, J., Zhou, W., Liu, X., Tao, D.: Safebench: A safety evaluation framework for multimodal large language models. Int. J. Comput. Vis.134(1), 18 (2026).https://doi.org/10.1007/S11263-025- 02613-1,https://doi.org/10.1007/s11263-025-02613-1
-
[39]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Yoon, S., Jeung, W., No, A.: R-TOFU: unlearning in large reasoning models. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. pp. 5239–5258. Association for Computational Linguistics (2025).https://doi....
-
[40]
Zhang, Z., Sun, Z., Zhang, Z., Guo, J., He, X.: Fc-attack: Jailbreaking large vision- language models via auto-generated flowcharts. CoRRabs/2502.21059(2025). https://doi.org/10.48550/ARXIV.2502.21059,https://doi.org/10.48550/ arXiv.2502.21059
-
[41]
In: Tan, J., Toussaint, M., Darvish, K
Zitkovich,B.,Yu,T.,Xu,S.,Xu,P.,Xiao,T.,Xia,F.,Wu,J.,Wohlhart,P.,Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H.T., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.E., Leal, I., Kuang, Y., Kalashnikov, D.,...
work page 2023
-
[42]
PMLR (2023),https://proceedings.mlr.press/v229/zitkovich23a.html
work page 2023
-
[43]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable ad- versarial attacks on aligned language models. CoRRabs/2307.15043(2023). https://doi.org/10.48550/ARXIV.2307.15043,https://doi.org/10.48550/ arXiv.2307.15043
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15043 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.