New Wide-Net-Casting Jailbreak Attacks Risk Large Models

Haoxuan Qu; Hossein Rahmani; Jun Liu; Qiuchi Xiang

arxiv: 2605.17128 · v1 · pith:4NHLCNU2new · submitted 2026-05-16 · 💻 cs.CR · cs.AI

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

Qiuchi Xiang , Haoxuan Qu , Hossein Rahmani , Jun Liu This is my paper

Pith reviewed 2026-05-20 14:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attackslarge language modelssafety riskswide-net-castingmulti-model attacksadversarial attacks

0 comments

The pith

Adversaries querying groups of large models can achieve jailbreak success rates up to 100% with a tailored attack method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies a new attack scenario called wide-net-casting, in which an adversary queries multiple large models instead of focusing on one. The authors show that this approach uncovers safety risks that single-model defenses overlook. They introduce a novel jailbreak method designed specifically for this multi-model setting. In experiments, the method reaches 100% success rate against large models without extra protections. This suggests that safety research needs to address scenarios where attackers can cast a wide net across several models.

Core claim

In the wide-net-casting scenario where an adversary can query a group of large models, a tailored jailbreak method can elicit harmful outputs with success rates reaching 100% in some experiments on models without additional safeguards.

What carries the argument

A novel jailbreak method tailored to the wide-net-casting scenario, which leverages querying multiple models to increase the chance of bypassing individual safeguards.

If this is right

Evaluations of large model safety should incorporate multi-model attack scenarios.
Existing defenses for single models may not suffice against wide-net-casting attacks.
The high success rates indicate that wide-net-casting represents a distinct high-risk scenario.
Future defense research should prioritize methods effective against group querying.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Companies deploying multiple AI models might need coordinated safety mechanisms across them.
Attackers could use this to target ensembles or services offering multiple models.
Testing the method on models with safeguards could quantify how much protection is needed.

Load-bearing premise

That querying a group of large models is a practical and realistic attack scenario for adversaries, distinct from single-model attacks.

What would settle it

An experiment where the tailored method is tested against a group of large models each equipped with standard safeguards, showing significantly lower success rates.

Figures

Figures reproduced from arXiv: 2605.17128 by Haoxuan Qu, Hossein Rahmani, Jun Liu, Qiuchi Xiang.

**Figure 1.** Figure 1: Illustration of the single-model jailbreak scenario and the wide-net-casting jailbreak scenario. As shown, in the widenet-casting scenario, unlike the single-model case, successfully jailbreaking any one large model in the group is sufficient for the adversary to obtain a desired harmful response. be “attacked”, i.e., their safeguards could still be bypassed by malicious intent, leading them to produce ha… view at source ↗

**Figure 2.** Figure 2: Illustration of straightforward adaptation of an existing model-based jailbreak method to the wide-net-casting scenario. A jailbreak attack is deemed successful in the wide-net-casting scenario as long as any target large model is successfully jailbroken. practical yet previously unexplored setting. 3. Risk Analysis of Wide-net-casting Jailbreak To investigate risks underlying the wide-net-casting scenar… view at source ↗

**Figure 3.** Figure 3: Visualization of word clouds showing the keywords of harmful intents that each optimized generator specializes in under the wide-net-casting scenario on AdvBench. The word clouds reflect the keywords and their frequencies (larger words indicate higher frequency) among the harmful intents for which each generator is most proficient [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization for the feature space of the sentence-level representation of harmful intent that each generator specializes in. D. Theoretical Analysis of Mathematically Formalizing the Sub-goal ❶ As mentioned in Sec. 4.1 of the main paper, the sub-goal ❶ (maximizing exploitation) aims to concentrate updates as much as possible on generators with smaller intermediate losses. We formally show that the sub-go… view at source ↗

**Figure 5.** Figure 5: Qualitative results comparing our method with the straightforward adaptation (Baseline) and two naive strategies of jailbreaking LLMs in (a) and MLLMs in (b), where the green model name indicates a successful jailbreak, and a red model name indicates failure. As shown, our method consistently achieves successful jailbreaking across all intents, while the naive strategies succeed only on a few intents. 25 … view at source ↗

read the original abstract

Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a previously unexplored 'wide-net-casting' jailbreak scenario in which an adversary queries a group of large language models simultaneously (rather than a single model) to elicit harmful outputs. It develops a novel jailbreak method tailored to this multi-model setting and reports that the method achieves jailbreak success rates reaching 100% in some experiments against large models lacking additional safeguards, arguing that this scenario exposes substantial and previously overlooked safety risks that merit attention in future evaluation and defense work.

Significance. If the empirical results hold after rigorous validation, including proper baselines and controls, the work could usefully expand the threat model for LLM safety beyond single-model interactions and encourage broader multi-model testing in red-teaming protocols. The emphasis on a practical attack vector that leverages access to multiple models simultaneously is a potentially valuable contribution to the field.

major comments (2)

[Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.
[Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.

minor comments (1)

[Introduction] The operational definition of 'wide-net-casting' (e.g., whether queries are simultaneous, how responses are aggregated, or the exact adversary model) should be stated explicitly and early to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.

Authors: We agree with the referee that the abstract would benefit from additional context to support the key empirical claim. The full experimental details are described in the Method and Evaluation sections of the manuscript. To improve clarity and address this concern directly, we will revise the abstract to briefly summarize the experimental conditions under which the 100% success rate was achieved. revision: yes
Referee: [Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.

Authors: This is a valid point. The tailored jailbreak method is designed to exploit the simultaneous access to multiple models in ways that are not possible with sequential single-model queries. However, to more rigorously demonstrate that the wide-net-casting scenario presents unique risks, we will include additional baseline experiments applying the method (or suitable adaptations) to individual models and compare the success rates. This will be added to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis with no derivations or self-referential fits

full rationale

The paper presents an empirical study identifying a wide-net-casting jailbreak scenario and a tailored attack method, reporting observed success rates up to 100% in experiments. No equations, derivations, predictions from fitted parameters, or first-principles results are present in the provided text. The central claims rest on experimental outcomes rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The analysis is self-contained against external benchmarks of attack success, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; the work relies on standard domain assumptions in AI safety about attack practicality and model behavior.

axioms (1)

domain assumption Querying multiple large models constitutes a distinct and practical adversarial scenario
Invoked to frame the new risk and justify the tailored method.

pith-pipeline@v0.9.0 · 5652 in / 1150 out tokens · 47342 ms · 2026-05-20T14:42:05.228551+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a novel jailbreak method tailored to the wide-net-casting scenario. The method pairs each target large model with a dedicated 'jailbreak expert'... η∗t = (η1,∗t , … , ηM,∗t ), ηi,∗t = exp(−ℓit/βt) / Σ exp(−ℓmt/βt)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

W ASR = 1/N Σ WM m=1 snm (logical OR across models in group)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

T., Nakov, P., and Gurevych, I

9 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Geng, J., Tran, T. T., Nakov, P., and Gurevych, I. Con in- struction: Universal jailbreaking of multimodal large lan- guage models via non-textual modalities. arXiv preprint arXiv:2506.00548,

work page arXiv
[5]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

From clip to dino: Visual encoders shout in multi-modal large language models,

Jiang, D., Liu, Y ., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825,

work page arXiv
[9]

Robustkv: Defending large language models against jailbreak attacks via kv eviction

Jiang, T., Wang, Z., Liang, J., Li, C., Wang, Y ., and Wang, T. Robustkv: Defending large language models against jailbreak attacks via kv eviction. arXiv preprint arXiv:2410.19937,

work page arXiv
[10]

Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation

Li, Z., Rahmani, H., Zhang, J., Xue, Y ., Mirmehdi, M., Kuen, J., Gu, J., and Liu, J. Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation. arXiv preprint arXiv:2603.20470,

work page arXiv
[11]

arXiv preprint arXiv:2404.07921

Liao, Z. and Sun, H. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

work page arXiv
[12]

Towards understanding jailbreak attacks in llms: A representation space analysis

Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7067–7085,

work page 2024
[13]

Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems, 36:34892–34916, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b. Liu, X., Zhu, Y ., Gu, J., Lan, Y ., Yang, C., and...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SGDR: Stochastic Gradient Descent with Warm Restarts

10 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion

Miao, Z., Ding, Y ., Li, L., and Shao, J. Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

work page 2025
[16]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al

Accessed: 2025-11-06. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

work page 2025
[17]

arXiv preprint arXiv:2404.16873 (2024)

Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., and Tian, Y . Advprompter: Fast adaptive adversarial prompt- ing for llms. arXiv preprint arXiv:2404.16873,

work page arXiv
[18]

Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth- llm: Defending large language models against jailbreak- ing attacks. arXiv preprint arXiv:2310.03684,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Plug and Pray: Exploiting off-the-shelf components of multi-modal models

Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539,

work page arXiv
[20]

and Eberhart, R

Shi, Y . and Eberhart, R. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), pp. 69–73. Ieee,

work page 1998
[21]

and Eberhart, R

Shi, Y . and Eberhart, R. C. Empirical study of parti- cle swarm optimization. In Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406), volume 3, pp. 1945–1950. IEEE,

work page 1999
[22]

Iterative self-tuning llms for enhanced jailbreaking capabilities

Sun, C.-E., Liu, X., Yang, W., Weng, T.-W., Cheng, H., San, A., Galley, M., and Gao, J. Iterative self-tuning llms for enhanced jailbreaking capabilities. arXiv preprint arXiv:2410.18469,

work page arXiv
[23]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Annealed winner-takes-all for motion forecasting

Xu, Y ., Letzelter, V ., Chen, M., Zablocki, ´E., and Cord, M. Annealed winner-takes-all for motion forecasting. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1264–1270. IEEE,

work page 2025
[26]

Virtual context enhancing jailbreak attacks with special token injection

Zhou, Y ., Lu, L., Sun, R., Zhou, P., and Sun, L. Virtual context enhancing jailbreak attacks with special token injection. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11843–11857, Miami, Florida, USA,

work page 2024
[27]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-emnlp.692. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp.692 2024
[28]

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Zong, Y ., Bohdal, O., Yu, T., Yang, Y ., and Hospedales, T. Safety fine-tuning at (almost) no cost: A base- line for vision large language models. arXiv preprint arXiv:2402.02207,

work page arXiv
[29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models. arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

As shown in Tab

+ PixArt-α (Chen et al., 2024)”. As shown in Tab. 8, both the attack success rate and response toxicity consistently increase compared to single-model results across all groups, suggesting that, regardless of variation in target group size, applying existing jailbreak attack methods to the wide-net-casting scenario can consistently amplify safety risks. T...

work page 2024
[31]

22.0% / 0.158 13.3% / 0.089 19.2% / 0.134 12.2% / 0.08237.5% / 0.311 + IMMUNE (Ghosal et al., 2025)MLAI (Hao et al.,

work page 2025
[32]

As shown in Tab

and Qwen-VL-Max (Bai et al., 2023). As shown in Tab. 13, using different large models to select the response yields almost the same W-Toxicity Scores, indicating highly consistent response selections and demonstrating the robustness of the W-Toxicity Score to the choice of response-selection LLMs. Table 13.Evaluation of jailbreaking MLLMs with different r...

work page 2023
[33]

Beyond these open-source models, numerous closed-source commercial models also exist (e.g., Qwen-VL-Max (Bai et al., 2023), Gemini-1.5-Pro (Team et al., 2024), and GPT-4o (OpenAI, 2024)) and have been widely used in prior jailbreak studies for evaluation (Hao et al., 2025; Li et al., 2024; Xie et al., 2024; Yang et al., 2025). 14 New Wide-Net-Casting Jail...

work page 2023
[34]

Dataset Attack W ASR / W-Toxicity Score OriginalSafety Alignment OriginalSafety Alignment+ VLGuard (Zong et al.,

AdvBench Baseline (ReMiss (Xie et al., 2024)) 50.2% / 0.469 33.1% / 0.293 28.5% / 0.257 Naive Strategy 1 55.5% / 0.513 37.3% / 0.325 31.1% / 0.279Naive Strategy 2 56.2% / 0.520 38.0% / 0.336 31.8% / 0.284Ours73.3% / 0.672 51.9% / 0.462 40.6% / 0.363 Table 15.Evaluation of jailbreaking MLLMs within the same model family using different methods tailored to ...

work page 2024
[35]

+ PixArt-α(Chen et al., 2024)) 89.4% / 0.858 32.3% / 0.284 31.5% / 0.273 28.1% / 0.242 Naive Strategy 1 92.4% / 0.886 34.7% / 0.317 33.8% / 0.295 32.0% / 0.283Naive Strategy 2 93.6% / 0.897 35.5% / 0.318 34.4% / 0.307 32.7% / 0.292Ours99.6% / 0.932 46.2% / 0.412 44.2% / 0.406 40.6% / 0.365 MM-SafetyBench Baseline (MLAI (Hao et al.,

work page 2024
[36]

+ PixArt-α(Chen et al., 2024)) 89.9% / 0.865 34.3% / 0.303 33.1% / 0.294 29.4% / 0.257 Naive Strategy 1 93.0% / 0.892 37.1% / 0.335 35.3% / 0.318 31.5% / 0.283Naive Strategy 2 93.7% / 0.903 37.8% / 0.343 36.6% / 0.321 32.2% / 0.295Ours100% / 0.934 47.6% / 0.435 46.7% / 0.423 42.3% / 0.381 To evaluate jailbreak methods on large closed-source models, a comm...

work page 2024
[37]

Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families

+ PixArt-α(Chen et al., 2024)) 43.6% / 0.392 52.1% / 0.489 38.9% / 0.341 47.7% / 0.43369.5% / 0.653Ours 51.2% / 0.481 61.2% / 0.580 46.1% / 0.424 55.2% / 0.52386.8% / 0.799 C. Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families. We perform evaluation on two ...

work page 2024
[38]

+ PixArt-α(Chen et al., 2024)) 93.3% / 0.867 37.5% / 0.311 Naive Strategy 1 95.5% / 0.883 40.6% / 0.355 Naive Strategy 2 95.8% / 0.898 41.1% / 0.363 Naive Strategy 3 95.6% / 0.887 40.9% / 0.358 Ours100% / 0.940 50.8% / 0.473 Impact of our output selection strategy during inference.Our method produces a set of responses for each intent when simultaneously ...

work page 2024
[39]

Specifically, we pass each intent in AdvBench through all optimized generators and, for each successful jailbroken intent, compute the keyword frequencies

Ours (joint-training from scratch) 100% / 0.938 50.7% / 0.470 Ours (joint-training from independently-trained generators)100% / 0.940 50.8% / 0.473 Visualization of specialization analysis.To illustrate the specialization of the optimized generators trained by our method, we visualize word clouds of keywords for harmful intents that each generator special...

work page 2024
[40]

Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses

We aim to find aη ∗ t such that the generator with a smaller lossℓ m t can be assigned a larger loss weight ηm t in each training step t as much as possible (maximizing exploitation). Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses. Thus, we can formalize this process mathematical...

work page 2024
[41]

Combining this result with Eq

(23) This shows that U ′(τ)≤0 , and the inequality of U ′(τ)<0 is strict whenever the losses ℓt are not all identical. Combining this result with Eq. 22 yields: dH dτ =τ U ′(τ)<0, τ >0. (24) 21 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Step (5).From Eq. 24, sinceτ= 1/β t, we apply the chain rule to obtain: dH dβt = dH dτ · dτ dβt =−( 1 β2 )...

work page 2007
[42]

+ PixArt-α (Chen et al., 2024)”. In Sec. 3 of the main paper, to further uncover potential risks in the wide-net-casting scenario, our investigation includes model-based jailbreaks for MLLMs. Since model-based jailbreak approaches for MLLMs remain largely underexplored, we construct a model-based jailbreak for MLLMs by adopting pipelines of LLM-oriented m...

work page 2024
[43]

Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images

+ PixArt-α(Chen et al., 2024)”. Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images. Correspondingly, inspired by recent work (Wu et al., 2025), we optimize PixArt-α (Chen et al.,

work page 2024
[44]

+ PixArt-α (Chen et al., 2024)” first uses MLAI to overgenerate adversarial images and then selects the top K= 20 high-quality adversarial samples for each harmful intent to supervise PixArt-α in generating adversarial images for jailbreaking large models. H. Additional Details w.r.t Experimental Setup in Sec. 3 of the Main Paper Datasets.The AdvBench dat...

work page 2024
[45]

Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectiveness of jailbreak attacks

serves as a benchmark for evaluating the safety of MLLMs, which comprises 5,040 text-image pairs and covers 13 typical prohibited scenarios specified in OpenAI and Meta’s usage strategy (Achiam et al., 2023; Inan et al., 2023). Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectivenes...

work page 2023
[46]

For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality

as an automatic LLM judge to determine whether a jailbreak is successful. For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality. Following common practice (Hao et al., 2025), we classify the response as a successful jailbreak if the jailbreak quality rating exceeds 0.5. Therefore, for each intent, responses from all gen...

work page 2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [4]

T., Nakov, P., and Gurevych, I

9 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Geng, J., Tran, T. T., Nakov, P., and Gurevych, I. Con in- struction: Universal jailbreaking of multimodal large lan- guage models via non-textual modalities. arXiv preprint arXiv:2506.00548,

work page arXiv

[5] [5]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

From clip to dino: Visual encoders shout in multi-modal large language models,

Jiang, D., Liu, Y ., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825,

work page arXiv

[9] [9]

Robustkv: Defending large language models against jailbreak attacks via kv eviction

Jiang, T., Wang, Z., Liang, J., Li, C., Wang, Y ., and Wang, T. Robustkv: Defending large language models against jailbreak attacks via kv eviction. arXiv preprint arXiv:2410.19937,

work page arXiv

[10] [10]

Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation

Li, Z., Rahmani, H., Zhang, J., Xue, Y ., Mirmehdi, M., Kuen, J., Gu, J., and Liu, J. Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation. arXiv preprint arXiv:2603.20470,

work page arXiv

[11] [11]

arXiv preprint arXiv:2404.07921

Liao, Z. and Sun, H. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

work page arXiv

[12] [12]

Towards understanding jailbreak attacks in llms: A representation space analysis

Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7067–7085,

work page 2024

[13] [13]

Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems, 36:34892–34916, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b. Liu, X., Zhu, Y ., Gu, J., Lan, Y ., Yang, C., and...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

SGDR: Stochastic Gradient Descent with Warm Restarts

10 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion

Miao, Z., Ding, Y ., Li, L., and Shao, J. Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

work page 2025

[16] [16]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al

Accessed: 2025-11-06. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

work page 2025

[17] [17]

arXiv preprint arXiv:2404.16873 (2024)

Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., and Tian, Y . Advprompter: Fast adaptive adversarial prompt- ing for llms. arXiv preprint arXiv:2404.16873,

work page arXiv

[18] [18]

Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth- llm: Defending large language models against jailbreak- ing attacks. arXiv preprint arXiv:2310.03684,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Plug and Pray: Exploiting off-the-shelf components of multi-modal models

Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539,

work page arXiv

[20] [20]

and Eberhart, R

Shi, Y . and Eberhart, R. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), pp. 69–73. Ieee,

work page 1998

[21] [21]

and Eberhart, R

Shi, Y . and Eberhart, R. C. Empirical study of parti- cle swarm optimization. In Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406), volume 3, pp. 1945–1950. IEEE,

work page 1999

[22] [22]

Iterative self-tuning llms for enhanced jailbreaking capabilities

Sun, C.-E., Liu, X., Yang, W., Weng, T.-W., Cheng, H., San, A., Galley, M., and Gao, J. Iterative self-tuning llms for enhanced jailbreaking capabilities. arXiv preprint arXiv:2410.18469,

work page arXiv

[23] [23]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Annealed winner-takes-all for motion forecasting

Xu, Y ., Letzelter, V ., Chen, M., Zablocki, ´E., and Cord, M. Annealed winner-takes-all for motion forecasting. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1264–1270. IEEE,

work page 2025

[26] [26]

Virtual context enhancing jailbreak attacks with special token injection

Zhou, Y ., Lu, L., Sun, R., Zhou, P., and Sun, L. Virtual context enhancing jailbreak attacks with special token injection. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11843–11857, Miami, Florida, USA,

work page 2024

[27] [27]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-emnlp.692. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp.692 2024

[28] [28]

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Zong, Y ., Bohdal, O., Yu, T., Yang, Y ., and Hospedales, T. Safety fine-tuning at (almost) no cost: A base- line for vision large language models. arXiv preprint arXiv:2402.02207,

work page arXiv

[29] [29]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models. arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

As shown in Tab

+ PixArt-α (Chen et al., 2024)”. As shown in Tab. 8, both the attack success rate and response toxicity consistently increase compared to single-model results across all groups, suggesting that, regardless of variation in target group size, applying existing jailbreak attack methods to the wide-net-casting scenario can consistently amplify safety risks. T...

work page 2024

[31] [31]

22.0% / 0.158 13.3% / 0.089 19.2% / 0.134 12.2% / 0.08237.5% / 0.311 + IMMUNE (Ghosal et al., 2025)MLAI (Hao et al.,

work page 2025

[32] [32]

As shown in Tab

and Qwen-VL-Max (Bai et al., 2023). As shown in Tab. 13, using different large models to select the response yields almost the same W-Toxicity Scores, indicating highly consistent response selections and demonstrating the robustness of the W-Toxicity Score to the choice of response-selection LLMs. Table 13.Evaluation of jailbreaking MLLMs with different r...

work page 2023

[33] [33]

Beyond these open-source models, numerous closed-source commercial models also exist (e.g., Qwen-VL-Max (Bai et al., 2023), Gemini-1.5-Pro (Team et al., 2024), and GPT-4o (OpenAI, 2024)) and have been widely used in prior jailbreak studies for evaluation (Hao et al., 2025; Li et al., 2024; Xie et al., 2024; Yang et al., 2025). 14 New Wide-Net-Casting Jail...

work page 2023

[34] [34]

Dataset Attack W ASR / W-Toxicity Score OriginalSafety Alignment OriginalSafety Alignment+ VLGuard (Zong et al.,

AdvBench Baseline (ReMiss (Xie et al., 2024)) 50.2% / 0.469 33.1% / 0.293 28.5% / 0.257 Naive Strategy 1 55.5% / 0.513 37.3% / 0.325 31.1% / 0.279Naive Strategy 2 56.2% / 0.520 38.0% / 0.336 31.8% / 0.284Ours73.3% / 0.672 51.9% / 0.462 40.6% / 0.363 Table 15.Evaluation of jailbreaking MLLMs within the same model family using different methods tailored to ...

work page 2024

[35] [35]

+ PixArt-α(Chen et al., 2024)) 89.4% / 0.858 32.3% / 0.284 31.5% / 0.273 28.1% / 0.242 Naive Strategy 1 92.4% / 0.886 34.7% / 0.317 33.8% / 0.295 32.0% / 0.283Naive Strategy 2 93.6% / 0.897 35.5% / 0.318 34.4% / 0.307 32.7% / 0.292Ours99.6% / 0.932 46.2% / 0.412 44.2% / 0.406 40.6% / 0.365 MM-SafetyBench Baseline (MLAI (Hao et al.,

work page 2024

[36] [36]

+ PixArt-α(Chen et al., 2024)) 89.9% / 0.865 34.3% / 0.303 33.1% / 0.294 29.4% / 0.257 Naive Strategy 1 93.0% / 0.892 37.1% / 0.335 35.3% / 0.318 31.5% / 0.283Naive Strategy 2 93.7% / 0.903 37.8% / 0.343 36.6% / 0.321 32.2% / 0.295Ours100% / 0.934 47.6% / 0.435 46.7% / 0.423 42.3% / 0.381 To evaluate jailbreak methods on large closed-source models, a comm...

work page 2024

[37] [37]

Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families

+ PixArt-α(Chen et al., 2024)) 43.6% / 0.392 52.1% / 0.489 38.9% / 0.341 47.7% / 0.43369.5% / 0.653Ours 51.2% / 0.481 61.2% / 0.580 46.1% / 0.424 55.2% / 0.52386.8% / 0.799 C. Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families. We perform evaluation on two ...

work page 2024

[38] [38]

+ PixArt-α(Chen et al., 2024)) 93.3% / 0.867 37.5% / 0.311 Naive Strategy 1 95.5% / 0.883 40.6% / 0.355 Naive Strategy 2 95.8% / 0.898 41.1% / 0.363 Naive Strategy 3 95.6% / 0.887 40.9% / 0.358 Ours100% / 0.940 50.8% / 0.473 Impact of our output selection strategy during inference.Our method produces a set of responses for each intent when simultaneously ...

work page 2024

[39] [39]

Specifically, we pass each intent in AdvBench through all optimized generators and, for each successful jailbroken intent, compute the keyword frequencies

Ours (joint-training from scratch) 100% / 0.938 50.7% / 0.470 Ours (joint-training from independently-trained generators)100% / 0.940 50.8% / 0.473 Visualization of specialization analysis.To illustrate the specialization of the optimized generators trained by our method, we visualize word clouds of keywords for harmful intents that each generator special...

work page 2024

[40] [40]

Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses

We aim to find aη ∗ t such that the generator with a smaller lossℓ m t can be assigned a larger loss weight ηm t in each training step t as much as possible (maximizing exploitation). Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses. Thus, we can formalize this process mathematical...

work page 2024

[41] [41]

Combining this result with Eq

(23) This shows that U ′(τ)≤0 , and the inequality of U ′(τ)<0 is strict whenever the losses ℓt are not all identical. Combining this result with Eq. 22 yields: dH dτ =τ U ′(τ)<0, τ >0. (24) 21 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Step (5).From Eq. 24, sinceτ= 1/β t, we apply the chain rule to obtain: dH dβt = dH dτ · dτ dβt =−( 1 β2 )...

work page 2007

[42] [42]

+ PixArt-α (Chen et al., 2024)”. In Sec. 3 of the main paper, to further uncover potential risks in the wide-net-casting scenario, our investigation includes model-based jailbreaks for MLLMs. Since model-based jailbreak approaches for MLLMs remain largely underexplored, we construct a model-based jailbreak for MLLMs by adopting pipelines of LLM-oriented m...

work page 2024

[43] [43]

Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images

+ PixArt-α(Chen et al., 2024)”. Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images. Correspondingly, inspired by recent work (Wu et al., 2025), we optimize PixArt-α (Chen et al.,

work page 2024

[44] [44]

+ PixArt-α (Chen et al., 2024)” first uses MLAI to overgenerate adversarial images and then selects the top K= 20 high-quality adversarial samples for each harmful intent to supervise PixArt-α in generating adversarial images for jailbreaking large models. H. Additional Details w.r.t Experimental Setup in Sec. 3 of the Main Paper Datasets.The AdvBench dat...

work page 2024

[45] [45]

Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectiveness of jailbreak attacks

serves as a benchmark for evaluating the safety of MLLMs, which comprises 5,040 text-image pairs and covers 13 typical prohibited scenarios specified in OpenAI and Meta’s usage strategy (Achiam et al., 2023; Inan et al., 2023). Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectivenes...

work page 2023

[46] [46]

For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality

as an automatic LLM judge to determine whether a jailbreak is successful. For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality. Following common practice (Hao et al., 2025), we classify the response as a successful jailbreak if the jailbreak quality rating exceeds 0.5. Therefore, for each intent, responses from all gen...

work page 2025