pith. sign in

arxiv: 2605.17128 · v1 · pith:4NHLCNU2new · submitted 2026-05-16 · 💻 cs.CR · cs.AI

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

Pith reviewed 2026-05-20 14:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak attackslarge language modelssafety riskswide-net-castingmulti-model attacksadversarial attacks
0
0 comments X

The pith

Adversaries querying groups of large models can achieve jailbreak success rates up to 100% with a tailored attack method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies a new attack scenario called wide-net-casting, in which an adversary queries multiple large models instead of focusing on one. The authors show that this approach uncovers safety risks that single-model defenses overlook. They introduce a novel jailbreak method designed specifically for this multi-model setting. In experiments, the method reaches 100% success rate against large models without extra protections. This suggests that safety research needs to address scenarios where attackers can cast a wide net across several models.

Core claim

In the wide-net-casting scenario where an adversary can query a group of large models, a tailored jailbreak method can elicit harmful outputs with success rates reaching 100% in some experiments on models without additional safeguards.

What carries the argument

A novel jailbreak method tailored to the wide-net-casting scenario, which leverages querying multiple models to increase the chance of bypassing individual safeguards.

If this is right

  • Evaluations of large model safety should incorporate multi-model attack scenarios.
  • Existing defenses for single models may not suffice against wide-net-casting attacks.
  • The high success rates indicate that wide-net-casting represents a distinct high-risk scenario.
  • Future defense research should prioritize methods effective against group querying.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Companies deploying multiple AI models might need coordinated safety mechanisms across them.
  • Attackers could use this to target ensembles or services offering multiple models.
  • Testing the method on models with safeguards could quantify how much protection is needed.

Load-bearing premise

That querying a group of large models is a practical and realistic attack scenario for adversaries, distinct from single-model attacks.

What would settle it

An experiment where the tailored method is tested against a group of large models each equipped with standard safeguards, showing significantly lower success rates.

Figures

Figures reproduced from arXiv: 2605.17128 by Haoxuan Qu, Hossein Rahmani, Jun Liu, Qiuchi Xiang.

Figure 1
Figure 1. Figure 1: Illustration of the single-model jailbreak scenario and the wide-net-casting jailbreak scenario. As shown, in the wide￾net-casting scenario, unlike the single-model case, successfully jailbreaking any one large model in the group is sufficient for the adversary to obtain a desired harmful response. be “attacked”, i.e., their safeguards could still be bypassed by malicious intent, leading them to produce ha… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of straightforward adaptation of an existing model-based jailbreak method to the wide-net-casting scenario. A jailbreak attack is deemed successful in the wide-net-casting sce￾nario as long as any target large model is successfully jailbroken. practical yet previously unexplored setting. 3. Risk Analysis of Wide-net-casting Jailbreak To investigate risks underlying the wide-net-casting sce￾nar… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of word clouds showing the keywords of harmful intents that each optimized generator specializes in under the wide-net-casting scenario on AdvBench. The word clouds reflect the keywords and their frequencies (larger words indicate higher frequency) among the harmful intents for which each generator is most proficient [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization for the feature space of the sentence-level representation of harmful intent that each generator specializes in. D. Theoretical Analysis of Mathematically Formalizing the Sub-goal ❶ As mentioned in Sec. 4.1 of the main paper, the sub-goal ❶ (maximizing exploitation) aims to concentrate updates as much as possible on generators with smaller intermediate losses. We formally show that the sub-go… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results comparing our method with the straightforward adaptation (Baseline) and two naive strategies of jailbreaking LLMs in (a) and MLLMs in (b), where the green model name indicates a successful jailbreak, and a red model name indicates failure. As shown, our method consistently achieves successful jailbreaking across all intents, while the naive strategies succeed only on a few intents. 25 … view at source ↗
read the original abstract

Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a previously unexplored 'wide-net-casting' jailbreak scenario in which an adversary queries a group of large language models simultaneously (rather than a single model) to elicit harmful outputs. It develops a novel jailbreak method tailored to this multi-model setting and reports that the method achieves jailbreak success rates reaching 100% in some experiments against large models lacking additional safeguards, arguing that this scenario exposes substantial and previously overlooked safety risks that merit attention in future evaluation and defense work.

Significance. If the empirical results hold after rigorous validation, including proper baselines and controls, the work could usefully expand the threat model for LLM safety beyond single-model interactions and encourage broader multi-model testing in red-teaming protocols. The emphasis on a practical attack vector that leverages access to multiple models simultaneously is a potentially valuable contribution to the field.

major comments (2)
  1. [Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.
  2. [Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.
minor comments (1)
  1. [Introduction] The operational definition of 'wide-net-casting' (e.g., whether queries are simultaneous, how responses are aggregated, or the exact adversary model) should be stated explicitly and early to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.

    Authors: We agree with the referee that the abstract would benefit from additional context to support the key empirical claim. The full experimental details are described in the Method and Evaluation sections of the manuscript. To improve clarity and address this concern directly, we will revise the abstract to briefly summarize the experimental conditions under which the 100% success rate was achieved. revision: yes

  2. Referee: [Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.

    Authors: This is a valid point. The tailored jailbreak method is designed to exploit the simultaneous access to multiple models in ways that are not possible with sequential single-model queries. However, to more rigorously demonstrate that the wide-net-casting scenario presents unique risks, we will include additional baseline experiments applying the method (or suitable adaptations) to individual models and compare the success rates. This will be added to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis with no derivations or self-referential fits

full rationale

The paper presents an empirical study identifying a wide-net-casting jailbreak scenario and a tailored attack method, reporting observed success rates up to 100% in experiments. No equations, derivations, predictions from fitted parameters, or first-principles results are present in the provided text. The central claims rest on experimental outcomes rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The analysis is self-contained against external benchmarks of attack success, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; the work relies on standard domain assumptions in AI safety about attack practicality and model behavior.

axioms (1)
  • domain assumption Querying multiple large models constitutes a distinct and practical adversarial scenario
    Invoked to frame the new risk and justify the tailored method.

pith-pipeline@v0.9.0 · 5652 in / 1150 out tokens · 47342 ms · 2026-05-20T14:42:05.228551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  3. [3]

    Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462,

  4. [4]

    T., Nakov, P., and Gurevych, I

    9 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Geng, J., Tran, T. T., Nakov, P., and Gurevych, I. Con in- struction: Universal jailbreaking of multimodal large lan- guage models via non-textual modalities. arXiv preprint arXiv:2506.00548,

  5. [5]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,

  6. [6]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  7. [7]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674,

  8. [8]

    From clip to dino: Visual encoders shout in multi-modal large language models,

    Jiang, D., Liu, Y ., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825,

  9. [9]

    Robustkv: Defending large language models against jailbreak attacks via kv eviction

    Jiang, T., Wang, Z., Liang, J., Li, C., Wang, Y ., and Wang, T. Robustkv: Defending large language models against jailbreak attacks via kv eviction. arXiv preprint arXiv:2410.19937,

  10. [10]

    Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation

    Li, Z., Rahmani, H., Zhang, J., Xue, Y ., Mirmehdi, M., Kuen, J., Gu, J., and Liu, J. Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation. arXiv preprint arXiv:2603.20470,

  11. [11]

    arXiv preprint arXiv:2404.07921

    Liao, Z. and Sun, H. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

  12. [12]

    Towards understanding jailbreak attacks in llms: A representation space analysis

    Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7067–7085,

  13. [13]

    Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems, 36:34892–34916, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b. Liu, X., Zhu, Y ., Gu, J., Lan, Y ., Yang, C., and...

  14. [14]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    10 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983,

  15. [15]

    Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion

    Miao, Z., Ding, Y ., Li, L., and Shao, J. Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

  16. [16]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al

    Accessed: 2025-11-06. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

  17. [17]

    arXiv preprint arXiv:2404.16873 (2024)

    Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., and Tian, Y . Advprompter: Fast adaptive adversarial prompt- ing for llms. arXiv preprint arXiv:2404.16873,

  18. [18]

    Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth- llm: Defending large language models against jailbreak- ing attacks. arXiv preprint arXiv:2310.03684,

  19. [19]

    Plug and Pray: Exploiting off-the-shelf components of multi-modal models

    Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539,

  20. [20]

    and Eberhart, R

    Shi, Y . and Eberhart, R. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), pp. 69–73. Ieee,

  21. [21]

    and Eberhart, R

    Shi, Y . and Eberhart, R. C. Empirical study of parti- cle swarm optimization. In Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406), volume 3, pp. 1945–1950. IEEE,

  22. [22]

    Iterative self-tuning llms for enhanced jailbreaking capabilities

    Sun, C.-E., Liu, X., Yang, W., Weng, T.-W., Cheng, H., San, A., Galley, M., and Gao, J. Iterative self-tuning llms for enhanced jailbreaking capabilities. arXiv preprint arXiv:2410.18469,

  23. [23]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  24. [24]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

  25. [25]

    Annealed winner-takes-all for motion forecasting

    Xu, Y ., Letzelter, V ., Chen, M., Zablocki, ´E., and Cord, M. Annealed winner-takes-all for motion forecasting. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1264–1270. IEEE,

  26. [26]

    Virtual context enhancing jailbreak attacks with special token injection

    Zhou, Y ., Lu, L., Sun, R., Zhou, P., and Sun, L. Virtual context enhancing jailbreak attacks with special token injection. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11843–11857, Miami, Florida, USA,

  27. [27]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-emnlp.692. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  28. [28]

    Safety fine-tuning at (almost) no cost: A baseline for vision large language models

    Zong, Y ., Bohdal, O., Yu, T., Yang, Y ., and Hospedales, T. Safety fine-tuning at (almost) no cost: A base- line for vision large language models. arXiv preprint arXiv:2402.02207,

  29. [29]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models. arXiv preprint arXiv:2307.15043,

  30. [30]

    As shown in Tab

    + PixArt-α (Chen et al., 2024)”. As shown in Tab. 8, both the attack success rate and response toxicity consistently increase compared to single-model results across all groups, suggesting that, regardless of variation in target group size, applying existing jailbreak attack methods to the wide-net-casting scenario can consistently amplify safety risks. T...

  31. [31]

    22.0% / 0.158 13.3% / 0.089 19.2% / 0.134 12.2% / 0.08237.5% / 0.311 + IMMUNE (Ghosal et al., 2025)MLAI (Hao et al.,

  32. [32]

    As shown in Tab

    and Qwen-VL-Max (Bai et al., 2023). As shown in Tab. 13, using different large models to select the response yields almost the same W-Toxicity Scores, indicating highly consistent response selections and demonstrating the robustness of the W-Toxicity Score to the choice of response-selection LLMs. Table 13.Evaluation of jailbreaking MLLMs with different r...

  33. [33]

    Beyond these open-source models, numerous closed-source commercial models also exist (e.g., Qwen-VL-Max (Bai et al., 2023), Gemini-1.5-Pro (Team et al., 2024), and GPT-4o (OpenAI, 2024)) and have been widely used in prior jailbreak studies for evaluation (Hao et al., 2025; Li et al., 2024; Xie et al., 2024; Yang et al., 2025). 14 New Wide-Net-Casting Jail...

  34. [34]

    Dataset Attack W ASR / W-Toxicity Score OriginalSafety Alignment OriginalSafety Alignment+ VLGuard (Zong et al.,

    AdvBench Baseline (ReMiss (Xie et al., 2024)) 50.2% / 0.469 33.1% / 0.293 28.5% / 0.257 Naive Strategy 1 55.5% / 0.513 37.3% / 0.325 31.1% / 0.279Naive Strategy 2 56.2% / 0.520 38.0% / 0.336 31.8% / 0.284Ours73.3% / 0.672 51.9% / 0.462 40.6% / 0.363 Table 15.Evaluation of jailbreaking MLLMs within the same model family using different methods tailored to ...

  35. [35]

    + PixArt-α(Chen et al., 2024)) 89.4% / 0.858 32.3% / 0.284 31.5% / 0.273 28.1% / 0.242 Naive Strategy 1 92.4% / 0.886 34.7% / 0.317 33.8% / 0.295 32.0% / 0.283Naive Strategy 2 93.6% / 0.897 35.5% / 0.318 34.4% / 0.307 32.7% / 0.292Ours99.6% / 0.932 46.2% / 0.412 44.2% / 0.406 40.6% / 0.365 MM-SafetyBench Baseline (MLAI (Hao et al.,

  36. [36]

    + PixArt-α(Chen et al., 2024)) 89.9% / 0.865 34.3% / 0.303 33.1% / 0.294 29.4% / 0.257 Naive Strategy 1 93.0% / 0.892 37.1% / 0.335 35.3% / 0.318 31.5% / 0.283Naive Strategy 2 93.7% / 0.903 37.8% / 0.343 36.6% / 0.321 32.2% / 0.295Ours100% / 0.934 47.6% / 0.435 46.7% / 0.423 42.3% / 0.381 To evaluate jailbreak methods on large closed-source models, a comm...

  37. [37]

    Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families

    + PixArt-α(Chen et al., 2024)) 43.6% / 0.392 52.1% / 0.489 38.9% / 0.341 47.7% / 0.43369.5% / 0.653Ours 51.2% / 0.481 61.2% / 0.580 46.1% / 0.424 55.2% / 0.52386.8% / 0.799 C. Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families. We perform evaluation on two ...

  38. [38]

    + PixArt-α(Chen et al., 2024)) 93.3% / 0.867 37.5% / 0.311 Naive Strategy 1 95.5% / 0.883 40.6% / 0.355 Naive Strategy 2 95.8% / 0.898 41.1% / 0.363 Naive Strategy 3 95.6% / 0.887 40.9% / 0.358 Ours100% / 0.940 50.8% / 0.473 Impact of our output selection strategy during inference.Our method produces a set of responses for each intent when simultaneously ...

  39. [39]

    Specifically, we pass each intent in AdvBench through all optimized generators and, for each successful jailbroken intent, compute the keyword frequencies

    Ours (joint-training from scratch) 100% / 0.938 50.7% / 0.470 Ours (joint-training from independently-trained generators)100% / 0.940 50.8% / 0.473 Visualization of specialization analysis.To illustrate the specialization of the optimized generators trained by our method, we visualize word clouds of keywords for harmful intents that each generator special...

  40. [40]

    Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses

    We aim to find aη ∗ t such that the generator with a smaller lossℓ m t can be assigned a larger loss weight ηm t in each training step t as much as possible (maximizing exploitation). Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses. Thus, we can formalize this process mathematical...

  41. [41]

    Combining this result with Eq

    (23) This shows that U ′(τ)≤0 , and the inequality of U ′(τ)<0 is strict whenever the losses ℓt are not all identical. Combining this result with Eq. 22 yields: dH dτ =τ U ′(τ)<0, τ >0. (24) 21 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Step (5).From Eq. 24, sinceτ= 1/β t, we apply the chain rule to obtain: dH dβt = dH dτ · dτ dβt =−( 1 β2 )...

  42. [42]

    + PixArt-α (Chen et al., 2024)”. In Sec. 3 of the main paper, to further uncover potential risks in the wide-net-casting scenario, our investigation includes model-based jailbreaks for MLLMs. Since model-based jailbreak approaches for MLLMs remain largely underexplored, we construct a model-based jailbreak for MLLMs by adopting pipelines of LLM-oriented m...

  43. [43]

    Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images

    + PixArt-α(Chen et al., 2024)”. Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images. Correspondingly, inspired by recent work (Wu et al., 2025), we optimize PixArt-α (Chen et al.,

  44. [44]

    + PixArt-α (Chen et al., 2024)” first uses MLAI to overgenerate adversarial images and then selects the top K= 20 high-quality adversarial samples for each harmful intent to supervise PixArt-α in generating adversarial images for jailbreaking large models. H. Additional Details w.r.t Experimental Setup in Sec. 3 of the Main Paper Datasets.The AdvBench dat...

  45. [45]

    Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectiveness of jailbreak attacks

    serves as a benchmark for evaluating the safety of MLLMs, which comprises 5,040 text-image pairs and covers 13 typical prohibited scenarios specified in OpenAI and Meta’s usage strategy (Achiam et al., 2023; Inan et al., 2023). Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectivenes...

  46. [46]

    For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality

    as an automatic LLM judge to determine whether a jailbreak is successful. For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality. Following common practice (Hao et al., 2025), we classify the response as a successful jailbreak if the jailbreak quality rating exceeds 0.5. Therefore, for each intent, responses from all gen...