New Wide-Net-Casting Jailbreak Attacks Risk Large Models
Pith reviewed 2026-05-20 14:42 UTC · model grok-4.3
The pith
Adversaries querying groups of large models can achieve jailbreak success rates up to 100% with a tailored attack method.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the wide-net-casting scenario where an adversary can query a group of large models, a tailored jailbreak method can elicit harmful outputs with success rates reaching 100% in some experiments on models without additional safeguards.
What carries the argument
A novel jailbreak method tailored to the wide-net-casting scenario, which leverages querying multiple models to increase the chance of bypassing individual safeguards.
If this is right
- Evaluations of large model safety should incorporate multi-model attack scenarios.
- Existing defenses for single models may not suffice against wide-net-casting attacks.
- The high success rates indicate that wide-net-casting represents a distinct high-risk scenario.
- Future defense research should prioritize methods effective against group querying.
Where Pith is reading between the lines
- Companies deploying multiple AI models might need coordinated safety mechanisms across them.
- Attackers could use this to target ensembles or services offering multiple models.
- Testing the method on models with safeguards could quantify how much protection is needed.
Load-bearing premise
That querying a group of large models is a practical and realistic attack scenario for adversaries, distinct from single-model attacks.
What would settle it
An experiment where the tailored method is tested against a group of large models each equipped with standard safeguards, showing significantly lower success rates.
Figures
read the original abstract
Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies a previously unexplored 'wide-net-casting' jailbreak scenario in which an adversary queries a group of large language models simultaneously (rather than a single model) to elicit harmful outputs. It develops a novel jailbreak method tailored to this multi-model setting and reports that the method achieves jailbreak success rates reaching 100% in some experiments against large models lacking additional safeguards, arguing that this scenario exposes substantial and previously overlooked safety risks that merit attention in future evaluation and defense work.
Significance. If the empirical results hold after rigorous validation, including proper baselines and controls, the work could usefully expand the threat model for LLM safety beyond single-model interactions and encourage broader multi-model testing in red-teaming protocols. The emphasis on a practical attack vector that leverages access to multiple models simultaneously is a potentially valuable contribution to the field.
major comments (2)
- [Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.
- [Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.
minor comments (1)
- [Introduction] The operational definition of 'wide-net-casting' (e.g., whether queries are simultaneous, how responses are aggregated, or the exact adversary model) should be stated explicitly and early to avoid ambiguity for readers.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the tailored method reaches 100% success 'in some experiments' is presented without any description of the experimental setup, including the specific models evaluated, number of queries or trials, success criteria, presence or absence of safeguards, or statistical controls. This absence leaves the primary empirical result without visible supporting evidence and directly undermines evaluation of the paper's core contribution.
Authors: We agree with the referee that the abstract would benefit from additional context to support the key empirical claim. The full experimental details are described in the Method and Evaluation sections of the manuscript. To improve clarity and address this concern directly, we will revise the abstract to briefly summarize the experimental conditions under which the 100% success rate was achieved. revision: yes
-
Referee: [Method or Evaluation section] Method or Evaluation section (inferred from abstract claims): No baseline results are reported for the same tailored method applied to individual models rather than the group. Without this comparison, it remains unclear whether the reported high success rates are attributable to the wide-net-casting scenario itself or would be achievable by sequential single-model queries, which is required to substantiate the claim that wide-net-casting constitutes a distinct high-risk setting.
Authors: This is a valid point. The tailored jailbreak method is designed to exploit the simultaneous access to multiple models in ways that are not possible with sequential single-model queries. However, to more rigorously demonstrate that the wide-net-casting scenario presents unique risks, we will include additional baseline experiments applying the method (or suitable adaptations) to individual models and compare the success rates. This will be added to the Evaluation section. revision: yes
Circularity Check
No circularity: empirical analysis with no derivations or self-referential fits
full rationale
The paper presents an empirical study identifying a wide-net-casting jailbreak scenario and a tailored attack method, reporting observed success rates up to 100% in experiments. No equations, derivations, predictions from fitted parameters, or first-principles results are present in the provided text. The central claims rest on experimental outcomes rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The analysis is self-contained against external benchmarks of attack success, with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Querying multiple large models constitutes a distinct and practical adversarial scenario
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a novel jailbreak method tailored to the wide-net-casting scenario. The method pairs each target large model with a dedicated 'jailbreak expert'... η∗t = (η1,∗t , … , ηM,∗t ), ηi,∗t = exp(−ℓit/βt) / Σ exp(−ℓmt/βt)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
W ASR = 1/N Σ WM m=1 snm (logical OR across models in group)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[4]
T., Nakov, P., and Gurevych, I
9 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Geng, J., Tran, T. T., Nakov, P., and Gurevych, I. Con in- struction: Universal jailbreaking of multimodal large lan- guage models via non-textual modalities. arXiv preprint arXiv:2506.00548,
-
[5]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
From clip to dino: Visual encoders shout in multi-modal large language models,
Jiang, D., Liu, Y ., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825,
-
[9]
Robustkv: Defending large language models against jailbreak attacks via kv eviction
Jiang, T., Wang, Z., Liang, J., Li, C., Wang, Y ., and Wang, T. Robustkv: Defending large language models against jailbreak attacks via kv eviction. arXiv preprint arXiv:2410.19937,
-
[10]
Li, Z., Rahmani, H., Zhang, J., Xue, Y ., Mirmehdi, M., Kuen, J., Gu, J., and Liu, J. Diffgraph: An au- tomated agent-driven model merging framework for in-the-wild text-to-image generation. arXiv preprint arXiv:2603.20470,
-
[11]
arXiv preprint arXiv:2404.07921
Liao, Z. and Sun, H. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,
-
[12]
Towards understanding jailbreak attacks in llms: A representation space analysis
Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 7067–7085,
work page 2024
-
[13]
Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems, 36:34892–34916, 2023a. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b. Liu, X., Zhu, Y ., Gu, J., Lan, Y ., Yang, C., and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
SGDR: Stochastic Gradient Descent with Warm Restarts
10 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion
Miao, Z., Ding, Y ., Li, L., and Shao, J. Visual contextual at- tack: Jailbreaking mllms with image-driven context injec- tion. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,
work page 2025
-
[16]
Accessed: 2025-11-06. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,
work page 2025
-
[17]
arXiv preprint arXiv:2404.16873 (2024)
Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., and Tian, Y . Advprompter: Fast adaptive adversarial prompt- ing for llms. arXiv preprint arXiv:2404.16873,
-
[18]
Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth- llm: Defending large language models against jailbreak- ing attacks. arXiv preprint arXiv:2310.03684,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Plug and Pray: Exploiting off-the-shelf components of multi-modal models
Shayegani, E., Dong, Y ., and Abu-Ghazaleh, N. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539,
-
[20]
Shi, Y . and Eberhart, R. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), pp. 69–73. Ieee,
work page 1998
-
[21]
Shi, Y . and Eberhart, R. C. Empirical study of parti- cle swarm optimization. In Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406), volume 3, pp. 1945–1950. IEEE,
work page 1999
-
[22]
Iterative self-tuning llms for enhanced jailbreaking capabilities
Sun, C.-E., Liu, X., Yang, W., Weng, T.-W., Cheng, H., San, A., Galley, M., and Gao, J. Iterative self-tuning llms for enhanced jailbreaking capabilities. arXiv preprint arXiv:2410.18469,
-
[23]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Annealed winner-takes-all for motion forecasting
Xu, Y ., Letzelter, V ., Chen, M., Zablocki, ´E., and Cord, M. Annealed winner-takes-all for motion forecasting. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1264–1270. IEEE,
work page 2025
-
[26]
Virtual context enhancing jailbreak attacks with special token injection
Zhou, Y ., Lu, L., Sun, R., Zhou, P., and Sun, L. Virtual context enhancing jailbreak attacks with special token injection. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 11843–11857, Miami, Florida, USA,
work page 2024
-
[27]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-emnlp.692. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp.692 2024
-
[28]
Safety fine-tuning at (almost) no cost: A baseline for vision large language models
Zong, Y ., Bohdal, O., Yu, T., Yang, Y ., and Hospedales, T. Safety fine-tuning at (almost) no cost: A base- line for vision large language models. arXiv preprint arXiv:2402.02207,
-
[29]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models. arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
+ PixArt-α (Chen et al., 2024)”. As shown in Tab. 8, both the attack success rate and response toxicity consistently increase compared to single-model results across all groups, suggesting that, regardless of variation in target group size, applying existing jailbreak attack methods to the wide-net-casting scenario can consistently amplify safety risks. T...
work page 2024
-
[31]
22.0% / 0.158 13.3% / 0.089 19.2% / 0.134 12.2% / 0.08237.5% / 0.311 + IMMUNE (Ghosal et al., 2025)MLAI (Hao et al.,
work page 2025
-
[32]
and Qwen-VL-Max (Bai et al., 2023). As shown in Tab. 13, using different large models to select the response yields almost the same W-Toxicity Scores, indicating highly consistent response selections and demonstrating the robustness of the W-Toxicity Score to the choice of response-selection LLMs. Table 13.Evaluation of jailbreaking MLLMs with different r...
work page 2023
-
[33]
Beyond these open-source models, numerous closed-source commercial models also exist (e.g., Qwen-VL-Max (Bai et al., 2023), Gemini-1.5-Pro (Team et al., 2024), and GPT-4o (OpenAI, 2024)) and have been widely used in prior jailbreak studies for evaluation (Hao et al., 2025; Li et al., 2024; Xie et al., 2024; Yang et al., 2025). 14 New Wide-Net-Casting Jail...
work page 2023
-
[34]
AdvBench Baseline (ReMiss (Xie et al., 2024)) 50.2% / 0.469 33.1% / 0.293 28.5% / 0.257 Naive Strategy 1 55.5% / 0.513 37.3% / 0.325 31.1% / 0.279Naive Strategy 2 56.2% / 0.520 38.0% / 0.336 31.8% / 0.284Ours73.3% / 0.672 51.9% / 0.462 40.6% / 0.363 Table 15.Evaluation of jailbreaking MLLMs within the same model family using different methods tailored to ...
work page 2024
-
[35]
+ PixArt-α(Chen et al., 2024)) 89.4% / 0.858 32.3% / 0.284 31.5% / 0.273 28.1% / 0.242 Naive Strategy 1 92.4% / 0.886 34.7% / 0.317 33.8% / 0.295 32.0% / 0.283Naive Strategy 2 93.6% / 0.897 35.5% / 0.318 34.4% / 0.307 32.7% / 0.292Ours99.6% / 0.932 46.2% / 0.412 44.2% / 0.406 40.6% / 0.365 MM-SafetyBench Baseline (MLAI (Hao et al.,
work page 2024
-
[36]
+ PixArt-α(Chen et al., 2024)) 89.9% / 0.865 34.3% / 0.303 33.1% / 0.294 29.4% / 0.257 Naive Strategy 1 93.0% / 0.892 37.1% / 0.335 35.3% / 0.318 31.5% / 0.283Naive Strategy 2 93.7% / 0.903 37.8% / 0.343 36.6% / 0.321 32.2% / 0.295Ours100% / 0.934 47.6% / 0.435 46.7% / 0.423 42.3% / 0.381 To evaluate jailbreak methods on large closed-source models, a comm...
work page 2024
-
[37]
+ PixArt-α(Chen et al., 2024)) 43.6% / 0.392 52.1% / 0.489 38.9% / 0.341 47.7% / 0.43369.5% / 0.653Ours 51.2% / 0.481 61.2% / 0.580 46.1% / 0.424 55.2% / 0.52386.8% / 0.799 C. Additional Ablation Studies In this section, we conduct extensive ablation studies on AdvBench, focusing on jailbreaking MLLMs from different families. We perform evaluation on two ...
work page 2024
-
[38]
+ PixArt-α(Chen et al., 2024)) 93.3% / 0.867 37.5% / 0.311 Naive Strategy 1 95.5% / 0.883 40.6% / 0.355 Naive Strategy 2 95.8% / 0.898 41.1% / 0.363 Naive Strategy 3 95.6% / 0.887 40.9% / 0.358 Ours100% / 0.940 50.8% / 0.473 Impact of our output selection strategy during inference.Our method produces a set of responses for each intent when simultaneously ...
work page 2024
-
[39]
Ours (joint-training from scratch) 100% / 0.938 50.7% / 0.470 Ours (joint-training from independently-trained generators)100% / 0.940 50.8% / 0.473 Visualization of specialization analysis.To illustrate the specialization of the optimized generators trained by our method, we visualize word clouds of keywords for harmful intents that each generator special...
work page 2024
-
[40]
We aim to find aη ∗ t such that the generator with a smaller lossℓ m t can be assigned a larger loss weight ηm t in each training step t as much as possible (maximizing exploitation). Intuitively, to this end, we need to shift the weight value from generators with larger losses to those with smaller losses. Thus, we can formalize this process mathematical...
work page 2024
-
[41]
(23) This shows that U ′(τ)≤0 , and the inequality of U ′(τ)<0 is strict whenever the losses ℓt are not all identical. Combining this result with Eq. 22 yields: dH dτ =τ U ′(τ)<0, τ >0. (24) 21 New Wide-Net-Casting Jailbreak Attacks Risk Large Models Step (5).From Eq. 24, sinceτ= 1/β t, we apply the chain rule to obtain: dH dβt = dH dτ · dτ dβt =−( 1 β2 )...
work page 2007
-
[42]
+ PixArt-α (Chen et al., 2024)”. In Sec. 3 of the main paper, to further uncover potential risks in the wide-net-casting scenario, our investigation includes model-based jailbreaks for MLLMs. Since model-based jailbreak approaches for MLLMs remain largely underexplored, we construct a model-based jailbreak for MLLMs by adopting pipelines of LLM-oriented m...
work page 2024
-
[43]
+ PixArt-α(Chen et al., 2024)”. Notably, MLAI jailbreaks and bypasses safety-aligned MLLMs by generating adversarial perturbations on input images. Correspondingly, inspired by recent work (Wu et al., 2025), we optimize PixArt-α (Chen et al.,
work page 2024
-
[44]
+ PixArt-α (Chen et al., 2024)” first uses MLAI to overgenerate adversarial images and then selects the top K= 20 high-quality adversarial samples for each harmful intent to supervise PixArt-α in generating adversarial images for jailbreaking large models. H. Additional Details w.r.t Experimental Setup in Sec. 3 of the Main Paper Datasets.The AdvBench dat...
work page 2024
-
[45]
serves as a benchmark for evaluating the safety of MLLMs, which comprises 5,040 text-image pairs and covers 13 typical prohibited scenarios specified in OpenAI and Meta’s usage strategy (Achiam et al., 2023; Inan et al., 2023). Metrics.In the main paper, we use Attack Success Rate (ASR) as one of our primary evaluation metrics to quantify the effectivenes...
work page 2023
-
[46]
For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality
as an automatic LLM judge to determine whether a jailbreak is successful. For each intent-response pair, Beaver-Dam-7B can output a rating of the jailbreak quality. Following common practice (Hao et al., 2025), we classify the response as a successful jailbreak if the jailbreak quality rating exceeds 0.5. Therefore, for each intent, responses from all gen...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.