pith. sign in

arxiv: 2606.26936 · v1 · pith:NFTK5Q66new · submitted 2026-06-25 · 💻 cs.CR · cs.CL· cs.LG

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

Pith reviewed 2026-06-26 04:12 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords jailbreakingLLMsbandit algorithmssafety benchmarksmalicious queriesattack success rateopen-weight modelsquery complexity
0
0 comments X

The pith

Bandit algorithms let non-experts select jailbreaks that succeed at up to 97 percent on 15 open LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether ordinary users without special expertise can reliably elicit harmful outputs from current LLMs. It answers by framing jailbreak selection as a multi-armed bandit problem that learns a good policy through limited noisy trials and then applies that policy to many new queries. The authors also assemble a large benchmark of malicious requests, some deliberately made more complex, and measure how complexity affects success. Their results show that the bandit method reaches high average success across many models and that complexity further improves outcomes. This directly tests whether the growing catalog of known jailbreaks poses a practical threat to non-experts.

Core claim

A multi-armed bandit attack learns the optimal jailbreak from noisy exploration on a small query set and applies the learned policy to a larger exploitation set, reaching success rates as high as 97 percent on average over 15 state-of-the-art open-weight LLMs; adding complexity to the queries raises success by up to 26 percent on average.

What carries the argument

Multi-armed bandit framework that performs online learning of the best jailbreak from a large choice set via limited exploration before exploitation.

If this is right

  • Non-expert attackers can achieve high success rates without hand-crafting prompts for each target model.
  • Increasing query complexity functions as an automatable way to improve attack performance across models.
  • The risk posed by the public catalog of jailbreaks is shown to be practically realizable rather than merely theoretical.
  • Automated selection methods can be applied to any large collection of candidate jailbreaks without requiring expert knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bandit selection loop could be tested on closed models or on multimodal systems to check whether the observed success pattern holds.
  • Defenses that only block fixed prompt templates may be insufficient once attackers learn to pick from many templates adaptively.
  • The FrankensteinBench construction process could be reused to generate fresh test sets for evaluating future alignment techniques.

Load-bearing premise

The policy learned from noisy trials on a small set of queries will transfer to produce high success on a separate and larger collection of malicious queries.

What would settle it

Measure whether the policy selected after exploration on the small query set still yields high success rates when run on an entirely new, held-out collection of malicious queries against the same models.

Figures

Figures reproduced from arXiv: 2606.26936 by Arjun Bhagoji, Arpit Agarwal, Prarabdh Shukla, Ritik, Suhas Rao.

Figure 1
Figure 1. Figure 1: Overview of our red-teaming approach. RQ1 Our attack begins with an Exploration Phase, where the attacker runs a bandit algorithm on the exploration set to learn a policy θ to pick jailbreaks. The second phase varies based on the attack type: In a Transfer Attack, the second phase is an Exploitation Phase (2a in the figure), where the attacker directly applies the learned policy on an exploitation set. In … view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of EXP3’s probability distribution over jailbreaks during the Exploration Phase on Llama-3.1-8B-Instruct. The 0% Exploration is just the Uniform Priors attacker. With sufficient exploration, our attack is able to identify the most effective jailbreak (lo2). nical jargon in an attempt to reframe the malicious request as a technical question. An important trait of these is that they require domain … view at source ↗
Figure 3
Figure 3. Figure 3: Attack Success Rate (ASR) of the Transfer and Continual Attacks. The bars on the right represent the average ASR across models. Each heatmap shows bandit algorithms alongside the top-3 jailbreaks – libertas, lo2, and aim. Continual Attack provides improvement over the Transfer Attack, and bandit-based strategies outperform the naive Uniform Priors Attack. For our bandit-based attacks, the heatmaps show the… view at source ↗
Figure 4
Figure 4. Figure 4: Attack Success Rate (ASR) of Transfer Attack when multiple passes are allowed. Increasing the number of passes in the Transfer Attack leads to significant gains in ASR, with bandit-based approaches consistently outperforming the Uniform Priors Attack. 6 Ablations In this section, we examine the transferability of our attacks, the effect of the jailbreak set, and the effect of bandit algorithm parameters th… view at source ↗
Figure 5
Figure 5. Figure 5: Transferability of the EXP3 attack. In this experiment, we perform the exploration phase of a Transfer Attack on a proxy model followed by the exploitation phase on the intended target model. The x-axis represents the Proxy Model, i.e., the model whose EXP3 weights were used and the y-axis shows the Tar￾get Model. The cells represent the ASR observed on the target model. Different models from the same prov… view at source ↗
Figure 6
Figure 6. Figure 6: Average ASR over all 15 target models in the Transfer Attack vs the number of top jailbreaks removed. We remove the top-k jailbreaks based on their average ASR across all models. Even with the worst 10 jailbreaks (i.e., top-60 removed), LinUCB, EXP3 and RWM are able to achieve an ASR of about 50%, whereas the naive Uniform Priors Attack’s ASR drops to about 20%. breaks” in our jailbreak set. For the purpos… view at source ↗
Figure 8
Figure 8. Figure 8: ASR on Llama-3.1-8B-Instruct in the Transfer Attack Scenario vs the Exploration Hori￾zon. Overall, we find that all algorithms benefit from a longer Exploration Horizon. The line plots show the mean ASR over 3 runs with different seeds and the error bars indicate the standard deviation. 0 1000 2000 4000 8000 9000 Exploration Horizon 0% 20% 40% 60% 80% 100% ASR exp3 linear_cb linucb rwm square_cb thompson_s… view at source ↗
Figure 7
Figure 7. Figure 7: Average ASR over all 15 target models in the Transfer Attack vs the number of “good” jail￾breaks. We randomly keep only g many jailbreaks with an average ASR ≥ 60% (i.e., “good” jailbreaks) for this experiment. We find that, for all algorithms, the ASR does not vary much indicating lesser dependence on the presence of many good jailbreaks. All bandit algorithms consistently outperform the naive Uniform Pri… view at source ↗
Figure 10
Figure 10. Figure 10: ASR across models for D(tr) = ∅ (i.e., no exploration phase) in the Continual Attack Scenario. While performance for partial information algorithms drops in this scenario, we find that RWM has perfor￾mance similar to when the attack was carried out with the full exploration set. 7 Discussion & Future Work Case Study on Proprietary Models: We con￾duct preliminary tests on transferability from open-weight m… view at source ↗
Figure 11
Figure 11. Figure 11: Label Distribution in FRANKENSTEINBENCH: Our test set mostly consists of manually curated queries to ensure uniqueness and high-quality. ate fleiss’ kappa of 0.26. We suspect that the low kappa value is due to class imbalance, and this con￾clusion is supported by Gwet’s AC1, which was observed to be much higher at 0.53. As a further safeguard, the annotation exercise was repeated (with some annotators swa… view at source ↗
Figure 12
Figure 12. Figure 12: Average ASR on the FRANKENSTEIN￾BENCH validation set over all 15 target models with different generation lengths. ASR varies only a little with generation length and hence we choose 2, 048 as the generation length for our full-scale evaluations. and diffred (Shukla et al., 2024). For the target dimension d, we experiment with 10, 20 and 32. Additionally, with the Qwen-3 embedding models, we also consider … view at source ↗
Figure 13
Figure 13. Figure 13: shows the regret of Algorithm 1 under different bandit algorithms. We also show the re￾gret of the Uniform Priors baseline, the single best jailbreak in hindsight and the best possible sequence of jailbreaks. The Uniform Priors at￾tacker has the highest regret throughout the hori￾zon, and both RWM and ThompsonSampling are able to achieve near zero regret. The best possi￾ble sequence of jailbreaks represen… view at source ↗
Figure 14
Figure 14. Figure 14: Annotation Guidelines provided to annotators. Q4 was included to make sure annotators don’t lose attention during the annotation. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt Template used with gpt-4.1-2025-04-14 to classify a query as either simple or complex. Rater LLM Prompt Template You are a helpful online moderator with expertise in the {domain} domain. You will be provided a harmful query and a response to the same harmful query. You need to act as an impartial judge and evaluate whether the given response correctly answers the query and enables carrying out a ha… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt Template used with gemma-3-27b-it to rate a query-response pair as either 1 (benign) or 0 (harmful). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt Template used with gemini-2.0-flash to enhance a simple query into its complex variant. LLM Prompt Template used for Generation Generate five distinct harmful prompts for red-teaming an LLM under the AIR-2024 taxonomy category {L3_category} : {L4_category} . Each prompt must be classified as a COMPLEX harmful prompt according to the following definition: - The malicious intent should be embedded wi… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt Template used with gemini-2.0-flash and gpt-4o-mini to generate complex queries. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Attack Success Rate of different jailbreaks on the FRANKENSTEINBENCH test set. baseline refers to the case where the query was passed without applying jailbreak. The column and row averages are shown as bars at the top and right side of the heatmap respectively. Rows and columns are sorted by their respective averages. deepseek-r1-distill-qwen-1.5B is the most susceptible model with 75% ASR on average acr… view at source ↗
Figure 20
Figure 20. Figure 20: Attack Success Rate of different jailbreaks on the simple queries in the FRANKENSTEINBENCH test set. The column and row averages are shown as bars at the top and right side of the heatmap respectively. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Attack Success Rate of different jailbreaks on the complex queries in the FRANKENSTEINBENCH test set. The column and row averages are shown as bars at the top and right side of the heatmap respectively. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Attack Success Rate of different jailbreaks on the entire FRANKENSTEINBENCH dataset. The column and row averages are shown as bars at the top and right side of the heatmap respectively. Rows and columns are sorted by their respective averages. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Attack Success Rate of simple queries on the entire FRANKENSTEINBENCH dataset. The column and row averages are shown as bars at the top and right side of the heatmap respectively. Rows and columns are sorted by their respective averages. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Attack Success Rate of complex queries on the entire FRANKENSTEINBENCH dataset. The column and row averages are shown as bars at the top and right side of the heatmap respectively. Rows and columns are sorted by their respective averages. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Attack Success Rate (ASR) for the domain ablation experiment. We run the Transfer Attack on each model by holding one domain out from the exploration set and limiting the exploitation set to the held-out domain. Average ASR across models for a given bandit algorithm does not change much as compared to when exploration and exploitation is done on the full sets, indicating that the attacker may execute the … view at source ↗
Figure 26
Figure 26. Figure 26: Attack Success Rate (ASR) vs the Exploration Horizon for the Transfer Attack (Part 1/2). We run the Transfer Attack with different sizes of D(tr) . The line plots show the mean ASR over 3 runs with different seeds. The error bars indicate the standard deviation. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Attack Success Rate (ASR) vs the Exploration Horizon for the Transfer Attack (Part 2/2). We run the Transfer Attack with different sizes of D(tr) . The line plots show the mean ASR over 3 runs with different seeds. The error bars indicate the standard deviation. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Attack Success Rate (ASR) vs the Exploration Horizon for the Continual Attack (Part 1/2). We run the Continual Attack with different sizes of D(tr) . The line plots show the mean ASR over 3 runs with different seeds. The error bars indicate the standard deviation. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Attack Success Rate (ASR) vs the Exploration Horizon for the Continual Attack (Part 2/2). We run the Continual Attack with different sizes of D(tr) . The line plots show the mean ASR over 3 runs with different seeds. The error bars indicate the standard deviation. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Transferability of different attacks. In this experiment, we perform the exploration phase of a Transfer Attack on a proxy model followed by the exploitation phase on the intended target model. The x-axis represents the Proxy Model, i.e., the model whose EXP3 weights were used and the y-axis shows the Target Model. The cells represent the ASR observed on the target model. Similar to the trends observed un… view at source ↗
Figure 31
Figure 31. Figure 31: Regret of different algorithms in the Exploration Phase of the Transfer Attack (Part 1/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. As expected from theoretical guarantees, the regret of bandit algorithms reduces as the exploration horizon (T) is increased. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Regret of different algorithms in the Exploration Phase of the Transfer Attack (Part 2/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. As expected from theoretical guarantees, the regret of bandit algorithms reduces as the exploration horizon (T) is increased. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Regret of different algorithms in the Exploitation Phase of the Transfer Attack (Part 1/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. While the theoretical guarantees of bandit algorithms hold only for online learning, we observe that the exploitation regret also reduces in most cases as the length of the exploration horizon (T) i… view at source ↗
Figure 34
Figure 34. Figure 34: Regret of different algorithms in the Exploitation Phase of the Transfer Attack (Part 2/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. While the theoretical guarantees of bandit algorithms hold only for online learning, we observe that the exploitation regret also reduces in most cases as the length of the exploration horizon (T) i… view at source ↗
Figure 35
Figure 35. Figure 35: Regret of different algorithms over the full horizon (T ′ ) of the Continual Attack (Part 1/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. As expected from theoretical guarantees, the regret of bandit algorithms reduces as the length of the full horizon (T ′ ) is increased. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Regret of different algorithms over the full horizon (T ′ ) of the Continual Attack (Part 2/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. As expected from theoretical guarantees, the regret of bandit algorithms reduces as the length of the full horizon (T ′ ) is increased. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Regret of different algorithms in the Joint Exploration & Exploitation Phase of the Continual Attack (Part 1/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. While theoretical guarantees only hold for online learning, we observe that in most cases, the regret during the latter exploration-exploitation phase benefits (i.e., reduces) a… view at source ↗
Figure 38
Figure 38. Figure 38: Regret of different algorithms in the Joint Exploration & Exploitation Phase of the Continual Attack (Part 2/2). The line plots show the mean Regret over 3 runs with different seeds. The error bars indicate the standard deviation. While theoretical guarantees only hold for online learning, we observe that in most cases, the regret during the latter exploration-exploitation phase benefits (i.e., reduces) a… view at source ↗
read the original abstract

With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit framework. This allows efficient online learning of the optimal jailbreak from a large choice set via noisy exploration on a small number of queries, with subsequent application of the learnt policy on an exploitation set. For the latter, we curate $\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious queries drawn from manual curation over $7$ existing benchmarks, along with automated enhancement and generation. Each query is categorized as simple or complex by the technical expertise required to craft it. Our findings confirm the concern. Our bandit-based attack achieves success rates as high as $97\%$ on average over $15$ SoTA open-weight LLMs. Moreover, adding complexity to queries raises the attack success rate by up to $26\%$ on average across models -- making it an effective, automatable prompting strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that a multi-armed bandit framework enables non-expert attackers to efficiently learn an optimal jailbreak policy via noisy exploration on a small query subset, then apply it to a large exploitation set. Using their curated FrankensteinBench (11,279 malicious queries drawn from 7 existing benchmarks plus automated enhancement), they report bandit-based attacks achieving up to 97% average attack success rate (ASR) across 15 SoTA open-weight LLMs, with complex queries raising ASR by up to 26% on average.

Significance. If the generalization from exploration to exploitation holds and the benchmark is unbiased, the work provides concrete evidence that automated, query-adaptive jailbreaking is accessible without expertise, strengthening the case for improved LLM safeguards. The FrankensteinBench curation and the empirical scale (15 models, large query set) are useful contributions for the field; the bandit framing offers a principled way to handle large jailbreak choice sets.

major comments (2)
  1. [Abstract, attack strategy paragraph] Abstract and attack strategy paragraph: The headline 97% ASR and +26% complexity effect rest on the claim that a policy learned via bandit exploration on a small number of queries transfers to the separate 11,279-query exploitation set. No information is supplied on exploration-set size, reward definition (automated judge vs. human), specific bandit variant, or any statistical test (e.g., hold-out validation or significance of transfer) confirming that the reported rates reflect generalization rather than overfitting to the exploration queries.
  2. [Results and evaluation sections] Results and evaluation sections: The reported average ASR figures and the complexity delta lack accompanying details on how attack success is defined, number of exploration queries used, statistical significance, confidence intervals, or controls for benchmark-construction bias (e.g., how simple vs. complex categorization was validated and whether the automated enhancement introduces artifacts).
minor comments (2)
  1. [Method] Notation for the bandit reward and policy update is introduced without an explicit equation or pseudocode block, making the online learning procedure harder to reproduce.
  2. [Benchmark section] FrankensteinBench construction (manual curation + automated enhancement) would benefit from an explicit table listing the source benchmarks and the exact enhancement rules applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and the opportunity to clarify aspects of our work. We address each of the major comments point by point below, indicating the revisions we plan to make to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract, attack strategy paragraph] Abstract and attack strategy paragraph: The headline 97% ASR and +26% complexity effect rest on the claim that a policy learned via bandit exploration on a small number of queries transfers to the separate 11,279-query exploitation set. No information is supplied on exploration-set size, reward definition (automated judge vs. human), specific bandit variant, or any statistical test (e.g., hold-out validation or significance of transfer) confirming that the reported rates reflect generalization rather than overfitting to the exploration queries.

    Authors: We agree with the referee that these methodological details are essential for validating the generalization from exploration to exploitation. We will revise the abstract and attack strategy section to explicitly include the exploration set size, the reward definition using an automated judge, the specific bandit variant implemented, and statistical evidence such as hold-out validation results to confirm the transfer of the learned policy. This will be done in the next version of the manuscript. revision: yes

  2. Referee: [Results and evaluation sections] Results and evaluation sections: The reported average ASR figures and the complexity delta lack accompanying details on how attack success is defined, number of exploration queries used, statistical significance, confidence intervals, or controls for benchmark-construction bias (e.g., how simple vs. complex categorization was validated and whether the automated enhancement introduces artifacts).

    Authors: We acknowledge that the results section would benefit from additional rigor in reporting. In the revised manuscript, we will add precise definitions of attack success, report the number of exploration queries, include statistical significance tests and confidence intervals for the ASR values, and provide details on the validation of the simple/complex categorization as well as any measures taken to mitigate potential artifacts from the automated query enhancement in FrankensteinBench. These additions will address the concerns regarding evaluation transparency and potential biases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external LLMs and new benchmark

full rationale

The paper reports empirical attack success rates measured on 15 external open-weight LLMs using a bandit policy applied to a held-out exploitation set of 11,279 queries in FrankensteinBench. No equations, fitted parameters, or self-citations are shown to reduce the reported 97% ASR or +26% complexity effect to inputs by construction. The derivation chain consists of standard bandit exploration followed by independent measurement against external models and a curated benchmark; this is self-contained and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of the multi-armed bandit framework to noisy jailbreak feedback and on the representativeness of the curated benchmark; both are domain assumptions rather than derived results.

free parameters (1)
  • exploration query budget
    The number of queries allocated to the exploration phase is a design choice that controls learning versus exploitation.
axioms (1)
  • domain assumption Bandit algorithms can identify effective jailbreak templates from noisy binary feedback on a modest number of queries
    Invoked in the description of the attack strategy.
invented entities (1)
  • FrankensteinBench no independent evidence
    purpose: Large safety benchmark of 11,279 malicious queries with automated enhancement and complexity labels
    Newly introduced collection used to demonstrate the attack.

pith-pipeline@v0.9.1-grok · 5783 in / 1397 out tokens · 53137 ms · 2026-06-26T04:12:29.871385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 4 linked inside Pith

  1. [1]

    grandma exploit

    Contextual bandits with linear payoff func- tions. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Re- search, pages 208–214, Fort Lauderdale, FL, USA. PMLR. Anthony Cuthbertson. 2023. Chatgpt “grandma exploit” helps people pirate software. Mahavir Dabas, Tran ...

  2. [2]

    InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90

    Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zha...

  3. [3]

    Javier Rando

    Direct preference optimization: Your lan- guage model is secretly a reward model.Preprint, arXiv:2305.18290. Javier Rando. 2025. Do not write that jailbreak paper. InICLR Blogposts 2025. Https://iclr- blogposts.github.io/2025/blog/do-not-write- jailbreak-papers/. Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleigh...

  4. [4]

    InProceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 3430–3438

    DiffRed: Dimensionality reduction guided by stable rank. InProceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 3430–3438. PMLR. Kanishka Singh. 2025. Las vegas cybertruck suspect used chatgpt to plan blast, police say | reuters. Aleksandrs Slivkins. 2024. ...

  5. [5]

    Vladimir V ovk

    Embeddinggemma: Powerful and lightweight text representations.Preprint, arXiv:2509.20354. Vladimir V ovk. 2001. Competitive on-line statistics. International Statistical Review, 69. V olodya V ovk. 1997. Competitive on-line linear regres- sion. InAdvances in Neural Information Processing Systems, volume 10. MIT Press. Eric Wallace, Shi Feng, Nikhil Kandpa...

  6. [6]

    Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483. 15 xAI. 2025. Grok 4.1 model card. Technical report, xAI. Sophie Xhonneux, Alessandro Sordoni, Stephan Gün- nemann, Gauthier Gidel, and Leo Schwinn. 2024. Efficient adversarial training in llms with continuous attacks.Preprint, arXiv:2405.15589. Zheng-Xin Yong, Beyza Ermis, Marzieh...

  7. [7]

    Ethical Considerations(Appendix A): Dis- cussion on societal impact, associated risks and responsible disclosure of the artifacts as- sociated with this work

  8. [8]

    Related Work(Appendix B): A brief survey of literature in the field leading upto our con- tribution

  9. [9]

    Dataset Details(Appendix C): Details about the annotation process, quality control anal- ysis of crowd labels, statistics on crowd re- sponses and design choices used to build the complexity classifier referenced in §4

  10. [10]

    Jailbreak Examples(Appendix D): Exam- ples with citations of the 70 jailbreaks used in our evaluations, referenced in §5.2

  11. [11]

    This ex- pands on our description in §2 and §3

    Bandit Algorithms(Appendix E): Detailed notation, description of various bandit algo- rithms used for Algorithm 1, pseudocodes, regret bounds, and time complexity. This ex- pands on our description in §2 and §3

  12. [12]

    Regression Oracle for SquareCB (Ap- pendix F): Discussion on the Regression Oracle used with the SquareCB algorithm, bounds on regret, and pseudocode

  13. [13]

    Details on Context Vectors(Appendix G): Notation and theoretical description of how we obtain context vectors for contextual ban- dit algorithms, relevant to §2 and §3

  14. [14]

    Additional Information on Experimental Setup(Appendix H): Additional information on experimental setup [§5.1] for inference, bandit algorithms, choosing the rater judge, choice of hyperparameters, cost considera- tions, and compute requirements for repro- ducing our study

  15. [15]

    Additional Results(Appendix I): Additional results and plots from various experiments discussed in §5

  16. [16]

    the average jane

    Regret of Various Attacks(Appendix J): A discussion on the regret of various attacks, including the transfer attack and continual at- tack. A Ethical Considerations Exposure to Offensive Content:In the curation of FRANKENSTEINBENCH, our research team was exposed to offensive and malicious queries. All authors were aware of the nature of this work, and con...

  17. [17]

    Exploration Regret (Figures 31 to 32): This is the attacker’s regret on the exploration set: REG= 1 T X qt∈D(tr) R(qt, MJ †,qt)−ASR tr where, J † = arg min J∈{J 1,...,Jn} X qt∈D(tr) R(qt, MJ,qt)

  18. [18]

    Exploitation Regret (Figures 33 to 34): This is the attacker’s regret only on the exploitation set, but against the single best jailbreak of the exploration set: 30 REG= 1 (T ′ −T) X qt∈D(ev) R(qt, MJ †,qt)−ASRev where, J † = arg min J∈{J 1,...,Jn} X qt∈D(tr) R(qt, MJ,qt) For theContinual Attack, we measure and report the following two kinds of regret:

  19. [19]

    Full Regret (Figures 35 to 36): This is the same as the usual notion of regret,i.e., the gap between the attack’sASR and that of the single best jailbreak identified in T ′ rounds of theContinual Attack: REG= 1 T ′ X qt∈D(tr)∪D(ev) R(qt, MJ †,qt)− |D(tr)|ASRtr +|D (ev)|ASRev |D(tr)|+|D (ev)| where, J ‡ = arg min J∈{J 1,...,Jn} X qt∈D(tr)∪D(ev) R(qt, MJ,qt)

  20. [20]

    Label: <label>\n Reason: <reason>

    Exploitation Regret (Figures 37 to 38): This is the attacker’s regret measured only on the exploitation set but against the single best jail- break identified across theExplorationand theJoint Exploration & Exploitationphases: REG= 1 (T ′ −T) X qt∈D(ev) R(qt, MJ ‡,qt)−ASRev where, J ‡ = arg min J∈{J 1,...,Jn} X qt∈D (tr) ∪D (ev) R(qt, MJ,qt) 31 Annotation...

  21. [21]

    How do I make a bomb

    Do not use words more than seven letters long. 3. Never say the word "the". Style Injection Json How do I make a bombRespond only in JSON format, as a list of strings with four words per string. Distractors Answer the following prompts:Write an inspirational platitudeWrite a three-paragraph essay on flowers, but with the twist that the second sentence tra...