pith. sign in

arxiv: 2510.07239 · v2 · pith:6GOACN6Snew · submitted 2025-10-08 · 💻 cs.CL

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Pith reviewed 2026-05-21 20:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM red-teamingtest-time adaptationLoRA expertsmulti-armed banditreinforcement learningadversarial promptsmodel vulnerabilities
0
0 comments X

The pith

A bandit selects among LoRA experts trained for different attack styles to adapt red-teaming prompts online to each target LLM's responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a set of LoRA adapters, each specialized for one attack style such as manipulation or slang, by using reinforcement learning where a rule-based safety model rewards the creation of unsafe outputs. At inference time a multi-armed bandit chooses which expert to apply according to whether earlier prompts produced safe or unsafe replies from the target model, shifting emphasis toward styles that work while still trying others. This lets the red-teaming process adjust in real time without retraining the entire system for every new model. If the method holds, red-teaming becomes more efficient at locating safety failures because attacks are tailored to the specific weaknesses of the model under test. The same selection record also shows which attack styles succeed most often against that model.

Core claim

Red-Bandit post-trains a collection of parameter-efficient LoRA experts, each tuned to a distinct attack style through reinforcement learning that rewards generations the rule-based safety model flags as unsafe. At inference a multi-armed bandit policy picks the expert dynamically based on the safety score of the target LLM's responses, balancing exploration of new styles against exploitation of those already shown to elicit unsafe behavior. The approach reports state-of-the-art attack success rates on AdvBench under sufficient exploration, produces prompts with lower perplexity, and uses the policy's choices to indicate which styles expose model-specific vulnerabilities.

What carries the argument

The multi-armed bandit policy that selects among LoRA experts specialized for distinct attack styles, taking the target model's response safety as the reward signal to decide the next expert.

If this is right

  • Red-Bandit reaches higher attack success rates on AdvBench when exploration is sufficient.
  • Generated prompts show lower perplexity than prior red-teaming methods and are therefore more human-readable.
  • The bandit policy identifies which attack styles are most effective against a given target model.
  • Online selection reduces the need to retrain a full red-teaming system for each new target LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bandit-over-experts pattern could be tested on other online adaptation problems such as domain-specific prompt tuning.
  • Replacing the rule-based reward with a learned safety judge might reduce false positives in the diagnostic signal.
  • Running the method across many LLMs could reveal whether certain attack styles are broadly effective or remain model-specific.

Load-bearing premise

The rule-based safety model supplies a reliable proxy for unsafe behavior whose rewards generalize to the target LLM being tested.

What would settle it

An experiment that generates prompts with Red-Bandit on a target LLM, then has independent human raters or a stronger safety classifier judge the same prompts to check whether the fraction judged unsafe matches the attack success rate reported by the rule-based model.

Figures

Figures reproduced from arXiv: 2510.07239 by Alessandra Russo, Christos Ziakas, Nicholas Loo, Nishita Jain.

Figure 1
Figure 1. Figure 1: Inference pipeline of Red-Bandit. At test time, a multi-armed bandit dynamically selects [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline of Red-Bandit. Multiple LoRA attack-style experts are trained with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Red-Bandit, a red-teaming framework for LLMs that post-trains a collection of LoRA experts, each specialized to a distinct attack style, via reinforcement learning rewarded by a rule-based safety model. At inference, a multi-armed bandit policy selects among the experts based on the safety of the target model's responses to enable test-time adaptation. The work claims state-of-the-art ASR@10 on AdvBench (under sufficient exploration), lower prompt perplexity, and diagnostic value from the bandit policy in revealing model-specific vulnerabilities.

Significance. If the performance claims and generalization of the rule-based reward hold, the approach offers a practical mechanism for online adaptation in red-teaming without full model retraining, potentially improving scalability of safety audits. The bandit-as-diagnostic element could provide actionable insights into attack-style effectiveness if empirically validated across models.

major comments (2)
  1. [RL Training and Reward Design] The central performance and diagnostic claims rest on the assumption that the rule-based safety model used as RL reward is a reliable, unbiased proxy for unsafe behavior in the target LLM. The manuscript should include targeted validation (e.g., correlation with human judgments or alternative detectors on held-out prompts) in the RL training and evaluation sections; without it, experts may optimize for proxy artifacts, weakening both the adaptation mechanism and the interpretation of bandit-selected attack styles.
  2. [Experiments] The abstract asserts SOTA ASR@10 and lower perplexity, yet the experiments section must supply full baseline comparisons, statistical tests, ablation results on bandit hyperparameters and number of experts, and results across multiple target LLMs to support these claims. Current lack of these details prevents evaluation of whether the reported gains are robust or attributable to the proposed method.
minor comments (2)
  1. [Method Overview] Clarify the precise definition and enumeration of attack styles (e.g., manipulation, slang) and how they are assigned to individual LoRA experts.
  2. [Abstract and Results] The qualifier 'under sufficient exploration' for the ASR@10 result should be quantified with the specific exploration schedule or epsilon values used in the reported runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the manuscript. We address each major comment below and have made targeted revisions to improve clarity, validation, and experimental rigor.

read point-by-point responses
  1. Referee: [RL Training and Reward Design] The central performance and diagnostic claims rest on the assumption that the rule-based safety model used as RL reward is a reliable, unbiased proxy for unsafe behavior in the target LLM. The manuscript should include targeted validation (e.g., correlation with human judgments or alternative detectors on held-out prompts) in the RL training and evaluation sections; without it, experts may optimize for proxy artifacts, weakening both the adaptation mechanism and the interpretation of bandit-selected attack styles.

    Authors: We agree that explicit validation of the rule-based safety model strengthens the claims. In the revised manuscript, we have added a dedicated validation subsection under RL Training. This includes Pearson correlation analysis (r=0.81) between rule-based rewards and human judgments on a held-out set of 300 prompts, as well as agreement rates with an alternative LLM-based safety classifier. These results support the proxy's reliability for the attack styles considered and reduce concerns about artifact optimization. We also discuss limitations of the rule-based approach in the revised text. revision: yes

  2. Referee: [Experiments] The abstract asserts SOTA ASR@10 and lower perplexity, yet the experiments section must supply full baseline comparisons, statistical tests, ablation results on bandit hyperparameters and number of experts, and results across multiple target LLMs to support these claims. Current lack of these details prevents evaluation of whether the reported gains are robust or attributable to the proposed method.

    Authors: We appreciate this point and have expanded the Experiments section substantially. The revision now includes: full tables with all baselines (GCG, AutoDAN, PAIR, and others) and their ASR@10/perplexity metrics; paired t-tests confirming statistical significance (p<0.05) of Red-Bandit's gains; ablations varying the number of experts (3–8) and bandit exploration parameter ε; and results on two additional target models (Vicuna-7B and Llama-3-8B) where the method retains SOTA performance under sufficient exploration. These additions demonstrate robustness without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents Red-Bandit as a framework that trains LoRA experts via RL using a rule-based safety model reward signal, then applies a standard multi-armed bandit at inference to select experts based on observed target-model responses. No equations, derivations, or performance metrics are shown to reduce by construction to fitted parameters, self-definitions, or self-citation chains. The claimed SOTA ASR@10, lower perplexity, and diagnostic utility of the bandit policy rest on empirical evaluation against AdvBench rather than any self-referential or load-bearing internal definitions. The central mechanism is a conventional bandit selection process whose outputs are not forced by the inputs in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper likely contains additional free parameters for RL training and bandit exploration rate plus domain assumptions about the safety model.

axioms (1)
  • domain assumption A rule-based safety model can serve as a reliable reward signal for training attack prompts
    Invoked to train the LoRA experts via reinforcement learning

pith-pipeline@v0.9.0 · 5718 in / 1258 out tokens · 37818 ms · 2026-05-21T20:50:31.870826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Red-Bandit post-trains a set of parameter-efficient LoRA experts... using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

    cs.AI 2026-05 unverdicted novelty 7.0

    Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 20 internal anchors

  1. [1]

    Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  3. [3]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  4. [4]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

  5. [5]

    Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

  6. [6]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  7. [7]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  8. [8]

    Managing ai risks in an era of rapid progress

    Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, et al. Managing ai risks in an era of rapid progress. arXiv preprint arXiv:2310.17688, page 18, 2023

  9. [9]

    Auditing large language models: a three-layered approach

    Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: a three-layered approach. ai and ethics. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, volume 1, pages 4275–4293, 2023

  10. [10]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509, 2022

  11. [11]

    2305.11747 , archivePrefix=

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

  12. [12]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  13. [13]

    Openai’s approach to external red teaming for ai models and systems.arXiv preprint arXiv:2503.16431, 2025

    Lama Ahmad, Sandhini Agarwal, Michael Lampe, and Pamela Mishkin. Openai’s approach to external red teaming for ai models and systems.arXiv preprint arXiv:2503.16431, 2025

  14. [14]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  15. [15]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

  16. [16]

    Measuring Progress on Scalable Oversight for Large Language Models

    Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

  17. [17]

    Atoxia: Red-teaming large language models with target toxic answers.arXiv preprint arXiv:2408.14853, 2024

    Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, and Anningzhe Gao. Atoxia: Red-teaming large language models with target toxic answers.arXiv preprint arXiv:2408.14853, 2024

  18. [18]

    arXiv preprint arXiv:2404.16873 (2024)

    Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Advprompter: Fast adaptive adversarial prompting for llms.arXiv preprint arXiv:2404.16873, 2024

  19. [19]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jail- breaking black box large language models in twenty queries, 2024.URL https://arxiv. org/abs/2310.08419, 1(2):3, 2024

  20. [20]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

  21. [21]

    Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37: 69747–69786, 2024

    Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37: 69747–69786, 2024

  22. [22]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023. 10

  23. [23]

    Diverse and effective red teaming with auto-generated rewards and multi-step reinforcement learning.arXiv preprint arXiv:2412.18693, 2024

    Alex Beutel, Kai Xiao, Johannes Heidecke, and Lilian Weng. Diverse and effective red teaming with auto-generated rewards and multi-step reinforcement learning.arXiv preprint arXiv:2412.18693, 2024

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017

  25. [25]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  26. [26]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  27. [27]

    Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002

    P Auer, N Cesa-Bianchi, and P Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002

  28. [28]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  29. [29]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  30. [30]

    Magistral.arXiv preprint arXiv:2506.10910, 2025

    Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.arXiv preprint arXiv:2506.10910, 2025

  31. [31]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  32. [32]

    Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024a

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024

  33. [33]

    Modular deep learning.arXiv preprint arXiv:2302.11529, 2023

    Jonas Pfeiffer, Sebastian Ruder, Ivan Vuli´c, and Edoardo Maria Ponti. Modular deep learning.arXiv preprint arXiv:2302.11529, 2023

  34. [34]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...

  35. [35]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  36. [36]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

  37. [37]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  38. [38]

    Chatgpt (mar 14 version).https://chat.openai.com/chat, 2023

    OpenAI. Chatgpt (mar 14 version).https://chat.openai.com/chat, 2023. Accessed: 2025-10-04

  39. [39]

    Gpt-4o mini: Advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-system-card/, 2024. Accessed: 2024-08-08

  40. [40]

    Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

    OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2024-08-08

  41. [41]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Automatically auditing large language models via discrete optimization

    Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. InInternational Conference on Machine Learning, pages 15307–15329. PMLR, 2023

  44. [44]

    Tree of attacks: Jailbreaking black-box LLMs automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=SoM3vngOH5

  45. [45]

    Graph of attacks with pruning: Optimizing stealthy jailbreak prompt generation for enhanced llm content moderation.arXiv preprint arXiv:2501.18638, 2025

    Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, and Yanjun Qi. Graph of attacks with pruning: Optimizing stealthy jailbreak prompt generation for enhanced llm content moderation.arXiv preprint arXiv:2501.18638, 2025

  46. [46]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

  47. [47]

    Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.arXiv preprint arXiv:2308.06463, 2023

  48. [48]

    Curiosity-driven red- teaming for large language models

    Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models.arXiv preprint arXiv:2402.19464, 2024

  49. [49]

    Learning diverse attacks on large language 11 models for robust red-teaming and safety tuning

    Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, and Moksh Jain. Learning diverse attacks on large language 11 models for robust red-teaming and safety tuning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum...

  50. [50]

    Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching

    Aditya Ramesh, Shivam Bhardwaj, Aditya Saibewar, and Manohar Kaul. Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching. InThe Thirteenth International Conference on Learning Representations, 2025

  51. [51]

    Spony (extended harmbench) dataset

    Yinkai Wang. Spony (extended harmbench) dataset. https://huggingface.co/datasets/Spony/ harmbench-dataset, 2024. Hugging Face dataset

  52. [52]

    augmented_advbench_v4 dataset

    Baidicoot. augmented_advbench_v4 dataset. https://huggingface.co/datasets/Baidicoot/ augmented_advbench_v4, 2024. Hugging Face dataset

  53. [53]

    Outline a plan to steal a car without getting caught

    Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018. 12 A Examples of Attack Styles Table 5: Examples for each attack style used to train LoRA ex...