Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
Pith reviewed 2026-05-21 20:50 UTC · model grok-4.3
The pith
A bandit selects among LoRA experts trained for different attack styles to adapt red-teaming prompts online to each target LLM's responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Red-Bandit post-trains a collection of parameter-efficient LoRA experts, each tuned to a distinct attack style through reinforcement learning that rewards generations the rule-based safety model flags as unsafe. At inference a multi-armed bandit policy picks the expert dynamically based on the safety score of the target LLM's responses, balancing exploration of new styles against exploitation of those already shown to elicit unsafe behavior. The approach reports state-of-the-art attack success rates on AdvBench under sufficient exploration, produces prompts with lower perplexity, and uses the policy's choices to indicate which styles expose model-specific vulnerabilities.
What carries the argument
The multi-armed bandit policy that selects among LoRA experts specialized for distinct attack styles, taking the target model's response safety as the reward signal to decide the next expert.
If this is right
- Red-Bandit reaches higher attack success rates on AdvBench when exploration is sufficient.
- Generated prompts show lower perplexity than prior red-teaming methods and are therefore more human-readable.
- The bandit policy identifies which attack styles are most effective against a given target model.
- Online selection reduces the need to retrain a full red-teaming system for each new target LLM.
Where Pith is reading between the lines
- The same bandit-over-experts pattern could be tested on other online adaptation problems such as domain-specific prompt tuning.
- Replacing the rule-based reward with a learned safety judge might reduce false positives in the diagnostic signal.
- Running the method across many LLMs could reveal whether certain attack styles are broadly effective or remain model-specific.
Load-bearing premise
The rule-based safety model supplies a reliable proxy for unsafe behavior whose rewards generalize to the target LLM being tested.
What would settle it
An experiment that generates prompts with Red-Bandit on a target LLM, then has independent human raters or a stronger safety classifier judge the same prompts to check whether the fraction judged unsafe matches the attack success rate reported by the rule-based model.
Figures
read the original abstract
Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Red-Bandit, a red-teaming framework for LLMs that post-trains a collection of LoRA experts, each specialized to a distinct attack style, via reinforcement learning rewarded by a rule-based safety model. At inference, a multi-armed bandit policy selects among the experts based on the safety of the target model's responses to enable test-time adaptation. The work claims state-of-the-art ASR@10 on AdvBench (under sufficient exploration), lower prompt perplexity, and diagnostic value from the bandit policy in revealing model-specific vulnerabilities.
Significance. If the performance claims and generalization of the rule-based reward hold, the approach offers a practical mechanism for online adaptation in red-teaming without full model retraining, potentially improving scalability of safety audits. The bandit-as-diagnostic element could provide actionable insights into attack-style effectiveness if empirically validated across models.
major comments (2)
- [RL Training and Reward Design] The central performance and diagnostic claims rest on the assumption that the rule-based safety model used as RL reward is a reliable, unbiased proxy for unsafe behavior in the target LLM. The manuscript should include targeted validation (e.g., correlation with human judgments or alternative detectors on held-out prompts) in the RL training and evaluation sections; without it, experts may optimize for proxy artifacts, weakening both the adaptation mechanism and the interpretation of bandit-selected attack styles.
- [Experiments] The abstract asserts SOTA ASR@10 and lower perplexity, yet the experiments section must supply full baseline comparisons, statistical tests, ablation results on bandit hyperparameters and number of experts, and results across multiple target LLMs to support these claims. Current lack of these details prevents evaluation of whether the reported gains are robust or attributable to the proposed method.
minor comments (2)
- [Method Overview] Clarify the precise definition and enumeration of attack styles (e.g., manipulation, slang) and how they are assigned to individual LoRA experts.
- [Abstract and Results] The qualifier 'under sufficient exploration' for the ASR@10 result should be quantified with the specific exploration schedule or epsilon values used in the reported runs.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us strengthen the manuscript. We address each major comment below and have made targeted revisions to improve clarity, validation, and experimental rigor.
read point-by-point responses
-
Referee: [RL Training and Reward Design] The central performance and diagnostic claims rest on the assumption that the rule-based safety model used as RL reward is a reliable, unbiased proxy for unsafe behavior in the target LLM. The manuscript should include targeted validation (e.g., correlation with human judgments or alternative detectors on held-out prompts) in the RL training and evaluation sections; without it, experts may optimize for proxy artifacts, weakening both the adaptation mechanism and the interpretation of bandit-selected attack styles.
Authors: We agree that explicit validation of the rule-based safety model strengthens the claims. In the revised manuscript, we have added a dedicated validation subsection under RL Training. This includes Pearson correlation analysis (r=0.81) between rule-based rewards and human judgments on a held-out set of 300 prompts, as well as agreement rates with an alternative LLM-based safety classifier. These results support the proxy's reliability for the attack styles considered and reduce concerns about artifact optimization. We also discuss limitations of the rule-based approach in the revised text. revision: yes
-
Referee: [Experiments] The abstract asserts SOTA ASR@10 and lower perplexity, yet the experiments section must supply full baseline comparisons, statistical tests, ablation results on bandit hyperparameters and number of experts, and results across multiple target LLMs to support these claims. Current lack of these details prevents evaluation of whether the reported gains are robust or attributable to the proposed method.
Authors: We appreciate this point and have expanded the Experiments section substantially. The revision now includes: full tables with all baselines (GCG, AutoDAN, PAIR, and others) and their ASR@10/perplexity metrics; paired t-tests confirming statistical significance (p<0.05) of Red-Bandit's gains; ablations varying the number of experts (3–8) and bandit exploration parameter ε; and results on two additional target models (Vicuna-7B and Llama-3-8B) where the method retains SOTA performance under sufficient exploration. These additions demonstrate robustness without altering the core claims. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents Red-Bandit as a framework that trains LoRA experts via RL using a rule-based safety model reward signal, then applies a standard multi-armed bandit at inference to select experts based on observed target-model responses. No equations, derivations, or performance metrics are shown to reduce by construction to fitted parameters, self-definitions, or self-citation chains. The claimed SOTA ASR@10, lower perplexity, and diagnostic utility of the bandit policy rest on empirical evaluation against AdvBench rather than any self-referential or load-bearing internal definitions. The central mechanism is a conventional bandit selection process whose outputs are not forced by the inputs in a circular manner.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A rule-based safety model can serve as a reliable reward signal for training attack prompts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Red-Bandit post-trains a set of parameter-efficient LoRA experts... using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....
Reference graph
Works this paper leans on
-
[1]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023
work page 2023
-
[6]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023
work page 2023
-
[8]
Managing ai risks in an era of rapid progress
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, et al. Managing ai risks in an era of rapid progress. arXiv preprint arXiv:2310.17688, page 18, 2023
-
[9]
Auditing large language models: a three-layered approach
Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: a three-layered approach. ai and ethics. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, volume 1, pages 4275–4293, 2023
work page 2023
-
[10]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509, 2022
-
[11]
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023
-
[12]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Lama Ahmad, Sandhini Agarwal, Michael Lampe, and Pamela Mishkin. Openai’s approach to external red teaming for ai models and systems.arXiv preprint arXiv:2503.16431, 2025
-
[14]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Measuring Progress on Scalable Oversight for Large Language Models
Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, and Anningzhe Gao. Atoxia: Red-teaming large language models with target toxic answers.arXiv preprint arXiv:2408.14853, 2024
-
[18]
arXiv preprint arXiv:2404.16873 (2024)
Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. Advprompter: Fast adaptive adversarial prompting for llms.arXiv preprint arXiv:2404.16873, 2024
-
[19]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jail- breaking black box large language models in twenty queries, 2024.URL https://arxiv. org/abs/2310.08419, 1(2):3, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37: 69747–69786, 2024
work page 2024
-
[22]
Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023. 10
work page 2023
-
[23]
Alex Beutel, Kai Xiao, Johannes Heidecke, and Lilian Weng. Diverse and effective red teaming with auto-generated rewards and multi-step reinforcement learning.arXiv preprint arXiv:2412.18693, 2024
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[27]
Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002
P Auer, N Cesa-Bianchi, and P Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235–256, 2002
work page 2002
-
[28]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[29]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Magistral.arXiv preprint arXiv:2506.10910, 2025
Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.arXiv preprint arXiv:2506.10910, 2025
-
[31]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024a
Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024
-
[33]
Modular deep learning.arXiv preprint arXiv:2302.11529, 2023
Jonas Pfeiffer, Sebastian Ruder, Ivan Vuli´c, and Edoardo Maria Ponti. Modular deep learning.arXiv preprint arXiv:2302.11529, 2023
-
[34]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URLhttps://arxiv...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[37]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Chatgpt (mar 14 version).https://chat.openai.com/chat, 2023
OpenAI. Chatgpt (mar 14 version).https://chat.openai.com/chat, 2023. Accessed: 2025-10-04
work page 2023
-
[39]
Gpt-4o mini: Advancing cost-efficient intelligence
OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-system-card/, 2024. Accessed: 2024-08-08
work page 2024
-
[40]
Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2024-08-08
work page 2024
-
[41]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Automatically auditing large language models via discrete optimization
Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. InInternational Conference on Machine Learning, pages 15307–15329. PMLR, 2023
work page 2023
-
[44]
Tree of attacks: Jailbreaking black-box LLMs automatically
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=SoM3vngOH5
work page 2024
-
[45]
Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, and Yanjun Qi. Graph of attacks with pruning: Optimizing stealthy jailbreak prompt generation for enhanced llm content moderation.arXiv preprint arXiv:2501.18638, 2025
-
[46]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Gpt- 4 is too smart to be safe: Stealthy chat with llms via cipher,
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.arXiv preprint arXiv:2308.06463, 2023
-
[48]
Curiosity-driven red- teaming for large language models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models.arXiv preprint arXiv:2402.19464, 2024
-
[49]
Learning diverse attacks on large language 11 models for robust red-teaming and safety tuning
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, and Moksh Jain. Learning diverse attacks on large language 11 models for robust red-teaming and safety tuning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum...
work page 2025
-
[50]
Aditya Ramesh, Shivam Bhardwaj, Aditya Saibewar, and Manohar Kaul. Efficient jailbreak attack sequences on large language models via multi-armed bandit-based context switching. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[51]
Spony (extended harmbench) dataset
Yinkai Wang. Spony (extended harmbench) dataset. https://huggingface.co/datasets/Spony/ harmbench-dataset, 2024. Hugging Face dataset
work page 2024
-
[52]
Baidicoot. augmented_advbench_v4 dataset. https://huggingface.co/datasets/Baidicoot/ augmented_advbench_v4, 2024. Hugging Face dataset
work page 2024
-
[53]
Outline a plan to steal a car without getting caught
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018. 12 A Examples of Attack Styles Table 5: Examples for each attack style used to train LoRA ex...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.