arxiv: 2604.21159 · v1 · submitted 2026-04-22 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: unknown

Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet , Andy Luo , Swapnil Shinde , Sahil Wadhwa , Emily Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:29 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords adaptive instruction compositionLLM red-teamingjailbreak discoverycontextual banditreinforcement learningcontrastive pretrainingattack effectivenessattack diversity

0 comments

The pith

Adaptive Instruction Composition trains a contextual bandit via reinforcement learning to combine crowdsourced texts into instructions that yield more effective and diverse LLM jailbreaks than random selection or prior adaptive methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Instruction Composition to address limitations in LLM red-teaming where attacker models either repeat similar strategies through trial and error or combine known harmful elements at random. It trains a lightweight neural contextual bandit on contrastive embeddings using reinforcement learning to select combinations that balance attack success with variety in the large space of possible instructions. A sympathetic reader would care because this joint optimization could expose a wider range of vulnerabilities in target models, improving safety evaluations. The method demonstrates gains on effectiveness and diversity metrics that hold under transfer to different models and exceed recent adaptive baselines on Harmbench. Ablations highlight the role of contrastive pretraining in enabling quick scaling and generalization.

Core claim

Adaptive Instruction Composition employs a lightweight neural contextual bandit with contrastive pretraining that receives reinforcement learning signals to choose how to combine crowdsourced harmful queries and tactics into instructions for an attacker LLM. This process guides the attacker toward generations that exploit target-specific weaknesses while promoting diversity across attempts, outperforming random composition on effectiveness and diversity metrics even under model transfer and surpassing other recent adaptive red-teaming approaches on Harmbench.

What carries the argument

The lightweight neural contextual bandit with contrastive pretraining that uses reinforcement learning to balance exploration and exploitation when selecting combinations from the combinatorial space of instructions.

If this is right

Red-teaming can generate a broader set of successful attacks that better reveal model weaknesses.
The learned bandit transfers its performance gains to new target models without retraining.
Contrastive pretraining enables the bandit to handle the scale of instruction combinations effectively.
The framework surpasses multiple existing adaptive red-teaming techniques on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive composition mechanism could extend to other prompt-optimization tasks where both quality and variety matter.
If new crowdsourced data arrives, the bandit might incorporate it with less retraining than full retraining approaches.
This style of bandit could apply to combinatorial search in related areas like automated testing or content generation.

Load-bearing premise

The lightweight neural contextual bandit with contrastive pretraining can rapidly generalize and scale to the massive combinatorial space of instructions while jointly optimizing effectiveness and diversity via RL.

What would settle it

A head-to-head evaluation in which the adaptive bandit produces no higher effectiveness or diversity scores than random combination on the same metrics and Harmbench benchmark would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2604.21159 by Andy Luo, Emily Chen, Jesse Zymet, Sahil Wadhwa, Swapnil Shinde.

**Figure 1.** Figure 1: Overview of Adaptive Instruction Composition et al., 2024). For example, though in a prior investigation GPT-4 refused to answer an array of harmful requests, it engaged with these requests once they were translated into a low-resource language (Yong et al., 2024). Much research into LLM safety focuses on developing robust defenses against attacks using finetuned models (Ouyang et al., 2022; Bai et al., … view at source ↗

**Figure 2.** Figure 2: Adaptive selection of attack content followed by attack generation Algorithm 1 Adaptive Instruction Composition Input: Total trials T; total instruction candidates K; vanilla queries Q = {q1, ..., q|Q|} and jailbreak tactics J = {j1, ..., j|J |}; instruction template tpl(·); feature map e(·); reward distribution function f(·|θ); attacker, target, and evaluator models A(·), T (·), E(·). for t = 1, 2, ...,… view at source ↗

**Figure 3.** Figure 3: Attack success rates as rolling averages with window length 200 Target WT ASR WT Count AIC Subtle ASR AIC Subtle Count AIC Aggr. ASR AIC Aggr. Count Mistral-7B 0.252 2521 0.363 3626 0.567 5673 Llama-3-70B 0.088 876 0.155 1552 0.450 4500 Llama-3.3-70B 0.183 1830 0.247 2465 0.558 5579 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: ASR, SBERT bandit vs. BERT bandits 5.4 Scaling Insights from Ablation We conducted an ablation study (see Appendix C) over different hyperparameters defined by Adaptive Instruction Composition, and found that three major factors drove performance: λ (and, similarly, ν, since both terms scale variance), the size of the bandit’s input (e.g., tactic count, as well as dimensionality of component embeddings a… view at source ↗

**Figure 5.** Figure 5: ASR, 10 vs. 50 component embedding dimensions Contrastive vs. non-contrastive embeddings. As discussed in Section 5.4 and observed in Figure 4, contrastive input embeddings yielded superior outcomes for learning efficiency relative to their non-contrastive counterparts. The results suggest that contrastively pretrained inputs help the network to rapidly generalize rewards over related actions during on… view at source ↗

**Figure 6.** Figure 6: ASR, 3 tactics Metric Bandit 500 1000 1500 2000 2500 3000 3500+ Unique Queries Subtle 488 482 490 490 480 481 479 Unique Queries Aggressive 482 433 401 412 478 486 367→416 Query Similarity Subtle 0.152 0.159 0.148 0.154 0.158 0.158 0.158 Query Similarity Aggressive 0.147 0.185 0.198 0.219 0.232 0.242 0.276→0.332 Attack Similarity Subtle 0.262 0.279 0.285 0.270 0.271 0.274 0.268 Attack Similarity Aggressive… view at source ↗

read the original abstract

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a contextual bandit that adaptively composes crowdsourced red-teaming instructions and beats random mixing plus some baselines on Harmbench, but the scaling story for the combinatorial space is still thin on mechanics.

read the letter

The main contribution is a lightweight neural contextual bandit trained with RL to pick how to combine crowdsourced harmful queries and tactics into attacker instructions. It optimizes a joint reward for jailbreak success and diversity, and the results show clear gains over random composition on effectiveness and diversity metrics, plus better numbers than several recent adaptive methods on Harmbench. The contrastive pretraining on embeddings is presented as the piece that lets the bandit generalize without enumerating everything, and they include ablations that support this. The transfer results across models are also useful to see. This moves the field past pure trial-and-error attacker LLMs or blind random combos in a practical way. The soft spot is exactly the one the stress-test flags: how the action space is actually represented. The abstract claims the embeddings enable rapid scaling to the massive combinatorial space, but the paper gives limited detail on whether compositions are sampled via a generative head, pruned to a fixed template pool, or handled some other way. Without that, it's hard to know if the reported gains truly demonstrate bandit scaling or just better selection within a narrower set. The reward shaping to avoid mode collapse also gets little space. Overall the empirical claims look grounded enough to stand on their own, but the implementation choices matter for how much credit the method deserves. This is for red-teaming and AI safety groups who need better automated tools. It has enough concrete results and ablations to warrant peer review, mainly so reviewers can press on the bandit setup and reward details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adaptive Instruction Composition, a framework that uses reinforcement learning via a lightweight neural contextual bandit with contrastive pretraining to adaptively combine crowdsourced harmful queries and tactics into instructions for attacking LLMs. It claims to jointly optimize for attack effectiveness and diversity, substantially outperforming random composition on effectiveness/diversity metrics (including under model transfer) and surpassing recent adaptive baselines on Harmbench.

Significance. If the results hold with full experimental details, the work could meaningfully advance automated LLM red-teaming by enabling more scalable discovery of diverse jailbreaks through adaptive composition rather than random or purely trial-and-error methods. The contrastive embedding approach for generalization in combinatorial spaces and the provided ablations represent a potentially useful technical contribution to balancing exploration/exploitation in this domain.

major comments (2)

[Abstract] Abstract: The central claim of substantial outperformance over random combination and prior adaptive methods on effectiveness/diversity metrics and Harmbench lacks specific quantitative results, error bars, statistical significance tests, or full experimental setup details (e.g., number of runs, exact baselines, reward shaping for the joint objective). This prevents verification of the support for the scaling and generalization claims.
[Methods] Methods (contextual bandit description): The paper does not specify how the action space of instruction compositions is represented or handled (e.g., whether compositions are enumerated from a fixed pool, sampled via a generative head, pruned, or otherwise restricted), which is load-bearing for the claim that contrastive pretraining enables rapid generalization and scaling to the 'massive combinatorial space' without sample-complexity blowup.

minor comments (2)

[Abstract] Abstract and experiments: Include standard deviations or confidence intervals alongside reported metrics to allow assessment of result stability.
[Ablations] Ablations: Clarify the exact contrastive pretraining objective and how it interacts with the RL policy to prevent mode collapse in the diversity-effectiveness tradeoff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their careful reading and valuable suggestions. We respond to each major comment in turn and outline the changes we will implement in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of substantial outperformance over random combination and prior adaptive methods on effectiveness/diversity metrics and Harmbench lacks specific quantitative results, error bars, statistical significance tests, or full experimental setup details (e.g., number of runs, exact baselines, reward shaping for the joint objective). This prevents verification of the support for the scaling and generalization claims.

Authors: We agree that including specific quantitative results and experimental details in the abstract would strengthen the presentation of our central claims. While the body of the manuscript contains tables with performance metrics including means and standard deviations across multiple runs, comparisons to the listed baselines, and a full description of the reward function used to jointly optimize effectiveness and diversity, the abstract itself is currently more qualitative. We will revise the abstract to incorporate key quantitative findings, such as the reported improvements on the metrics and Harmbench, along with a note on the number of runs and the structure of the reward. This will better support the claims regarding scaling and generalization. revision: yes
Referee: [Methods] Methods (contextual bandit description): The paper does not specify how the action space of instruction compositions is represented or handled (e.g., whether compositions are enumerated from a fixed pool, sampled via a generative head, pruned, or otherwise restricted), which is load-bearing for the claim that contrastive pretraining enables rapid generalization and scaling to the 'massive combinatorial space' without sample-complexity blowup.

Authors: Thank you for highlighting this important omission. The manuscript describes the use of a combinatorial space but does not provide sufficient detail on its construction and management. In our approach, the action space is constructed by enumerating feasible compositions from a fixed pool of crowdsourced texts, with restrictions on the number of components per instruction and pruning of low-quality or duplicate compositions based on embedding similarity. The neural contextual bandit then uses contrastive embeddings of the full instructions as context to learn a policy that generalizes without needing to explore the entire space. We will expand the Methods section with a clear description of the action space representation, the enumeration and pruning procedures, and how the contrastive pretraining contributes to efficient learning in this space. This will better substantiate the claims about generalization and scaling. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based only on the abstract, the central claim rests on standard RL assumptions for balancing exploration/exploitation and the effectiveness of contrastive embeddings for generalization in a large space; no specific free parameters or invented entities are detailed.

axioms (1)

domain assumption Contrastive pretraining enables the network to rapidly generalize and scale to the massive combinatorial space of instructions.
Stated in the abstract as key to the method's performance.

pith-pipeline@v0.9.0 · 5493 in / 1280 out tokens · 23294 ms · 2026-05-09T23:29:38.716752+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 10 canonical work pages · 1 internal anchor

[1]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024
[2]

2021 , eprint=

Neural Thompson Sampling , author=. 2021 , eprint=

2021
[3]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
[4]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024
[5]

and Choi, Yejin

Liu, Alisa and Swayamdipta, Swabha and Smith, Noah A. and Choi, Yejin. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. 2022

2022
[6]

2024 , eprint=

PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System , author=. 2024 , eprint=

2024
[7]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[8]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[9]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

2024
[10]

2023 , eprint=

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author=. 2023 , eprint=

2023
[11]

2024 , eprint=

Safeguarding Large Language Models: A Survey , author=. 2024 , eprint=

2024
[12]

2024 , eprint=

Low-Resource Languages Jailbreak GPT-4 , author=. 2024 , eprint=

2024
[13]

2021 , eprint=

Persistent Anti-Muslim Bias in Large Language Models , author=. 2021 , eprint=

2021
[14]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[15]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

2022
[16]

2021 , eprint=

Extracting Training Data from Large Language Models , author=. 2021 , eprint=

2021
[17]

Official Microsoft Blog , url =

Lee, Peter , date =. Official Microsoft Blog , url =. 2016 , title =

2016
[18]

Red Teaming Language Models with Language Models

Red teaming language models with language models , author=. arXiv preprint arXiv:2202.03286 , year=

work page Pith review arXiv
[19]

2022 , eprint=

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=

2022
[20]

2024 , eprint=

Curiosity-driven Red-teaming for Large Language Models , author=. 2024 , eprint=

2024
[21]

2024 , eprint=

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs , author=. 2024 , eprint=

2024
[22]

arXiv preprint arXiv:2310.12505 , year=

Attack prompt generation for red teaming and defending large language models , author=. arXiv preprint arXiv:2310.12505 , year=

work page arXiv
[23]

arXiv preprint arXiv:2311.09447 , year=

How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities , author=. arXiv preprint arXiv:2311.09447 , year=

work page arXiv
[24]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=
[25]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=

work page internal anchor Pith review arXiv
[26]

2020 , eprint=

Neural Contextual Bandits with UCB-based Exploration , author=. 2020 , eprint=

2020
[27]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019
[28]

2020 , eprint=

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , author=. 2020 , eprint=

2020
[29]

Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=

Density biased sampling: An improved method for data mining and clustering , author=. Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=

2000
[30]

Pacific-Asia conference on knowledge discovery and data mining , pages=

Density-based clustering based on hierarchical density estimates , author=. Pacific-Asia conference on knowledge discovery and data mining , pages=. 2013 , organization=

2013
[31]

2024 , eprint=

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning , author=. 2024 , eprint=

2024
[32]

2020 , eprint=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. 2020 , eprint=

2020
[33]

2025 , eprint=

Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models , author=. 2025 , eprint=

2025
[34]

2020 , eprint=

TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , author=. 2020 , eprint=

2020
[35]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

2024
[36]

2024 , eprint=

Exploring Straightforward Conversational Red-Teaming , author=. 2024 , eprint=

2024
[37]

Tree of attacks: Jailbreaking black-box llms automatically

Tree of attacks: Jailbreaking black-box llms automatically , author=. arXiv preprint arXiv:2312.02119 , year=

work page arXiv
[38]

do anything now

"do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

2024
[39]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024
[40]

2024 , eprint=

Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning , author=. 2024 , eprint=

2024
[41]

2025 , eprint=

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores , author=. 2025 , eprint=

2025
[42]

2021 , eprint=

Evaluating the Evaluation of Diversity in Natural Language Generation , author=. 2021 , eprint=

2021
[43]

2023 , eprint=

Combinatorial Neural Bandits , author=. 2023 , eprint=

2023
[44]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019
[45]

2024 , eprint=

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs , author=. 2024 , eprint=

2024
[46]

MART: improving LLM safety with multi- round automatic red-teaming

Mart: Improving llm safety with multi-round automatic red-teaming , author=. arXiv preprint arXiv:2311.07689 , year=

work page arXiv
[47]

arXiv preprint arXiv:2407.03876 , year=

Automated Progressive Red Teaming , author=. arXiv preprint arXiv:2407.03876 , year=

work page arXiv
[48]

arXiv preprint arXiv:2404.00629 , year=

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models , author=. arXiv preprint arXiv:2404.00629 , year=

work page arXiv
[49]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

Red-Teaming for generative AI: Silver bullet or security theater? , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=
[50]

Rawat, S

Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI , author=. arXiv preprint arXiv:2409.15398 , year=

work page arXiv
[51]

2024 , eprint=

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents , author=. 2024 , eprint=

2024
[52]

2024 , eprint=

Revisiting Character-level Adversarial Attacks for Language Models , author=. 2024 , eprint=

2024
[53]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

2021
[54]

2024 , eprint=

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , author=. 2024 , eprint=

2024
[55]

2024 , eprint=

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=

2024
[56]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda

Deng, Gelei and Liu, Yi and Li, Yuekang and Wang, Kailong and Zhang, Ying and Li, Zefeng and Wang, Haoyu and Zhang, Tianwei and Liu, Yang , year=. MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots , url=. doi:10.14722/ndss.2024.24188 , booktitle=

work page doi:10.14722/ndss.2024.24188 2024
[57]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[58]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

2022
[59]

2023 , eprint=

Pretraining Language Models with Human Preferences , author=. 2023 , eprint=

2023
[60]

2024 , eprint=

The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms , author=. 2024 , eprint=

2024
[61]

Advances in Neural Information Processing Systems , volume=

Algorithms for infinitely many-armed bandits , author=. Advances in Neural Information Processing Systems , volume=
[62]

2020 , eprint=

Satisficing in Time-Sensitive Bandit Learning , author=. 2020 , eprint=

2020
[63]

Data & Knowledge Engineering , volume=

Indexed-based density biased sampling for clustering applications , author=. Data & Knowledge Engineering , volume=. 2006 , publisher=

2006
[64]

Proceedings of the 30th International Conference on Machine Learning , pages =

Thompson Sampling for Contextual Bandits with Linear Payoffs , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

2013
[65]

Proceedings of the 19th international conference on World wide web , pages=

A contextual-bandit approach to personalized news article recommendation , author=. Proceedings of the 19th international conference on World wide web , pages=
[66]

Journal of Machine Learning Research , volume=

Using confidence bounds for exploitation-exploration trade-offs , author=. Journal of Machine Learning Research , volume=
[67]

Advances in neural information processing systems , volume=

Parametric bandits: The generalized linear case , author=. Advances in neural information processing systems , volume=