AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

Khanak Khandelwal (Indian Institute of Technology Jodhpur)

arxiv: 2606.24589 · v1 · pith:UZUXDYEInew · submitted 2026-06-23 · 💻 cs.AI · cs.CL

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

Khanak Khandelwal (Indian Institute of Technology Jodhpur) This is my paper

Pith reviewed 2026-06-25 23:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM red-teamingadversarial promptsprompt mutationmulti-judge evaluationcross-model transferautomated safety testing

0 comments

The pith

An automated pipeline mutates seed prompts and confirms failures with a three-judge panel, revealing that attacks transfer zero-shot from small to large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AdversaBench as a pipeline that applies five structured mutation operators to seed prompts in reasoning, instruction-following, and tool-use categories, queries a target model, and verifies failures with a three-judge panel plus meta-judge tiebreaker. Across 45 seeds, every one produced a confirmed failure. The work also demonstrates that prompts generated against an 8B model succeed directly against a 70B model. A sympathetic reader would care because the method offers a repeatable way to scale adversarial testing and points to vulnerabilities that are not tied to any single model size.

Core claim

AdversaBench mutates seed prompts with five structured operators, queries the target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across three categories show every seed produced a confirmed failure. Operator effectiveness varies sharply by category, instruction-following seeds require more attacker iterations on average, pairwise judge agreement reaches 80-87 percent yet Cohen's kappa is near zero due to label skew, and adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B.

What carries the argument

The end-to-end pipeline of five structured mutation operators plus a three-judge confirmation system with meta-judge tiebreaker.

If this is right

Mutations exploit general behavioral patterns rather than model-specific weaknesses.
Instruction-following seeds require more iterations to produce failures than reasoning or tool-use seeds.
Category-level disagreement rates among judges reveal more about evaluation reliability than overall agreement percentages.
Binary failure rates alone do not capture differences in how hard it is to surface failures across task types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same mutations succeed across model scales, they may point to shared limitations in how current LLMs handle certain prompt structures.
The pipeline could be applied to additional model families to test whether transfer holds beyond the Llama series.
Reliable automated confirmation might allow red-teaming to be run continuously during model development rather than as a one-time check.

Load-bearing premise

The three-judge panel with meta-judge tiebreaker accurately identifies genuine model failures without systematic bias or error from the judges themselves.

What would settle it

A case where the three-judge panel plus meta-judge labels a response as a failure but independent human review shows the model actually followed the prompt correctly.

Figures

Figures reproduced from arXiv: 2606.24589 by Khanak Khandelwal (Indian Institute of Technology Jodhpur).

**Figure 1.** Figure 1: AdversaBench pipeline. Unanimous judge consensus saves directly; splits are resolved by a GPT-4o-mini meta-judge. JailbreakBench [Chao et al., 2024], and StrongREJECT [Souly et al., 2024] standardize attack suites and automated grading at larger scale; our work complements these efforts with a focused study on multi-judge confirmation, evaluator reliability, and cross-model transferability. 2.4. Agent Benc… view at source ↗

**Figure 2.** Figure 2: Operator effectiveness heatmap (mean reward per operator–category pair). inject_distractor achieves 1.00 mean reward on reasoning and tool-use seeds but 0.33 on instruction-following, confirming that no single operator dominates across all categories. The survival curve ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Target model survival curve by iteration and category. Reasoning and tool-use seeds drop to 10% survival after iteration 1; instruction-following seeds remain at 60% survival after iteration 1 and require up to 5 iterations to fully break [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Judge disagreement by category. Reasoning failures are unanimous across all seeds; instructionfollowing produces partial disagreement in 33% of cases, confirming that the meta-judge is most active on the hardest category. Operator selection is category-dependent. The heatmap in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at https://github.com/khanak0509/AdversaBench .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pipeline combines standard red-teaming pieces into a usable end-to-end setup with public code, but the 100% confirmed failure claim rests on a judge panel whose reliability is not yet demonstrated.

read the letter

The main takeaway is that this paper describes a concrete red-teaming pipeline using five mutation operators on 45 seeds across three categories, followed by a three-judge panel with meta-judge tiebreaker. Every seed ends up labeled a confirmed failure, and prompts generated on Llama 3.1 8B transfer to the 70B version. They also release the code and dataset.

What the work actually does is lay out an empirical workflow with some useful breakdowns: operator effectiveness changes by category, instruction-following seeds take more iterations on average, and they plot survival curves to show difficulty beyond simple success rates. The transfer result is presented as evidence that the mutations hit general patterns. Making the full pipeline and scripts available is the clearest positive here.

The soft spot is the confirmation step. The abstract gives 80-87% pairwise agreement but near-zero Cohen's kappa from label skew, and it does not report external validation, human labels, or ablation on the judges themselves. Without those checks, the universal failure claim and the transfer finding both depend on the panel correctly identifying real issues rather than shared biases. The lack of judge prompt details or exclusion rules in the provided text makes it hard to judge how much post-hoc tuning might have shaped the outcomes.

This is aimed at people already running LLM safety evaluations who want a ready-to-try automated setup. A reader working on red-teaming tools could extract the operators and the category-level analysis for their own experiments.

I would send it to peer review. The empirical results and released artifacts are concrete enough to merit referee scrutiny, especially on the judge validation question.

Referee Report

1 major / 1 minor

Summary. The paper introduces AdversaBench, an automated red-teaming pipeline that applies five structured mutation operators to 45 seed prompts across reasoning, instruction-following, and tool-use categories, queries target LLMs, and confirms failures via a three-judge panel plus meta-judge tiebreaker. It claims every seed yielded a confirmed failure, reports category-specific operator effectiveness and iteration counts, notes 80-87% pairwise judge agreement with near-zero Cohen's kappa due to skew, and finds zero-shot transfer of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B.

Significance. If the multi-judge procedure is shown to be reliable, the work supplies a practical, open-source (code, dataset, and scripts at the provided GitHub link) framework for scalable LLM adversarial evaluation and highlights that mutations may exploit general rather than model-specific weaknesses.

major comments (1)

[Abstract and §3] Abstract and §3 (Judge Confirmation procedure): The central claim that 'every seed produced a confirmed failure' and the transferability result both depend on the three-judge panel + meta-judge accurately labeling genuine failures. The manuscript reports 80-87% pairwise agreement but near-zero Cohen's kappa due to label skew, yet supplies no external validation (human labels, held-out ground truth, cross-model judge ablation, or prompt wording/exclusion criteria) to demonstrate the panel is unbiased rather than systematically over- or under-calling failures. This is load-bearing for the headline results.

minor comments (1)

[Abstract] Abstract: states that 'category-level disagreement rates are more informative' but does not report the actual rates or point to the table/figure where they appear.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the judge confirmation procedure as load-bearing for the headline claims. We address the major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Judge Confirmation procedure): The central claim that 'every seed produced a confirmed failure' and the transferability result both depend on the three-judge panel + meta-judge accurately labeling genuine failures. The manuscript reports 80-87% pairwise agreement but near-zero Cohen's kappa due to label skew, yet supplies no external validation (human labels, held-out ground truth, cross-model judge ablation, or prompt wording/exclusion criteria) to demonstrate the panel is unbiased rather than systematically over- or under-calling failures. This is load-bearing for the headline results.

Authors: We agree that the absence of external validation (human labels, held-out ground truth, or judge ablations) is a genuine limitation for claims that rest on the panel's accuracy. The manuscript already reports pairwise agreement and notes the kappa issue due to skew, but does not provide the requested forms of validation. In revision we will: (1) move the full judge prompts and exclusion criteria to a new appendix subsection, (2) expand the discussion in §3 to explicitly address potential systematic biases and the implications of label skew, and (3) add an explicit limitations paragraph stating that human validation was not performed and remains future work. These changes will not alter the reported numbers but will make the dependence on the panel more transparent. We do not claim the current evidence proves the panel is unbiased; we only report the procedure and agreement statistics that were obtained. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical pipeline with no derivations or self-referential reductions

full rationale

The paper describes an experimental red-teaming pipeline (seed mutation operators, model queries, three-judge confirmation) whose central claims ('every seed produced a confirmed failure', cross-model transfer) rest on reported experimental outcomes rather than any equations, fitted parameters, or derivations. No self-citation load-bearing steps, uniqueness theorems, ansatzes, or renamings appear in the provided text. The judge-panel procedure is an empirical measurement step, not a self-defining or fitted-input reduction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the reliability of the judge panel to confirm failures and on the assumption that the 45 seeds are representative; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The three-judge panel with meta-judge accurately identifies genuine model failures
The pipeline's confirmation step depends on this; invoked in the description of failure verification.

pith-pipeline@v0.9.1-grok · 5792 in / 1192 out tokens · 29836 ms · 2026-06-25T23:47:28.629971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 6 internal anchors

[1]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM -as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models. arXiv preprint arXiv:2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Debenedetti, E., Zhang, J., Balunovic, M., Beurer-Kellner, L., Fischer, M., and Vechev, M. (2024). AgentDojo : A dynamic environment to evaluate attacks and defenses for LLM agents. arXiv preprint arXiv:2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46

1960
[5]

Feinstein, A. R. and Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543--549

1990
[6]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). AlpacaEval : An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

2023
[7]

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. (2024). HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Chao, P., Robey, A., Dobriban, E., Hsieh, C.-J., Smith, R., and Jana, S. (2024). JailbreakBench : An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Yang, J., Gupta, S., Mangaokar, N., et al. (2024). A strongREJECT for empty jailbreaks. arXiv preprint arXiv:2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM -as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models. arXiv preprint arXiv:2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Debenedetti, E., Zhang, J., Balunovic, M., Beurer-Kellner, L., Fischer, M., and Vechev, M. (2024). AgentDojo : A dynamic environment to evaluate attacks and defenses for LLM agents. arXiv preprint arXiv:2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46

1960

[5] [5]

Feinstein, A. R. and Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543--549

1990

[6] [6]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). AlpacaEval : An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

2023

[7] [7]

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. (2024). HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Chao, P., Robey, A., Dobriban, E., Hsieh, C.-J., Smith, R., and Jana, S. (2024). JailbreakBench : An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Yang, J., Gupta, S., Mangaokar, N., et al. (2024). A strongREJECT for empty jailbreaks. arXiv preprint arXiv:2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024