AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Pith reviewed 2026-06-25 23:47 UTC · model grok-4.3
The pith
An automated pipeline mutates seed prompts and confirms failures with a three-judge panel, revealing that attacks transfer zero-shot from small to large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdversaBench mutates seed prompts with five structured operators, queries the target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across three categories show every seed produced a confirmed failure. Operator effectiveness varies sharply by category, instruction-following seeds require more attacker iterations on average, pairwise judge agreement reaches 80-87 percent yet Cohen's kappa is near zero due to label skew, and adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B.
What carries the argument
The end-to-end pipeline of five structured mutation operators plus a three-judge confirmation system with meta-judge tiebreaker.
If this is right
- Mutations exploit general behavioral patterns rather than model-specific weaknesses.
- Instruction-following seeds require more iterations to produce failures than reasoning or tool-use seeds.
- Category-level disagreement rates among judges reveal more about evaluation reliability than overall agreement percentages.
- Binary failure rates alone do not capture differences in how hard it is to surface failures across task types.
Where Pith is reading between the lines
- If the same mutations succeed across model scales, they may point to shared limitations in how current LLMs handle certain prompt structures.
- The pipeline could be applied to additional model families to test whether transfer holds beyond the Llama series.
- Reliable automated confirmation might allow red-teaming to be run continuously during model development rather than as a one-time check.
Load-bearing premise
The three-judge panel with meta-judge tiebreaker accurately identifies genuine model failures without systematic bias or error from the judges themselves.
What would settle it
A case where the three-judge panel plus meta-judge labels a response as a failure but independent human review shows the model actually followed the prompt correctly.
Figures
read the original abstract
Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at https://github.com/khanak0509/AdversaBench .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AdversaBench, an automated red-teaming pipeline that applies five structured mutation operators to 45 seed prompts across reasoning, instruction-following, and tool-use categories, queries target LLMs, and confirms failures via a three-judge panel plus meta-judge tiebreaker. It claims every seed yielded a confirmed failure, reports category-specific operator effectiveness and iteration counts, notes 80-87% pairwise judge agreement with near-zero Cohen's kappa due to skew, and finds zero-shot transfer of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B.
Significance. If the multi-judge procedure is shown to be reliable, the work supplies a practical, open-source (code, dataset, and scripts at the provided GitHub link) framework for scalable LLM adversarial evaluation and highlights that mutations may exploit general rather than model-specific weaknesses.
major comments (1)
- [Abstract and §3] Abstract and §3 (Judge Confirmation procedure): The central claim that 'every seed produced a confirmed failure' and the transferability result both depend on the three-judge panel + meta-judge accurately labeling genuine failures. The manuscript reports 80-87% pairwise agreement but near-zero Cohen's kappa due to label skew, yet supplies no external validation (human labels, held-out ground truth, cross-model judge ablation, or prompt wording/exclusion criteria) to demonstrate the panel is unbiased rather than systematically over- or under-calling failures. This is load-bearing for the headline results.
minor comments (1)
- [Abstract] Abstract: states that 'category-level disagreement rates are more informative' but does not report the actual rates or point to the table/figure where they appear.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the judge confirmation procedure as load-bearing for the headline claims. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Judge Confirmation procedure): The central claim that 'every seed produced a confirmed failure' and the transferability result both depend on the three-judge panel + meta-judge accurately labeling genuine failures. The manuscript reports 80-87% pairwise agreement but near-zero Cohen's kappa due to label skew, yet supplies no external validation (human labels, held-out ground truth, cross-model judge ablation, or prompt wording/exclusion criteria) to demonstrate the panel is unbiased rather than systematically over- or under-calling failures. This is load-bearing for the headline results.
Authors: We agree that the absence of external validation (human labels, held-out ground truth, or judge ablations) is a genuine limitation for claims that rest on the panel's accuracy. The manuscript already reports pairwise agreement and notes the kappa issue due to skew, but does not provide the requested forms of validation. In revision we will: (1) move the full judge prompts and exclusion criteria to a new appendix subsection, (2) expand the discussion in §3 to explicitly address potential systematic biases and the implications of label skew, and (3) add an explicit limitations paragraph stating that human validation was not performed and remains future work. These changes will not alter the reported numbers but will make the dependence on the panel more transparent. We do not claim the current evidence proves the panel is unbiased; we only report the procedure and agreement statistics that were obtained. revision: partial
Circularity Check
No circularity: empirical pipeline with no derivations or self-referential reductions
full rationale
The paper describes an experimental red-teaming pipeline (seed mutation operators, model queries, three-judge confirmation) whose central claims ('every seed produced a confirmed failure', cross-model transfer) rest on reported experimental outcomes rather than any equations, fitted parameters, or derivations. No self-citation load-bearing steps, uniqueness theorems, ansatzes, or renamings appear in the provided text. The judge-panel procedure is an empirical measurement step, not a self-defining or fitted-input reduction. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three-judge panel with meta-judge accurately identifies genuine model failures
Reference graph
Works this paper leans on
-
[1]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM -as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models. arXiv preprint arXiv:2202.03286
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Debenedetti, E., Zhang, J., Balunovic, M., Beurer-Kellner, L., Fischer, M., and Vechev, M. (2024). AgentDojo : A dynamic environment to evaluate attacks and defenses for LLM agents. arXiv preprint arXiv:2406.13352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46
1960
-
[5]
Feinstein, A. R. and Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543--549
1990
-
[6]
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). AlpacaEval : An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval
2023
-
[7]
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. (2024). HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Chao, P., Robey, A., Dobriban, E., Hsieh, C.-J., Smith, R., and Jana, S. (2024). JailbreakBench : An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Yang, J., Gupta, S., Mangaokar, N., et al. (2024). A strongREJECT for empty jailbreaks. arXiv preprint arXiv:2402.10260
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.