pith. sign in

arxiv: 2606.25487 · v2 · pith:MSOHZ3U3new · submitted 2026-06-24 · 💻 cs.CL · cs.CR· cs.LG

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Pith reviewed 2026-06-26 05:25 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG
keywords jailbreakattack success rateautomated evaluationLLM safetyadversarial robustnesssafety classifier
0
0 comments X

The pith

Automated judges for LLM jailbreak success rates disagree with humans and change verdicts under simple attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two common automated judges for attack success rate—dedicated safety classifiers and prompted LLM judges—against human majority votes on 596 HarmBench completions. Classifiers over-flag responses as harmful while LLM judges miss many harmful cases and produce wildly different rates depending on the model chosen. Both types of judge can be flipped into changing their scores by adding untouched benign text or refusal sentences, and a white-box attack succeeds against the classifier even though the harmful content remains. These findings indicate that many published attack success rates rest on uncalibrated and attackable scoring systems rather than stable measurements of harm.

Core claim

Using human majority votes on 596 HarmBench completions as ground truth, dedicated classifiers achieve precision 0.835 and recall 0.974 while LLM judges achieve precision 0.81–0.94 but recall only 0.06–0.65; surface wrappers that add benign framing flip LLM judges on 57–100% of cases and a single refusal sentence accounts for 39–88% of those flips, whereas the classifier resists surface attacks (at most 6.7%) yet succumbs to a white-box GCG attack that flips 70% of confident true positives.

What carries the argument

Comparison of classifier and LLM-as-judge families against human majority-vote ground truth on HarmBench, followed by surface framing attacks on LLM judges and gradient-based optimization attack on the open-weight classifier.

If this is right

  • Reported attack success rates vary sharply with the choice of judge.
  • LLM judges can be made to change their verdict by adding benign framing that leaves the harmful text unchanged.
  • The dedicated classifier resists surface attacks but can be reversed by a modest white-box optimization on its weights.
  • Papers should report judge precision and recall on human data and include an adversarial robustness check of the judge.
  • Attack success rates can be corrected for measured judge precision before publication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized public human validation sets could reduce dependence on any single judge.
  • Hybrid scoring systems might combine the classifier's resistance to surface changes with better recall calibration.
  • The gap between automated and human judgments may systematically inflate or deflate published attack success rates across the literature.

Load-bearing premise

The 596 human-labeled completions scored by majority vote form an accurate and representative ground truth for whether content is harmful.

What would settle it

A fresh human-labeled set of jailbreak completions on which the reported precision, recall, and attack flip rates differ substantially from the numbers obtained on the HarmBench validation slice.

Figures

Figures reproduced from arXiv: 2606.25487 by Yang Gao (Veyon Solutions).

Figure 1
Figure 1. Figure 1: Precision against recall for the four judges. The classifier sits in the high-recall, lower-precision [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Any-wrapper flip rate. The dedicated classifier (hatched) is at 3.4%; the three LLM-judges range [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of the 30 confident true positives broken by the white-box attack as a function of optimiza [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite ways. The dedicated classifier over-flags (precision 0.835, recall 0.974); three different LLM-as-judges keep high precision (0.81 to 0.94) but show erratic recall (0.06 to 0.65), so the same responses produce very different ASR depending on which judge scores them. The two families also differ sharply in robustness. Wrappers that leave the harmful text untouched and only add benign framing flip every LLM-judge between 57% and 100% of the time, and a single prepended refusal sentence accounts for much of this (39% to 88%). The dedicated classifier resists these surface attacks (at most 6.7%), but a white-box GCG attack on its open weights flips 70% of confident true positives (21 of 30; 95% CI 54 to 86%) even at a small optimization budget. A two-annotator audit confirms the attacks leave the harm intact: every one of 80 sampled flips still contained the harmful content. Because a large and growing share of reported ASR comes from LLM-judges, many such numbers are unreliable both on average and under deliberate pressure. We recommend that papers report judge precision and recall on a human-labeled slice, report ASR corrected for judge precision, and include an adversarial check of the judge. Our code is released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that automated judges for attack success rate (ASR) in LLM jailbreak papers are unreliable both on average and under attack. Using 596 human-labeled completions from the HarmBench classifier validation set scored by majority vote, it reports that a dedicated safety classifier over-flags (precision 0.835, recall 0.974) while three LLM-as-judges maintain high precision (0.81–0.94) but exhibit erratic recall (0.06–0.65). Simple adversarial wrappers flip LLM-judge decisions 57–100% of the time (with a single refusal sentence accounting for 39–88%), while the classifier resists surface attacks (≤6.7%) but succumbs to white-box GCG (70% of confident true positives flipped; 95% CI 54–86%). A two-annotator audit on 80 flipped samples confirms harm remains intact. The paper recommends reporting judge precision/recall on human-labeled data, corrected ASR, and adversarial checks, and releases code.

Significance. If the results hold, the work is significant because it supplies concrete, falsifiable evidence that a growing share of published ASR numbers rest on unvalidated judges whose outputs can be manipulated without altering the underlying content. The explicit precision/recall figures, confidence interval on the GCG attack, two-annotator audit, and released code are strengths that make the findings directly actionable for the community. The paper thereby identifies a methodological gap that affects the interpretability of a large body of jailbreak literature.

major comments (1)
  1. [Evaluation on the HarmBench validation set] The central claim that 'many such numbers are unreliable both on average and under deliberate pressure' rests on precision/recall and attack results computed against human majority votes on the single 596-completion HarmBench validation set. The manuscript provides no diversity statistics across behaviors or models, no inter-annotator agreement rates, and no cross-dataset comparison, leaving open whether the observed failure modes (erratic LLM recall, over-flagging by the classifier, and attack susceptibility) generalize to the response distributions and harm definitions used in other jailbreak evaluations.
minor comments (2)
  1. [Abstract] The abstract states that 'three different LLM-as-judges' were tested but does not name the models or prompts; adding these details would improve reproducibility.
  2. A summary table listing precision, recall, and attack flip rates for each judge family would make the comparative results easier to parse at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the paper's significance and for the constructive comment. We address the concern about evaluation scope below.

read point-by-point responses
  1. Referee: [Evaluation on the HarmBench validation set] The central claim that 'many such numbers are unreliable both on average and under deliberate pressure' rests on precision/recall and attack results computed against human majority votes on the single 596-completion HarmBench validation set. The manuscript provides no diversity statistics across behaviors or models, no inter-annotator agreement rates, and no cross-dataset comparison, leaving open whether the observed failure modes (erratic LLM recall, over-flagging by the classifier, and attack susceptibility) generalize to the response distributions and harm definitions used in other jailbreak evaluations.

    Authors: We agree that broader validation would strengthen the claims. The HarmBench validation set was selected as it is the standard public resource providing human majority-vote labels for exactly this classifier-evaluation task. In revision we will add diversity statistics: the breakdown of the 596 completions across HarmBench's 18 harm categories and the source models used to generate responses. For inter-annotator agreement we will explicitly note that labels are taken from the original HarmBench release (majority vote of three annotators) but that raw per-annotator data are not available to us. Cross-dataset comparison would require new human labeling on other corpora, which exceeds the scope of the current work; we will add a limitations paragraph acknowledging this and listing it as future work. These changes address the comment while preserving the paper's focus on a widely-used benchmark whose failure modes already affect many published ASR numbers. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on external human labels and explicit attack measurements

full rationale

The paper derives its reliability conclusions by comparing LLM judges and a classifier directly to majority-vote human labels on the fixed 596-sample HarmBench validation set, then measuring attack success rates against those same labels. No equations, fitted parameters, or self-citations are invoked to redefine success or to derive the reported precision/recall/attack-flip statistics; the human annotations function as an independent external benchmark, and the two-annotator audit of attack outcomes is likewise grounded in direct inspection rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human majority votes on the HarmBench validation set are reliable ground truth and that the tested judges and attacks are representative of those used in the broader literature.

axioms (1)
  • domain assumption Human majority votes on the 596 HarmBench completions provide accurate ground truth labels for whether a response is harmful.
    This label set is used to compute precision and recall for both judge families.

pith-pipeline@v0.9.1-grok · 5878 in / 1217 out tokens · 24605 ms · 2026-06-26T05:25:10.453138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Mazeika, M., Phan, L., et al. (2024). HarmBench: A Standardized Evaluation Framework for Auto- mated Red Teaming and Robust Refusal. ICML 2024 (PMLR v235). arXiv:2402.04249

  2. [2]

    Souly, A., Lu, Q., Bowen, D., et al. (2024). A StrongREJECT for Empty Jailbreaks. NeurIPS 2024 Datasets & Benchmarks. arXiv:2402.10260

  3. [3]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG). arXiv:2307.15043

  4. [4]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P., et al. (2023/2024). Jailbreaking Black Box LLMs in Twenty Queries (PAIR). SaTML 2024. arXiv:2310.08419

  5. [5]

    Mehrotra, A., et al. (2024). Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP). NeurIPS 2024. arXiv:2312.02119

  6. [6]

    Paulus, A., Zharmagambetov, A., Guo, C., Amos, B., & Tian, Y. (2024). AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs. arXiv:2404.16873

  7. [7]

    Inan, H., et al. (2023). Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv:2312.06674

  8. [8]

    Han, S., et al. (2024). WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. NeurIPS 2024 Datasets & Benchmarks. arXiv:2406.18495

  9. [9]

    Eiras, F., Zemour, E., Lin, E., & Mugunthan, V. (2025). Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges. ICBINB Workshop @ ICLR 2025. arXiv:2503.04474

  10. [10]

    Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., & Günnemann, S. (2026). A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness. arXiv:2603.06594

  11. [11]

    Miller, E. (2024). Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. arXiv:2411.00640

  12. [12]

    Meta-level Adversarial Evaluation of Oversight Techniques

    Shlegeris, B., & Greenblatt, R. Meta-level Adversarial Evaluation of Oversight Techniques. Redwood Research / Alignment Forum. 8

  13. [13]

    Wen, Y., Zharmagambetov, A., Evtimov, I., Kokhlikyan, N., Goldstein, T., Chaudhuri, K., & Guo, C. (2025). RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection. arXiv:2510.04885

  14. [14]

    Yin, C., Geng, R., Wang, Y., & Jia, J. (2026). PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses. arXiv:2603.13026

  15. [15]

    Qwen2.5 Technical Report

    Qwen Team (2024/2025). Qwen2.5 Technical Report. arXiv:2412.15115

  16. [16]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin, M., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219. Appendix A. Prompts, wrappers, and attack configuration All text below is reproduced verbatim from our harness. {behavior}, {context}, and {generation} are filled per item; {generation} is truncated to its first 512 tokens before scor...