arxiv: 2604.24074 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Xinran Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords safety benchmarksLLM judgesprompt sensitivitymeasurement varianceHarmBenchAI safety evaluationjudge configuration

0 comments

The pith

LLM judge prompt wording alone can change safety benchmark scores by up to 24 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the specific wording of prompts used by LLM-based judges affects the outcomes of AI safety benchmarks such as HarmBench. Through a controlled factorial experiment holding the judge model fixed, it finds that prompt variants shift detected harmful-response rates by as much as 24.2 percentage points. This matters because safety evaluations are used to assess and compare AI models, and large unaccounted variance from prompt choices could distort conclusions about relative safety. The study also shows moderate instability in model rankings and category-specific differences in sensitivity.

Core claim

Using a 2x2x3 factorial design to create 12 judge prompt variants along evaluation structure and instruction framing, the paper applies them with one fixed judge model to six target models and 400 HarmBench behaviors. Prompt wording alone shifts harmful-response rates by up to 24.2 percentage points, with even surface rewording causing swings up to 20.1 points. Model safety rankings are moderately unstable (mean Kendall tau 0.89), and category sensitivity ranges from 39.6 points for copyright to 0 points for harassment. A supplementary multi-judge test shows judge-model choice adds further variance.

What carries the argument

A 2x2x3 factorial design generating 12 judge prompt variants along axes of evaluation structure and instruction framing, applied to fixed judge model and target outputs to isolate prompt wording effects on classification.

If this is right

Safety benchmark scores can vary by up to 24.2 percentage points from prompt wording alone even with the judge model held fixed.
Model safety rankings show only moderate stability across prompt variants, with mean Kendall tau of 0.89.
Sensitivity to prompt choices differs sharply by harm category, reaching 39.6 points for copyright but 0 for harassment.
Judge-model selection adds measurement variance on top of prompt effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark results may need to include sensitivity ranges or standardized prompt suites to become more reliable for model comparisons.
This variance could contribute to inconsistencies seen when safety scores are reported across different labs or evaluation setups.
Similar prompt-sensitivity tests could be applied to capability or alignment benchmarks that also rely on LLM judges.

Load-bearing premise

The observed differences in judgments are caused primarily by prompt wording rather than interactions with the specific judge model, sampling variance in target-model outputs, or unmeasured properties of the 400 HarmBench behaviors.

What would settle it

Re-running the identical target outputs through the same 12 prompts but with a different judge model and observing average prompt-induced swings below 5 percentage points would indicate the variance is not generally attributable to prompt wording.

Figures

Figures reproduced from arXiv: 2604.24074 by Xinran Zhang.

**Figure 1.** Figure 1: Maximum swing in harmful rate (pp) across 12 Sonnet-judged prompts, by HarmBench category. Copyright is the most sensitive; harassment shows zero variation view at source ↗

**Figure 2.** Figure 2: Pairwise Cohen’s κ across 12 multi-judge configurations. Block structure shows v3 (GLM-5) clustering separately from v1/v2 (Sonnet/MiniMax). Mean κ = 0.47. while others may be over- or under-sensitive. Resolving this would require human annotations or comparison to gold-standard labels, which we leave to future work. 6.2 Prompt Axes vs. Surface Wording Our factorial design introduces two prompt axes (stru… view at source ↗

read the original abstract

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt wording in safety judges can swing HarmBench harm rates by up to 24 points and make model rankings moderately unstable.

read the letter

The main thing to know is that this paper measures how much judge prompt choices affect safety benchmark outputs, and the effects are large enough to matter for rankings and category conclusions. They ran a 2x2x3 factorial on 12 prompt variants, fixed the judge model, and produced 28k judgments over six targets and 400 behaviors. Shifts reach 24.2 percentage points from prompt wording alone, with even minor rewording inside a condition moving results by 20 points. Category sensitivity varies sharply, from 39.6 points on copyright down to zero on harassment. Model rankings show mean Kendall tau of 0.89. A quick multi-judge check adds that model choice contributes extra variance on top of prompts. The design keeps target outputs fixed, so the variance traces to the prompt changes rather than response sampling or judge-model interactions. That controlled scale is the new part; prior work had not quantified prompt sensitivity at this level on HarmBench. The execution is straightforward and the numbers are big enough to be useful. The main soft spot is that the abstract skips details on exact aggregation rules, statistical tests, and multiple-comparison controls, which leaves open whether every reported swing is equally reliable. The raw effect sizes are large, though, and the stress-test design isolates the prompt factor cleanly, so this looks like a minor reporting gap rather than a flaw in the evidence. Readers who run or design automated safety evaluations will find this directly relevant. It gives concrete numbers on a controllable source of measurement error that affects how we interpret model safety. The work is worth a serious referee because the empirical setup is sound and the practical implication for benchmarking is clear. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper claims that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarks such as HarmBench. Using a 2×2×3 factorial design to generate 12 prompt variants (along evaluation structure and instruction framing axes) with a fixed judge model (Claude Sonnet 4-6), the authors produce 28,812 judgments across six target models and 400 HarmBench behaviors. They report that prompt wording shifts harmful-response rates by up to 24.2 percentage points (with within-condition rewording causing up to 20.1 pp swings), yields moderately unstable model rankings (mean Kendall tau = 0.89), and shows category-level sensitivity ranging from 39.6 pp (copyright) to 0 pp (harassment). A supplementary multi-judge experiment with three models is included to demonstrate additional variance from judge-model choice.

Significance. If the results hold, this work demonstrates that safety benchmark outcomes are sensitive to an often-fixed implementation detail (judge prompt), with direct implications for the reliability and comparability of model safety assessments. The large-scale, controlled factorial design that isolates prompt effects on identical target outputs, combined with the use of an established benchmark (HarmBench) and reproducible judgment count, provides concrete, falsifiable evidence of variance that strengthens the central claim. This could prompt the field to adopt sensitivity analyses or standardized prompts in future benchmarking.

major comments (1)

[Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.

minor comments (2)

[Abstract] Abstract: Adding a brief clause on the aggregation procedure or statistical approach used to quantify the percentage-point shifts and Kendall tau would improve clarity for readers encountering the quantitative claims first.
[Methods] Methods: The description of how the 400 behaviors are distributed across categories and how judgments are collected could explicitly state whether any filtering or balancing was applied to ensure the factorial comparisons are balanced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.

Authors: We thank the referee for highlighting this important point regarding transparency in our reporting. In the revised manuscript, we will expand the Results section (and add a brief Methods subsection) to include a clear description of the aggregation rules: for each of the 12 judge prompt variants, the harmful-response rate is calculated as the proportion of the 400 HarmBench behaviors that receive a 'harmful' judgment from the fixed judge model (Claude Sonnet 4-6). This aggregation is performed independently per prompt variant across the six target models. We did not perform formal inferential statistical hypothesis tests (such as t-tests, ANOVA, or chi-squared tests) on the differences between prompt variants, as the study is primarily descriptive and emphasizes the magnitude of observed variations within a controlled 2×2×3 factorial design. Model ranking instability is quantified using Kendall's tau rank correlation coefficient, with the mean value of 0.89 reported across relevant conditions. Regarding multiple comparisons, because the primary claims rest on the range of observed maximum differences (24.2 pp overall, 20.1 pp within-condition) and category-level sensitivities rather than p-value-based significance testing across the 12 prompts and harm categories, no adjustments for multiple comparisons were applied. We will explicitly state this in the revision, note the exploratory nature of the per-category results, and discuss any implications for interpreting effects such as the 39.6 pp range for copyright. These clarifications will be incorporated to strengthen the presentation of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurement study

full rationale

The paper conducts a controlled empirical experiment with a 2×2×3 factorial design on judge prompts applied to fixed target-model outputs from HarmBench. All reported results (percentage-point shifts, Kendall-tau values, category sensitivities) are direct measurements against external behaviors and do not rely on any derivations, equations, fitted parameters renamed as predictions, or self-citation chains for uniqueness. No load-bearing step reduces to its own inputs by construction; the design isolates prompt effects without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical sensitivity analysis. No free parameters are fitted to produce the central claims. No new entities are postulated.

axioms (1)

standard math Standard assumptions of factorial experimental design and rank correlation (Kendall tau) apply to the 28,812 judgments
Invoked when reporting mean Kendall tau = 0.89 and category-level sensitivity ranges

pith-pipeline@v0.9.0 · 5503 in / 1259 out tokens · 40719 ms · 2026-05-08T03:36:52.607511+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Llm-safety evaluations lack robustness.arXivpreprintarXiv:2503.02574, 2025

Beyer, T., Xhonneux, S., Geisler, S., Gidel, G., Schwinn, L., Günnemann, S.: LLM- safety evaluations lack robustness. arXiv preprint arXiv:2503.02574 (2025)

work page arXiv 2025
[2]

arXiv preprint arXiv:2601.08843 (2025)

Deng, H., Farber, C., Lee, J., Tang, D.: Rubric-conditioned LLM grading: Align- ment, uncertainty, and robustness. arXiv preprint arXiv:2601.08843 (2025)

work page arXiv 2025
[3]

In: ICBINB Workshop at ICLR (2025)

Eiras, F., Zemour, E., Lin, E., Mugunthan, V.: Know thy judge: On the robustness meta-evaluation of LLM safety judges. In: ICBINB Workshop at ICLR (2025)

2025
[4]

arXiv preprint arXiv:2408.03837 (2024)

Gupta, P., Yau, L.Q., Low, B.H.H.: WalledEval: A comprehensive safety evaluation toolkit for large language models. arXiv preprint arXiv:2408.03837 (2024)

work page arXiv 2024
[5]

arXiv preprint arXiv:2601.08654 (2026)

Hong, Y., Yao, H., Shen, B., Xu, W., Wei, H., Dong, Y.: RULERS: Locked rubrics and evidence-anchored scoring for robust LLM evaluation. arXiv preprint arXiv:2601.08654 (2026)

work page arXiv 2026
[6]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 (2023)

work page internal anchor Pith review arXiv 2023
[7]

In: ICLR (2024)

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., Seo, M.: Prometheus: Inducing fine-grained evaluation capability in language models. In: ICLR (2024)

2024
[8]

arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y., Shao, J.: SALAD- Bench:Ahierarchicalandcomprehensivesafetybenchmarkforlargelanguagemod- els. arXiv preprint arXiv:2402.05044 (2024)

work page arXiv 2024
[9]

In: EMNLP (2023)

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: EMNLP (2023)

2023
[10]

In: ICML (2024)

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D.: HarmBench: A standardized evalu- ation framework for automated red teaming and robust refusal. In: ICML (2024)

2024
[11]

A coin flip for safety: Llm judges fail to reliably measure adversarial robustness

Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A coin flip for safety: LLM judges fail to reliably measure adversarial robustness. arXiv preprint arXiv:2603.06594 (2026)

work page arXiv 2026
[12]

arXiv preprint arXiv:2601.17312 (2026)

Silva, H., Mendes, M., Gonçalo Oliveira, H.: Meta-judging with large language models: Concepts, methods, and challenges. arXiv preprint arXiv:2601.17312 (2026)

work page arXiv 2026
[13]

In: NeurIPS Datasets and Benchmarks Track (2024)

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., Toyer, S.: A StrongREJECT for empty jailbreaks. In: NeurIPS Datasets and Benchmarks Track (2024)

2024
[14]

arXiv preprint arXiv:2602.13274 (2026)

Thomas, R.S., Shiromani, S., Chaudhry, A., Li, R., Sharma, V., Zhu, K., Dev, S.: ProMoral-Bench: Evaluating prompting strategies for moral reasoning and safety in LLMs. arXiv preprint arXiv:2602.13274 (2026)

work page arXiv 2026
[15]

arXiv preprint arXiv:2603.14723 (2026)

Zhang, X.: Beyond creed: A non-identity safety condition—a strong empirical alternative to identity framing in low-data LoRA fine-tuning. arXiv preprint arXiv:2603.14723 (2026)

work page arXiv 2026
[16]

arXiv preprint arXiv:2603.28005 (2026)

Zhang, X.: Rethinking atomic decomposition for LLM judges: A prompt-controlled study of reference-grounded QA evaluation. arXiv preprint arXiv:2603.28005 (2026)

work page arXiv 2026
[17]

Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., Huang, M.: SafetyBench: Evaluating the safety of large language models. arXiv preprint arXiv:2309.07045 (2024)

work page arXiv 2024
[18]

not harmful

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge Sensitivity of Safety Benchmarks to Judge Configuration 13 with MT-Bench and chatbot arena. In: NeurIPS Datasets and Benchmarks Track (2023) 14 X. Zhang A Experimental Details A.1 Parse Error ...

2023