pith. machine review for the scientific record. sign in

arxiv: 2604.24074 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords safety benchmarksLLM judgesprompt sensitivitymeasurement varianceHarmBenchAI safety evaluationjudge configuration
0
0 comments X

The pith

LLM judge prompt wording alone can change safety benchmark scores by up to 24 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the specific wording of prompts used by LLM-based judges affects the outcomes of AI safety benchmarks such as HarmBench. Through a controlled factorial experiment holding the judge model fixed, it finds that prompt variants shift detected harmful-response rates by as much as 24.2 percentage points. This matters because safety evaluations are used to assess and compare AI models, and large unaccounted variance from prompt choices could distort conclusions about relative safety. The study also shows moderate instability in model rankings and category-specific differences in sensitivity.

Core claim

Using a 2x2x3 factorial design to create 12 judge prompt variants along evaluation structure and instruction framing, the paper applies them with one fixed judge model to six target models and 400 HarmBench behaviors. Prompt wording alone shifts harmful-response rates by up to 24.2 percentage points, with even surface rewording causing swings up to 20.1 points. Model safety rankings are moderately unstable (mean Kendall tau 0.89), and category sensitivity ranges from 39.6 points for copyright to 0 points for harassment. A supplementary multi-judge test shows judge-model choice adds further variance.

What carries the argument

A 2x2x3 factorial design generating 12 judge prompt variants along axes of evaluation structure and instruction framing, applied to fixed judge model and target outputs to isolate prompt wording effects on classification.

If this is right

  • Safety benchmark scores can vary by up to 24.2 percentage points from prompt wording alone even with the judge model held fixed.
  • Model safety rankings show only moderate stability across prompt variants, with mean Kendall tau of 0.89.
  • Sensitivity to prompt choices differs sharply by harm category, reaching 39.6 points for copyright but 0 for harassment.
  • Judge-model selection adds measurement variance on top of prompt effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark results may need to include sensitivity ranges or standardized prompt suites to become more reliable for model comparisons.
  • This variance could contribute to inconsistencies seen when safety scores are reported across different labs or evaluation setups.
  • Similar prompt-sensitivity tests could be applied to capability or alignment benchmarks that also rely on LLM judges.

Load-bearing premise

The observed differences in judgments are caused primarily by prompt wording rather than interactions with the specific judge model, sampling variance in target-model outputs, or unmeasured properties of the 400 HarmBench behaviors.

What would settle it

Re-running the identical target outputs through the same 12 prompts but with a different judge model and observing average prompt-induced swings below 5 percentage points would indicate the variance is not generally attributable to prompt wording.

Figures

Figures reproduced from arXiv: 2604.24074 by Xinran Zhang.

Figure 1
Figure 1. Figure 1: Maximum swing in harmful rate (pp) across 12 Sonnet-judged prompts, by HarmBench category. Copyright is the most sensitive; harassment shows zero variation view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise Cohen’s κ across 12 multi-judge configurations. Block structure shows v3 (GLM-5) clustering separately from v1/v2 (Sonnet/MiniMax). Mean κ = 0.47. while others may be over- or under-sensitive. Resolving this would require hu￾man annotations or comparison to gold-standard labels, which we leave to future work. 6.2 Prompt Axes vs. Surface Wording Our factorial design introduces two prompt axes (stru… view at source ↗
read the original abstract

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarks such as HarmBench. Using a 2×2×3 factorial design to generate 12 prompt variants (along evaluation structure and instruction framing axes) with a fixed judge model (Claude Sonnet 4-6), the authors produce 28,812 judgments across six target models and 400 HarmBench behaviors. They report that prompt wording shifts harmful-response rates by up to 24.2 percentage points (with within-condition rewording causing up to 20.1 pp swings), yields moderately unstable model rankings (mean Kendall tau = 0.89), and shows category-level sensitivity ranging from 39.6 pp (copyright) to 0 pp (harassment). A supplementary multi-judge experiment with three models is included to demonstrate additional variance from judge-model choice.

Significance. If the results hold, this work demonstrates that safety benchmark outcomes are sensitive to an often-fixed implementation detail (judge prompt), with direct implications for the reliability and comparability of model safety assessments. The large-scale, controlled factorial design that isolates prompt effects on identical target outputs, combined with the use of an established benchmark (HarmBench) and reproducible judgment count, provides concrete, falsifiable evidence of variance that strengthens the central claim. This could prompt the field to adopt sensitivity analyses or standardized prompts in future benchmarking.

major comments (1)
  1. [Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.
minor comments (2)
  1. [Abstract] Abstract: Adding a brief clause on the aggregation procedure or statistical approach used to quantify the percentage-point shifts and Kendall tau would improve clarity for readers encountering the quantitative claims first.
  2. [Methods] Methods: The description of how the 400 behaviors are distributed across categories and how judgments are collected could explicitly state whether any filtering or balancing was applied to ensure the factorial comparisons are balanced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.

    Authors: We thank the referee for highlighting this important point regarding transparency in our reporting. In the revised manuscript, we will expand the Results section (and add a brief Methods subsection) to include a clear description of the aggregation rules: for each of the 12 judge prompt variants, the harmful-response rate is calculated as the proportion of the 400 HarmBench behaviors that receive a 'harmful' judgment from the fixed judge model (Claude Sonnet 4-6). This aggregation is performed independently per prompt variant across the six target models. We did not perform formal inferential statistical hypothesis tests (such as t-tests, ANOVA, or chi-squared tests) on the differences between prompt variants, as the study is primarily descriptive and emphasizes the magnitude of observed variations within a controlled 2×2×3 factorial design. Model ranking instability is quantified using Kendall's tau rank correlation coefficient, with the mean value of 0.89 reported across relevant conditions. Regarding multiple comparisons, because the primary claims rest on the range of observed maximum differences (24.2 pp overall, 20.1 pp within-condition) and category-level sensitivities rather than p-value-based significance testing across the 12 prompts and harm categories, no adjustments for multiple comparisons were applied. We will explicitly state this in the revision, note the exploratory nature of the per-category results, and discuss any implications for interpreting effects such as the 39.6 pp range for copyright. These clarifications will be incorporated to strengthen the presentation of the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurement study

full rationale

The paper conducts a controlled empirical experiment with a 2×2×3 factorial design on judge prompts applied to fixed target-model outputs from HarmBench. All reported results (percentage-point shifts, Kendall-tau values, category sensitivities) are direct measurements against external behaviors and do not rely on any derivations, equations, fitted parameters renamed as predictions, or self-citation chains for uniqueness. No load-bearing step reduces to its own inputs by construction; the design isolates prompt effects without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical sensitivity analysis. No free parameters are fitted to produce the central claims. No new entities are postulated.

axioms (1)
  • standard math Standard assumptions of factorial experimental design and rank correlation (Kendall tau) apply to the 28,812 judgments
    Invoked when reporting mean Kendall tau = 0.89 and category-level sensitivity ranges

pith-pipeline@v0.9.0 · 5503 in / 1259 out tokens · 40719 ms · 2026-05-08T03:36:52.607511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Llm-safety evaluations lack robustness.arXivpreprintarXiv:2503.02574, 2025

    Beyer, T., Xhonneux, S., Geisler, S., Gidel, G., Schwinn, L., Günnemann, S.: LLM- safety evaluations lack robustness. arXiv preprint arXiv:2503.02574 (2025)

  2. [2]

    arXiv preprint arXiv:2601.08843 (2025)

    Deng, H., Farber, C., Lee, J., Tang, D.: Rubric-conditioned LLM grading: Align- ment, uncertainty, and robustness. arXiv preprint arXiv:2601.08843 (2025)

  3. [3]

    In: ICBINB Workshop at ICLR (2025)

    Eiras, F., Zemour, E., Lin, E., Mugunthan, V.: Know thy judge: On the robustness meta-evaluation of LLM safety judges. In: ICBINB Workshop at ICLR (2025)

  4. [4]

    arXiv preprint arXiv:2408.03837 (2024)

    Gupta, P., Yau, L.Q., Low, B.H.H.: WalledEval: A comprehensive safety evaluation toolkit for large language models. arXiv preprint arXiv:2408.03837 (2024)

  5. [5]

    arXiv preprint arXiv:2601.08654 (2026)

    Hong, Y., Yao, H., Shen, B., Xu, W., Wei, H., Dong, Y.: RULERS: Locked rubrics and evidence-anchored scoring for robust LLM evaluation. arXiv preprint arXiv:2601.08654 (2026)

  6. [6]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 (2023)

  7. [7]

    In: ICLR (2024)

    Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., Seo, M.: Prometheus: Inducing fine-grained evaluation capability in language models. In: ICLR (2024)

  8. [8]

    arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

    Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y., Shao, J.: SALAD- Bench:Ahierarchicalandcomprehensivesafetybenchmarkforlargelanguagemod- els. arXiv preprint arXiv:2402.05044 (2024)

  9. [9]

    In: EMNLP (2023)

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: EMNLP (2023)

  10. [10]

    In: ICML (2024)

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D.: HarmBench: A standardized evalu- ation framework for automated red teaming and robust refusal. In: ICML (2024)

  11. [11]

    A coin flip for safety: Llm judges fail to reliably measure adversarial robustness

    Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A coin flip for safety: LLM judges fail to reliably measure adversarial robustness. arXiv preprint arXiv:2603.06594 (2026)

  12. [12]

    arXiv preprint arXiv:2601.17312 (2026)

    Silva, H., Mendes, M., Gonçalo Oliveira, H.: Meta-judging with large language models: Concepts, methods, and challenges. arXiv preprint arXiv:2601.17312 (2026)

  13. [13]

    In: NeurIPS Datasets and Benchmarks Track (2024)

    Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., Toyer, S.: A StrongREJECT for empty jailbreaks. In: NeurIPS Datasets and Benchmarks Track (2024)

  14. [14]

    arXiv preprint arXiv:2602.13274 (2026)

    Thomas, R.S., Shiromani, S., Chaudhry, A., Li, R., Sharma, V., Zhu, K., Dev, S.: ProMoral-Bench: Evaluating prompting strategies for moral reasoning and safety in LLMs. arXiv preprint arXiv:2602.13274 (2026)

  15. [15]

    arXiv preprint arXiv:2603.14723 (2026)

    Zhang, X.: Beyond creed: A non-identity safety condition—a strong empirical alternative to identity framing in low-data LoRA fine-tuning. arXiv preprint arXiv:2603.14723 (2026)

  16. [16]

    arXiv preprint arXiv:2603.28005 (2026)

    Zhang, X.: Rethinking atomic decomposition for LLM judges: A prompt-controlled study of reference-grounded QA evaluation. arXiv preprint arXiv:2603.28005 (2026)

  17. [17]

    Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

    Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., Huang, M.: SafetyBench: Evaluating the safety of large language models. arXiv preprint arXiv:2309.07045 (2024)

  18. [18]

    not harmful

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge Sensitivity of Safety Benchmarks to Judge Configuration 13 with MT-Bench and chatbot arena. In: NeurIPS Datasets and Benchmarks Track (2023) 14 X. Zhang A Experimental Details A.1 Parse Error ...