Recognition: unknown
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
Pith reviewed 2026-05-08 03:36 UTC · model grok-4.3
The pith
LLM judge prompt wording alone can change safety benchmark scores by up to 24 percentage points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a 2x2x3 factorial design to create 12 judge prompt variants along evaluation structure and instruction framing, the paper applies them with one fixed judge model to six target models and 400 HarmBench behaviors. Prompt wording alone shifts harmful-response rates by up to 24.2 percentage points, with even surface rewording causing swings up to 20.1 points. Model safety rankings are moderately unstable (mean Kendall tau 0.89), and category sensitivity ranges from 39.6 points for copyright to 0 points for harassment. A supplementary multi-judge test shows judge-model choice adds further variance.
What carries the argument
A 2x2x3 factorial design generating 12 judge prompt variants along axes of evaluation structure and instruction framing, applied to fixed judge model and target outputs to isolate prompt wording effects on classification.
If this is right
- Safety benchmark scores can vary by up to 24.2 percentage points from prompt wording alone even with the judge model held fixed.
- Model safety rankings show only moderate stability across prompt variants, with mean Kendall tau of 0.89.
- Sensitivity to prompt choices differs sharply by harm category, reaching 39.6 points for copyright but 0 for harassment.
- Judge-model selection adds measurement variance on top of prompt effects.
Where Pith is reading between the lines
- Benchmark results may need to include sensitivity ranges or standardized prompt suites to become more reliable for model comparisons.
- This variance could contribute to inconsistencies seen when safety scores are reported across different labs or evaluation setups.
- Similar prompt-sensitivity tests could be applied to capability or alignment benchmarks that also rely on LLM judges.
Load-bearing premise
The observed differences in judgments are caused primarily by prompt wording rather than interactions with the specific judge model, sampling variance in target-model outputs, or unmeasured properties of the 400 HarmBench behaviors.
What would settle it
Re-running the identical target outputs through the same 12 prompts but with a different judge model and observing average prompt-induced swings below 5 percentage points would indicate the variance is not generally attributable to prompt wording.
Figures
read the original abstract
Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarks such as HarmBench. Using a 2×2×3 factorial design to generate 12 prompt variants (along evaluation structure and instruction framing axes) with a fixed judge model (Claude Sonnet 4-6), the authors produce 28,812 judgments across six target models and 400 HarmBench behaviors. They report that prompt wording shifts harmful-response rates by up to 24.2 percentage points (with within-condition rewording causing up to 20.1 pp swings), yields moderately unstable model rankings (mean Kendall tau = 0.89), and shows category-level sensitivity ranging from 39.6 pp (copyright) to 0 pp (harassment). A supplementary multi-judge experiment with three models is included to demonstrate additional variance from judge-model choice.
Significance. If the results hold, this work demonstrates that safety benchmark outcomes are sensitive to an often-fixed implementation detail (judge prompt), with direct implications for the reliability and comparability of model safety assessments. The large-scale, controlled factorial design that isolates prompt effects on identical target outputs, combined with the use of an established benchmark (HarmBench) and reproducible judgment count, provides concrete, falsifiable evidence of variance that strengthens the central claim. This could prompt the field to adopt sensitivity analyses or standardized prompts in future benchmarking.
major comments (1)
- [Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.
minor comments (2)
- [Abstract] Abstract: Adding a brief clause on the aggregation procedure or statistical approach used to quantify the percentage-point shifts and Kendall tau would improve clarity for readers encountering the quantitative claims first.
- [Methods] Methods: The description of how the 400 behaviors are distributed across categories and how judgments are collected could explicitly state whether any filtering or balancing was applied to ensure the factorial comparisons are balanced.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Results section] Results section: The central claim that prompt wording alone shifts harmful-response rates by up to 24.2 pp (and causes ranking instability) relies on the reported differences being robust, but the manuscript provides no details on the statistical tests used, the exact aggregation rules for turning individual judgments into rates, or adjustments for multiple comparisons across the 12 prompts and harm categories. This omission risks leaving the magnitude of the effects open to unexamined confounds despite the strong factorial design.
Authors: We thank the referee for highlighting this important point regarding transparency in our reporting. In the revised manuscript, we will expand the Results section (and add a brief Methods subsection) to include a clear description of the aggregation rules: for each of the 12 judge prompt variants, the harmful-response rate is calculated as the proportion of the 400 HarmBench behaviors that receive a 'harmful' judgment from the fixed judge model (Claude Sonnet 4-6). This aggregation is performed independently per prompt variant across the six target models. We did not perform formal inferential statistical hypothesis tests (such as t-tests, ANOVA, or chi-squared tests) on the differences between prompt variants, as the study is primarily descriptive and emphasizes the magnitude of observed variations within a controlled 2×2×3 factorial design. Model ranking instability is quantified using Kendall's tau rank correlation coefficient, with the mean value of 0.89 reported across relevant conditions. Regarding multiple comparisons, because the primary claims rest on the range of observed maximum differences (24.2 pp overall, 20.1 pp within-condition) and category-level sensitivities rather than p-value-based significance testing across the 12 prompts and harm categories, no adjustments for multiple comparisons were applied. We will explicitly state this in the revision, note the exploratory nature of the per-category results, and discuss any implications for interpreting effects such as the 39.6 pp range for copyright. These clarifications will be incorporated to strengthen the presentation of the results. revision: yes
Circularity Check
No significant circularity: purely empirical measurement study
full rationale
The paper conducts a controlled empirical experiment with a 2×2×3 factorial design on judge prompts applied to fixed target-model outputs from HarmBench. All reported results (percentage-point shifts, Kendall-tau values, category sensitivities) are direct measurements against external behaviors and do not rely on any derivations, equations, fitted parameters renamed as predictions, or self-citation chains for uniqueness. No load-bearing step reduces to its own inputs by construction; the design isolates prompt effects without internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of factorial experimental design and rank correlation (Kendall tau) apply to the 28,812 judgments
Reference graph
Works this paper leans on
-
[1]
Llm-safety evaluations lack robustness.arXivpreprintarXiv:2503.02574, 2025
Beyer, T., Xhonneux, S., Geisler, S., Gidel, G., Schwinn, L., Günnemann, S.: LLM- safety evaluations lack robustness. arXiv preprint arXiv:2503.02574 (2025)
-
[2]
arXiv preprint arXiv:2601.08843 (2025)
Deng, H., Farber, C., Lee, J., Tang, D.: Rubric-conditioned LLM grading: Align- ment, uncertainty, and robustness. arXiv preprint arXiv:2601.08843 (2025)
-
[3]
In: ICBINB Workshop at ICLR (2025)
Eiras, F., Zemour, E., Lin, E., Mugunthan, V.: Know thy judge: On the robustness meta-evaluation of LLM safety judges. In: ICBINB Workshop at ICLR (2025)
2025
-
[4]
arXiv preprint arXiv:2408.03837 (2024)
Gupta, P., Yau, L.Q., Low, B.H.H.: WalledEval: A comprehensive safety evaluation toolkit for large language models. arXiv preprint arXiv:2408.03837 (2024)
-
[5]
arXiv preprint arXiv:2601.08654 (2026)
Hong, Y., Yao, H., Shen, B., Xu, W., Wei, H., Dong, Y.: RULERS: Locked rubrics and evidence-anchored scoring for robust LLM evaluation. arXiv preprint arXiv:2601.08654 (2026)
-
[6]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., Khabsa, M.: Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 (2023)
work page internal anchor Pith review arXiv 2023
-
[7]
In: ICLR (2024)
Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., Seo, M.: Prometheus: Inducing fine-grained evaluation capability in language models. In: ICLR (2024)
2024
-
[8]
arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044
Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y., Shao, J.: SALAD- Bench:Ahierarchicalandcomprehensivesafetybenchmarkforlargelanguagemod- els. arXiv preprint arXiv:2402.05044 (2024)
-
[9]
In: EMNLP (2023)
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. In: EMNLP (2023)
2023
-
[10]
In: ICML (2024)
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D.: HarmBench: A standardized evalu- ation framework for automated red teaming and robust refusal. In: ICML (2024)
2024
-
[11]
A coin flip for safety: Llm judges fail to reliably measure adversarial robustness
Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A coin flip for safety: LLM judges fail to reliably measure adversarial robustness. arXiv preprint arXiv:2603.06594 (2026)
-
[12]
arXiv preprint arXiv:2601.17312 (2026)
Silva, H., Mendes, M., Gonçalo Oliveira, H.: Meta-judging with large language models: Concepts, methods, and challenges. arXiv preprint arXiv:2601.17312 (2026)
-
[13]
In: NeurIPS Datasets and Benchmarks Track (2024)
Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., Toyer, S.: A StrongREJECT for empty jailbreaks. In: NeurIPS Datasets and Benchmarks Track (2024)
2024
-
[14]
arXiv preprint arXiv:2602.13274 (2026)
Thomas, R.S., Shiromani, S., Chaudhry, A., Li, R., Sharma, V., Zhu, K., Dev, S.: ProMoral-Bench: Evaluating prompting strategies for moral reasoning and safety in LLMs. arXiv preprint arXiv:2602.13274 (2026)
-
[15]
arXiv preprint arXiv:2603.14723 (2026)
Zhang, X.: Beyond creed: A non-identity safety condition—a strong empirical alternative to identity framing in low-data LoRA fine-tuning. arXiv preprint arXiv:2603.14723 (2026)
-
[16]
arXiv preprint arXiv:2603.28005 (2026)
Zhang, X.: Rethinking atomic decomposition for LLM judges: A prompt-controlled study of reference-grounded QA evaluation. arXiv preprint arXiv:2603.28005 (2026)
-
[17]
Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions
Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., Huang, M.: SafetyBench: Evaluating the safety of large language models. arXiv preprint arXiv:2309.07045 (2024)
-
[18]
not harmful
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge Sensitivity of Safety Benchmarks to Judge Configuration 13 with MT-Bench and chatbot arena. In: NeurIPS Datasets and Benchmarks Track (2023) 14 X. Zhang A Experimental Details A.1 Parse Error ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.