arxiv: 2604.09606 · v1 · submitted 2026-03-10 · 💻 cs.AI · cs.SE

Recognition: 1 theorem link

· Lean Theorem

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

Keita Broadwater

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:48 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LLM safetyreliability evaluationrepeated samplingprompt stress testingfailure probabilityoperational risktemperature variationbinomial modeling

0 comments

The pith

Repeated sampling of the same prompts reveals large differences in LLM safety failure rates that single-sample benchmarks miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that conventional safety evaluations of large language models, which rely on single prompts or very small numbers of samples across many tasks, do not reflect the risks that arise when the same prompt is issued repeatedly in real deployments. It proposes Accelerated Prompt Stress Testing to sample identical prompts many times while varying temperature and adding controlled perturbations, then models the observed unsafe outputs as binomial trials to compute per-inference failure probabilities. Under this repeated-sampling regime, models that score similarly on standard low-N tests display markedly different empirical failure rates, especially as temperature changes. A reader would care because high-stakes applications require consistent safety across repeated uses rather than one-off correctness. The approach shifts evaluation from breadth across tasks to depth on the same input.

Core claim

While instruction-tuned LLMs show comparable safety performance on AIR-BENCH prompts when evaluated with single or very-low-sample (N ≤ 3) methods, repeated sampling under controlled temperatures produces substantial variation in empirical failure probabilities across models, with failures treated as stochastic Bernoulli outcomes rather than isolated events.

What carries the argument

Accelerated Prompt Stress Testing (APST), which repeatedly samples identical prompts under fixed temperature and perturbation conditions to estimate binomial per-inference failure probabilities.

If this is right

Safety benchmarks must report statistical distributions of failures rather than single-point scores.
Model selection for repeated-use settings should incorporate temperature-dependent failure rates.
Operational risk can be quantified directly as a per-inference probability derived from binomial counts.
Prompt perturbation and temperature controls become standard parameters in reliability testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines may need to restrict temperature ranges to keep failure probabilities below acceptable thresholds.
The gap between benchmark scores and repeated-use reliability could affect regulatory or insurance requirements for AI systems.
Extending APST to longer interaction histories might reveal additional failure modes not visible in isolated prompt repetitions.

Load-bearing premise

That failures observed across repeated identical prompts under controlled conditions reflect genuine latent operational risks rather than artifacts created by the sampling process itself.

What would settle it

All tested models would exhibit statistically indistinguishable failure rates across temperatures even after hundreds of repeated samples per prompt, or the observed variations would vanish under fully deterministic decoding.

Figures

Figures reproduced from arXiv: 2604.09606 by Keita Broadwater.

**Figure 2.** Figure 2: Empirical failure probability by temperature for the Phase 1 cal [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical failure probability as a function of sampling depth for Phase [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: AIR-BENCH–style category-level safety summary under shallow [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of benchmark-style (AIR-BENCH–equivalent, prompt [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment often exposes a different class of risk: operational failures arising from repeated generations of the same prompt rather than broad task generalization. In high-stakes settings, response consistency and safety under repeated use are critical operational requirements. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST probes LLM behavior by repeatedly sampling identical prompts under controlled operational conditions, including temperature variation and prompt perturbation, to surface latent failure modes such as hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST characterizes them statistically as stochastic outcomes of repeated inference. We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities, enabling quantitative comparison of operational risk across models and configurations. We apply APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024 derived safety and security prompts. While models exhibit similar performance under conventional single- or very-low-sample evaluation (N <= 3), repeated sampling reveals substantial variation in empirical failure probabilities across temperatures. These results demonstrate that shallow benchmark scores can obscure meaningful differences in reliability under sustained use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Repeated sampling on the same safety prompts uncovers temperature-driven failure rate differences that low-N benchmarks miss, but the methods need concrete details to confirm the variation is real.

read the letter

The main thing here is that single-shot or low-sample safety tests on benchmarks like AIR-BENCH can make models look more similar than they are under repeated use. The paper introduces APST to fix that by running identical prompts many times at different temperatures and with perturbations, then modeling each run as a Bernoulli trial so they can estimate per-inference failure probabilities with binomial stats. They apply it to AIR-BENCH 2024 safety prompts and report that the gaps only appear once N gets higher than 3. That framing is new enough in the safety evaluation space and it does a clean job of connecting reliability engineering ideas to LLM deployment risks without overclaiming. The approach is straightforward and uses standard probability tools, which keeps it grounded. The soft spots are mostly in the missing implementation details. The abstract does not spell out how failures are detected consistently, what the exact perturbation steps are, or whether they tested that the observed temperature effects exceed sampling noise. Without those, it is hard to rule out that some of the variation is artifact rather than latent mode. The independence assumption for the trials also feels light if prompt history or model state carries over. This is worth a serious referee for anyone working on operational LLM safety or evaluation frameworks. A reader who needs practical ways to stress-test deployed models would get value from the full methods and numbers. I would send it to peer review so the details can be checked and the results can be reproduced or extended.

Referee Report

2 major / 2 minor

Summary. The paper introduces Accelerated Prompt Stress Testing (APST), a depth-oriented framework for LLM safety evaluation that repeatedly samples identical prompts under controlled temperature and perturbation conditions. Failures are modeled as Bernoulli trials and aggregated via binomial distributions to estimate per-inference failure probabilities. Experiments on AIR-BENCH 2024 safety prompts across multiple instruction-tuned LLMs show that conventional low-N (≤3) evaluations yield similar performance, while higher-N repeated sampling uncovers substantial temperature-dependent variation in empirical failure rates.

Significance. If the empirical results hold after addressing procedural details, the work provides a practical extension of reliability-engineering stress testing to LLM inference, demonstrating that shallow benchmarks can mask operational reliability gaps. The use of standard Bernoulli/binomial models is a strength, as is the focus on falsifiable, quantitative failure probabilities rather than qualitative observations.

major comments (2)

[§3] §3 (Methodology): The exact sampling procedure, value of N (number of repetitions), and implementation of prompt perturbations are not specified in sufficient detail to allow reproduction or to rule out that observed temperature-driven variations are artifacts of the sampling process itself rather than latent model failures.
[§4] §4 (Results): The claim of 'substantial variation' in failure probabilities across temperatures is not accompanied by confidence intervals, standard errors, or statistical significance tests on the binomial estimates; without these, it is unclear whether the differences exceed what would be expected from finite-sample noise alone.

minor comments (2)

[Abstract] The abstract introduces the acronym APST without spelling it out on first use.
[Figures] Figure captions and axis labels in the results figures should explicitly state the number of repetitions N and the failure-detection criteria used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address the major comments point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Methodology): The exact sampling procedure, value of N (number of repetitions), and implementation of prompt perturbations are not specified in sufficient detail to allow reproduction or to rule out that observed temperature-driven variations are artifacts of the sampling process itself rather than latent model failures.

Authors: We agree that the methodology section requires more precise specification to ensure reproducibility. In the revised version of the paper, we will provide the exact value of N used in our experiments, a step-by-step description of the sampling procedure including how temperature is applied and any seeding for randomness, and the specific implementation details for prompt perturbations. This will allow readers to replicate the study and confirm that the temperature-dependent variations reflect genuine model behaviors. revision: yes
Referee: [§4] §4 (Results): The claim of 'substantial variation' in failure probabilities across temperatures is not accompanied by confidence intervals, standard errors, or statistical significance tests on the binomial estimates; without these, it is unclear whether the differences exceed what would be expected from finite-sample noise alone.

Authors: We concur that including measures of statistical uncertainty would strengthen the results. We will revise §4 to include binomial confidence intervals for the estimated failure probabilities and conduct appropriate statistical tests to assess the significance of differences across temperature settings. This will demonstrate that the reported variations are statistically meaningful and not merely due to sampling variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces APST as an empirical stress-testing framework that repeatedly samples prompts and applies standard Bernoulli/binomial estimation to observed failure counts. Failure probabilities are computed directly from data under controlled temperature and perturbation conditions, with no parameter fitting that is then relabeled as a prediction, no self-citations invoked as load-bearing uniqueness theorems, and no ansatz or renaming of prior results. The contrast between N≤3 and higher-N evaluation follows immediately from the sampling procedure itself once the failure detector is accepted; the derivation chain contains no self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical assumptions and the operational relevance of repeated-prompt failures; no new entities or heavily fitted parameters are introduced in the abstract.

axioms (1)

domain assumption Safety failures can be modeled as independent Bernoulli trials across repeated inferences
Invoked when applying binomial formulations to estimate per-inference failure probabilities

pith-pipeline@v0.9.0 · 5530 in / 1242 out tokens · 26921 ms · 2026-05-15T12:48:49.034512+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model observed safety failures using Bernoulli and binomial formulations to estimate per-inference failure probabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, J. Burns, D. Zou, et al., “Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

AIR-BENCH: A safety benchmark based on risk categories from regulations and policies,

Y . Zeng, Y . Yang, A. Zhou, J. Z. Tan, Y . Tu, Y . Mai, K. Klyman, M. Pan, R. Jia, D. Song, P. Liang, and B. Li, “AIR-BENCH: A safety benchmark based on risk categories from regulations and policies,”arXiv preprint arXiv:2402.09407, 2024

work page arXiv 2024
[3]

Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board–style examination,

S. Krishna, N. Bhambra, R. Bleakney, and R. Bhayana, “Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board–style examination,”Radiology, vol. 311, no. 2, p. e232715, 2024

work page 2024
[4]

Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions: Cross-sectional study,

K. Hanss, K. V . Sarma, A. L. Glowinski, A. Krystal, R. Saunders, A. Halls, et al., “Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions: Cross-sectional study,”J. Med. Internet Res., vol. 27, p. e69910, 2025

work page 2025
[5]

The instability of safety: How random seeds and tem- perature expose inconsistent LLM refusal behavior,

E. Larsen, “The instability of safety: How random seeds and tem- perature expose inconsistent LLM refusal behavior,”arXiv preprint arXiv:2512.12066, 2025

work page arXiv 2025
[6]

Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: A comparative study,

J. C. Penny-Dimri, M. Bachmann, W. R. Cooke, S. Mathewlynn, S. Dockree, J. Tolladay, J. Kossen, L. Li, Y . Gal, and G. D. Jones, “Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: A comparative study,”Lancet Obstet. Gynaecol. Womens Health, vol. 1, no. 1, pp. e47–e56, 2025

work page 2025
[7]

Uncertainty quantification for LLM- based survey simulations,

C. Huang, Y . Wu, and K. Wang, “Uncertainty quantification for LLM- based survey simulations,”arXiv preprint arXiv:2502.17773, 2025

work page arXiv 2025
[8]

Calibrating LLM confidence by probing perturbed representation stability,

R. Khanmohammadi, E. Miahi, M. Mardikoraem, S. Kaur, I. Brugere, C. Smiley, K. S. Thind, and M. M. Ghassemi, “Calibrating LLM confidence by probing perturbed representation stability,” inProc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), 2025, pp. 10459– 10525

work page 2025
[9]

Quantifying perturbation impacts for large language models,

P. Rauba, Q. Wei, and M. van der Schaar, “Quantifying perturbation impacts for large language models,”arXiv preprint arXiv:2412.00868, 2024

work page arXiv 2024
[10]

Towards robust LLMs: An adversarial robustness measurement framework,

N. Levy, A. Ashrov, and G. Katz, “Towards robust LLMs: An adversarial robustness measurement framework,”arXiv preprint arXiv:2504.17723, 2025

work page arXiv 2025
[11]

Survey of uncertainty estimation in large language models: Sources, methods, applications, and challenges,

J. He, L. Yu, C. Li, R. Yang, F. Chen, K. Li, M. Zhang, et al., “Survey of uncertainty estimation in large language models: Sources, methods, applications, and challenges,” arXiv preprint, 2025

work page 2025
[12]

Probabilistic consensus through ensemble validation: A frame- work for LLM reliability,

N. Naik, “Probabilistic consensus through ensemble validation: A frame- work for LLM reliability,”arXiv preprint arXiv:2411.06535, 2024

work page arXiv 2024
[13]

HIP-LLM: A hierarchical imprecise probability approach to reliability assessment of large language models,

R. Aghazadeh-Chakherlou, Q. Guo, S. Khastgir, P. Popov, X. Zhang, and X. Zhao, “HIP-LLM: A hierarchical imprecise probability approach to reliability assessment of large language models,”arXiv preprint arXiv:2511.00527, 2025

work page arXiv 2025
[14]

SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types,

Y . Mou, S. Zhang, and W. Ye, “SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types,” inAdv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, 2024, pp. 123032–123054

work page 2024
[15]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Z. Zhang, S. Cui, Y . Lu, J. Zhou, J. Yang, H. Wang, and M. Huang, “Agent-SafetyBench: Evaluating the safety of LLM agents,”arXiv preprint arXiv:2412.14470, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

aiXamine: Simplified LLM safety and security,

F. Deniz, D. Popovic, Y . Boshmaf, E. Jeong, M. Ahmad, S. Chawla, and I. Khalil, “aiXamine: Simplified LLM safety and security,”arXiv preprint arXiv:2504.14985, 2025

work page arXiv 2025