Brittlebench: Quantifying LLM robustness via prompt sensitivity
Pith reviewed 2026-05-15 17:49 UTC · model grok-4.3
The pith
Semantics-preserving prompt perturbations explain up to half of LLM performance variance and flip model rankings in 63 percent of cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the Brittlebench pipeline on popular benchmarks, the authors establish that semantics-preserving input perturbations account for up to half the performance variance of a given model and that applying even a single perturbation reverses the relative ordering of models in 63 percent of comparisons. Performance degrades as much as 12 percent overall, but the effect is uneven across models, allowing the method to disentangle data-induced difficulty from prompt-related brittleness.
What carries the argument
Brittlebench, an evaluation pipeline that applies semantics-preserving perturbations to benchmark inputs to isolate and quantify model sensitivity to prompt variants.
If this is right
- Standard benchmark scores can overestimate capabilities by ignoring common input variations.
- Claims that one model outperforms another can reverse with small rephrasings of the same questions.
- Roughly half the observed performance spread across models traces to prompt sensitivity rather than core ability.
- Evaluation protocols should routinely include perturbed test sets to measure robustness.
Where Pith is reading between the lines
- Deployment in noisy user environments could widen performance gaps beyond what clean tests predict.
- Training objectives that reward invariance to equivalent prompts might reduce measured brittleness.
- Automated generation of perturbation sets could become a standard part of benchmark design.
Load-bearing premise
The perturbations used are truly semantics-preserving and add no extra task difficulty beyond the change in prompt wording.
What would settle it
Repeating the full Brittlebench procedure on a fresh collection of models and benchmarks and finding that perturbations explain less than 10 percent of variance with no ranking changes in more than 60 percent of cases.
read the original abstract
Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Brittlebench, a theoretical framework and evaluation pipeline for quantifying LLM sensitivity (brittleness) to prompt variants. It applies semantics-preserving perturbations to popular benchmarks, reporting up to 12% performance degradation, single-perturbation ranking changes in 63% of model comparisons, and decomposition showing that such perturbations account for up to half the total performance variance across open-weight and commercial models.
Significance. If the perturbations are validated as semantics-preserving, the work would be significant for LLM evaluation: it supplies a concrete method to disentangle prompt-induced variability from inherent data difficulty, demonstrates instability in current model rankings, and quantifies the contribution of input noise to observed performance. The empirical variance decomposition and ranking-flip statistics provide falsifiable, actionable numbers that could influence benchmark design and robustness research.
major comments (2)
- [Abstract] Abstract: the central claims (up to 50% variance from perturbations; 63% ranking flips) rest on the unverified premise that the applied perturbations are semantics-preserving and do not alter task difficulty. No description of generation mechanism, human equivalence validation, paraphrase checks, or difficulty controls is supplied, leaving the variance attribution and disentanglement framework open to confounding.
- [Evaluation pipeline] Evaluation pipeline (Brittlebench): the manuscript provides no statistical controls, confidence intervals, or multiple-testing corrections for the reported ranking alterations and variance percentages. Without these, it is unclear whether the 63% figure or the 'up to half' variance share are robust to sampling variation across models and benchmarks.
minor comments (2)
- [Abstract] The abstract lists 'a suite of popular benchmarks' but does not name them; explicit enumeration would improve reproducibility.
- [Methods] Notation for the brittleness metric and variance decomposition should be introduced with a clear equation or diagram early in the methods.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major point below and describe the changes we will make in revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims (up to 50% variance from perturbations; 63% ranking flips) rest on the unverified premise that the applied perturbations are semantics-preserving and do not alter task difficulty. No description of generation mechanism, human equivalence validation, paraphrase checks, or difficulty controls is supplied, leaving the variance attribution and disentanglement framework open to confounding.
Authors: We agree that the abstract provides insufficient detail on these aspects. In the revised manuscript we will expand the abstract with a concise description of the perturbation generation process and add a dedicated subsection that reports the human equivalence validation protocol, paraphrase similarity checks, and task-difficulty controls. These additions will directly support the semantics-preserving claim and the subsequent variance attribution. revision: yes
-
Referee: [Evaluation pipeline] Evaluation pipeline (Brittlebench): the manuscript provides no statistical controls, confidence intervals, or multiple-testing corrections for the reported ranking alterations and variance percentages. Without these, it is unclear whether the 63% figure or the 'up to half' variance share are robust to sampling variation across models and benchmarks.
Authors: We accept this criticism. The revised manuscript will include bootstrap-derived confidence intervals for both the ranking-flip percentage and the variance-decomposition results. We will also apply a multiple-testing correction (FDR) across the model-benchmark comparisons. These statistical controls will be reported in the evaluation pipeline section and will demonstrate robustness to sampling variation. revision: yes
Circularity Check
No circularity: framework defined independently then applied empirically to benchmarks
full rationale
The paper defines a theoretical framework for brittleness and prompt sensitivity, then applies semantics-preserving perturbations to existing benchmarks to measure performance changes, variance decomposition, and ranking shifts. No steps reduce by construction to inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained against external benchmarks, with results derived from direct application rather than tautological equivalence. The assumption that perturbations preserve semantics is an empirical premise open to validation but does not create circularity in the reported statistics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Var(Y) = VarD(EP[Y|D]) + ED[VarP(Y|D)] ... Vdata ... Vbrittleness
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
semantics-preserving perturbations ... word manipulation, prompt padding, paraphrasing
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.