Brittlebench: Quantifying LLM robustness via prompt sensitivity

Adina Williams; Anaelia Ovalle; Angelika Romanou; Antoine Bosselut; Candace Ross; Chantal Shaib; Jesse Dodge; Kerem Oktar; Koustuv Sinha; Mark Ibrahim

arxiv: 2603.13285 · v2 · submitted 2026-02-27 · 💻 cs.LG · cs.AI

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou , Mark Ibrahim , Candace Ross , Chantal Shaib , Kerem Oktar , Samuel J. Bell , Anaelia Ovalle , Jesse Dodge

show 3 more authors

Antoine Bosselut Koustuv Sinha Adina Williams

This is my paper

Pith reviewed 2026-05-15 17:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM robustnessprompt sensitivitybrittleness evaluationsemantics-preserving perturbationsperformance variancemodel rankingsBrittlebenchbenchmark noise

0 comments

The pith

Semantics-preserving prompt perturbations explain up to half of LLM performance variance and flip model rankings in 63 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework to separate prompt sensitivity from inherent task difficulty in language model evaluations. It introduces Brittlebench, a pipeline that applies meaning-preserving changes to inputs from standard benchmarks and measures resulting drops in accuracy. These changes reduce performance by as much as 12 percent and shift which model ranks highest in most head-to-head tests. The work shows that prompt-related variability can account for half the total variance seen across models on the same tasks. This matters because clean benchmarks may give misleading pictures of real-world reliability where user phrasing varies.

Core claim

Using the Brittlebench pipeline on popular benchmarks, the authors establish that semantics-preserving input perturbations account for up to half the performance variance of a given model and that applying even a single perturbation reverses the relative ordering of models in 63 percent of comparisons. Performance degrades as much as 12 percent overall, but the effect is uneven across models, allowing the method to disentangle data-induced difficulty from prompt-related brittleness.

What carries the argument

Brittlebench, an evaluation pipeline that applies semantics-preserving perturbations to benchmark inputs to isolate and quantify model sensitivity to prompt variants.

If this is right

Standard benchmark scores can overestimate capabilities by ignoring common input variations.
Claims that one model outperforms another can reverse with small rephrasings of the same questions.
Roughly half the observed performance spread across models traces to prompt sensitivity rather than core ability.
Evaluation protocols should routinely include perturbed test sets to measure robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment in noisy user environments could widen performance gaps beyond what clean tests predict.
Training objectives that reward invariance to equivalent prompts might reduce measured brittleness.
Automated generation of perturbation sets could become a standard part of benchmark design.

Load-bearing premise

The perturbations used are truly semantics-preserving and add no extra task difficulty beyond the change in prompt wording.

What would settle it

Repeating the full Brittlebench procedure on a fresh collection of models and benchmarks and finding that perturbations explain less than 10 percent of variance with no ranking changes in more than 60 percent of cases.

read the original abstract

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all models equally: even a single perturbation alters the relative ranking of models in 63% of cases, impacting conclusions about comparative model performance. Decomposing the total variance of both state-of-the-art open-weight and commercial models, we find that semantics-preserving input perturbations can account for up to half of the performance variance for a given model. Brittlebench highlights the need for more robust evaluations and models, and allows us to systematically understand model brittleness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Brittlebench, a theoretical framework and evaluation pipeline for quantifying LLM sensitivity (brittleness) to prompt variants. It applies semantics-preserving perturbations to popular benchmarks, reporting up to 12% performance degradation, single-perturbation ranking changes in 63% of model comparisons, and decomposition showing that such perturbations account for up to half the total performance variance across open-weight and commercial models.

Significance. If the perturbations are validated as semantics-preserving, the work would be significant for LLM evaluation: it supplies a concrete method to disentangle prompt-induced variability from inherent data difficulty, demonstrates instability in current model rankings, and quantifies the contribution of input noise to observed performance. The empirical variance decomposition and ranking-flip statistics provide falsifiable, actionable numbers that could influence benchmark design and robustness research.

major comments (2)

[Abstract] Abstract: the central claims (up to 50% variance from perturbations; 63% ranking flips) rest on the unverified premise that the applied perturbations are semantics-preserving and do not alter task difficulty. No description of generation mechanism, human equivalence validation, paraphrase checks, or difficulty controls is supplied, leaving the variance attribution and disentanglement framework open to confounding.
[Evaluation pipeline] Evaluation pipeline (Brittlebench): the manuscript provides no statistical controls, confidence intervals, or multiple-testing corrections for the reported ranking alterations and variance percentages. Without these, it is unclear whether the 63% figure or the 'up to half' variance share are robust to sampling variation across models and benchmarks.

minor comments (2)

[Abstract] The abstract lists 'a suite of popular benchmarks' but does not name them; explicit enumeration would improve reproducibility.
[Methods] Notation for the brittleness metric and variance decomposition should be introduced with a clear equation or diagram early in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major point below and describe the changes we will make in revision.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims (up to 50% variance from perturbations; 63% ranking flips) rest on the unverified premise that the applied perturbations are semantics-preserving and do not alter task difficulty. No description of generation mechanism, human equivalence validation, paraphrase checks, or difficulty controls is supplied, leaving the variance attribution and disentanglement framework open to confounding.

Authors: We agree that the abstract provides insufficient detail on these aspects. In the revised manuscript we will expand the abstract with a concise description of the perturbation generation process and add a dedicated subsection that reports the human equivalence validation protocol, paraphrase similarity checks, and task-difficulty controls. These additions will directly support the semantics-preserving claim and the subsequent variance attribution. revision: yes
Referee: [Evaluation pipeline] Evaluation pipeline (Brittlebench): the manuscript provides no statistical controls, confidence intervals, or multiple-testing corrections for the reported ranking alterations and variance percentages. Without these, it is unclear whether the 63% figure or the 'up to half' variance share are robust to sampling variation across models and benchmarks.

Authors: We accept this criticism. The revised manuscript will include bootstrap-derived confidence intervals for both the ranking-flip percentage and the variance-decomposition results. We will also apply a multiple-testing correction (FDR) across the model-benchmark comparisons. These statistical controls will be reported in the evaluation pipeline section and will demonstrate robustness to sampling variation. revision: yes

Circularity Check

0 steps flagged

No circularity: framework defined independently then applied empirically to benchmarks

full rationale

The paper defines a theoretical framework for brittleness and prompt sensitivity, then applies semantics-preserving perturbations to existing benchmarks to measure performance changes, variance decomposition, and ranking shifts. No steps reduce by construction to inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained against external benchmarks, with results derived from direct application rather than tautological equivalence. The assumption that perturbations preserve semantics is an empirical premise open to validation but does not create circularity in the reported statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the approach appears to rest on standard assumptions in machine learning evaluation about benchmark validity and perturbation semantics.

pith-pipeline@v0.9.0 · 5553 in / 1063 out tokens · 61518 ms · 2026-05-15T17:49:09.097548+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Var(Y) = VarD(EP[Y|D]) + ED[VarP(Y|D)] ... Vdata ... Vbrittleness
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantics-preserving perturbations ... word manipulation, prompt padding, paraphrasing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.