The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Jan Dubi\'nski; Sebastian Cygert; Wojciech Zarzecki

arxiv: 2606.03305 · v2 · pith:ZM3OD6UNnew · submitted 2026-06-02 · 💻 cs.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Wojciech Zarzecki , Jan Dubi\'nski , Sebastian Cygert This is my paper

Pith reviewed 2026-06-28 09:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords benchmark contaminationLLM auditingdistribution shiftdataset inferencedata provenancecontamination detectionmodel evaluation

0 comments

The pith

Statistical methods for detecting LLM benchmark contamination fail in realistic auditing due to distribution shift and scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three contamination detection approaches on 27 models spanning multiple families and sizes up to 27B parameters, including frontier industry models. It runs 335 evaluations and finds only 199 produce correct results. LLM Dataset Inference yields false positives when suspect and validation sets are not identically distributed. Post-Hoc Dataset Inference lacks statistical power because benchmarks are tiny relative to pre-training corpora. CoDeC supplies only coarse signals that cannot confirm whether a specific benchmark split was seen during training. These patterns show that current tools cannot reliably replace direct knowledge of training data provenance in practical settings.

Core claim

Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits.

What carries the argument

The three detection paradigms (LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC) evaluated for robustness against distribution shift between suspect and validation sets and against the small size of benchmarks relative to pre-training data.

If this is right

Statistical detection cannot yet replace transparent data provenance for confirming benchmark validity.
Methods must handle cases where suspect and validation sets violate the IID assumption.
Benchmark size limits the power of post-hoc membership inference at realistic scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditing pipelines could combine statistical signals with provenance logs to compensate for individual method weaknesses.
Specialized models in medicine or culture may require domain-specific calibration of detection thresholds.
Releasing the evaluation benchmark enables direct comparison of new detection algorithms against the identified failure cases.

Load-bearing premise

The assumed ground truth labels for whether contamination occurred in each of the 27 models are accurate enough to label the 335 detection outcomes as correct or incorrect.

What would settle it

Independent verification of actual training-data membership for the frontier models that contradicts the ground-truth contamination labels used to score the 335 outcomes.

Figures

Figures reproduced from arXiv: 2606.03305 by Jan Dubi\'nski, Sebastian Cygert, Wojciech Zarzecki.

**Figure 1.** Figure 1: Summary of our evaluation of methods for detecting whether a model was trained on a benchmark, across three tasks: Task 1 evaluates vulnerability to limited reference data ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: CoDeC scores for Task 2. CoDeC contamination scores for benchmarks. Hatched bars indicate, that this split was used as training split. We can observe trend between model sizes, however training splits do not have higher scores. provide clear evidence of split-level membership. CoDeC scores show descrease consistently with model size, but they are indifferent to fact whether given split of benchmark was use… view at source ↗

**Figure 3.** Figure 3: Application of CoDeC and LLM Dataset Inference to industry mod [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: CoDeC contamination scores on Pythia across multiple datasets. Evaluation-only benchmarks consistently receive lower scores than pre-training corpora, reproducing the separation reported in prior work. However, train and test splits within the same corpus yield nearly identical scores, indicating that CoDeC cannot distinguish split-level membership. CoDeC. CoDeC exhibits a different limitation than the D… view at source ↗

**Figure 5.** Figure 5: CoDeC scores for OLMo 2 (instruction-tuned) grouped by data [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms, LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC, across 25 models from multiple families (including Pythia, OLMo 2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 201 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a broad test of three contamination detectors on 27 models and reports only 199/335 correct calls, but the result hinges on unverified ground-truth labels for the frontier models.

read the letter

The core finding is that LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC produce correct outcomes in only 199 of 335 evaluations when applied to real benchmarks. The paper shows distribution shift produces false positives in one method, scale makes another underpowered, and the third gives only coarse signals. That pattern is new at this breadth; earlier work stayed in controlled academic settings with known training data.

They evaluate across Pythia, OLMo 2, cultural and medical models, and frontier industry ones, up to 27B parameters. The systematic comparison and the open-sourced benchmark are the useful parts. The counts are concrete and the failure modes are tied to specific conditions rather than vague warnings.

The soft spot is exactly the one in the stress-test note. For the 27 models, especially the industry frontier ones, the paper needs to show how it established whether contamination actually occurred. If those labels come from proxies or self-reports rather than direct evidence, the 199/335 split and the claims about method failure become circular. The abstract gives no details on this step, so the reliability-gap conclusion rests on an unexamined foundation.

This is for researchers who build or audit LLM benchmarks. It is worth a serious referee because it tests the tools people actually use and surfaces a practical problem, even with the ground-truth gap. The authors should be asked to document the labeling process in detail before any stronger claims.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three statistical paradigms for detecting LLM benchmark contamination (LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC) across 27 models spanning multiple families and scales up to 27B, with extension to frontier industry models. It reports that only 199 of 335 evaluations produce correct outcomes, attributing specific failure modes—false positives under distribution shift for LLM Dataset Inference, underpowered detection at benchmark scale for Post-Hoc, and only coarse provenance signals for CoDeC—and concludes that statistical detection cannot yet replace transparent data provenance.

Significance. If the empirical results hold, the work is significant for documenting a practical reliability gap between controlled validation settings and realistic auditing scenarios. Strengths include the systematic multi-method, multi-family, multi-scale evaluation (including industry models) and the open-sourcing of the benchmark, which supports reproducibility and follow-on work. The concrete counts (199/335) provide falsifiable, quantitative evidence rather than purely theoretical claims.

major comments (2)

[Methodology / Experimental Setup (ground-truth labeling subsection)] The central claim—that only 199/335 outcomes are correct and that specific failure modes exist—rests entirely on per-model ground-truth contamination labels for the 27 models (including frontier industry models). No section details the exact procedure, proxies, or heuristics used to establish these labels for models where training data membership cannot be directly observed; this is load-bearing because any systematic error in the labels would artifactually produce the reported false-positive rates and underpoweredness conclusions.
[Results (aggregate outcomes and per-method breakdowns)] Table or results section reporting the 199/335 aggregate: the manuscript provides no sensitivity analysis or alternative labelings to show how the headline count and per-method failure-mode attributions change under plausible variations in ground-truth assignment for the industry models.

minor comments (2)

[Abstract] Abstract states the 335 evaluations and 199 correct outcomes but does not define 'correct outcome' or reference the ground-truth procedure, reducing standalone clarity.
[Introduction / Background] Notation for the three methods is introduced without a consolidated table of acronyms and key assumptions, which would aid comparison across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting these important methodological points. We agree that the ground-truth labeling procedures and robustness of the 199/335 aggregate require clearer documentation. We address each comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Methodology / Experimental Setup (ground-truth labeling subsection)] The central claim—that only 199/335 outcomes are correct and that specific failure modes exist—rests entirely on per-model ground-truth contamination labels for the 27 models (including frontier industry models). No section details the exact procedure, proxies, or heuristics used to establish these labels for models where training data membership cannot be directly observed; this is load-bearing because any systematic error in the labels would artifactually produce the reported false-positive rates and underpoweredness conclusions.

Authors: We agree that explicit documentation of the labeling procedure is essential. For the open-weight models (Pythia, OLMo~2, and the specialised cultural/medical models), labels are derived directly from their publicly released training data documentation and pre-training corpus descriptions. For the frontier industry models, labels combine official model cards, stated training data cutoffs, and cross-references to independent contamination reports in the literature. In the revised manuscript we will add a dedicated subsection under Methodology that enumerates these sources, the decision rules applied to ambiguous cases, and any limitations of the proxies. This addition will make the load-bearing assumptions transparent to readers. revision: yes
Referee: [Results (aggregate outcomes and per-method breakdowns)] Table or results section reporting the 199/335 aggregate: the manuscript provides no sensitivity analysis or alternative labelings to show how the headline count and per-method failure-mode attributions change under plausible variations in ground-truth assignment for the industry models.

Authors: We concur that a sensitivity analysis would increase confidence in the headline count. Because alternative labelings for industry models would rest on additional untestable assumptions, a full quantitative sensitivity table is not feasible without introducing speculative scenarios. We will add a concise discussion in the Results section that (a) states the sources of label uncertainty for the industry models and (b) qualitatively assesses how plausible mislabelings would affect the reported failure-mode attributions. If the referee can suggest concrete alternative label sets, we will incorporate the corresponding quantitative checks. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation study with external ground truth; no derivations or self-referential reductions

full rationale

The paper performs an empirical audit of three contamination detection methods across 335 evaluations on 27 models, classifying outcomes as correct/incorrect by direct comparison to per-model ground truth labels on contamination status. No equations, fitted parameters, or derivations are present that could reduce predictions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims rest on external labels rather than self-referential quantities, satisfying the self-contained-against-external-benchmarks criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study with no mathematical derivations; relies on standard assumptions about contamination ground truth and IID violations.

axioms (1)

domain assumption Ground truth contamination status can be reliably determined for the evaluated models including frontier industry models.
The classification of 335 outcomes as correct or incorrect depends on this.

pith-pipeline@v0.9.1-grok · 5777 in / 1084 out tokens · 22212 ms · 2026-06-28T09:46:32.648779+00:00 · methodology

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)