Detecting Data Contamination in LLMs via In-Context Learning

Besmira Nushi; Klaudia Ba{\l}azy; Meriem Boubdir; Micha{\l} Zawalski; Pablo Ribalta

arxiv: 2510.27055 · v2 · submitted 2025-10-30 · 💻 cs.CL · cs.AI

Detecting Data Contamination in LLMs via In-Context Learning

Micha{\l} Zawalski , Meriem Boubdir , Klaudia Ba{\l}azy , Besmira Nushi , Pablo Ribalta This is my paper

Pith reviewed 2026-05-18 02:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords data contaminationlarge language modelsin-context learningmemorization detectionbenchmark evaluationLLM training data

0 comments

The pith

In-context examples boost LLM confidence on unseen data but often reduce it on contaminated training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoDeC, a method that detects training data contamination in large language models by tracking how in-context learning changes the model's confidence. Adding examples from a dataset usually raises confidence when the data is new to the model. The same examples tend to lower confidence when the data was memorized during training, because they interfere with the stored patterns. This difference yields scores that cleanly separate contaminated datasets from clean ones. The technique is useful because it lets evaluators check whether benchmark results reflect genuine capability or simply recall of test examples, and it requires no knowledge of the model's training set.

Core claim

CoDeC distinguishes data memorized during training from data outside the training distribution by measuring how in-context learning affects model performance. In-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora.

What carries the argument

The change in model confidence or performance caused by adding in-context examples from the evaluated dataset.

If this is right

Contamination scores become directly interpretable and separate seen from unseen data without extra labels.
Open-weight models with hidden training data show clear signs of memorization on standard benchmarks.
The method integrates automatically into existing evaluation pipelines because it needs no model-specific changes.
Quantified contamination levels can be reported alongside benchmark accuracy numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence-shift signal might help flag contamination in closed models if probability outputs are accessible.
Benchmark creators could use the method to audit and replace contaminated test sets before release.
Repeated application over time could track how contamination accumulates as new models are trained on growing web data.

Load-bearing premise

The observed drop in confidence for contaminated data is caused specifically by disruption of memorization patterns rather than other prompt-related effects or dataset properties.

What would settle it

Apply CoDeC to a model trained without any of the test datasets and observe that in-context examples increase confidence equally across all datasets instead of producing lower scores on the supposedly clean ones.

read the original abstract

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDeC tries to flag training contamination by checking whether in-context examples raise or lower model confidence, but the abstract gives no numbers or controls so the separation claim is still untested.

read the letter

The core idea is that in-context examples usually increase confidence on fresh data but can decrease it on data the model has memorized, and that difference can be turned into a contamination score. That framing around differential ICL effects looks new compared with earlier memorization probes that rely on direct likelihood or membership inference. The approach is presented as lightweight and model-agnostic, which is a practical plus if it scales without extra training or data access. It targets a genuine pain point: benchmarks lose meaning once test sets leak into training corpora, and anything that can surface that without needing the original training data would be handy for the field. The main weakness is the missing evidence. The abstract states that scores clearly separate seen from unseen sets, yet supplies no quantitative results, no description of how confidence is measured, no details on example selection or formatting, and no ablations that match dataset length, difficulty, or domain. Without those, it is hard to rule out that the confidence drop comes from prompt length or example relevance rather than from disrupted memorization. The stress-test note correctly flags this gap. Because only the abstract is available, the current support for the central claim is thin. Readers who run LLM evaluations or build benchmarks would be the natural audience; they might pick up the basic recipe and test it themselves. The work shows clear thinking about the contamination problem and honest engagement with why existing checks fall short, so it is worth sending to peer review once the full experiments and controls are in place.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Contamination Detection via Context (CoDeC), a method to detect and quantify training data contamination in LLMs by measuring how in-context learning affects model performance. It claims that ICL typically boosts confidence for unseen datasets but may reduce it for contaminated data due to disrupted memorization patterns, yielding interpretable contamination scores that clearly separate seen and unseen datasets. The approach is described as simple, automated, and agnostic to both models and datasets, with potential to reveal memorization in open-weight models having undisclosed training corpora.

Significance. If the central claims hold after full validation, the work would address an important practical problem in LLM evaluation: detecting contamination that can inflate benchmark performance. A model- and dataset-agnostic method that relies only on observable behavioral differences under ICL could be readily integrated into existing evaluation pipelines and would be especially useful for auditing open models.

major comments (2)

[Abstract] Abstract: the assertion that 'Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets' is presented without any quantitative results, error analysis, statistical tests, or validation details. Full experimental evidence is required to assess whether the claimed separation is actually achieved.
[Abstract] Abstract: the load-bearing interpretation that any observed drop in confidence on contaminated data is specifically caused by 'disrupted memorization patterns' (rather than prompt length, example relevance, or other dataset properties) is not supported by described controls. No information is given on how confidence is quantified, how ICL examples are selected or formatted, what the no-ICL baseline performance is, or whether ablations match difficulty, length, and domain between seen and unseen sets.

minor comments (1)

[Abstract] Abstract: the precise definition or formula for the 'contamination score' is not stated; adding a short formal description would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the abstract to provide additional quantitative highlights and methodological clarifications while preserving its brevity. Below we respond to each major comment.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets' is presented without any quantitative results, error analysis, statistical tests, or validation details. Full experimental evidence is required to assess whether the claimed separation is actually achieved.

Authors: The abstract summarizes the paper's main findings, with full quantitative results, error analysis, statistical tests, and validation details presented in the Experiments section of the manuscript. To address the concern directly in the abstract, we have revised it to include a concise reference to key outcomes demonstrating clear separation (e.g., consistent performance differences across models and datasets with supporting statistical evidence). This change makes the claim more self-contained without exceeding abstract length limits. revision: yes
Referee: [Abstract] Abstract: the load-bearing interpretation that any observed drop in confidence on contaminated data is specifically caused by 'disrupted memorization patterns' (rather than prompt length, example relevance, or other dataset properties) is not supported by described controls. No information is given on how confidence is quantified, how ICL examples are selected or formatted, what the no-ICL baseline performance is, or whether ablations match difficulty, length, and domain between seen and unseen sets.

Authors: We agree the abstract provides only a high-level overview and does not describe these elements. The full manuscript details confidence quantification via output probabilities, ICL example selection and formatting procedures, no-ICL baselines, and ablations that explicitly match difficulty, length, and domain to isolate memorization effects. We have updated the abstract with a brief clause noting that controlled experiments support the memorization-disruption interpretation over alternative factors. This revision improves clarity while directing readers to the methodological sections for complete information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical behavioral distinction without self-referential definitions or fitted reductions

full rationale

The abstract describes CoDeC as an observational method that measures differential effects of in-context examples on model confidence for seen versus unseen datasets, attributing the pattern to memorization disruption. No equations, parameters, or derivation steps are provided that would reduce the contamination score to a tautological fit or self-definition. The separation of datasets is presented as an experimental outcome rather than a constructed equivalence, and no self-citations or ansatzes are invoked in the available text. The approach remains self-contained as a direct comparison of model behavior against external data splits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unelaborated assumption that confidence shifts reliably indicate memorization status.

pith-pipeline@v0.9.0 · 5639 in / 1026 out tokens · 39515 ms · 2026-05-18T02:06:26.732563+00:00 · methodology

Detecting Data Contamination in LLMs via In-Context Learning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)