Detecting Data Contamination in LLMs via In-Context Learning
Pith reviewed 2026-05-18 02:06 UTC · model grok-4.3
The pith
In-context examples boost LLM confidence on unseen data but often reduce it on contaminated training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoDeC distinguishes data memorized during training from data outside the training distribution by measuring how in-context learning affects model performance. In-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora.
What carries the argument
The change in model confidence or performance caused by adding in-context examples from the evaluated dataset.
If this is right
- Contamination scores become directly interpretable and separate seen from unseen data without extra labels.
- Open-weight models with hidden training data show clear signs of memorization on standard benchmarks.
- The method integrates automatically into existing evaluation pipelines because it needs no model-specific changes.
- Quantified contamination levels can be reported alongside benchmark accuracy numbers.
Where Pith is reading between the lines
- The same confidence-shift signal might help flag contamination in closed models if probability outputs are accessible.
- Benchmark creators could use the method to audit and replace contaminated test sets before release.
- Repeated application over time could track how contamination accumulates as new models are trained on growing web data.
Load-bearing premise
The observed drop in confidence for contaminated data is caused specifically by disruption of memorization patterns rather than other prompt-related effects or dataset properties.
What would settle it
Apply CoDeC to a model trained without any of the test datasets and observe that in-context examples increase confidence equally across all datasets instead of producing lower scores on the supposedly clean ones.
read the original abstract
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Contamination Detection via Context (CoDeC), a method to detect and quantify training data contamination in LLMs by measuring how in-context learning affects model performance. It claims that ICL typically boosts confidence for unseen datasets but may reduce it for contaminated data due to disrupted memorization patterns, yielding interpretable contamination scores that clearly separate seen and unseen datasets. The approach is described as simple, automated, and agnostic to both models and datasets, with potential to reveal memorization in open-weight models having undisclosed training corpora.
Significance. If the central claims hold after full validation, the work would address an important practical problem in LLM evaluation: detecting contamination that can inflate benchmark performance. A model- and dataset-agnostic method that relies only on observable behavioral differences under ICL could be readily integrated into existing evaluation pipelines and would be especially useful for auditing open models.
major comments (2)
- [Abstract] Abstract: the assertion that 'Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets' is presented without any quantitative results, error analysis, statistical tests, or validation details. Full experimental evidence is required to assess whether the claimed separation is actually achieved.
- [Abstract] Abstract: the load-bearing interpretation that any observed drop in confidence on contaminated data is specifically caused by 'disrupted memorization patterns' (rather than prompt length, example relevance, or other dataset properties) is not supported by described controls. No information is given on how confidence is quantified, how ICL examples are selected or formatted, what the no-ICL baseline performance is, or whether ablations match difficulty, length, and domain between seen and unseen sets.
minor comments (1)
- [Abstract] Abstract: the precise definition or formula for the 'contamination score' is not stated; adding a short formal description would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the abstract to provide additional quantitative highlights and methodological clarifications while preserving its brevity. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets' is presented without any quantitative results, error analysis, statistical tests, or validation details. Full experimental evidence is required to assess whether the claimed separation is actually achieved.
Authors: The abstract summarizes the paper's main findings, with full quantitative results, error analysis, statistical tests, and validation details presented in the Experiments section of the manuscript. To address the concern directly in the abstract, we have revised it to include a concise reference to key outcomes demonstrating clear separation (e.g., consistent performance differences across models and datasets with supporting statistical evidence). This change makes the claim more self-contained without exceeding abstract length limits. revision: yes
-
Referee: [Abstract] Abstract: the load-bearing interpretation that any observed drop in confidence on contaminated data is specifically caused by 'disrupted memorization patterns' (rather than prompt length, example relevance, or other dataset properties) is not supported by described controls. No information is given on how confidence is quantified, how ICL examples are selected or formatted, what the no-ICL baseline performance is, or whether ablations match difficulty, length, and domain between seen and unseen sets.
Authors: We agree the abstract provides only a high-level overview and does not describe these elements. The full manuscript details confidence quantification via output probabilities, ICL example selection and formatting procedures, no-ICL baselines, and ablations that explicitly match difficulty, length, and domain to isolate memorization effects. We have updated the abstract with a brief clause noting that controlled experiments support the memorization-disruption interpretation over alternative factors. This revision improves clarity while directing readers to the methodological sections for complete information. revision: yes
Circularity Check
No circularity: empirical behavioral distinction without self-referential definitions or fitted reductions
full rationale
The abstract describes CoDeC as an observational method that measures differential effects of in-context examples on model confidence for seen versus unseen datasets, attributing the pattern to memorization disruption. No equations, parameters, or derivation steps are provided that would reduce the contamination score to a tautological fit or self-definition. The separation of datasets is presented as an experimental outcome rather than a constructed equivalence, and no self-citations or ansatzes are invoked in the available text. The approach remains self-contained as a direct comparison of model behavior against external data splits.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.