LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories

Ali Payani; Haiyan Zhao; Mengnan Du; Yingcong Li; Zirui He

arxiv: 2509.20909 · v2 · submitted 2025-09-25 · 💻 cs.CL

LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories

Zirui He , Haiyan Zhao , Yingcong Li , Ali Payani , Mengnan Du This is my paper

Pith reviewed 2026-05-18 14:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords benchmark contaminationlogit trajectoriesmemorization detectionlayerwise analysislarge language modelsevaluation auditing

0 comments

The pith

Contaminated benchmark examples commit to answers earlier across layerwise logit trajectories than clean examples do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that memorization from benchmark contamination leaves a detectable signature in how large language models build their answer preferences layer by layer. Contaminated examples tend to lock onto the correct answer quickly and stabilize, while clean examples accumulate evidence more gradually before committing. These trajectory differences persist across models and survive rephrasing of the input, which defeats most existing detection approaches based on string overlap or final-output likelihood. A lightweight classifier using the trajectory features can then separate the two categories reliably. If the claim holds, it supplies an internal-model signal for auditing whether strong benchmark scores reflect genuine reasoning or prior exposure to the test items.

Core claim

LogitTrace tracks the evolving logit values assigned to candidate answers as they pass through successive layers of the model. Contaminated examples display earlier commitment, with the target answer's logit rising sharply and stabilizing sooner, whereas clean examples show slower, more incremental growth consistent with step-by-step evidence accumulation. These patterns allow a simple classifier to distinguish contamination status across different models and input variants. Controlled experiments that inject target samples via LoRA reproduce the same early-commitment signatures, linking the observed trajectories to repeated exposure during training.

What carries the argument

Layerwise logit trajectories that record how preference logits for possible answers change from early to late layers, revealing the timing of decision stabilization.

If this is right

Benchmark scores on tasks such as AIME or Math500 can be audited for hidden contamination using signals available inside the model itself.
Detection continues to function on reworded or paraphrased questions that break surface-overlap and completion-based checks.
The same layerwise view supplies a new observable for studying how memorization alters internal computation rather than just final outputs.
Trajectory features could be monitored during evaluation to flag potential data leakage without requiring the original training corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interventions that slow early commitment on seen examples might reduce inflated performance on contaminated benchmarks.
The method could extend to detecting memorization induced by fine-tuning rather than pretraining exposure.
Combining trajectory signals with existing overlap-based detectors might raise overall reliability without added external data.

Load-bearing premise

The observed differences in trajectory timing are produced by memorization from contamination rather than by question difficulty, prompt format, or model-specific factors that happen to align with contamination labels.

What would settle it

Finding that clean but unusually difficult examples produce the same early-commitment trajectory shape as known contaminated ones, or that rephrasing alone alters trajectories independently of contamination status.

Figures

Figures reproduced from arXiv: 2509.20909 by Ali Payani, Haiyan Zhao, Mengnan Du, Yingcong Li, Zirui He.

**Figure 1.** Figure 1: These four examples illustrate the potential risk that memorization may masquerade as generalization. In (a), the model answers the original problem correctly, which is expected. In (b), after rephrasing, the model still provides the correct solution, appearing to generalize. In (c), however, even when the input is reduced to a minimal set of keywords, the model produces the correct answer together with a … view at source ↗

**Figure 2.** Figure 2: Class-mean activation heatmaps. Average channel activations across layers for clean vs. contaminated samples. Red regions denote stronger activation and blue denotes weaker activation. 0 5 10 15 20 25 Layer 10 1 10 0 10 1 10 2 10 3 0 10 3 10 2 10 1 10 0 Margin Mean activation trajectories of contaminated samples 0 5 10 15 20 25 Layer Mean activation trajectories of clean samples 0.0 0.5 1.0 1.5 2.0 Entropy… view at source ↗

**Figure 3.** Figure 3: Mean activation trajectories of correctly classified samples. Each panel shows layerwise dynamics at the answer position for contaminated (left) and clean (right) samples. ence answer. (ii) TS-Guessing (Deng et al., 2024): contamination is detected if the model can infer the gold answer through slot-filling or symbolic transformations, without requiring exact string match. These two belong to completion-d… view at source ↗

**Figure 4.** Figure 4: Layer-wise activations for one sample across LoRA ranks, with layers on the x-axis, 24 feature channels on the y-axis, and color showing normalized values. 0 5 10 15 20 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Grad-CAM Clean Contaminated [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Class-mean Grad-CAM visualization comparing clean vs. contaminated samples. Qwen2.5-7B Llama3.1-8B 0 20 40 60 80 100 F1 Score (%) 68.7 81.4 95.2 64.5 75.8 88.9 layer=1/3 layer=1/2 layer=1 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Dataset composition and group-level performance comparison. Left: fraction of samples classified as seen-like (94.9%) versus not-seen (5.1%). Middle: Top-1 accuracy and MRR@6 for the two groups. Right: effect sizes (Seen–Not) with 95% confidence intervals. MMLU mathematics may derive from recalling previously seen problems rather than solving unseen ones from first principles. Our detector thus recovers hi… view at source ↗

**Figure 8.** Figure 8: Activation trajectories of the 10 most confidently classified samples by the discriminator trained on Qwen2.5-7B. Clean samples are shown on the left, while contaminated samples are shown on the right. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Activation trajectories of the 10 most confidently classified samples by the discriminator trained on Qwen2.5-Math-7B. Clean samples are shown on the left, while contaminated samples are shown on the right. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Activation trajectories of the 10 most confidently classified samples by the discriminator trained on Llama3.1-8B. Clean samples are shown on the left, while contaminated samples are shown on the right. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Activation trajectories of the 10 most confidently classified samples by the discriminator trained on Qwen3-8B. Clean samples are shown on the left, while contaminated samples are shown on the right. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, where benchmark contamination can make memorized solutions appear as genuine reasoning. Existing detection methods largely rely on surface overlap, completion behavior, or final-output likelihood, and often degrade when inputs are simply rephrased. In this paper, we propose LogitTrace(Layerwise Logit Trajectories), a framework for analyzing memorization-like decision dynamics through intermediate logit trajectories. Instead of judging memorization only from the final answer, LogitTrace examines how model preferences emerge and stabilize across layers. We find that contaminated examples tend to show earlier commitment, while clean examples exhibit more gradual evidence accumulation. These trajectory signals allow a lightweight classifier to separate contaminated and clean examples across multiple models and input variants. Controlled LoRA injection experiments further show that repeated exposure to target samples induces similar trajectory patterns. Overall, our results suggest that LogitTrace provides evidence beyond surface overlap and final-output confidence, offering a useful lens for studying memorization-like behavior in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogitTrace tracks layerwise logit paths to flag contamination via earlier commitment, but the signal may track difficulty or formatting instead.

read the letter

The main takeaway is that contaminated examples appear to lock in their preferred answer earlier in the layer stack, while clean ones accumulate evidence more gradually, and a simple classifier can pick up on that difference across models and rephrasings. The LoRA injection runs add a useful check by showing that forcing exposure produces similar trajectory shapes. That combination is the clearest advance over prior overlap or final-likelihood checks. The paper does a decent job framing the problem and giving a concrete way to look inside the forward pass rather than just at the output. The controlled injection experiment is the strongest piece of evidence they present for linking the pattern to memorization. On the soft spots, the abstract and summary give almost no numbers, no error bars, and no indication that the clean and contaminated sets were matched on difficulty, length, or lexical features. Those variables are known to change how quickly models settle on an answer, so the separation could be driven by them rather than contamination. Without stratified ablations or regression controls, it is hard to tell how much of the signal is real. The stress-test concern about confounds looks like it still applies. This is aimed at groups that maintain or evaluate math and reasoning benchmarks and want another tool against leakage. A reader already working on contamination detection would get some practical ideas from the trajectory view, even if the current results are preliminary. I would send it to peer review. The core observation is worth referee time and the LoRA setup shows honest effort, but the paper will need tighter controls and quantitative reporting before the claim can be trusted.

Referee Report

2 major / 2 minor

Summary. The paper introduces LogitTrace, a framework for detecting benchmark contamination in LLMs by analyzing layerwise logit trajectories rather than final outputs or surface overlap. It claims that contaminated examples exhibit earlier commitment to answers across layers, while clean examples show more gradual evidence accumulation; these signals enable a lightweight classifier to separate the two across models and input variants. Controlled LoRA injection experiments are presented as further evidence that repeated exposure induces similar trajectory patterns.

Significance. If the central claim holds after controls, the work offers a potentially useful internal-model lens on memorization that could complement existing detection methods and improve evaluation reliability on benchmarks such as AIME and Math500. The LoRA injection component provides a controlled causal test that is a methodological strength.

major comments (2)

[Experimental Setup / Abstract] The central claim requires that trajectory differences are caused by contamination-induced memorization. However, the manuscript provides no indication that contaminated and clean examples are matched or controlled for question difficulty, length, lexical features, or prompt formatting (see skeptic note and abstract description of experiments). These factors independently modulate evidence accumulation speed; without regression controls, propensity matching, or difficulty-stratified ablations, the classifier separation and LoRA results risk being spurious.
[Abstract / Results] The abstract states that a classifier works and that LoRA reproduces the pattern, yet reports no quantitative results, error bars, dataset sizes, accuracy metrics, or statistical tests. This absence makes it impossible to assess effect sizes, reproducibility, or whether post-hoc choices affect the separation claim.

minor comments (2)

[Method] Clarify the precise operational definition of 'earlier commitment' (e.g., the layer index at which the target logit first exceeds a threshold or the rate of logit stabilization).
[Introduction] Add a short related-work paragraph situating LogitTrace against prior internal-representation analyses of memorization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor that we address below. We have revised the manuscript to incorporate additional controls and quantitative reporting as outlined in the point-by-point responses.

read point-by-point responses

Referee: [Experimental Setup / Abstract] The central claim requires that trajectory differences are caused by contamination-induced memorization. However, the manuscript provides no indication that contaminated and clean examples are matched or controlled for question difficulty, length, lexical features, or prompt formatting (see skeptic note and abstract description of experiments). These factors independently modulate evidence accumulation speed; without regression controls, propensity matching, or difficulty-stratified ablations, the classifier separation and LoRA results risk being spurious.

Authors: We agree that explicit controls for question difficulty, length, lexical features, and prompt formatting are necessary to strengthen the attribution of trajectory differences to contamination. Our original experiments drew contaminated and clean examples from the same benchmark distributions to reduce some baseline differences, but we did not report formal matching or regression adjustments. In the revision we will add difficulty-stratified ablations (binning by proxy difficulty metrics such as average model performance on held-out similar items), propensity-score matching on length and lexical overlap, and regression controls that partial out these covariates when evaluating classifier separation and LoRA-induced patterns. These additions will directly address the concern that the observed signals could be confounded. revision: yes
Referee: [Abstract / Results] The abstract states that a classifier works and that LoRA reproduces the pattern, yet reports no quantitative results, error bars, dataset sizes, accuracy metrics, or statistical tests. This absence makes it impossible to assess effect sizes, reproducibility, or whether post-hoc choices affect the separation claim.

Authors: The referee is correct that the abstract and high-level results summary omitted concrete metrics. The full manuscript contains the underlying experiments, but these details were not elevated to the abstract. We will revise the abstract to report dataset sizes (number of contaminated and clean examples per model), classifier accuracy with standard deviations across multiple runs or folds, and statistical significance tests comparing trajectory features between conditions. Error bars will be added to figures, and a summary table of quantitative results will be included in the main text so that effect sizes and reproducibility can be evaluated directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LogitTrace derivation chain

full rationale

The paper's central method observes empirical differences in layerwise logit trajectories (earlier commitment for contaminated examples vs. gradual accumulation for clean ones) and trains a lightweight classifier on these signals, with supporting LoRA injection experiments to induce similar patterns via repeated exposure. No derivation step reduces by construction to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The approach relies on observable model behaviors across models and input variants rather than tautological inputs, making the chain self-contained against external contamination benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that logit trajectories encode memorization signals independently of surface features. No explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Differences in layerwise logit trajectories reliably indicate memorization versus genuine reasoning.
This premise is invoked when the authors interpret earlier commitment as evidence of contamination.

pith-pipeline@v0.9.0 · 5720 in / 1303 out tokens · 32604 ms · 2026-05-18T14:29:09.756115+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

contaminated samples tend to lock onto an answer with high confidence much earlier in the forward pass, whereas clean samples exhibit more gradual evidence accumulation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use the logit lens to project intermediate hidden states into the output space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[2]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, et al. Extracting training data from large language models. In USENIX Security Symposium, 2021

work page 2021
[3]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[4]

Investigating data contamination in modern benchmarks for large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

work page 2024
[5]

Generalization or memorization? data contamination and trustworthy evaluation for large language models

Yiheng Dong, Yuxin Guo, Yuxiang Wang, Yinpeng Chen, Xiaohui Sun, Lei Zhang, and Heng Ji. Generalization or memorization? data contamination and trustworthy evaluation for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024
[6]

Duarte, Xuandong Zhao, Arlindo L

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. DE-COP : Detecting copyrighted content in language models training data, 2024

work page 2024
[7]

Data contamination quiz: A tool to detect and estimate contamination in large language models, 2023

Shahriar Golchin and Mihai Surdeanu. Data contamination quiz: A tool to detect and estimate contamination in large language models, 2023

work page 2023
[8]

Time travel in LLMs : Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs : Tracing data contamination in large language models. In Proceedings of the 2024 International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

Copyright violations and large language models

Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders S gaard. Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[10]

Memorization or reasoning? exploring the idiom understanding of llms

Jisu Kim, Youngwoo Shin, Uiji Hwang, Jihun Choi, Richeng Xuan, and Taeuk Kim. Memorization or reasoning? exploring the idiom understanding of llms. arXiv preprint arXiv:2505.16216, 2025

work page arXiv 2025
[11]

Overestimation in llm evaluation: A controlled large-scale study on data contamination’s impact on machine translation, 2025

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, and Markus Freitag. Overestimation in llm evaluation: A controlled large-scale study on data contamination’s impact on machine translation, 2025

work page 2025
[12]

Memorization vs

Aochong Oliver Li and Tanya Goyal. Memorization vs. reasoning: Updating llms with new knowledge. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[13]

Task contamination: Language models may not be few-shot anymore, 2023

Changmao Li and Jeffrey Flanigan. Task contamination: Language models may not be few-shot anymore, 2023

work page 2023
[14]

Lost in the Middle: How Language Models Use Long Contexts

Xuefeng Li, Jason Wei, Denny Zhou, et al. Evaluating verbal reasoning in language models. arXiv preprint arXiv:2307.03172, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Estimating contamination via perplexity: Quantifying memorisation in language model evaluation

Yucheng Li et al. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 b

work page 2023
[16]

Feder Cooper, Daphne Ippolito, Christopher A

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tram \`e r, and Katherine Lee. Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[17]

Interpreting GPT : The logit lens, 2020

Nostalgebraist. Interpreting GPT : The logit lens, 2020. Online blog post

work page 2020
[18]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Lipton, and J

Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter. Memorization with compression: Rethinking llm memorization through the lens of adversarial compression. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[20]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.\ 618--626, 2017

work page 2017
[21]

Detecting pretraining data from large language models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[22]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Dice: Detecting in-distribution contamination in llm's fine-tuning phase for math reasoning

Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, and Juanzi Li. Dice: Detecting in-distribution contamination in llm's fine-tuning phase for math reasoning. arXiv preprint arXiv:2406.04197, 2024

work page arXiv 2024
[24]

Reasoning or memorization? unreliable results of reinforcement learning due to data contamination

Yifei Wu, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Reasoning or memorization? unreliable results of reinforcement learning due to data contamination. arXiv preprint arXiv:2503.04683, 2025

work page arXiv 2025
[25]

Do llms know to respect copyright notice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Jialiang Xu, Shenglan Li, Zhaozhuo Xu, and Denghui Zhang. Do llms know to respect copyright notice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[26]

arXiv preprint arXiv:2311.04850 , year =

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023

work page arXiv 2023
[27]

Data contamination can cross language barriers

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, and Jingbo Shang. Data contamination can cross language barriers. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[28]

Data contamination calibration for black-box llms

Wentao Ye, Jiaqi Hu, Liyao Li, Haobo Wang, Gang Chen, and Junbo Zhao. Data contamination calibration for black-box llms. In Findings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024
[29]

Codeipprompt: Intellectual property infringement assessment of code language models

Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, and Chaowei Xiao. Codeipprompt: Intellectual property infringement assessment of code language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 40373--40389. PMLR, 2023

work page 2023
[30]

PaCoST : Paired confidence significance testing for benchmark contamination detection in large language models

Huixuan Zhang, Yun Lin, and Xiaojun Wan. PaCoST : Paired confidence significance testing for benchmark contamination detection in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024 a

work page 2024
[31]

Pre-training data detection for large language models: A divergence-based calibration method

Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. Pre-training data detection for large language models: A divergence-based calibration method. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024 b

work page 2024
[32]

Measuring copyright risks of large language models via partial information probing

Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang. Measuring copyright risks of large language models via partial information probing. In Proceedings of the International Conference on Data-Centric AI (DCAI), 2024

work page 2024
[33]

Don't make your llm an evaluation benchmark cheater

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don't make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023

work page arXiv 2023
[34]

Inference-time decontamination: Reusing leaked benchmarks for large language model evaluation

Qin Zhu, Qinyuan Cheng, Runyu Peng, Xiaonan Li, Ru Peng, Tengxiao Liu, Xipeng Qiu, and Xuanjing Huang. Inference-time decontamination: Reusing leaked benchmarks for large language model evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024 a

work page 2024
[35]

Clean-eval: Clean evaluation on contaminated large language models

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, and Hongyuan Lu. Clean-eval: Clean evaluation on contaminated large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024 b

work page 2024
[36]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[37]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[38]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[39]

b/G5O6 m= 'BV ۜ i?漠7S 1ɞ/ps>Yem'>y1 R a AV

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2046

[1] [1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[2] [2]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, et al. Extracting training data from large language models. In USENIX Security Symposium, 2021

work page 2021

[3] [3]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[4] [4]

Investigating data contamination in modern benchmarks for large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

work page 2024

[5] [5]

Generalization or memorization? data contamination and trustworthy evaluation for large language models

Yiheng Dong, Yuxin Guo, Yuxiang Wang, Yinpeng Chen, Xiaohui Sun, Lei Zhang, and Heng Ji. Generalization or memorization? data contamination and trustworthy evaluation for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024

[6] [6]

Duarte, Xuandong Zhao, Arlindo L

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. DE-COP : Detecting copyrighted content in language models training data, 2024

work page 2024

[7] [7]

Data contamination quiz: A tool to detect and estimate contamination in large language models, 2023

Shahriar Golchin and Mihai Surdeanu. Data contamination quiz: A tool to detect and estimate contamination in large language models, 2023

work page 2023

[8] [8]

Time travel in LLMs : Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs : Tracing data contamination in large language models. In Proceedings of the 2024 International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

Copyright violations and large language models

Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders S gaard. Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[10] [10]

Memorization or reasoning? exploring the idiom understanding of llms

Jisu Kim, Youngwoo Shin, Uiji Hwang, Jihun Choi, Richeng Xuan, and Taeuk Kim. Memorization or reasoning? exploring the idiom understanding of llms. arXiv preprint arXiv:2505.16216, 2025

work page arXiv 2025

[11] [11]

Overestimation in llm evaluation: A controlled large-scale study on data contamination’s impact on machine translation, 2025

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, and Markus Freitag. Overestimation in llm evaluation: A controlled large-scale study on data contamination’s impact on machine translation, 2025

work page 2025

[12] [12]

Memorization vs

Aochong Oliver Li and Tanya Goyal. Memorization vs. reasoning: Updating llms with new knowledge. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[13] [13]

Task contamination: Language models may not be few-shot anymore, 2023

Changmao Li and Jeffrey Flanigan. Task contamination: Language models may not be few-shot anymore, 2023

work page 2023

[14] [14]

Lost in the Middle: How Language Models Use Long Contexts

Xuefeng Li, Jason Wei, Denny Zhou, et al. Evaluating verbal reasoning in language models. arXiv preprint arXiv:2307.03172, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Estimating contamination via perplexity: Quantifying memorisation in language model evaluation

Yucheng Li et al. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 b

work page 2023

[16] [16]

Feder Cooper, Daphne Ippolito, Christopher A

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tram \`e r, and Katherine Lee. Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[17] [17]

Interpreting GPT : The logit lens, 2020

Nostalgebraist. Interpreting GPT : The logit lens, 2020. Online blog post

work page 2020

[18] [18]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Lipton, and J

Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary C. Lipton, and J. Zico Kolter. Memorization with compression: Rethinking llm memorization through the lens of adversarial compression. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[20] [20]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.\ 618--626, 2017

work page 2017

[21] [21]

Detecting pretraining data from large language models

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[22] [22]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Dice: Detecting in-distribution contamination in llm's fine-tuning phase for math reasoning

Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, and Juanzi Li. Dice: Detecting in-distribution contamination in llm's fine-tuning phase for math reasoning. arXiv preprint arXiv:2406.04197, 2024

work page arXiv 2024

[24] [24]

Reasoning or memorization? unreliable results of reinforcement learning due to data contamination

Yifei Wu, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Reasoning or memorization? unreliable results of reinforcement learning due to data contamination. arXiv preprint arXiv:2503.04683, 2025

work page arXiv 2025

[25] [25]

Do llms know to respect copyright notice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Jialiang Xu, Shenglan Li, Zhaozhuo Xu, and Denghui Zhang. Do llms know to respect copyright notice? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[26] [26]

arXiv preprint arXiv:2311.04850 , year =

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023

work page arXiv 2023

[27] [27]

Data contamination can cross language barriers

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, and Jingbo Shang. Data contamination can cross language barriers. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[28] [28]

Data contamination calibration for black-box llms

Wentao Ye, Jiaqi Hu, Liyao Li, Haobo Wang, Gang Chen, and Junbo Zhao. Data contamination calibration for black-box llms. In Findings of the Association for Computational Linguistics: ACL 2024, 2024

work page 2024

[29] [29]

Codeipprompt: Intellectual property infringement assessment of code language models

Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, and Chaowei Xiao. Codeipprompt: Intellectual property infringement assessment of code language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 40373--40389. PMLR, 2023

work page 2023

[30] [30]

PaCoST : Paired confidence significance testing for benchmark contamination detection in large language models

Huixuan Zhang, Yun Lin, and Xiaojun Wan. PaCoST : Paired confidence significance testing for benchmark contamination detection in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024 a

work page 2024

[31] [31]

Pre-training data detection for large language models: A divergence-based calibration method

Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. Pre-training data detection for large language models: A divergence-based calibration method. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024 b

work page 2024

[32] [32]

Measuring copyright risks of large language models via partial information probing

Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang. Measuring copyright risks of large language models via partial information probing. In Proceedings of the International Conference on Data-Centric AI (DCAI), 2024

work page 2024

[33] [33]

Don't make your llm an evaluation benchmark cheater

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don't make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023

work page arXiv 2023

[34] [34]

Inference-time decontamination: Reusing leaked benchmarks for large language model evaluation

Qin Zhu, Qinyuan Cheng, Runyu Peng, Xiaonan Li, Ru Peng, Tengxiao Liu, Xipeng Qiu, and Xuanjing Huang. Inference-time decontamination: Reusing leaked benchmarks for large language model evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024 a

work page 2024

[35] [35]

Clean-eval: Clean evaluation on contaminated large language models

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yunze Song, Yumeng Zhang, Hanxu Hu, Yiran Wei, Rui Wang, and Hongyuan Lu. Clean-eval: Clean evaluation on contaminated large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024 b

work page 2024

[36] [36]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[37] [37]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[38] [38]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[39] [39]

b/G5O6 m= 'BV ۜ i?漠7S 1ɞ/ps>Yem'>y1 R a AV

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2046