pith. sign in

arxiv: 2605.21856 · v1 · pith:VZIPYCJKnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

Pith reviewed 2026-05-22 08:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data contaminationLLM evaluationchain-of-thoughtmemorization detectionbenchmark integrityzero-shot probingevasive contamination
0
0 comments X

The pith

Truncating chain-of-thought reasoning exposes hidden memorization from contaminated training data in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLMs generate reasoning steps mainly to conceal direct recall of benchmark examples seen during training. By cutting off all chain-of-thought output at the start, the model is forced to produce answers from immediate input-output mappings rather than step-by-step logic. The method then measures how much better the model performs on the original benchmark compared with an isomorphically altered copy of the same tasks. A sizable gap signals that the original performance rests on memorized shortcuts instead of general reasoning. This difference remains visible even when the contamination arrives through paraphrased or reworded versions of the benchmark.

Core claim

A model's generated reasoning steps actively mask its underlying memorization. The Zero-CoT Probe detects both direct and evasive contamination by truncating the entire Chain-of-Thought process and comparing zero-CoT accuracy on the original benchmark against an isomorphically perturbed reference dataset. The resulting performance gap isolates memorization effects, and a new Contamination Confidence metric quantifies both the likelihood and severity of the leakage.

What carries the argument

Zero-CoT Probe (ZCP), which removes all chain-of-thought tokens and contrasts zero-shot performance on the original benchmark with performance on an isomorphically perturbed reference set to isolate shortcut memorization.

Load-bearing premise

An isomorphically perturbed copy of the benchmark stays free of contamination and keeps the same difficulty and statistical distribution as the original.

What would settle it

A model known to have been trained only on clean data produces nearly identical zero-CoT accuracy on the original benchmark and its perturbed counterpart, while a deliberately contaminated model shows no measurable gap.

Figures

Figures reproduced from arXiv: 2605.21856 by Hanyu Wang, Jinghui Chen, Lu Lin, Yifan Lan, Yuanpu Cao.

Figure 1
Figure 1. Figure 1: Reasoning masks data contamination. Under Full-CoT (Top), memorization is indis￾tinguishable from genuine reasoning. Our Zero-CoT Probe (Bottom) forces the model to bypass intermediate reasoning. Consequently, the model fails on clean questions but still correctly answers contaminated ones via a learned shortcut mapping, thereby exposing the memorization. 2 Related Work Data Contamination. Data contaminati… view at source ↗
Figure 2
Figure 2. Figure 2: The accuracy gap (∆) between contaminated and clean questions across varying CoT percentages. As the reasoning chain is systematically omitted, the gap widens drastically. In standard generation processes, LLMs solve complex problems via a Full-CoT (default) generation setting. Given an input question xi , the model first generates an intermediate reasoning chain cˆi , and then produces the final answer yˆ… view at source ↗
Figure 3
Figure 3. Figure 3: The automated multi-model pipeline for constructing the reference dataset [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The influence of dataset size on detection stability across different metrics. The experiment [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contamination Confidence (Ccont) across different models and metrics on GSM8K and MATH-500. The red dashed line denotes the clean baseline (0.5). Missing bars for GPT models indicate the unavailability of logit-based metrics (Pf irst and Pall). • High Stability, Highest Access (Logit-based metrics): Both Pf irst and Pall achieve high contamination confidence (Ccont > 0.94) with as few as 50 ∼ 100 samples. … view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs' generated reasoning steps can mask underlying memorization from data contamination, including evasive strategies such as paraphrasing. It introduces the Zero-CoT Probe (ZCP), which truncates the full Chain-of-Thought process and detects contamination by comparing zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset; it also defines a Contamination Confidence metric. Experiments on previously identified contaminated models and specially fine-tuned ones are said to show robust detection of both direct and evasive contamination.

Significance. If the central isolation claim holds after proper controls, ZCP would offer a practical black-box tool for identifying stealthy contamination that evades existing detectors, strengthening the validity of LLM benchmark evaluations. The core idea of deliberately truncating CoT to surface latent shortcuts is a targeted and potentially useful contribution.

major comments (2)
  1. [§3] §3 (method description): the paper provides no concrete details on how the isomorphically perturbed reference datasets are constructed (perturbation rules, surface-statistic preservation, or isomorphism criteria). Without these, it is impossible to verify that observed performance gaps isolate memorization rather than perturbation-induced difficulty or distribution shifts, which directly undermines the central claim.
  2. [§4] §4 (experiments): no statistical controls, human difficulty ratings, or results on clean models are reported to confirm that the perturbed sets preserve task difficulty and distribution. The abstract's claim that gaps 'isolate memorization' therefore rests on an untested assumption; this is load-bearing for the detection results on both known and fine-tuned contaminated models.
minor comments (2)
  1. [Abstract] The abstract and method sections use 'isomorphically perturbed' without a precise operational definition or example; adding one would improve clarity.
  2. Figure or table captions describing the perturbed datasets should explicitly state the perturbation procedure and any equivalence checks performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity and validation that we address below. We have revised the manuscript to strengthen the presentation of the method and experimental controls.

read point-by-point responses
  1. Referee: [§3] §3 (method description): the paper provides no concrete details on how the isomorphically perturbed reference datasets are constructed (perturbation rules, surface-statistic preservation, or isomorphism criteria). Without these, it is impossible to verify that observed performance gaps isolate memorization rather than perturbation-induced difficulty or distribution shifts, which directly undermines the central claim.

    Authors: We agree that explicit details on dataset construction are necessary for reproducibility and to support the isolation claim. In the revised manuscript we have expanded Section 3 with a dedicated subsection that specifies the perturbation procedure: problems are paraphrased by replacing surface lexical items and syntactic frames while preserving the underlying logical structure, variable bindings, and answer format (ensuring isomorphism). Surface statistics are controlled by matching sentence length, token-frequency histograms, and parse-tree depth distributions between original and perturbed versions. These rules are now stated with pseudocode and concrete examples drawn from the benchmarks used. revision: yes

  2. Referee: [§4] §4 (experiments): no statistical controls, human difficulty ratings, or results on clean models are reported to confirm that the perturbed sets preserve task difficulty and distribution. The abstract's claim that gaps 'isolate memorization' therefore rests on an untested assumption; this is load-bearing for the detection results on both known and fine-tuned contaminated models.

    Authors: We accept that additional validation strengthens the central claim. The revised Section 4 now reports (i) zero-CoT results on three clean models (Llama-2-7B, Mistral-7B, and a randomly initialized transformer) showing negligible performance gaps between original and perturbed sets, (ii) paired t-tests and effect-size calculations confirming that gaps in contaminated models are statistically significant (p < 0.01) while gaps in clean models are not, and (iii) an analysis of model confidence and answer consistency that serves as a proxy for preserved task difficulty. A full human difficulty rating study was not feasible within the revision timeline; we therefore rely on the statistical and clean-model controls above, which directly test the assumption that perturbations do not systematically alter difficulty. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The ZCP method measures an empirical performance gap between zero-CoT on the original benchmark and an isomorphically perturbed reference set. This is a direct observational comparison rather than a fitted parameter renamed as a prediction or a self-definitional loop. The abstract presents the perturbed set as an external control constructed to isolate memorization, with no indication that perturbation rules or thresholds are tuned on the same test models in a way that forces the detection signal by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling are referenced as load-bearing. The central claim remains an empirical detection procedure whose validity rests on external validation experiments rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that performance differences after truncation and perturbation directly measure memorization rather than other model behaviors or dataset artifacts.

axioms (1)
  • domain assumption Isomorphic perturbations preserve task semantics and difficulty while removing surface-level memorization cues.
    Invoked when constructing the reference dataset to isolate contamination effects.

pith-pipeline@v0.9.0 · 5776 in / 1278 out tokens · 31467 ms · 2026-05-22T08:02:28.774300+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 14 internal anchors

  1. [1]

    arXiv preprint arXiv:2402.02823 , year=

    Evading data contamination detection for language models is (too) easy , author=. arXiv preprint arXiv:2402.02823 , year=

  2. [2]

    Proceedings of the 16th International Natural Language Generation Conference , pages=

    Preventing generation of verbatim memorization in language models gives a false sense of privacy , author=. Proceedings of the 16th International Natural Language Generation Conference , pages=

  3. [3]

    arXiv preprint arXiv:2306.07899 , year=

    Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks , author=. arXiv preprint arXiv:2306.07899 , year=

  4. [4]

    arXiv preprint arXiv:2507.10532 , year=

    Reasoning or memorization? unreliable results of reinforcement learning due to data contamination , author=. arXiv preprint arXiv:2507.10532 , year=

  5. [5]

    The Eleventh International Conference on Learning Representations , year=

    Quantifying Memorization Across Neural Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  6. [6]

    Rethinking

    Avi Schwarzschild and Zhili Feng and Pratyush Maini and Zachary Chase Lipton and J Zico Kolter , booktitle=. Rethinking. 2025 , url=

  7. [7]

    General- ization or memorization: Data contamination and trustworthy evaluation for large language models, 2024

    Generalization or memorization: Data contamination and trustworthy evaluation for large language models , author=. arXiv preprint arXiv:2402.15938 , year=

  8. [8]

    Gonzalez, and Ion Stoica

    Rethinking benchmark and contamination for language models with rephrased samples , author=. arXiv preprint arXiv:2311.04850 , year=

  9. [9]

    arXiv preprint arXiv:2308.08493 , year=

    Time travel in llms: Tracing data contamination in large language models , author=. arXiv preprint arXiv:2308.08493 , year=

  10. [10]

    The Twelfth International Conference on Learning Representations , year=

    Proving test set contamination in black-box language models , author=. The Twelfth International Conference on Learning Representations , year=

  11. [11]

    Detecting Pretraining Data from Large Language Models

    Detecting pretraining data from large language models , author=. arXiv preprint arXiv:2310.16789 , year=

  12. [12]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  13. [13]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  14. [14]

    arXiv preprint arXiv:2502.14425 , year=

    A survey on data contamination for large language models , author=. arXiv preprint arXiv:2502.14425 , year=

  15. [15]

    Benchmark Data Contamination of Large Language Models: A Survey

    Benchmark data contamination of large language models: A survey , author=. arXiv preprint arXiv:2406.04244 , year=

  16. [16]

    generalization: Quantifying data leakage in NLP performance evaluation , author=

    Memorization vs. generalization: Quantifying data leakage in NLP performance evaluation , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

  17. [17]

    30th USENIX security symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=

  18. [18]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Membership inference attacks against language models via neighbourhood comparison , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  19. [19]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Investigating data contamination in modern benchmarks for large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  20. [20]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

  21. [21]

    The Fourteenth International Conference on Learning Representations , year=

    Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort , author=. The Fourteenth International Conference on Learning Representations , year=

  22. [22]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  23. [23]

    Automatic Chain of Thought Prompting in Large Language Models

    Automatic chain of thought prompting in large language models , author=. arXiv preprint arXiv:2210.03493 , year=

  24. [24]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  25. [25]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  26. [26]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  27. [27]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  28. [28]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  29. [29]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  30. [30]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

  31. [31]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  32. [32]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  33. [33]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  34. [34]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  35. [35]

    Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks

    Mireshghallah, Fatemehsadat and Goyal, Kartik and Uniyal, Archit and Berg-Kirkpatrick, Taylor and Shokri, Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.570

  36. [36]

    arXiv preprint arXiv:2406.04197 , year=

    DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning , author=. arXiv preprint arXiv:2406.04197 , year=

  37. [37]

    Detect-Pretrain-Code-Contamination , year =

    Weijia Shi , howpublished =. Detect-Pretrain-Code-Contamination , year =

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  39. [39]

    Journal of Mathematical Psychology , volume=

    Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses , author=. Journal of Mathematical Psychology , volume=. 2016 , publisher=

  40. [40]

    The American Statistician , volume=

    Calibration of values for testing precise null hypotheses , author=. The American Statistician , volume=. 2001 , publisher=

  41. [41]

    Biometrika , pages=

    The probable error of a mean , author=. Biometrika , pages=. 1908 , publisher=

  42. [42]

    Psychometrika , volume=

    Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

  43. [43]

    Breakthroughs in statistics: Methodology and distribution , pages=

    Bootstrap methods: another look at the jackknife , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1992 , publisher=

  44. [44]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Does data contamination detection work (well) for llms? a survey and evaluation on detection assumptions , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  45. [45]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    A careful examination of large language model performance on grade school arithmetic , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    Bofei Gao and Feifan Song and Zhe Yang and Zefan Cai and Yibo Miao and Qingxiu Dong and Lei Li and Chenghao Ma and Liang Chen and Runxin Xu and Zhengyang Tang and Benyou Wang and Daoguang Zan and Shanghaoran Quan and Ge Zhang and Lei Sha and Yichang Zhang and Xuancheng Ren and Tianyu Liu and Baobao Chang , booktitle=. Omni-. 2025 , url=

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    2025 , cdate=

    Zhihan Zhang and Yixin Cao and Lizi Liao , title=. 2025 , cdate=