Detecting Distillation Data from Reasoning Models

Hengxiang Zhang; Hongxin Wei; Hyeong Kyu Choi; Sharon Li

arxiv: 2510.04850 · v3 · submitted 2025-10-06 · 💻 cs.CL · cs.AI

Detecting Distillation Data from Reasoning Models

Hengxiang Zhang , Hyeong Kyu Choi , Sharon Li , Hongxin Wei This is my paper

Pith reviewed 2026-05-18 10:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords distillation data detectionreasoning modelstoken probability deviationdata contaminationlanguage model auditingbenchmark integrity

0 comments

The pith

Seen questions elicit more near-deterministic tokens from reasoning models than unseen ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the task of detecting whether a given question appeared in the data used to distill reasoning skills into smaller language models. The central observation is that models generate tokens with probabilities closer to one for questions they encountered during distillation. The authors introduce a Token Probability Deviation score that measures how far each generated token's probability strays from a high-confidence reference value. Lower scores reliably flag seen questions, producing up to 31 percent higher detection AUC than previous approaches even when only partial distillation data is available.

Core claim

Questions included in the distillation data cause the model to produce output tokens whose probabilities deviate less from a high-confidence reference, yielding lower TPD scores than questions never seen during distillation; this difference enables detection of contaminated benchmark items without requiring the full training set.

What carries the argument

Token Probability Deviation (TPD) score, which averages the deviation of each generated token's probability from a high-confidence reference probability.

If this is right

Benchmark results on distilled models can be audited for contamination even when the complete distillation corpus remains private.
Detection works from output probabilities alone, requiring no access to input embeddings or training logs.
The same probability pattern can be measured on any reasoning model that exposes token-level probabilities.
Lower TPD scores directly indicate higher risk that performance metrics are inflated by data leakage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal may appear after other forms of targeted fine-tuning, opening a route to detect narrower data leaks.
Model releases could include a TPD fingerprint on public benchmarks to certify absence of contamination.
Combining TPD with input-side statistics could increase robustness when the model is further aligned after distillation.

Load-bearing premise

Questions that appeared in the distillation data will consistently produce more near-deterministic token outputs than questions that did not.

What would settle it

A controlled test set of seen and unseen questions where the distributions of TPD scores show no statistical separation.

read the original abstract

Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper defines distillation data detection task and TPD score but leaves open whether gains come from distillation or unmatched question properties.

read the letter

The main takeaway is that this work carves out the task of detecting whether a question appeared in reasoning distillation data and offers Token Probability Deviation as a detector based on output token patterns. They observe that seen questions produce more near-deterministic tokens and build a score around deviation from high-confidence reference probabilities, claiming up to 31% AUC lift on distillation datasets. That focus on output tokens rather than inputs is a sensible adaptation to the partial-data setting, and the problem itself is worth attention because contamination can quietly inflate benchmark numbers for distilled models. The approach is straightforward and directly motivated by the observed pattern, which gives it some practical appeal for people running these pipelines. The soft spot is the lack of evidence that the seen and unseen question sets are comparable on difficulty, length, or other features that could affect token confidence on their own. The abstract gives no sign of matched-pair controls or ablations on question properties, so the reported gains could partly reflect those differences instead of the distillation signal. Without those checks the central claim rests on thinner ground than it first appears. The work is aimed at researchers who train or evaluate reasoning models and need practical ways to audit distillation data. A reader already thinking about contamination or membership-style detection could extract the core idea and the TPD formulation. It is coherent enough on its own terms to merit a serious referee who can examine the full experimental details and any controls that may be present beyond the abstract.

Referee Report

2 major / 2 minor

Summary. The paper defines the distillation data detection task to determine if benchmark questions were included in reasoning model distillation data. It proposes the Token Probability Deviation (TPD) score, which quantifies how much generated token probabilities deviate from high-confidence references. This is motivated by the observation that seen questions produce more near-deterministic outputs than unseen ones, yielding lower TPD scores and enabling up to 31% AUC gains in detection experiments.

Significance. If the core observation holds after controlling for confounds, the work provides a practical, output-focused technique for identifying contamination in partially observed distillation datasets. This is valuable for reliable benchmarking of small reasoning models and could inform better distillation practices. The parameter-free nature of TPD and its grounding in observable token patterns are strengths.

major comments (2)

[Experiments section (and abstract)] The central claim that lower TPD scores for seen questions result from distillation inclusion (rather than question-intrinsic factors) requires explicit controls or matching on question length, difficulty, lexical overlap, or solvability in the seen/unseen splits. No such ablations or matched-pair analysis are described, which is load-bearing for interpreting the AUC improvements as evidence of the intended signal.
[§3] §3 (Method): The high-confidence reference probability used in TPD needs a precise definition or equation; it is unclear whether it is computed from the target model on the same question, a separate reference model, or another source, which affects whether the deviation metric is truly non-circular.

minor comments (2)

[Abstract] The abstract reports AUC gains without naming the baselines or datasets; adding these specifics would improve readability even if details appear later.
[§3] Notation for TPD could be formalized with a short equation to make the deviation calculation unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. Below we provide point-by-point responses to the major comments. We plan to revise the manuscript accordingly to address these points.

read point-by-point responses

Referee: [Experiments section (and abstract)] The central claim that lower TPD scores for seen questions result from distillation inclusion (rather than question-intrinsic factors) requires explicit controls or matching on question length, difficulty, lexical overlap, or solvability in the seen/unseen splits. No such ablations or matched-pair analysis are described, which is load-bearing for interpreting the AUC improvements as evidence of the intended signal.

Authors: We concur that demonstrating the TPD signal is independent of question-intrinsic factors is essential for the robustness of our conclusions. Although the seen/unseen splits in our experiments are derived from standard benchmark partitions, we did not include explicit matching or ablations for length, difficulty, lexical overlap, or solvability. In the revised manuscript, we will add these controls and matched-pair analyses to isolate the effect of distillation data inclusion and thereby strengthen the interpretation of the AUC gains. revision: yes
Referee: [§3] §3 (Method): The high-confidence reference probability used in TPD needs a precise definition or equation; it is unclear whether it is computed from the target model on the same question, a separate reference model, or another source, which affects whether the deviation metric is truly non-circular.

Authors: We appreciate this observation regarding the clarity of our method. The current manuscript describes the high-confidence reference probability conceptually but lacks a formal equation. We will revise §3 to include a precise mathematical definition of the reference probability and the TPD score. This definition will specify the computation source and confirm that the metric avoids circularity by using a reference derived independently of the per-token deviations being scored. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TPD derivation or central claim

full rationale

The paper motivates TPD from an empirical observation about higher token determinism on seen questions, then defines the score to quantify deviation from high-confidence reference probabilities so that seen questions produce lower scores. This is a standard design of a heuristic detector based on a measured pattern, not a derivation that reduces by construction to its own inputs. No equations, fitted parameters renamed as predictions, self-citations, uniqueness theorems, or ansatzes appear in the abstract or description that would create a self-referential loop. The approach remains self-contained as an empirical method whose effectiveness is assessed via separate experiments on distillation datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption about token determinism for seen data and introduces TPD as a new metric without independent evidence beyond the reported experiments.

axioms (1)

domain assumption Seen questions elicit more near-deterministic tokens than unseen ones
This observation is stated as the motivation for designing TPD to quantify deviation from high-confidence probabilities.

invented entities (1)

Token Probability Deviation (TPD) score no independent evidence
purpose: Quantify token-level deviation of generated tokens from a high-confidence reference probability
New detection metric introduced in the paper with no external validation mentioned.

pith-pipeline@v0.9.0 · 5715 in / 1116 out tokens · 34131 ms · 2026-05-18T10:06:17.531194+00:00 · methodology

Detecting Distillation Data from Reasoning Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)