pith. sign in

arxiv: 2510.04850 · v3 · submitted 2025-10-06 · 💻 cs.CL · cs.AI

Detecting Distillation Data from Reasoning Models

Pith reviewed 2026-05-18 10:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords distillation data detectionreasoning modelstoken probability deviationdata contaminationlanguage model auditingbenchmark integrity
0
0 comments X

The pith

Seen questions elicit more near-deterministic tokens from reasoning models than unseen ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the task of detecting whether a given question appeared in the data used to distill reasoning skills into smaller language models. The central observation is that models generate tokens with probabilities closer to one for questions they encountered during distillation. The authors introduce a Token Probability Deviation score that measures how far each generated token's probability strays from a high-confidence reference value. Lower scores reliably flag seen questions, producing up to 31 percent higher detection AUC than previous approaches even when only partial distillation data is available.

Core claim

Questions included in the distillation data cause the model to produce output tokens whose probabilities deviate less from a high-confidence reference, yielding lower TPD scores than questions never seen during distillation; this difference enables detection of contaminated benchmark items without requiring the full training set.

What carries the argument

Token Probability Deviation (TPD) score, which averages the deviation of each generated token's probability from a high-confidence reference probability.

If this is right

  • Benchmark results on distilled models can be audited for contamination even when the complete distillation corpus remains private.
  • Detection works from output probabilities alone, requiring no access to input embeddings or training logs.
  • The same probability pattern can be measured on any reasoning model that exposes token-level probabilities.
  • Lower TPD scores directly indicate higher risk that performance metrics are inflated by data leakage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same signal may appear after other forms of targeted fine-tuning, opening a route to detect narrower data leaks.
  • Model releases could include a TPD fingerprint on public benchmarks to certify absence of contamination.
  • Combining TPD with input-side statistics could increase robustness when the model is further aligned after distillation.

Load-bearing premise

Questions that appeared in the distillation data will consistently produce more near-deterministic token outputs than questions that did not.

What would settle it

A controlled test set of seen and unseen questions where the distributions of TPD scores show no statistical separation.

read the original abstract

Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines the distillation data detection task to determine if benchmark questions were included in reasoning model distillation data. It proposes the Token Probability Deviation (TPD) score, which quantifies how much generated token probabilities deviate from high-confidence references. This is motivated by the observation that seen questions produce more near-deterministic outputs than unseen ones, yielding lower TPD scores and enabling up to 31% AUC gains in detection experiments.

Significance. If the core observation holds after controlling for confounds, the work provides a practical, output-focused technique for identifying contamination in partially observed distillation datasets. This is valuable for reliable benchmarking of small reasoning models and could inform better distillation practices. The parameter-free nature of TPD and its grounding in observable token patterns are strengths.

major comments (2)
  1. [Experiments section (and abstract)] The central claim that lower TPD scores for seen questions result from distillation inclusion (rather than question-intrinsic factors) requires explicit controls or matching on question length, difficulty, lexical overlap, or solvability in the seen/unseen splits. No such ablations or matched-pair analysis are described, which is load-bearing for interpreting the AUC improvements as evidence of the intended signal.
  2. [§3] §3 (Method): The high-confidence reference probability used in TPD needs a precise definition or equation; it is unclear whether it is computed from the target model on the same question, a separate reference model, or another source, which affects whether the deviation metric is truly non-circular.
minor comments (2)
  1. [Abstract] The abstract reports AUC gains without naming the baselines or datasets; adding these specifics would improve readability even if details appear later.
  2. [§3] Notation for TPD could be formalized with a short equation to make the deviation calculation unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. Below we provide point-by-point responses to the major comments. We plan to revise the manuscript accordingly to address these points.

read point-by-point responses
  1. Referee: [Experiments section (and abstract)] The central claim that lower TPD scores for seen questions result from distillation inclusion (rather than question-intrinsic factors) requires explicit controls or matching on question length, difficulty, lexical overlap, or solvability in the seen/unseen splits. No such ablations or matched-pair analysis are described, which is load-bearing for interpreting the AUC improvements as evidence of the intended signal.

    Authors: We concur that demonstrating the TPD signal is independent of question-intrinsic factors is essential for the robustness of our conclusions. Although the seen/unseen splits in our experiments are derived from standard benchmark partitions, we did not include explicit matching or ablations for length, difficulty, lexical overlap, or solvability. In the revised manuscript, we will add these controls and matched-pair analyses to isolate the effect of distillation data inclusion and thereby strengthen the interpretation of the AUC gains. revision: yes

  2. Referee: [§3] §3 (Method): The high-confidence reference probability used in TPD needs a precise definition or equation; it is unclear whether it is computed from the target model on the same question, a separate reference model, or another source, which affects whether the deviation metric is truly non-circular.

    Authors: We appreciate this observation regarding the clarity of our method. The current manuscript describes the high-confidence reference probability conceptually but lacks a formal equation. We will revise §3 to include a precise mathematical definition of the reference probability and the TPD score. This definition will specify the computation source and confirm that the metric avoids circularity by using a reference derived independently of the per-token deviations being scored. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TPD derivation or central claim

full rationale

The paper motivates TPD from an empirical observation about higher token determinism on seen questions, then defines the score to quantify deviation from high-confidence reference probabilities so that seen questions produce lower scores. This is a standard design of a heuristic detector based on a measured pattern, not a derivation that reduces by construction to its own inputs. No equations, fitted parameters renamed as predictions, self-citations, uniqueness theorems, or ansatzes appear in the abstract or description that would create a self-referential loop. The approach remains self-contained as an empirical method whose effectiveness is assessed via separate experiments on distillation datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption about token determinism for seen data and introduces TPD as a new metric without independent evidence beyond the reported experiments.

axioms (1)
  • domain assumption Seen questions elicit more near-deterministic tokens than unseen ones
    This observation is stated as the motivation for designing TPD to quantify deviation from high-confidence probabilities.
invented entities (1)
  • Token Probability Deviation (TPD) score no independent evidence
    purpose: Quantify token-level deviation of generated tokens from a high-confidence reference probability
    New detection metric introduced in the paper with no external validation mentioned.

pith-pipeline@v0.9.0 · 5715 in / 1116 out tokens · 34131 ms · 2026-05-18T10:06:17.531194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.