pith. sign in

arxiv: 2602.03812 · v2 · pith:MT3GNTTEnew · submitted 2026-02-03 · 💻 cs.LG · cs.AI· cs.CL

Antidistillation Fingerprinting

Pith reviewed 2026-05-21 13:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords antidistillation fingerprintingmodel distillation detectionLLM output watermarkinggradient-based token selectionproxy model alignmentfine-tuning detectionIP protection for language models
0
0 comments X

The pith

Antidistillation fingerprinting selects output tokens via a proxy model to embed signals that survive student fine-tuning and enable reliable detection of distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents antidistillation fingerprinting as a method to detect when third-party models are trained on a teacher model's outputs. Current fingerprinting approaches rely on random or heuristic changes that force a harsh choice between keeping generation quality high and making the signal stick during distillation. ADFP instead uses gradient information from a proxy model to pick tokens expected to leave a stronger trace after the student learns from them. Experiments across math, dialogue, and code tasks show this produces clearer detection while changing model outputs far less than prior techniques, and the method works even without knowing the student's exact design.

Core claim

By replacing incidental bias absorption with targeted token selection that maximizes expected detectability after fine-tuning, antidistillation fingerprinting produces a Pareto improvement: detection remains reliable across different student architectures while utility loss on downstream tasks stays small.

What carries the argument

The gradient-based antidistillation sampling procedure that uses a proxy model to identify tokens maximizing the expected strength of the fingerprint inside a fine-tuned student.

If this is right

  • Detection confidence rises compared with heuristic perturbation baselines on the tested tasks.
  • Generation utility remains closer to the original teacher model on mathematical reasoning, dialogue, and code generation.
  • The approach continues to work when the student architecture differs from the proxy used to create the fingerprint.
  • Fewer tokens need to be altered to reach usable detection strength than with prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-selection principle could be adapted to track other forms of model adaptation such as continued pre-training or preference tuning.
  • Output pipelines could incorporate this selection step to give model owners a built-in way to audit downstream copying without separate watermarks.
  • If the proxy-student alignment holds across larger scale gaps, the technique might reduce reliance on post-hoc auditing of training datasets.

Load-bearing premise

A proxy model can reliably choose tokens whose biases will be strongly absorbed and therefore detectable in an unknown student model after fine-tuning.

What would settle it

Fine-tuning a student model on ADFP-generated outputs and then measuring detection rates that fall below those of baseline watermarking methods on the same benchmarks.

Figures

Figures reproduced from arXiv: 2602.03812 by Alexander Robey, Asher Trockman, Fei Fang, John Kirchenbauer, J. Zico Kolter, Tom Goldstein, Yash Savani, Yixuan Even Xu.

Figure 1
Figure 1. Figure 1: Antidistillation fingerprinting (ADFP) performs targeted logit perturbations aligned with the student’s learning dynamics to optimize fingerprinting effect. Visually, while the standard heuristic boosts green tokens uniformly, ADFP selectively amplifies high-likelihood ones, which are most likely to be internalized, improving the quality-fingerprinting trade-off. (p-value 0.09 versus 0.01). Our results on … view at source ↗
Figure 2
Figure 2. Figure 2: Trade-off between fingerprinting p-value and generation quality on GSM8K under unsupervised evaluation. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and higher accuracy indicates better generation quality. Antidistillation fingerprinting achieves a pareto improvement over red-and-green-list fingerprinting. parameters. Sinc… view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between fingerprinting p-value and generation quality on OASST1 under unsupervised evaluation. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and lower NLL indicates better generation quality. Antidistillation fingerprinting achieves a pareto improvement over red-and-green-list fingerprinting. list fingerprinting (… view at source ↗
Figure 4
Figure 4. Figure 4: Trade-off between fingerprinting p-value and student’s accuracy after fine-tuning on GSM8K. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and higher accuracy indicates better fine-tuning quality. Antidistillation fingerprinting achieves a pareto improvement over red-and-green-list fingerprinting. We consider both open-weigh… view at source ↗
Figure 5
Figure 5. Figure 5: The effect of fingerprinted data fraction on fingerprinting p-value for both antidistillation fingerprinting (λ = 256, teacher accuracy 52%) and red-and-green-list fingerprinting (δ = 7, teacher accuracy 47%) on GSM8K. Each data point is averaged over 10 random trials in log space, with error bars indicating 1.96 times standard error of mean. Both methods’ fingerprinting effect degrades as the fingerprinte… view at source ↗
Figure 6
Figure 6. Figure 6: Trade-off between fingerprinting p-value and generation quality on GSM8K under supervised evaluation. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and higher accuracy indicates better generation quality. Antidistillation fingerprinting achieves a pareto improvement over red-and-green-list fingerprinting when the proxy mode… view at source ↗
Figure 7
Figure 7. Figure 7: Trade-off between fingerprinting p-value and generation quality on OASST1 under supervised evaluation. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and lower NLL indicates better generation quality. Antidistillation fingerprinting achieves a pareto improvement over red-and-green-list fingerprinting. 12 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 8
Figure 8. Figure 8: Trade-off between fingerprinting p-value and student’s accuracy after fine-tuning on GSM8K under supervised evaluation. Each point corresponds to a different logit perturbation strength δ or λ. Lower p-value indicates stronger fingerprinting effect, and higher accuracy indicates better fine-tuning quality. When the proxy model is the same as the student model, antidistillation fingerprinting barely sacrifi… view at source ↗
Figure 9
Figure 9. Figure 9: The effect of fingerprinted data fraction on fingerprinting p-value for both antidistillation fingerprinting (λ = 256, teacher accuracy 52%) and red-and-green-list fingerprinting (δ = 7, teacher accuracy 47%) on GSM8K under supervised evaluation. Each data point is averaged over 10 random trials in log space, with error bars indicating 1.96 times standard error of mean. Both methods’ fingerprinting effect … view at source ↗
Figure 10
Figure 10. Figure 10: ROC plots and AUC scores for both antidistillation fingerprinting (λ = 140, teacher accuracy 67%) and red-and-green-list fingerprinting (δ = 6, teacher accuracy 66%) on GSM8K under different evaluation settings. Each plot represents a binary classification task of distinguishing whether a fine-tuned student model is fingerprinted using the respective method or fine-tuned on unfingerprinted data. Each fing… view at source ↗
Figure 11
Figure 11. Figure 11: The computed p-value versus empirical false positive rate (FPR) for the non-fingerprinted baseline on GSM8K under different evaluation settings. Each plot shows the relationship between the p-value and empirical FPR over 100 random trials simulating individual detection scenarios where the ground truth label is negative. Similar to the ROC analysis, the threshold convention for computing each estimate is … view at source ↗
read the original abstract

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K, OASST1, and MBPP demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility across mathematical reasoning, dialogue, and code generation, even when the student model's architecture is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Antidistillation Fingerprinting (ADFP), a gradient-based method that uses a proxy model to select and sample tokens maximizing expected fingerprint detectability in a student model after fine-tuning on teacher outputs. It claims a significant Pareto improvement over state-of-the-art baselines on GSM8K, OASST1, and MBPP, delivering stronger detection confidence with minimal utility degradation across mathematical reasoning, dialogue, and code generation tasks, including when the student architecture is unknown.

Significance. If the empirical claims hold under rigorous validation, the work would advance LLM provenance techniques by aligning fingerprinting with distillation dynamics rather than relying on heuristic perturbations, potentially lowering the utility cost of detecting unauthorized model distillation.

major comments (2)
  1. [§3] §3 (gradient-based framework description): the objective directly optimizes token selection via proxy gradients to maximize detectability in the student, yet the robustness claim for unknown architectures requires that proxy loss landscapes correlate with student parameter updates despite mismatches in attention, normalization, or vocabulary; no theoretical justification or dedicated ablation with architecturally divergent students (e.g., transformer vs. non-transformer or differing layer counts) is provided, which is load-bearing for the 'even when unknown' assertion in the abstract.
  2. [Results] Results section (experiments on GSM8K/OASST1/MBPP): the reported Pareto improvement and detection confidence gains are asserted without accompanying statistical significance tests, variance across random seeds, or full baseline implementation details (including exact watermark strengths and proxy choices), making it impossible to assess whether the gains exceed what could arise from incidental absorption rather than targeted sampling.
minor comments (2)
  1. [§3] Notation for the detectability objective and proxy loss should be clarified with an explicit equation relating the sampled tokens to post-fine-tuning detection score.
  2. Figure captions and table headers lack sufficient detail on the exact metrics (e.g., AUC or p-value thresholds) used for 'detection confidence' and 'utility'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (gradient-based framework description): the objective directly optimizes token selection via proxy gradients to maximize detectability in the student, yet the robustness claim for unknown architectures requires that proxy loss landscapes correlate with student parameter updates despite mismatches in attention, normalization, or vocabulary; no theoretical justification or dedicated ablation with architecturally divergent students (e.g., transformer vs. non-transformer or differing layer counts) is provided, which is load-bearing for the 'even when unknown' assertion in the abstract.

    Authors: We agree that a formal theoretical analysis of gradient correlation across arbitrary architectural mismatches would strengthen the robustness claim. The current work is primarily empirical; the manuscript already includes cases where proxy and student differ in scale, layer count, and attention configuration while maintaining detection gains. To directly address the request for dedicated ablations, the revised version adds a new subsection with experiments on more divergent student architectures (including variations in normalization and vocabulary size), confirming that the Pareto improvements persist. revision: yes

  2. Referee: [Results] Results section (experiments on GSM8K/OASST1/MBPP): the reported Pareto improvement and detection confidence gains are asserted without accompanying statistical significance tests, variance across random seeds, or full baseline implementation details (including exact watermark strengths and proxy choices), making it impossible to assess whether the gains exceed what could arise from incidental absorption rather than targeted sampling.

    Authors: We acknowledge that the original results presentation lacked explicit statistical testing and variance reporting. The revised manuscript now reports standard deviations over five random seeds for all metrics, includes paired t-test p-values comparing ADFP against each baseline, and expands the appendix with exact hyperparameter settings, watermark strengths, and proxy model specifications used in every experiment. These additions make it possible to evaluate whether the observed gains are attributable to the targeted sampling procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ADFP uses independent proxy gradients for token selection

full rationale

The paper's core method builds a fingerprinting objective by running gradient ascent on a chosen proxy model to select tokens expected to be internalized during student fine-tuning. This is a constructive algorithmic choice grounded in the stated assumption that proxy gradients correlate with student updates, rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. Empirical Pareto improvements are reported on external benchmarks (GSM8K, OASST1, MBPP) with explicit acknowledgment that the student architecture is unknown; the derivation does not reduce the claimed detectability gain to the inputs by construction. The proxy choice is falsifiable outside the paper and does not invoke a uniqueness theorem or prior ansatz from the same authors as the sole justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proxy model approximation and the assumption that gradient signals from it transfer to the student's fine-tuning dynamics; no new physical entities or free parameters are explicitly introduced in the abstract, though the proxy choice itself functions as a modeling decision.

free parameters (1)
  • proxy model selection
    Choice of which model serves as proxy directly affects which tokens are sampled and thus the reported detection strength.
axioms (1)
  • domain assumption A proxy model's gradients can approximate the learning dynamics of an unknown student architecture during fine-tuning on teacher outputs.
    This premise is invoked to justify sampling tokens that maximize expected detectability rather than relying on incidental biases.

pith-pipeline@v0.9.0 · 5752 in / 1442 out tokens · 41465 ms · 2026-05-21T13:43:47.006553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Asking Back: Interaction-Layer Antidistillation Watermarks

    cs.CR 2026-05 unverdicted novelty 6.0

    Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via...

  2. Lossless Anti-Distillation Sampling

    cs.LG 2026-05 unverdicted novelty 5.0

    LADS is a sampling method that keeps benign user generations statistically identical to the original model while forcing correlated samples across a distiller's multiple accounts, provably worsening their generalizati...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    inappro- priately

    IEEE, 2025. URL https://arxiv.org/abs/ 2312.00273. Sander, T., Fernandez, P., Durmus, A., Douze, M., and Furon, T. Watermarking makes language models radioac- tive.Advances in Neural Information Processing Systems, 37:21079–21113, 2024. Sander, T., Fernandez, P., Mahloujifar, S., Durmus, A., and Guo, C. Detecting Benchmark Contamination Through Watermarki...

  2. [2]

    Qwen2 Technical Report

    URL https://www.axios.com/2025/ 01/29/openai-deepseek-ai-models-data- training. Accessed: January 30, 2025. Team, Q. et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024. von Werra, L., Belkada, Y ., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallou´edec, Q. Trl: Transformer reinforcement learn- ing...

  3. [3]

    **Clips Sold in April:** Natalia sold **48 clips** in April

  4. [4]

    To calculate the number of clips sold in May: \[ \text{ Clips in May } = \frac{\text{April Clips}}{2} = \frac{48}{2} = 24 \]

    **Clips Sold in May:** She sold **half** as many clips in May compared to April. To calculate the number of clips sold in May: \[ \text{ Clips in May } = \frac{\text{April Clips}}{2} = \frac{48}{2} = 24 \]

  5. [5]

    Antidistillation(λ= 64) <think> Natalia sold 48 clips in April

    **Total Clips Sold in April and May:** To find the total, add the clips sold in April and May: \[ \text{Total Clips} = 48 \text{ (April)} + 24 \text{ (May)} = 72 \] --- ### **Final Answer:** Natalia sold a total of \(\boxed{72}\) clips altogether in April and May. Antidistillation(λ= 64) <think> Natalia sold 48 clips in April. In May, she sold half of Apr...

  6. [6]

    - **Duration:** 50 minutes

    **Understand the Rate and Duration** - **Rate (Earnings):** \$12 per hour. - **Duration:** 50 minutes

  7. [7]

    **Convert Time to Hours** Since the earning rate is per hour, convert 50 minutes into hours: \[ \text{Time in hours} = \frac{\text{Total minutes}}{60} = \frac{50}{60} = 0.8333 \ \text{hours} \]

  8. [8]

    Since there are 60 minutes in an hour, 50 minutes is equal to \( \frac{50}{60} = \frac{5}{6} \) hours

    **Calculate the Total Earnings** Multiply the hourly rate by the time worked: \[ \text{Total Earnings} = 12\ \text{dollars/hour} \times 0.8333\ \text{hours} = 10\ \text{dollars} \] \[ \boxed{\text{\$10}} \] \[ \boxed{\text{{10 dollars}}} \] Antidistillation(λ= 64) <think> First, I note that Weng earns $12 an hour and she spent 50 minutes on babysitting. S...

  9. [9]

    Convert the time spent from minutes to hours

  10. [10]

    Calculate the total earnings by multiplying the number of hours by the hourly rate. **Step 1:** Convert Minutes to Hours Since there are 60 minutes in an hour: \[ 50 \text{ minutes} = \frac{50}{60} \text{ hours} = \frac{5}{6} \text{ hours} \] **Step 2:** Calculate the Earnings Weng's hourly rate is \$12. \[ \text{Earnings} = 12 \text{ dollars/hour} \times...

  11. [11]

    **Understand the Regular and Overtime Rates** - **Regular hourly wage (R):** \$18.00 - **Overtime hourly wage (O) = R + (½ * R)** = \$18.00 + \$9.00 = **\$27.00**

  12. [12]

    - **Number of workdays in 5:** 5

    **Identify the Work Hours and Overtime Calculation** - **Daily work hours:** 10 hours each day. - **Number of workdays in 5:** 5. **Calculate Total Overtime and Regular Hours:** \[ \begin{align*} &\text{Tina works 10 hours each day for 5 days, but her regular workday is up to 8 hours}.\ &\Rightarrow \text{Regular hours} = 8\, \text{hours/day} \times 5\, \...

  13. [13]

    **Amount given to Ken** Ken received **\$1 750**

  14. [14]

    \[ \text{Amount for Tony} = 2 \times \1750 = \3 500 \]

    **Amount given to Tony** It's stated Tony received **twice as much as Ken**. \[ \text{Amount for Tony} = 2 \times \1750 = \3 500 \]

  15. [15]

    **Total Amount Shared** Add the amount given to Ken and Tony. \[ \$1 750 + \$3 500 = \$5 250 \] \[ \boxed{\$\;5250} \] ### **Answer:** The total amount of money Sam shared between his two sons was \$\boxed{A}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\ ⌋ ,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\ ⌋ ,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\,}\ ⌋ ,}\,}\,...

  16. [16]

    **Initial Weight of the Box:** \[ \text{Initial weight of the box} = 1\text{ pound} \]

  17. [17]

    Ken started with an empty box weighing nothing

    **Adding 2 lbs of Jelly Beans:** \[ \text{ Weight after adding jelly beans }= 1\text{ pound } + (0 + 2)\text{ pounds } \] \[ = 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 ...

  18. [18]

    - Weight after adding jelly beans = 0 + 2 = 2 lbs

    Ken added jelly beans to bring the box to weigh 2 pounds. - Weight after adding jelly beans = 0 + 2 = 2 lbs

  19. [19]

    - Weight after tripling the jelly beans = 3 * 2 = 6 lbs

    Ken added enough brownies to cause the weight to triple. - Weight after tripling the jelly beans = 3 * 2 = 6 lbs

  20. [20]

    - Weight after adding additional jelly beans = 6 + 2 = 8 lbs

    Next, he added 2 more pounds of jelly beans. - Weight after adding additional jelly beans = 6 + 2 = 8 lbs

  21. [21]

    - Weight after adding gummy worms = 8 + (8 * 1) = 16 lbs

    Lastly, he added enough gummy worms to double the weight once again. - Weight after adding gummy worms = 8 + (8 * 1) = 16 lbs. Answer: \boxed{16}\ lbs. </think> Let \( x \) be the weight in pounds of the box of goodies

  22. [22]

    Ken started with an empty box weighing nothing

  23. [23]

    Ken added jelly beans to bring the box to weigh \( 2x \) pounds. - Weight after adding jelly beans = \( x + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0...

  24. [24]

    So that's $10*4 = $40 He wants to get $10 per month for 8 months to pay for his supplies

    **Gather the Information:** - *monthly cost for baseball supplies:* \$10 - Number of months in a baseball season:*4 - *earnings:*\( \$5 \$10 \$5 \$10 \) Antidistillation(λ= 384) <think> Gerald needs $10 per month for 4 months to pay for his supplies. So that's $10*4 = $40 He wants to get $10 per month for 8 months to pay for his supplies. So that's $10*8=...