AttnDiff: Attention-based Differential Fingerprinting for Large Language Models
Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3
The pith
Differential attention from conflicting prompts creates stable fingerprints that identify LLM derivatives despite fine-tuning, pruning, or merging.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AttnDiff extracts fingerprints from models via intrinsic information-routing behavior. It probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., >0.98 vs. <0.22 with M=60 probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.
What carries the argument
Differential attention patterns from minimally edited prompt pairs that create semantic conflicts, condensed into spectral descriptors and measured by centered kernel alignment.
If this is right
- Provenance checks become feasible with only 5 to 60 multi-domain probes.
- Related model variants retain similarity above 0.98 while unrelated families fall below 0.22.
- The method holds across Llama-2/3, Qwen2.5, and other open families ranging from 3B to 14B parameters.
- Verification works after PPO/DPO fine-tuning, pruning/compression, and model merging.
Where Pith is reading between the lines
- If attention patterns prove more stable than weights, model releases could include optional fingerprint metadata to simplify later audits.
- The same probe construction might be adapted to detect whether a fine-tuned model still carries the original routing signature even when weights have changed substantially.
- Repeated application across successive derivatives could trace long chains of model reuse without needing access to training data.
- Developers might incorporate the probe set into release checklists so downstream users can confirm a model's claimed lineage.
Load-bearing premise
Differential attention patterns reflect an intrinsic model identity that remains stable through fine-tuning, pruning, compression, and merging.
What would settle it
A derivative model obtained by fine-tuning, pruning, or merging shows similarity below 0.5 to its claimed source, or an unrelated model family shows similarity above 0.5, under the same set of 60 probes.
Figures
read the original abstract
Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AttnDiff, a white-box framework for LLM provenance verification. It extracts fingerprints by probing models with minimally edited prompt pairs that induce controlled semantic conflicts, capturing differential attention patterns, summarizing them via compact spectral descriptors, and comparing models with CKA similarity. The authors claim that across Llama-2/3, Qwen2.5 (3B-14B) and other open-source families, the method yields high similarity (>0.98) for related derivatives while separating unrelated families (<0.22) with M=60 probes, and remains effective for practical verification even after laundering operations such as fine-tuning (PPO/DPO), pruning/compression, and model merging.
Significance. If the reported robustness to laundering holds, the work would be significant for intellectual property protection of open-weight LLMs by offering a data-efficient, intrinsic-behavior-based fingerprinting approach that addresses a practical need in model accountability. The empirical separation between related and unrelated models is a notable strength, and the focus on white-box attention patterns provides a fresh angle compared to output-only or weight-based methods.
major comments (2)
- Abstract: the central claim of robustness under laundering operations (fine-tuning, pruning, merging) is load-bearing for the provenance-verification utility, yet the abstract provides no specifics on how these operations were implemented on the tested models or the exact number of derivative instances evaluated per operation; without this, it is impossible to determine whether the >0.98 similarity persists beyond the particular cases examined.
- Abstract: the reported separation (>0.98 vs. <0.22 with M=60 probes) is presented without accompanying details on probe construction, statistical tests, variance across runs, or raw similarity matrices, which undermines assessment of whether the distinction is reliable or sensitive to unstated choices in prompt editing and spectral summarization.
minor comments (2)
- The acronym CKA is used without expansion or reference on first appearance; a brief definition or citation would improve accessibility for readers outside the kernel-methods community.
- Consider including a summary table of similarity scores across all model pairs and laundering scenarios to make the cross-family and cross-operation results easier to parse at a glance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that the abstract would benefit from greater specificity regarding the laundering experiments and probe details to better support the central claims. We have revised the abstract accordingly while preserving its brevity. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: Abstract: the central claim of robustness under laundering operations (fine-tuning, pruning, merging) is load-bearing for the provenance-verification utility, yet the abstract provides no specifics on how these operations were implemented on the tested models or the exact number of derivative instances evaluated per operation; without this, it is impossible to determine whether the >0.98 similarity persists beyond the particular cases examined.
Authors: We agree that the abstract should reference the scope of the laundering evaluation to allow readers to assess the robustness claims. The full manuscript provides these details in the Experiments section, including the specific implementations (PPO/DPO fine-tuning on standard instruction datasets, magnitude-based pruning at multiple sparsity levels, and merging via established techniques such as linear interpolation) along with the number of derivative instances tested per operation and model family. We have revised the abstract to concisely note that robustness was evaluated across multiple derivative instances under each laundering operation. revision: yes
-
Referee: Abstract: the reported separation (>0.98 vs. <0.22 with M=60 probes) is presented without accompanying details on probe construction, statistical tests, variance across runs, or raw similarity matrices, which undermines assessment of whether the distinction is reliable or sensitive to unstated choices in prompt editing and spectral summarization.
Authors: We agree that the abstract would be strengthened by briefly indicating the methodological choices underlying the reported separation. Probe construction (minimally edited conflicting prompt pairs), spectral descriptor summarization, and CKA-based comparison are described in the Method section, with supporting analyses of variance across runs and similarity matrices provided in the supplementary material and experimental figures. We have updated the abstract to reference the use of M=60 multi-domain probes and the consistent separation observed across the evaluated model families. revision: yes
Circularity Check
No circularity: purely empirical measurement framework
full rationale
The paper presents AttnDiff as a data-efficient white-box method that probes models with minimally edited prompt pairs, extracts differential attention patterns, summarizes them via spectral descriptors, and compares via CKA similarity. No equations, derivations, or first-principles results are claimed that reduce outputs to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The reported similarities (>0.98 for derivatives, <0.22 for unrelated families) are experimental observations across Llama-2/3, Qwen2.5, and other families under laundering operations, not statistical predictions forced by fitted parameters. The central premise is an empirical assumption about attention pattern stability, which is externally falsifiable and does not collapse into self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
retains high similarity for related derivatives across fine-tuning (including PPO/DPO), pruning/compression, and merging
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[CYS+24] Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, and Xinyu Xing
Robust and efficient watermarking of large lan- guage models using error correction codes.Proceed- ings on Privacy Enhancing Technologies (PoPETs). Jiacheng Cai, Jiahao Yu, Yangguang Shao, and Yuhang Wu. 2024. UTF: Undertrained Tokens as Finger- prints: A Novel Approach to LLM Identification. Preprint, arXiv:2410.12318. Wei-Lin Chiang, Zhuohan Li, Zi Lin,...
-
[2]
Us- ing ˆyi =W x ∗ i and residual ri = ˆyi − 20 Figure 7: ProFlingo workflow. A short trigger prefix is optimized to induce a target response; a suspect model is then evaluated by its target response rate (TRR) over the query set. yi, we compute the coefficient of determi- nation R2 i = 1− P j r2 i,jP j(yi,j −¯yi)2 , and report mean_R2 = 1 N P i R2 i as t...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.