Fingerprinting LLMs via Prompt Injection
Pith reviewed 2026-05-21 21:22 UTC · model grok-4.3
The pith
Optimized prompt injections create fingerprints that identify whether one LLM derives from a specific base model even after post-training or quantization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By optimizing fingerprint prompts to enforce consistent token preferences, LLMPrint obtains fingerprints that are both unique to the base model and robust to post-processing. The method includes a unified verification procedure applicable to gray-box and black-box settings with statistical guarantees. Tests across five base models and approximately 700 variants demonstrate high true positive rates and near-zero false positive rates.
What carries the argument
Fingerprint prompts optimized to enforce consistent token preferences, which turn the model's prompt-injection vulnerability into a stable, model-specific signature.
If this is right
- Provenance checks become possible for models that were already released without any embedded markers.
- Verification functions under both partial internal access and query-only access.
- Low false-positive rates support reliable identification across hundreds of modified copies.
- Statistical guarantees allow quantified in each detection decision.
Where Pith is reading between the lines
- The same signatures could support audits that check whether a deployed model is an unauthorized derivative of a licensed base.
- Testing the method on additional post-processing such as weight pruning or model merging would clarify its limits.
- If the optimization process is public, future post-training pipelines might deliberately disrupt the targeted token preferences.
- Pairing these fingerprints with output-based similarity checks could create stronger combined evidence for model lineage.
Load-bearing premise
That token preferences fixed on the base model will stay distinctive and stable when the model undergoes post-processing steps that were not part of the optimization.
What would settle it
A collection of unrelated models that were never derived from the base yet produce the same consistent token outputs on the optimized fingerprint prompts would falsify the uniqueness claim.
Figures
read the original abstract
Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero. The code is publicly available at https://github.com/hifi-hyp/ACL-LLMPrint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper introduces LLMPrint, a detection framework for identifying whether an LLM is derived from a base model by optimizing fingerprint prompts that exploit prompt injection to enforce consistent token preferences. These fingerprints are asserted to be unique to the base model and robust to post-processing like post-training and quantization. A unified verification procedure is developed for gray-box and black-box settings with statistical guarantees. The method is evaluated on five base models and around 700 variants, achieving high true positive rates and near-zero false positive rates. The code is publicly available.
Significance. Should the robustness to post-processing be substantiated beyond the tested variants, this would represent a notable contribution to LLM security and provenance tracking, as it does not require modifying the model prior to release. The public code release is a positive aspect that supports reproducibility.
major comments (3)
- [Evaluation] The results on the 700 variants show high TPR and low FPR, but the manuscript lacks details on the optimization procedure used to construct the fingerprint prompts, the specific statistical tests applied, potential confounds in generating the variants, and the exact method for measuring false positives. These omissions hinder evaluation of the empirical support for the robustness claim (see Abstract and Evaluation section).
- [Method] The key assumption that fingerprints optimized for consistent token preferences on base models will remain distinctive after arbitrary post-processing is supported only by empirical testing on the ~700 variants. No theoretical argument or proof is provided that the optimization objective preserves invariance under fine-tuning or quantization schemes not included in the test set, which is load-bearing for the central robustness claim.
- [Verification Procedure] The unified verification procedure claims statistical guarantees, but the derivation of these guarantees and the assumptions (e.g., independence) are not clearly specified, particularly for the black-box case.
minor comments (2)
- [Abstract] Clarify the exact number of variants and their distribution between post-trained and quantized models.
- [Related Work] Ensure all relevant prior work on model fingerprinting and watermarking is cited for completeness.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We appreciate the focus on evaluation details, the nature of our robustness claims, and the clarity of statistical guarantees. We have revised the manuscript to address the first and third comments by adding substantial detail and derivations. For the second comment, we have expanded the discussion while noting that our contribution is primarily empirical.
read point-by-point responses
-
Referee: [Evaluation] The results on the 700 variants show high TPR and low FPR, but the manuscript lacks details on the optimization procedure used to construct the fingerprint prompts, the specific statistical tests applied, potential confounds in generating the variants, and the exact method for measuring false positives. These omissions hinder evaluation of the empirical support for the robustness claim (see Abstract and Evaluation section).
Authors: We agree that these details were insufficiently specified. In the revised manuscript we have expanded the Evaluation section with: (1) a complete description of the prompt optimization procedure, including the objective, search algorithm, and hyperparameters; (2) the precise statistical tests (binomial test on token preference consistency with explicit p-value threshold and multiple-testing correction); (3) how the ~700 variants were sourced and any controls applied to reduce confounds such as shared training data or architecture similarity; and (4) the exact false-positive protocol, including the number of cross-model negative pairs and the decision rule. We have also added pseudocode and additional tables summarizing these choices. revision: yes
-
Referee: [Method] The key assumption that fingerprints optimized for consistent token preferences on base models will remain distinctive after arbitrary post-processing is supported only by empirical testing on the ~700 variants. No theoretical argument or proof is provided that the optimization objective preserves invariance under fine-tuning or quantization schemes not included in the test set, which is load-bearing for the central robustness claim.
Authors: We acknowledge that the central robustness claim rests on empirical evidence rather than a formal invariance proof. The optimization objective is deliberately chosen to capture low-level, architecture- and training-induced token biases that are difficult to erase without fundamentally altering the model. In the revision we have added a dedicated Discussion subsection that (a) articulates why these biases are expected to survive common post-processing pipelines, (b) enumerates the diversity of the tested variants (multiple fine-tuning recipes, quantization bit-widths, and model scales), and (c) explicitly states the limitation that untested schemes could in principle break the fingerprint. We maintain that the breadth of the empirical evaluation constitutes the primary support for the claim at this stage. revision: partial
-
Referee: [Verification Procedure] The unified verification procedure claims statistical guarantees, but the derivation of these guarantees and the assumptions (e.g., independence) are not clearly specified, particularly for the black-box case.
Authors: We have revised the Verification Procedure section to include the full derivation. We now state the modeling assumptions explicitly (conditional independence of token samples given the injected prompt, justified by the prompt-injection mechanism forcing deterministic preference), derive the test statistic and its distribution under the null for both gray-box (logit access) and black-box (output-only) settings, and provide the exact p-value computation, including any asymptotic approximations used in the black-box case. A short appendix supplies the complete proof steps. revision: yes
- A formal theoretical proof that the optimization objective preserves distinctiveness under arbitrary post-processing schemes not represented in the evaluated set
Circularity Check
No significant circularity; empirical evaluation on external variants
full rationale
The paper's core procedure optimizes fingerprint prompts on base models to induce consistent token preferences and then evaluates distinctiveness and robustness via direct testing on ~700 post-trained or quantized variants. This constitutes an empirical measurement against held-out external models rather than any derivation that reduces by construction to fitted inputs, self-definitions, or self-citation chains. No equations or claims in the provided text equate a prediction to its optimization objective; statistical guarantees are presented as arising from the verification procedure applied to the test set. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs have inherent vulnerability to prompt injection that can be exploited to enforce consistent token preferences unique to the base model
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
by optimizing fingerprint prompts to enforce consistent token preferences... near the decision boundary between the target token pair
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLMPrint achieves high true positive rates while keeping false positive rates near zero on five base models and around 700 post-trained or quantized variants
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Detecting Language Model Attacks with Perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Stealing part of a production language model.arXiv preprint arXiv:2403.06634,
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model.arXiv preprint arXiv:2403.06634,
-
[3]
Robust llm fingerprinting via domain-specific watermarks.arXiv preprint arXiv:2505.16723,
Thibaud Gloaguen, Robin Staab, Nikola Jovanovi´ c, and Martin Vechev. Robust llm fingerprinting via domain-specific watermarks.arXiv preprint arXiv:2505.16723,
-
[4]
Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. Trap: Targeted ran- dom adversarial prompt honeypot for black-box identification.arXiv preprint arXiv:2402.12991,
-
[5]
A critical evaluation of defenses against prompt injection attacks,
Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks.arXiv preprint arXiv:2505.18333,
-
[6]
Introducing microsoft 365 copilot – your copilot for work.https://blogs.microsoft
12 Microsoft. Introducing microsoft 365 copilot – your copilot for work.https://blogs.microsoft. com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/,
work page 2023
-
[7]
Model provenance testing for large language models.arXiv preprint arXiv:2502.00706,
Ivica Nikolic, Teodora Baluta, and Prateek Saxena. Model provenance testing for large language models.arXiv preprint arXiv:2502.00706,
-
[8]
Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Cotsrf: Utilize chain of thought as stealthy and robust fingerprint of large language models.arXiv preprint arXiv:2505.16785,
-
[9]
Shida Wang, Chaohu Liu, Yubo Wang, and Linli Xu. Fpedit: Robust llm fingerprinting through localized knowledge editing.arXiv preprint arXiv:2508.02092,
-
[10]
Imf: Implicit fingerprint for large language models.arXiv preprint arXiv:2503.21805,
Peng Wanli, Xue Yiming, et al. Imf: Implicit fingerprint for large language models.arXiv preprint arXiv:2503.21805,
-
[11]
Prompt injection attacks against gpt-3.https://simonwillison.net/2022/Sep/ 12/prompt-injection/,
Simon Willison. Prompt injection attacks against gpt-3.https://simonwillison.net/2022/Sep/ 12/prompt-injection/,
work page 2022
-
[12]
Simon Willison. Delimiters won’t save you from prompt injection.https://simonwillison.net/ 2023/May/11/delimiters-wont-save-you,
work page 2023
-
[13]
Jiaxuan Wu, Yinghan Zhou, Wanli Peng, Yiming Xue, Juan Wen, and Ping Zhong. Editmf: Drawing an invisible fingerprint for your large language models.arXiv preprint arXiv:2508.08836, 2025a. 13 Zehao Wu, Yanjie Zhao, and Haoyu Wang. Gradient-based model fingerprinting for llm similarity detection and family classification.arXiv preprint arXiv:2506.01631, 202...
-
[14]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
from a curated set of common categories (e.g., animals, fruits, colors), with 15 Algorithm 2Fingerprint Verification Require:Fingerprint prompt set{p j}n j=1, token pair set{(w + j , w− j )}n j=1, base modelM B, suspect modelM S, validation negative suspect model set{M i}k i=1, and z-scorez Ensure:Verification result 1:For eachj= 1, . . . , n, queryM B wi...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.