pith. sign in

arxiv: 2509.25448 · v3 · pith:JDNRQ4CCnew · submitted 2025-09-29 · 💻 cs.CR · cs.CL

Fingerprinting LLMs via Prompt Injection

Pith reviewed 2026-05-21 21:22 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM fingerprintingprompt injectionmodel provenancepost-training robustnessquantizationblack-box detectiontoken preference
0
0 comments X

The pith

Optimized prompt injections create fingerprints that identify whether one LLM derives from a specific base model even after post-training or quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLMPrint to solve the problem of tracing model origins when LLMs have already been released and then altered. It works by searching for prompts that force a model to prefer certain tokens in a consistent way, turning that preference pattern into a signature tied to the original model. These signatures hold up under common changes like fine-tuning or reducing precision because the optimization targets stable behaviors rather than fragile details. The approach supplies a single verification method that applies when partial or no internal access is available and includes statistical checks to control errors. A reader would care because it offers a practical route to provenance for the thousands of modified models now in circulation without requiring changes at release time.

Core claim

By optimizing fingerprint prompts to enforce consistent token preferences, LLMPrint obtains fingerprints that are both unique to the base model and robust to post-processing. The method includes a unified verification procedure applicable to gray-box and black-box settings with statistical guarantees. Tests across five base models and approximately 700 variants demonstrate high true positive rates and near-zero false positive rates.

What carries the argument

Fingerprint prompts optimized to enforce consistent token preferences, which turn the model's prompt-injection vulnerability into a stable, model-specific signature.

If this is right

  • Provenance checks become possible for models that were already released without any embedded markers.
  • Verification functions under both partial internal access and query-only access.
  • Low false-positive rates support reliable identification across hundreds of modified copies.
  • Statistical guarantees allow quantified in each detection decision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same signatures could support audits that check whether a deployed model is an unauthorized derivative of a licensed base.
  • Testing the method on additional post-processing such as weight pruning or model merging would clarify its limits.
  • If the optimization process is public, future post-training pipelines might deliberately disrupt the targeted token preferences.
  • Pairing these fingerprints with output-based similarity checks could create stronger combined evidence for model lineage.

Load-bearing premise

That token preferences fixed on the base model will stay distinctive and stable when the model undergoes post-processing steps that were not part of the optimization.

What would settle it

A collection of unrelated models that were never derived from the base yet produce the same consistent token outputs on the optimized fingerprint prompts would falsify the uniqueness claim.

Figures

Figures reproduced from arXiv: 2509.25448 by Cheng Hong, Mengyuan Li, Neil Gong, Osama Ahmed, Yuepeng Hu, Zhengyuan Jiang, Zhicong Huang.

Figure 1
Figure 1. Figure 1: Overview of LLMPrint. Yoon et al., 2025, Ren et al., 2025, Pasquini et al., 2025], by contrast, avoid altering the base model and instead design prompts to elicit inherent behaviors that can be compared between the base and suspect models. For instance, some approaches measure agreement over large pools of randomly sampled prompts [Nikolic et al., 2025], while others craft prompts to expose lexical, stylis… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation studies of LLMPrint on Meta-Llama-3-8B. Results are reported on post-trained [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
read the original abstract

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero. The code is publicly available at https://github.com/hifi-hyp/ACL-LLMPrint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper introduces LLMPrint, a detection framework for identifying whether an LLM is derived from a base model by optimizing fingerprint prompts that exploit prompt injection to enforce consistent token preferences. These fingerprints are asserted to be unique to the base model and robust to post-processing like post-training and quantization. A unified verification procedure is developed for gray-box and black-box settings with statistical guarantees. The method is evaluated on five base models and around 700 variants, achieving high true positive rates and near-zero false positive rates. The code is publicly available.

Significance. Should the robustness to post-processing be substantiated beyond the tested variants, this would represent a notable contribution to LLM security and provenance tracking, as it does not require modifying the model prior to release. The public code release is a positive aspect that supports reproducibility.

major comments (3)
  1. [Evaluation] The results on the 700 variants show high TPR and low FPR, but the manuscript lacks details on the optimization procedure used to construct the fingerprint prompts, the specific statistical tests applied, potential confounds in generating the variants, and the exact method for measuring false positives. These omissions hinder evaluation of the empirical support for the robustness claim (see Abstract and Evaluation section).
  2. [Method] The key assumption that fingerprints optimized for consistent token preferences on base models will remain distinctive after arbitrary post-processing is supported only by empirical testing on the ~700 variants. No theoretical argument or proof is provided that the optimization objective preserves invariance under fine-tuning or quantization schemes not included in the test set, which is load-bearing for the central robustness claim.
  3. [Verification Procedure] The unified verification procedure claims statistical guarantees, but the derivation of these guarantees and the assumptions (e.g., independence) are not clearly specified, particularly for the black-box case.
minor comments (2)
  1. [Abstract] Clarify the exact number of variants and their distribution between post-trained and quantized models.
  2. [Related Work] Ensure all relevant prior work on model fingerprinting and watermarking is cited for completeness.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive review. We appreciate the focus on evaluation details, the nature of our robustness claims, and the clarity of statistical guarantees. We have revised the manuscript to address the first and third comments by adding substantial detail and derivations. For the second comment, we have expanded the discussion while noting that our contribution is primarily empirical.

read point-by-point responses
  1. Referee: [Evaluation] The results on the 700 variants show high TPR and low FPR, but the manuscript lacks details on the optimization procedure used to construct the fingerprint prompts, the specific statistical tests applied, potential confounds in generating the variants, and the exact method for measuring false positives. These omissions hinder evaluation of the empirical support for the robustness claim (see Abstract and Evaluation section).

    Authors: We agree that these details were insufficiently specified. In the revised manuscript we have expanded the Evaluation section with: (1) a complete description of the prompt optimization procedure, including the objective, search algorithm, and hyperparameters; (2) the precise statistical tests (binomial test on token preference consistency with explicit p-value threshold and multiple-testing correction); (3) how the ~700 variants were sourced and any controls applied to reduce confounds such as shared training data or architecture similarity; and (4) the exact false-positive protocol, including the number of cross-model negative pairs and the decision rule. We have also added pseudocode and additional tables summarizing these choices. revision: yes

  2. Referee: [Method] The key assumption that fingerprints optimized for consistent token preferences on base models will remain distinctive after arbitrary post-processing is supported only by empirical testing on the ~700 variants. No theoretical argument or proof is provided that the optimization objective preserves invariance under fine-tuning or quantization schemes not included in the test set, which is load-bearing for the central robustness claim.

    Authors: We acknowledge that the central robustness claim rests on empirical evidence rather than a formal invariance proof. The optimization objective is deliberately chosen to capture low-level, architecture- and training-induced token biases that are difficult to erase without fundamentally altering the model. In the revision we have added a dedicated Discussion subsection that (a) articulates why these biases are expected to survive common post-processing pipelines, (b) enumerates the diversity of the tested variants (multiple fine-tuning recipes, quantization bit-widths, and model scales), and (c) explicitly states the limitation that untested schemes could in principle break the fingerprint. We maintain that the breadth of the empirical evaluation constitutes the primary support for the claim at this stage. revision: partial

  3. Referee: [Verification Procedure] The unified verification procedure claims statistical guarantees, but the derivation of these guarantees and the assumptions (e.g., independence) are not clearly specified, particularly for the black-box case.

    Authors: We have revised the Verification Procedure section to include the full derivation. We now state the modeling assumptions explicitly (conditional independence of token samples given the injected prompt, justified by the prompt-injection mechanism forcing deterministic preference), derive the test statistic and its distribution under the null for both gray-box (logit access) and black-box (output-only) settings, and provide the exact p-value computation, including any asymptotic approximations used in the black-box case. A short appendix supplies the complete proof steps. revision: yes

standing simulated objections not resolved
  • A formal theoretical proof that the optimization objective preserves distinctiveness under arbitrary post-processing schemes not represented in the evaluated set

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external variants

full rationale

The paper's core procedure optimizes fingerprint prompts on base models to induce consistent token preferences and then evaluates distinctiveness and robustness via direct testing on ~700 post-trained or quantized variants. This constitutes an empirical measurement against held-out external models rather than any derivation that reduces by construction to fitted inputs, self-definitions, or self-citation chains. No equations or claims in the provided text equate a prediction to its optimization objective; statistical guarantees are presented as arising from the verification procedure applied to the test set. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that LLMs exhibit exploitable consistent token preferences under injection and that optimization can produce unique identifiers stable to post-processing; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLMs have inherent vulnerability to prompt injection that can be exploited to enforce consistent token preferences unique to the base model
    Key insight stated in the abstract as the foundation for fingerprint construction.

pith-pipeline@v0.9.0 · 5740 in / 1093 out tokens · 71855 ms · 2026-05-21T21:22:35.073888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Detecting Language Model Attacks with Perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132,

  2. [2]

    Stealing part of a production language model.arXiv preprint arXiv:2403.06634,

    Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model.arXiv preprint arXiv:2403.06634,

  3. [3]

    Robust llm fingerprinting via domain-specific watermarks.arXiv preprint arXiv:2505.16723,

    Thibaud Gloaguen, Robin Staab, Nikola Jovanovi´ c, and Martin Vechev. Robust llm fingerprinting via domain-specific watermarks.arXiv preprint arXiv:2505.16723,

  4. [4]

    Trap: Targeted ran- dom adversarial prompt honeypot for black-box identification.arXiv preprint arXiv:2402.12991,

    Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. Trap: Targeted ran- dom adversarial prompt honeypot for black-box identification.arXiv preprint arXiv:2402.12991,

  5. [5]

    A critical evaluation of defenses against prompt injection attacks,

    Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. A critical evaluation of defenses against prompt injection attacks.arXiv preprint arXiv:2505.18333,

  6. [6]

    Introducing microsoft 365 copilot – your copilot for work.https://blogs.microsoft

    12 Microsoft. Introducing microsoft 365 copilot – your copilot for work.https://blogs.microsoft. com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/,

  7. [7]

    Model provenance testing for large language models.arXiv preprint arXiv:2502.00706,

    Ivica Nikolic, Teodora Baluta, and Prateek Saxena. Model provenance testing for large language models.arXiv preprint arXiv:2502.00706,

  8. [8]

    Cotsrf: Utilize chain of thought as stealthy and robust fingerprint of large language models.arXiv preprint arXiv:2505.16785,

    Zhenzhen Ren, GuoBiao Li, Sheng Li, Zhenxing Qian, and Xinpeng Zhang. Cotsrf: Utilize chain of thought as stealthy and robust fingerprint of large language models.arXiv preprint arXiv:2505.16785,

  9. [9]

    Fpedit: Robust llm fingerprinting through localized knowledge editing.arXiv preprint arXiv:2508.02092,

    Shida Wang, Chaohu Liu, Yubo Wang, and Linli Xu. Fpedit: Robust llm fingerprinting through localized knowledge editing.arXiv preprint arXiv:2508.02092,

  10. [10]

    Imf: Implicit fingerprint for large language models.arXiv preprint arXiv:2503.21805,

    Peng Wanli, Xue Yiming, et al. Imf: Implicit fingerprint for large language models.arXiv preprint arXiv:2503.21805,

  11. [11]

    Prompt injection attacks against gpt-3.https://simonwillison.net/2022/Sep/ 12/prompt-injection/,

    Simon Willison. Prompt injection attacks against gpt-3.https://simonwillison.net/2022/Sep/ 12/prompt-injection/,

  12. [12]

    Delimiters won’t save you from prompt injection.https://simonwillison.net/ 2023/May/11/delimiters-wont-save-you,

    Simon Willison. Delimiters won’t save you from prompt injection.https://simonwillison.net/ 2023/May/11/delimiters-wont-save-you,

  13. [13]

    Editmf: Drawing an invisible fingerprint for your large language models.arXiv preprint arXiv:2508.08836, 2025a

    Jiaxuan Wu, Yinghan Zhou, Wanli Peng, Yiming Xue, Juan Wen, and Ping Zhong. Editmf: Drawing an invisible fingerprint for your large language models.arXiv preprint arXiv:2508.08836, 2025a. 13 Zehao Wu, Yanjie Zhao, and Haoyu Wang. Gradient-based model fingerprinting for llm similarity detection and family classification.arXiv preprint arXiv:2506.01631, 202...

  14. [14]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  15. [15]

    , n, queryM B withp j to obtain preference on (w + j , w− j ) and set bj ←1[w + j preferred overw − j ]

    from a curated set of common categories (e.g., animals, fruits, colors), with 15 Algorithm 2Fingerprint Verification Require:Fingerprint prompt set{p j}n j=1, token pair set{(w + j , w− j )}n j=1, base modelM B, suspect modelM S, validation negative suspect model set{M i}k i=1, and z-scorez Ensure:Verification result 1:For eachj= 1, . . . , n, queryM B wi...