pith. sign in

arxiv: 2510.18333 · v2 · submitted 2025-10-21 · 💻 cs.CR · cs.CL

Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption

Pith reviewed 2026-05-18 05:21 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM watermarkingincentive alignmentstakeholder incentivesin-context watermarkingmisuse detectionpractical adoptionmodel watermarking
0
0 comments X

The pith

LLM watermarking sees real adoption only when stakeholder incentives align in targeted domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Technical progress in watermarking large language models has not translated into widespread use. The paper identifies misaligned incentives among providers, platforms, and users as the root cause, expressed through competitive risks for companies, difficulties governing detection tools, and challenges in attributing misuse. It examines three watermarking approaches under this lens and highlights in-context watermarking as a promising case where trusted parties embed instructions that produce detectable signals only when documents are misused with an LLM. This setup avoids quality penalties for users and keeps providers neutral. The authors outline design principles for incentive-aligned methods and call for community engagement to develop practical tools in suitable settings.

Core claim

The limited real-world deployment of LLM watermarking stems from misaligned incentives among providers, platforms, and end users that create three barriers—competitive risk, detection-tool governance, and attribution issues—rather than from shortcomings in the algorithms themselves. Model watermarking fits provider goals but struggles in open-source environments. LLM text watermarking offers only modest benefits unless scoped narrowly, such as for dataset decontamination. In-context watermarking aligns incentives by letting trusted parties like educators or organizers embed hidden instructions; LLM outputs from those inputs then carry detectable watermarks indicating misuse, with no quality,

What carries the argument

In-context watermarking (ICW), in which trusted parties embed hidden watermark instructions into source documents so that any LLM output derived from them carries a detectable signal of misuse.

If this is right

  • Model watermarking aligns with LLM provider interests but faces additional challenges when models are open-sourced.
  • LLM text watermarking gains traction mainly in narrow applications such as dataset decontamination or user-controlled provenance tracking.
  • In-context watermarking lets trusted parties detect misuse while users experience no quality degradation and providers stay neutral.
  • Watermarking methods should be designed around domain-specific incentive structures rather than as general-purpose anti-misuse tools.
  • Active community participation is required to refine and standardize these incentive-aligned approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same incentive-alignment logic could guide deployment of other AI detection or provenance tools beyond watermarking.
  • In education, ICW-style methods might allow responsible LLM use while giving instructors reliable signals of unauthorized assistance.
  • Platforms could adopt common ICW instruction formats to enable consistent detection across multiple LLM providers.

Load-bearing premise

The three barriers arising from misaligned incentives—competitive risk, detection-tool governance, and attribution issues—are the dominant obstacles to adoption instead of technical limitations in existing watermarking methods.

What would settle it

A controlled deployment of in-context watermarking in a specific domain such as academic assignments, measuring whether misuse detection rates increase and legitimate use remains unaffected compared to no watermarking.

Figures

Figures reproduced from arXiv: 2510.18333 by Dawn Song, Gregory W. Wornell, Xuandong Zhao, Yepeng Liu, Yuheng Bu.

Figure 1
Figure 1. Figure 1: Example of model watermarking: an adversary fine-tunes, prunes, or illegally uses a protected [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Incentive model for model wa￾termarking among IP Owners, platforms, and users. In the Model-as-a-Service (MaaS) setting (e.g., ChatGPT), LLM developers host their own APIs without an interme￾diary platform. Growing usage boosts their visibility, rep￾utation, and subscription revenue. Adversaries, however, can erode this value by extracting large volumes of data, distilling the models, or deploying unauthor… view at source ↗
Figure 3
Figure 3. Figure 3: Broken Incentive Model for LLM Text Watermarking: Users may switch to unwatermarked models, under￾mining both the LLM provider’s interests and the intended goal of reducing misuse. We analyze the incentive model ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of two LLM text watermarking use cases. Left: Watermarking implemented [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of In-Context Watermark. One promising approach is to modify the input to the LLM. Given that many lazy reviewers (or students) paste documents directly into an LLM for summarization or drafting, the documents can be embedded with imperceptible in-context watermarking instructions. These signals subtly influence the LLM’s output, allowing downstream detection without altering the model or disrupti… view at source ↗
Figure 6
Figure 6. Figure 6: Incentive model for model wa￾termarking among trusted parties, tech￾nology providers, and users. The incentive model ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of Green/Red list LLM text Watermarking [ [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Despite progress in watermarking algorithms for large language models (LLMs), real-world deployment remains limited. We argue that this gap stems from misaligned incentives among LLM providers, platforms, and end users, which manifest as three key barriers: competitive risk, detection-tool governance, and attribution issues. We revisit three classes of watermarking through this lens. \emph{Model watermarking} naturally aligns with LLM provider interests, yet faces new challenges in open-source ecosystems. \emph{LLM text watermarking} offers modest provider benefit when framed solely as an anti-misuse tool, but can gain traction in narrowly scoped settings such as dataset de-contamination or user-controlled provenance. \emph{In-context watermarking} (ICW) is tailored for trusted parties, such as conference organizers or educators, who embed hidden watermarking instructions into documents. If a dishonest reviewer or student submits this text to an LLM, the output carries a detectable watermark indicating misuse. This setup aligns incentives: users experience no quality loss, trusted parties gain a detection tool, and LLM providers remain neutral by simply following watermark instructions. We advocate for a broader exploration of incentive-aligned methods, with ICW as an example, in domains where trusted parties need reliable tools to detect misuse. More broadly, we distill design principles for incentive-aligned, domain-specific watermarking and outline future research directions. Our position is that the practical adoption of LLM watermarking requires aligning stakeholder incentives in targeted application domains and fostering active community engagement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that limited real-world deployment of LLM watermarking stems from misaligned incentives among providers, platforms, and users, which create three barriers: competitive risk, detection-tool governance, and attribution issues. It revisits model watermarking (aligns with providers but challenged in open-source settings), LLM text watermarking (limited benefit as anti-misuse tool but potentially useful in scoped settings like de-contamination), and in-context watermarking (ICW), where trusted parties embed instructions so that LLM outputs from misused documents carry detectable signals. ICW is presented as incentive-aligned with no quality loss to users and neutrality for providers. The position advocates exploring incentive-aligned, domain-specific methods, distills design principles, and calls for community engagement to drive adoption.

Significance. If the incentive-alignment thesis holds, the paper offers a useful reframing that could redirect research effort toward domain-specific designs rather than universal technical solutions. The ICW example and distilled design principles provide a concrete starting point for trusted-party scenarios such as academic integrity or conference review processes, potentially increasing the chance of targeted deployments where stakeholder interests already converge.

major comments (2)
  1. [Introduction / Barriers section] Introduction and § on barriers: The central claim that competitive risk, detection-tool governance, and attribution issues are the dominant reasons for limited deployment (rather than technical factors such as robustness, quality degradation, or false-positive rates) is asserted without deployment case studies, provider surveys, or failure-mode analyses. This assumption is load-bearing for the argument that incentive realignment is the primary path to adoption.
  2. [In-Context Watermarking section] § on In-Context Watermarking: The statement that ICW produces 'no quality loss' for users and is technically viable is presented as given, yet the manuscript supplies no analysis, reference to existing ICW implementations, or discussion of detection reliability/false-positive rates in the targeted domains. This weakens the claim that ICW already satisfies the technical preconditions for incentive alignment.
minor comments (1)
  1. [Abstract] The abstract and main text use 'ICW' without an initial definition on first use; adding an explicit parenthetical expansion would improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive comments, which help us improve the clarity and rigor of our position paper. We address the major comments below, indicating the revisions we intend to make.

read point-by-point responses
  1. Referee: Introduction and § on barriers: The central claim that competitive risk, detection-tool governance, and attribution issues are the dominant reasons for limited deployment (rather than technical factors such as robustness, quality degradation, or false-positive rates) is asserted without deployment case studies, provider surveys, or failure-mode analyses. This assumption is load-bearing for the argument that incentive realignment is the primary path to adoption.

    Authors: We thank the referee for highlighting this point. As a position paper, our argument is grounded in an analysis of stakeholder incentives drawn from the current state of LLM deployment and watermarking literature, rather than presenting new empirical data. We recognize that providing additional supporting evidence, such as references to reported challenges in deployment, would strengthen the manuscript. In the revision, we will expand the barriers section to include specific examples from public discussions and papers on watermarking adoption barriers, while maintaining that incentive misalignment is a critical factor. This constitutes a partial revision as we do not claim to have conducted original surveys. revision: partial

  2. Referee: § on In-Context Watermarking: The statement that ICW produces 'no quality loss' for users and is technically viable is presented as given, yet the manuscript supplies no analysis, reference to existing ICW implementations, or discussion of detection reliability/false-positive rates in the targeted domains. This weakens the claim that ICW already satisfies the technical preconditions for incentive alignment.

    Authors: We agree with the referee that more details on ICW would be beneficial. The current manuscript presents ICW primarily as a conceptual framework to illustrate incentive alignment in trusted-party scenarios. To address the concern, we will revise the section to include references to existing prompt-based or instruction-following techniques that can serve as the basis for ICW, discuss how detection can be reliable in narrow domains (e.g., by combining with other signals), and clarify that 'no quality loss' means the watermarking instructions do not degrade the LLM's performance for the intended user task. We will also note open questions regarding false-positive rates as directions for future research. This will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper analyzes incentives without self-referential reductions

full rationale

The paper is a position piece that identifies three incentive barriers (competitive risk, detection-tool governance, attribution issues) and revisits watermarking classes through that external lens, proposing ICW as an incentive-aligned example. No equations, fitted parameters, predictions, or derivations appear in the provided text. Claims rest on reasoning about stakeholder interests rather than reducing to self-definitions, self-citations as load-bearing premises, or renaming known results. The central argument remains independent of any internal construction that would force equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central position depends on the premise that incentive misalignment, rather than technical limits, is the primary adoption barrier, plus standard assumptions about how providers, platforms, and users behave in deployment settings.

axioms (1)
  • domain assumption The three key barriers—competitive risk, detection-tool governance, and attribution issues—are the main obstacles preventing real-world deployment of LLM watermarking.
    Explicitly stated in the abstract as the reason for the adoption gap.

pith-pipeline@v0.9.0 · 5810 in / 1178 out tokens · 45801 ms · 2026-05-18T05:21:33.588248+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We argue that this gap stems from misaligned incentives among LLM providers, platforms, and end users, which manifest as four key barriers: competitive risk, detection-tool governance, robustness concerns and attribution issues.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Watermarking Should Be Treated as a Monitoring Primitive

    cs.CR 2026-05 unverdicted novelty 6.0

    Watermarking enables entity-level attribution and monitoring via signal aggregation across outputs, even in zero-bit designs, revealing a fundamental tension with attribution goals.

  2. Watermarking Should Be Treated as a Monitoring Primitive

    cs.CR 2026-05 conditional novelty 6.0

    Watermarking enables entity-level attribution and monitoring through signal aggregation even in zero-bit designs, creating an unavoidable dual-use tension between attribution and surveillance.

  3. Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes

    cs.IT 2026-05 unverdicted novelty 5.0

    Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of station...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    Is my data in your retrieval database? membership inference attacks against retrieval augmented generation,

    [AAG24] Maya Anderson, Guy Amit, and Abigail Goldsteen. Is my data in your retrieval database? membership inference attacks against retrieval augmented generation.arXiv preprint arXiv:2405.20446,

  2. [2]

    Watermarking of large language models

    [Aar23] Scott Aaronson. Watermarking of large language models. https://simons.berkeley. edu/talks/scott-aaronson-ut-austin-openai-2023-08-17 ,

  3. [3]

    [ALL+25] Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, and Shiyu Chang

    Accessed: 2023-08. [ALL+25] Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, and Shiyu Chang. Defending LLM watermarking against spoofing attacks with contrastive representation learning.arXiv preprint arXiv:2504.06575,

  4. [4]

    Multi-bit distortion-free watermarking for large language models.arXiv preprint arXiv:2402.16578,

    [BJZM24] Massieh Kordi Boroujeny, Ya Jiang, Kai Zeng, and Brian Mark. Multi-bit distortion-free watermarking for large language models.arXiv preprint arXiv:2402.16578,

  5. [5]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    [BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  6. [6]

    A watermark for black-box language models.arXiv preprint arXiv:2410.02099,

    [BWAM24] Dara Bahri, John Wieting, Dana Alon, and Donald Metzler. A watermark for black-box language models.arXiv preprint arXiv:2410.02099,

  7. [7]

    org/CorpusID:261660497

    [CBZ+23] Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. On the possibilities of ai-generated text detection.arXiv preprint arXiv:2304.04736,

  8. [8]

    Trojanrag: Retrieval-augmented genera- tion can be backdoor driver in large language mod- els,

    [CDJ+24] Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models.arXiv preprint arXiv:2405.13401,

  9. [9]

    Provably robust watermarks for open-source language models.arXiv preprint arXiv:2410.18861,

    [CGMR24] Miranda Christ, Sam Gunn, Tal Malkin, and Mariana Raykova. Provably robust watermarks for open-source language models.arXiv preprint arXiv:2410.18861,

  10. [10]

    Undetectable watermarks for language models

    [CGZ23] Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194,

  11. [11]

    Postmark: A robust blackbox watermark for large language models.arXiv preprint arXiv:2406.14517,

    10 [CKH+24] Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Wieting, and Mohit Iyyer. Postmark: A robust blackbox watermark for large language models.arXiv preprint arXiv:2406.14517,

  12. [12]

    Feder Cooper, Katherine Lee, James Grimmelmann, Daphne Ippolito, Christo- pher Callison-Burch, Christopher A

    [CLG+23] A Feder Cooper, Katherine Lee, James Grimmelmann, Daphne Ippolito, Christopher Callison-Burch, Christopher A Choquette-Choo, Niloofar Mireshghallah, Miles Brundage, David Mimno, Madiha Zahrah Choksi, et al. Report of the 1st workshop on generative ai and law.arXiv preprint arXiv:2311.06477,

  13. [13]

    Phantom: General trigger attacks on retrieval augmented language generation,

    [CSA+24] Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation.arXiv preprint arXiv:2405.20485,

  14. [14]

    Improved unbiased watermark for large language models.arXiv preprint arXiv:2502.11268,

    [CWGH25] Ruibo Chen, Yihan Wu, Junfeng Guo, and Heng Huang. Improved unbiased watermark for large language models.arXiv preprint arXiv:2502.11268,

  15. [15]

    Robust data watermarking in language models by injecting fictitious knowledge.arXiv preprint arXiv:2503.04036,

    [CWSJ25] Xinyue Cui, Johnny Tian-Zheng Wei, Swabha Swayamdipta, and Robin Jia. Robust data watermarking in language models by injecting fictitious knowledge.arXiv preprint arXiv:2503.04036,

  16. [16]

    [CYS+24] Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, and Xinyu Xing

    Issued 10 Jul 2023, effective 15 Aug 2023; requires explicit labels and implicit watermarks. [CYS+24] Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, and Xinyu Xing. Utf: Under- trained tokens as fingerprints a novel approach to LLM identification.arXiv preprint arXiv:2410.12318,

  17. [17]

    Optimizing adaptive attacks against content watermarks for language models.arXiv preprint arXiv:2410.02440,

    [DAL24] Abdulrahman Diaa, Toluwani Aremu, and Nils Lukas. Optimizing adaptive attacks against content watermarks for language models.arXiv preprint arXiv:2410.02440,

  18. [18]

    A Survey on In-context Learning

    [DLD+22] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234,

  19. [19]

    [Eur24] European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 march 2024 laying down harmonised rules on artificial intelligence and amending certain union legislative acts (artificial intelligence act),

  20. [20]

    Gumbelsoft: Diversified language model watermarking via the gumbelmax-trick

    11 [FZY+24] Jiayi Fu, Xuandong Zhao, Ruihan Yang, Yuansen Zhang, Jiangjie Chen, and Yanghua Xiao. Gumbelsoft: Diversified language model watermarking via the gumbelmax-trick. arXiv preprint arXiv:2402.12948,

  21. [21]

    Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264,

    [GCG+23] Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, and Amrit Singh Bedi. Towards possibilities & impossibilities of ai-generated text detection: A survey.arXiv preprint arXiv:2310.15264,

  22. [22]

    The Llama 3 Herd of Models

    [GDJ+24] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  23. [23]

    Watermax: breaking the llm watermark detectability-robustness-quality trade-off.arXiv preprint arXiv:2403.04808,

    [GF24] Eva Giboulot and Teddy Furon. Watermax: breaking the LLM watermark detectability- robustness-quality trade-off.arXiv preprint arXiv:2403.04808,

  24. [24]

    Black-box detection of language model watermarks.arXiv preprint arXiv:2405.20777,

    [GJSV24] Thibaud Gloaguen, Nikola Jovanovi´ c, Robin Staab, and Martin Vechev. Black-box detection of language model watermarks.arXiv preprint arXiv:2405.20777,

  25. [25]

    On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

    [GLLH23] Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

  26. [26]

    Unbiased watermark for large language models

    [HCW+23] Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models.arXiv preprint arXiv:2310.10669,

  27. [27]

    Universally optimal watermarking schemes for llms: from theory to practice.arXiv preprint arXiv:2410.02890, 2024

    [HLW+24] Haiyun He, Yepeng Liu, Ziqiao Wang, Yongyi Mao, and Yuheng Bu. Universally optimal watermarking schemes for LLMs: from theory to practice.arXiv preprint arXiv:2410.02890,

  28. [28]

    Distributional informa- tion embedding: A framework for multi-bit watermarking.arXiv preprint arXiv:2501.16558,

    [HLW+25] Haiyun He, Yepeng Liu, Ziqiao Wang, Yongyi Mao, and Yuheng Bu. Distributional informa- tion embedding: A framework for multi-bit watermarking.arXiv preprint arXiv:2501.16558,

  29. [29]

    Token-specific watermarking with enhanced detectability and semantic coherence for large language models.arXiv preprint arXiv:2402.18059,

    [HSL+24] Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, and Pengtao Xie. Token-specific watermarking with enhanced detectability and semantic coherence for large language models.arXiv preprint arXiv:2402.18059,

  30. [30]

    Semstamp: A semantic watermark with paraphrastic robustness for text generation.arXiv preprint arXiv:2310.03991,

    [HZH+23] Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. Semstamp: A semantic watermark with paraphrastic robustness for text generation.arXiv preprint arXiv:2310.03991,

  31. [31]

    Ward: Provable rag dataset inference via LLM watermarks.arXiv preprint arXiv:2410.03537,

    [JSBV24] Nikola Jovanovi´ c, Robin Staab, Maximilian Baader, and Martin Vechev. Ward: Provable rag dataset inference via LLM watermarks.arXiv preprint arXiv:2410.03537,

  32. [32]

    Watermark stealing in large language models

    [JSV24] Nikola Jovanovi´ c, Robin Staab, and Martin Vechev. Watermark stealing in large language models.arXiv preprint arXiv:2402.19361,

  33. [33]

    An overview of large language models for statisticians.arXiv preprint arXiv:2502.17814,

    [JYG+25] Wenlong Ji, Weizhe Yuan, Emily Getzen, Kyunghyun Cho, Michael I Jordan, Song Mei, Jason E Weston, Weijie J Su, Jing Xu, and Linjun Zhang. An overview of large language models for statisticians.arXiv preprint arXiv:2502.17814,

  34. [34]

    Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

    [KTHL23] Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models.arXiv preprint arXiv:2307.15593,

  35. [35]

    Adaptive text watermark for large language models.arXiv preprint arXiv:2401.13927,

    [LB24] Yepeng Liu and Yuheng Bu. Adaptive text watermark for large language models.arXiv preprint arXiv:2401.13927,

  36. [36]

    Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

    [LHA+23] Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

  37. [37]

    Watermarking text data on large language models for dataset copyright.arXiv preprint arXiv:2305.13257,

    [LHC+23] Yixin Liu, Hongsheng Hu, Xun Chen, Xuyun Zhang, and Lichao Sun. Watermarking text data on large language models for dataset copyright.arXiv preprint arXiv:2305.13257,

  38. [38]

    Trojtext: Test-time invisible textual trojan insertion

    [LLF23] Qian Lou, Yepeng Liu, and Bo Feng. Trojtext: Test-time invisible textual trojan insertion. arXiv preprint arXiv:2303.02242,

  39. [39]

    An unforgeable publicly verifiable watermark for large language models

    [LPH+23] Aiwei Liu, Leyi Pan, Xuming Hu, Shu’ang Li, Lijie Wen, Irwin King, and Philip S Yu. An unforgeable publicly verifiable watermark for large language models.arXiv preprint arXiv:2307.16230,

  40. [40]

    Robust detection of watermarks for large language models under human edits.arXiv preprint arXiv:2411.13868,

    [LRW+24] Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. Robust detection of watermarks for large language models under human edits.arXiv preprint arXiv:2411.13868,

  41. [41]

    In-Context Watermarks for Large Language Models

    [LZK+25] Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, and Yuheng Bu. In-context watermarks for large language models.arXiv preprint arXiv:2505.16934,

  42. [42]

    Mask-based membership inference attacks for retrieval-augmented generation

    [LZL25] Mingrui Liu, Sixiao Zhang, and Cheng Long. Mask-based membership inference attacks for retrieval-augmented generation. InProceedings of the ACM on Web Conference 2025, pages 2894–2907,

  43. [43]

    Dataset protection via watermarked canaries in retrieval-augmented LLMs.arXiv preprint arXiv:2502.10673,

    [LZSB25] Yepeng Liu, Xuandong Zhao, Dawn Song, and Yuheng Bu. Dataset protection via watermarked canaries in retrieval-augmented LLMs.arXiv preprint arXiv:2502.10673,

  44. [44]

    Can LLMs follow simple rules?arXiv preprint arXiv:2311.04235,

    13 [MCW+23] Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, and David Wagner. Can LLMs follow simple rules?arXiv preprint arXiv:2311.04235,

  45. [45]

    Improving your model ranking on chatbot arena by vote rigging.arXiv preprint arXiv:2501.17858,

    [MPD+25] Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, and Min Lin. Improving your model ranking on chatbot arena by vote rigging.arXiv preprint arXiv:2501.17858,

  46. [46]

    Scalable fingerprinting of large language models.arXiv preprint arXiv:2502.07760,

    [NHB+25] Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, and Sewoong Oh. Scalable fingerprinting of large language models.arXiv preprint arXiv:2502.07760,

  47. [47]

    No free lunch in LLM watermarking: Trade-offs in watermarking design choices.arXiv preprint arXiv:2402.16187,

    [PHZS24] Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. No free lunch in LLM watermarking: Trade-offs in watermarking design choices.arXiv preprint arXiv:2402.16187,

  48. [48]

    Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

    [PLH+24] Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. MarkLLM: An open-source toolkit for LLM watermarking.arXiv preprint arXiv:2405.10051,

  49. [49]

    Detecting LLM-written peer reviews.arXiv preprint arXiv:2503.15772,

    [RKLS25] Vishisht Rao, Aounon Kumar, Himabindu Lakkaraju, and Nihar B Shah. Detecting LLM-written peer reviews.arXiv preprint arXiv:2503.15772,

  50. [50]

    ArXiv:2407.10887 [cs]

    [RS24] Mark Russinovich and Ahmed Salem. Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique.arXiv preprint arXiv:2407.10887,

  51. [51]

    A robust semantics-based watermark for large language model against paraphrasing

    [RXL+23] Jie Ren, Han Xu, Yiding Liu, Yingqian Cui, Shuaiqiang Wang, Dawei Yin, and Jiliang Tang. A robust semantics-based watermark for large language model against paraphrasing. arXiv preprint arXiv:2311.08721,

  52. [52]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

    14 [SNW+25] Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet ¨Ust¨ un, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, et al. The leaderboard illusion. arXiv preprint arXiv:2504.20879,

  53. [53]

    Embarrassingly simple text watermarks.arXiv preprint arXiv:2310.08920,

    [STB+23] Ryoma Sato, Yuki Takezawa, Han Bao, Kenta Niwa, and Makoto Yamada. Embarrassingly simple text watermarks.arXiv preprint arXiv:2310.08920,

  54. [54]

    Proving membership in LLM pretraining data via data watermarks.arXiv preprint arXiv:2402.10892,

    [WWJ24] Johnny Tian-Zheng Wei, Ryan Yixiang Wang, and Robin Jia. Proving membership in LLM pretraining data via data watermarks.arXiv preprint arXiv:2402.10892,

  55. [55]

    Towards codable watermarking for injecting multi-bits information to llms,

    [WYC+23] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable watermarking for injecting multi-bits information to LLMs. arXiv preprint arXiv:2307.15992,

  56. [56]

    [XLH+25] Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, and Hui Xiong

    Accessed: 2025-05-22. [XLH+25] Yijie Xu, Aiwei Liu, Xuming Hu, Lijie Wen, and Hui Xiong. Mark your LLM: Detect- ing the misuse of open-source large language models via watermarking.arXiv preprint arXiv:2503.04636,

  57. [57]

    Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

    [XMW+23] Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models.arXiv preprint arXiv:2305.14710,

  58. [58]

    URLhttps://www.aclweb.org/anthology/2020.emnlp-demos.6

    [XWM+24] Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models.arXiv preprint arXiv:2401.12255,

  59. [59]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval,

    [XXL+20] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808,

  60. [60]

    Learning to watermark LLM-generated text via reinforcement learning.arXiv preprint arXiv:2403.10553,

    [XYL24] Xiaojun Xu, Yuanshun Yao, and Yang Liu. Learning to watermark LLM-generated text via reinforcement learning.arXiv preprint arXiv:2403.10553,

  61. [61]

    Robust multi-bit natural language watermarking through invariant features.arXiv preprint arXiv:2305.01904,

    [YAJK23] KiYoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. Robust multi-bit natural language watermarking through invariant features.arXiv preprint arXiv:2305.01904,

  62. [62]

    Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,

    [YCZ+23] Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,

  63. [63]

    Mergeprint: Robust fingerprinting against merging large language models.arXiv preprint arXiv:2410.08604,

    [YTWW24] Shojiro Yamabe, Tsubasa Takahashi, Futa Waseda, and Koki Wataoka. Mergeprint: Robust fingerprinting against merging large language models.arXiv preprint arXiv:2410.08604,

  64. [64]

    [YYZ+24] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  65. [65]

    Review outline:

    [ZALW23] Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

  66. [66]

    Agnibh Dasgupta, Abdullah Tanvir, and Xin Zhong

    [ZDT24] Xin Zhong, Agnibh Dasgupta, and Abdullah Tanvir. Watermarking language models through language models.arXiv preprint arXiv:2411.05091,

  67. [67]

    Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak

    [ZEF+23] Hanlin Zhang, Benjamin L Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for generative models.arXiv preprint arXiv:2311.04378,

  68. [68]

    Sok: Watermarking for ai-generated content.arXiv preprint arXiv:2411.18479,

    [ZGC+24] Xuandong Zhao, Sam Gunn, Miranda Christ, Jaiden Fairoze, Andres Fabrega, Nicholas Car- lini, Sanjam Garg, Sanghyun Hong, Milad Nasr, Florian Tramer, et al. Sok: Watermarking for ai-generated content.arXiv preprint arXiv:2411.18479,

  69. [69]

    Duwak: Dual watermarks in large language models.arXiv preprint arXiv:2403.13000,

    [ZGCC24] Chaoyi Zhu, Jeroen Galjaard, Pin-Yu Chen, and Lydia Y Chen. Duwak: Dual watermarks in large language models.arXiv preprint arXiv:2403.13000,

  70. [70]

    Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models,

    [ZGWJ24] Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867,

  71. [71]

    A survey of backdoor attacks and defenses on large language models: Implications for security measures,

    [ZJG+24] Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, and Luu Anh Tuan. A survey of recent backdoor attacks and defenses in large language models.arXiv preprint arXiv:2406.06852,

  72. [72]

    Cohemark: A novel sentence-level watermark for enhanced text quality.arXiv preprint arXiv:2504.17309,

    [ZLL+25] Junyan Zhang, Shuliang Liu, Aiwei Liu, Yubo Gao, Jungang Li, Xiaojie Gu, and Xuming Hu. Cohemark: A novel sentence-level watermark for enhanced text quality.arXiv preprint arXiv:2504.17309,

  73. [73]

    Instruction-Following Evaluation for Large Language Models

    [ZLM+23] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911,

  74. [74]

    Distillation-resistant watermarking for model protection in nlp.arXiv preprint arXiv:2210.03312,

    [ZLW22] Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Distillation-resistant watermarking for model protection in nlp.arXiv preprint arXiv:2210.03312,

  75. [75]

    vtune: Verifiable fine-tuning for LLMs through backdooring.arXiv preprint arXiv:2411.06611,

    [ZPPG24] Eva Zhang, Arka Pal, Akilesh Potti, and Micah Goldblum. vtune: Verifiable fine-tuning for LLMs through backdooring.arXiv preprint arXiv:2411.06611,

  76. [76]

    Enhancing Composition Window of Bicontinuous Structures by Designed Polydispersity Distribution of ABA Triblock Copolymers

    16 Watermarked LLMInput Prompt beachsunshinecomputerdogVocabulary⋯ 2.251.760.36-0.12Logits⋯ +0+2+2+0Watermark Strength 0.1310.5810.2490.006ProbabilityDistribution⋯⋯LLM “Florida has nice___” Decoding Process of LLM Shared Hash keyGreen Listsunshinecomputer⋯ Red Listbeachdog⋯ “Florida has nicesunshine.” Shared Hash key Red Listbeachdog⋯ Green Listsunshineco...

  77. [77]

    •China’s Cyberspace Administration [Cyb23] has gone further, mandating both visible and invisible watermarks for generative content and requiring platforms to detect and flag unmarked media. • In the U.S., NIST’s 2024 report on synthetic content [ CDR24] frames watermarking as a foundational content authentication tool, recommended even in the absence of ...