pith. sign in

arxiv: 2505.16934 · v2 · submitted 2025-05-22 · 💻 cs.CL

In-Context Watermarks for Large Language Models

Pith reviewed 2026-05-22 13:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords in-context watermarkinglarge language modelsprompt engineeringwatermark detectionAI content attributionindirect prompt injectionmodel-agnostic watermarking
0
0 comments X

The pith

Watermarks can be added to text from any large language model using only specially designed prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called In-Context Watermarking that lets users embed hidden markers into AI-generated text by giving the model instructions in the prompt itself. This approach does not require changing how the model generates text internally or having access to its internal processes. The authors test several versions of this method and show it can be detected afterward using statistical checks. It matters because many real uses of LLMs happen in places where the person checking for AI content cannot control or see the model, such as when reviewing academic papers. The work also explores secretly triggering the watermark by altering the input text.

Core claim

We introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising for

What carries the argument

In-Context Watermarking (ICW) that instructs the LLM through the prompt to generate text containing specific detectable patterns or statistical biases.

If this is right

  • Watermark detection becomes possible without any access to the LLM's generation process or parameters.
  • Content provenance can be verified in third-party scenarios like academic peer review or content moderation.
  • Four different granularity levels allow trade-offs between watermark strength and text quality.
  • As LLMs improve their instruction following, the reliability of such watermarks increases.
  • Covert triggering via input modification enables watermarking without user knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that are fine-tuned to ignore certain instructions might resist this watermarking, creating a potential defense.
  • Combining ICW with existing decoding-based watermarks could provide layered protection.
  • Testing on a wider range of models and tasks would reveal how general the approach is.
  • The method highlights vulnerabilities in LLMs to prompt-based manipulations for hidden behaviors.

Load-bearing premise

Large language models will consistently follow the watermarking instructions in the prompt and produce text with the intended detectable patterns across various contexts and models.

What would settle it

A test where the same watermark prompt is given to a new LLM and the generated text shows no statistically significant difference from normal outputs according to the detection method.

Figures

Figures reproduced from arXiv: 2505.16934 by Christopher Kruegel, Dawn Song, Xuandong Zhao, Yepeng Liu, Yuheng Bu.

Figure 1
Figure 1. Figure 1: An overview of In-Context Watermark. The application of ICW does not require access [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case study of the IPI setting: conference organizers embed a predefined watermarking [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Robustness performance of ICWs against editing and paraphrasing attacks under DTS set [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Text perplexity of different watermarking methods across various models. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution. Our code is available at https://github.com/yepengliu/In-Context-Watermarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces In-Context Watermarking (ICW), a prompt-engineering technique that embeds detectable watermarks into LLM-generated text by leveraging in-context learning and instruction following, without requiring access to the model's decoding process. It defines four strategies at different granularity levels, each paired with a tailored detector, examines the Indirect Prompt Injection (IPI) setting as a case study where triggers are hidden in input documents such as manuscripts, and states that experiments confirm feasibility as a model-agnostic method. The work is motivated by scenarios like detecting AI-generated peer reviews.

Significance. If the empirical support holds, the approach would be significant for enabling watermarking and provenance tracking in black-box settings where the generator model is inaccessible, such as conference review processes. It offers a practical, scalable alternative to logit-based or decoding-time methods by exploiting improving LLM instruction-following abilities, with the public code release aiding reproducibility and further testing.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.
  2. [§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.
minor comments (2)
  1. [Abstract] The abstract references four ICW strategies but provides no high-level characterization of their differences in granularity or detection approach, which would improve readability for a broad audience.
  2. [Introduction] The motivation example involving dishonest reviewers could include a brief citation to recent literature on AI use in peer review to strengthen the real-world grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, acknowledging where the current version falls short and outlining the revisions we will make to strengthen the empirical support and robustness analysis.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.

    Authors: We acknowledge that the current manuscript summarizes the experimental validation at a high level and does not include specific quantitative results, baselines, detection metrics such as precision/recall/F1, error analysis, or performance tables in the abstract or a dedicated overview within the Experiments section. This does limit the reader's ability to fully evaluate the central claims. We will revise the abstract to briefly report key quantitative highlights and substantially expand the Experiments section with tables presenting detection metrics, baseline comparisons, and error analysis to better substantiate the feasibility and model-agnostic properties of ICW. revision: yes

  2. Referee: [§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.

    Authors: We agree that the effectiveness of ICW hinges on reliable instruction following by the LLM, especially under indirect prompt injection in long documents, and that systematic robustness testing is necessary. The manuscript does report results across multiple LLMs to support the model-agnostic claim, but it lacks explicit evaluations varying temperature, context length, or analyzing partial-compliance scenarios. We will add a dedicated robustness subsection to the Experiments section, including new experiments that vary these parameters and analyze detection performance in cases of incomplete instruction adherence. This will provide a more rigorous assessment and address the risk of detection performance collapse. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompt-engineering method validated externally

full rationale

The paper introduces ICW as a prompt-based technique leveraging in-context learning, defines four granularity-level strategies, pairs each with a detection method, and validates via experiments on LLMs in standard and IPI settings. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citation chains or uniqueness theorems are invoked to force the central result. The claims rest on observable LLM behavior and separate detectors, which are externally testable and not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that LLMs can be reliably instructed via prompts to embed watermarks; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption LLMs possess reliable in-context learning and instruction-following abilities sufficient to embed and maintain watermark patterns as instructed.
    Invoked directly in the description of how ICW leverages these abilities to embed watermarks solely through prompt engineering.

pith-pipeline@v0.9.0 · 5766 in / 1132 out tokens · 30075 ms · 2026-05-22T13:07:47.147931+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes

    cs.IT 2026-05 unverdicted novelty 5.0

    Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of station...

  2. Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption

    cs.CR 2025-10 unverdicted novelty 4.0

    LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers

  1. [4]

    Green Letter

    “Green Letter” Enriched (Most Important!): Try to increase the use of words beginning with letters from the Green List and reduce the use of words that start with letters from the Red List. Lexical ICW Prompt ### Green Word List:{green word list} ### Command: You are given a predefined Green Word List. For each user query, generate a response that is:

  2. [5]

    Clear & Coherent: Easy to follow and logically organized

  3. [6]

    Avoid vague or overly verbose expressions

    Accurate & Concrete: Provides precise facts, examples, or steps. Avoid vague or overly verbose expressions

  4. [7]

    Contextually Relevant: Directly addresses the user’s intent and context

  5. [8]

    Green Word

    “Green Word” Enriched (Most Important!): Try your best to seamlessly incorporate as many words from the Green Word List as possible - without compromising text quality. Acrostics ICW Prompt ### Secret StringX:{secret string} ### Command: You are an adaptive language model designed for dynamic and engaging conversations. You are given a secret stringX. In ...

  6. [9]

    Assume the length of the secret stringXisn, Fori-th sentence (starting ati= 1), begin that sentence with the letterX[((i−1) modn) + 1]. The number of sentences in your response is not necessarily equal to the length of X, but the first letter ofi-th sentence should match the corresponding letter inX[((i−1) modn) + 1]in order

  7. [10]

    If skipped, the next sentence should begin with the following letter inX, maintaining the sequence

    For thei-th sentence, if starting with the letterX[((i−1) modn) + 1]would harm the coherence or natural tone of the response, you may skip that letter. If skipped, the next sentence should begin with the following letter inX, maintaining the sequence. You should try to avoid skipping the letter if possible

  8. [11]

    Ensure each sentence is coherent, directly addresses the query, and flows naturally as part of a unified response

  9. [12]

    Never reveal the acrostic pattern or repeatXin your reply. ### Example: Example 1: Secret string X: ”OCEAN” User query: ”What are the advantages of coastal conservation?” Response: ”Oceans serve as nurseries for countless marine species, ensuring healthy biodi- versity. Coastal wetlands act as natural buffers against storm surge and erosion. Ecosystem ser...

  10. [13]

    Summary: Briefly outline main points and objectives

  11. [14]

    Strengths: Identify the paper’s strongest aspects

  12. [15]

    Weaknesses: Point out areas for improvement

  13. [16]

    Questions: Pose questions for the authors

  14. [17]

    Maintain objectivity and provide specific examples from the paper to support your evalua- tion

    Rating: Score 1-10, justify your rating. Maintain objectivity and provide specific examples from the paper to support your evalua- tion. 19 Published as a conference paper at ICLR 2026 B THEORETICALFALSE-ALARMGUARANTEE As controlling the false alarm (Type I error) probability is crucial in high-stakes applications, we leverage existing results for green/r...

  15. [18]

    Assume|y| ≥1, then E[|y|G |y] =γ|y|andE[z y |y] = 0

  16. [19]

    Under the null hypothesis, where the text is not watermarked, the expected number of green words isγ|y|

    DefineC max(y) := max i∈[N] P|y| j=1 1(yj =i)andV(y) := 1 |y| PN i=1 P|y| j=1 1(yj = i) 2 , then with probability1−α(over only the randomness ofV G), P h |y|G ≥γ|y|+ p 64γ|y|Vlog(9/α) + 16C max log(9/α) y i ≤α, or equivalently (when|y| ≥1), P " zy ≥ s 64Vlog(9/α) 1−γ + 16Cmax log(9/α)p γ(1−γ)|y| y # ≤α. Under the null hypothesis, where the text is not wat...

  17. [20]

    Fully meets expectations

    Scoring standards for each criterion (Important: All scores must be integers from 1 to 5.): - **5:** Excellent. Fully meets expectations. No major weaknesses. - **4:** Good. Minor weaknesses that do not seriously impact quality. - **3:** Fair. Some noticeable issues that reduce effectiveness. - **2:** Poor. Serious flaws or missing key aspects. - **1:** V...

  18. [21]

    For each criterion, provide: - A score (from 1 to 5) - An explanation of why you gave this score

  19. [22]

    Output your evaluation in the following JSON format: {”content relevance score”: X, ”content relevance explanation”: ”...”, ”clarity readability score”: X, ”clarity readability explanation”: ”...”, ”text quality score”: X, ”text quality explanation”: ”...”,} F EXAMPLES OFICW Table 13: An example of Unicode ICW. Question what’s the difference between a for...