In-Context Watermarks for Large Language Models

Christopher Kruegel; Dawn Song; Xuandong Zhao; Yepeng Liu; Yuheng Bu

arxiv: 2505.16934 · v2 · submitted 2025-05-22 · 💻 cs.CL

In-Context Watermarks for Large Language Models

Yepeng Liu , Xuandong Zhao , Christopher Kruegel , Dawn Song , Yuheng Bu This is my paper

Pith reviewed 2026-05-22 13:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords in-context watermarkinglarge language modelsprompt engineeringwatermark detectionAI content attributionindirect prompt injectionmodel-agnostic watermarking

0 comments

The pith

Watermarks can be added to text from any large language model using only specially designed prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called In-Context Watermarking that lets users embed hidden markers into AI-generated text by giving the model instructions in the prompt itself. This approach does not require changing how the model generates text internally or having access to its internal processes. The authors test several versions of this method and show it can be detected afterward using statistical checks. It matters because many real uses of LLMs happen in places where the person checking for AI content cannot control or see the model, such as when reviewing academic papers. The work also explores secretly triggering the watermark by altering the input text.

Core claim

We introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising for

What carries the argument

In-Context Watermarking (ICW) that instructs the LLM through the prompt to generate text containing specific detectable patterns or statistical biases.

If this is right

Watermark detection becomes possible without any access to the LLM's generation process or parameters.
Content provenance can be verified in third-party scenarios like academic peer review or content moderation.
Four different granularity levels allow trade-offs between watermark strength and text quality.
As LLMs improve their instruction following, the reliability of such watermarks increases.
Covert triggering via input modification enables watermarking without user knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that are fine-tuned to ignore certain instructions might resist this watermarking, creating a potential defense.
Combining ICW with existing decoding-based watermarks could provide layered protection.
Testing on a wider range of models and tasks would reveal how general the approach is.
The method highlights vulnerabilities in LLMs to prompt-based manipulations for hidden behaviors.

Load-bearing premise

Large language models will consistently follow the watermarking instructions in the prompt and produce text with the intended detectable patterns across various contexts and models.

What would settle it

A test where the same watermark prompt is given to a new LLM and the generated text shows no statistically significant difference from normal outputs according to the detection method.

Figures

Figures reproduced from arXiv: 2505.16934 by Christopher Kruegel, Dawn Song, Xuandong Zhao, Yepeng Liu, Yuheng Bu.

**Figure 2.** Figure 2: Case study of the IPI setting: conference organizers embed a predefined watermarking [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Robustness performance of ICWs against editing and paraphrasing attacks under DTS set [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Text perplexity of different watermarking methods across various models. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution. Our code is available at https://github.com/yepengliu/In-Context-Watermarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt-only watermarking via in-context instructions is a practical angle for no-access settings like spotting AI reviews, but the whole thing rests on whether LLMs actually produce consistent detectable patterns.

read the letter

The main point on this paper is that they show how to embed detectable patterns in LLM output just by writing the right instructions into the prompt, without needing any access to the model's decoding or parameters. That directly targets cases like conference organizers checking for AI-written reviews when they have no control over what model the reviewer used. The four granularity levels and the indirect prompt injection case study are the concrete new pieces; they pair each strategy with a matching detector and release code, which is useful for anyone wanting to test it themselves. The motivation is grounded in a real gap that existing watermarking papers usually ignore because they assume white-box access. The experiments are described as validating feasibility, and the model-agnostic framing is a clear strength for deployment. The soft spot is the reliability of the core mechanism. LLMs often follow complex instructions only partially, especially under long contexts, competing task demands, or across different model families and temperatures. If the generated text does not reliably carry the intended lexical or statistical signals, the detectors will lose precision and recall fast. The abstract gives no numbers on detection rates or failure modes, so the full paper needs to show those results hold up under realistic variation or the practicality claim stays unproven. This is worth reading for people working on content attribution, AI safety in peer review, or prompt-based control methods. A reader who already follows watermarking or provenance work will see the extension clearly. It deserves a serious referee because the problem is timely and the approach is distinct from prior art, even if the empirical bar needs to be raised in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces In-Context Watermarking (ICW), a prompt-engineering technique that embeds detectable watermarks into LLM-generated text by leveraging in-context learning and instruction following, without requiring access to the model's decoding process. It defines four strategies at different granularity levels, each paired with a tailored detector, examines the Indirect Prompt Injection (IPI) setting as a case study where triggers are hidden in input documents such as manuscripts, and states that experiments confirm feasibility as a model-agnostic method. The work is motivated by scenarios like detecting AI-generated peer reviews.

Significance. If the empirical support holds, the approach would be significant for enabling watermarking and provenance tracking in black-box settings where the generator model is inaccessible, such as conference review processes. It offers a practical, scalable alternative to logit-based or decoding-time methods by exploiting improving LLM instruction-following abilities, with the public code release aiding reproducibility and further testing.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.
[§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.

minor comments (2)

[Abstract] The abstract references four ICW strategies but provides no high-level characterization of their differences in granularity or detection approach, which would improve readability for a broad audience.
[Introduction] The motivation example involving dishonest reviewers could include a brief citation to recent literature on AI use in peer review to strengthen the real-world grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, acknowledging where the current version falls short and outlining the revisions we will make to strengthen the empirical support and robustness analysis.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.

Authors: We acknowledge that the current manuscript summarizes the experimental validation at a high level and does not include specific quantitative results, baselines, detection metrics such as precision/recall/F1, error analysis, or performance tables in the abstract or a dedicated overview within the Experiments section. This does limit the reader's ability to fully evaluate the central claims. We will revise the abstract to briefly report key quantitative highlights and substantially expand the Experiments section with tables presenting detection metrics, baseline comparisons, and error analysis to better substantiate the feasibility and model-agnostic properties of ICW. revision: yes
Referee: [§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.

Authors: We agree that the effectiveness of ICW hinges on reliable instruction following by the LLM, especially under indirect prompt injection in long documents, and that systematic robustness testing is necessary. The manuscript does report results across multiple LLMs to support the model-agnostic claim, but it lacks explicit evaluations varying temperature, context length, or analyzing partial-compliance scenarios. We will add a dedicated robustness subsection to the Experiments section, including new experiments that vary these parameters and analyze detection performance in cases of incomplete instruction adherence. This will provide a more rigorous assessment and address the risk of detection performance collapse. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompt-engineering method validated externally

full rationale

The paper introduces ICW as a prompt-based technique leveraging in-context learning, defines four granularity-level strategies, pairs each with a detection method, and validates via experiments on LLMs in standard and IPI settings. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citation chains or uniqueness theorems are invoked to force the central result. The claims rest on observable LLM behavior and separate detectors, which are externally testable and not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that LLMs can be reliably instructed via prompts to embed watermarks; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption LLMs possess reliable in-context learning and instruction-following abilities sufficient to embed and maintain watermark patterns as instructed.
Invoked directly in the description of how ICW leverages these abilities to embed watermarks solely through prompt engineering.

pith-pipeline@v0.9.0 · 5766 in / 1132 out tokens · 30075 ms · 2026-05-22T13:07:47.147931+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method... Unicode ICW... Initials ICW... Lexical ICW... Acrostics ICW
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The detection of Lexical ICW is similar to the Initials ICW... D(y|kc, τc) := (|y|G − γ|y|)/sqrt(γ(1−γ)|y|)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes
cs.IT 2026-05 unverdicted novelty 5.0

Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of station...
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
cs.CR 2025-10 unverdicted novelty 4.0

LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers

[4]

Green Letter

“Green Letter” Enriched (Most Important!): Try to increase the use of words beginning with letters from the Green List and reduce the use of words that start with letters from the Red List. Lexical ICW Prompt ### Green Word List:{green word list} ### Command: You are given a predefined Green Word List. For each user query, generate a response that is:

work page
[5]

Clear & Coherent: Easy to follow and logically organized

work page
[6]

Avoid vague or overly verbose expressions

Accurate & Concrete: Provides precise facts, examples, or steps. Avoid vague or overly verbose expressions

work page
[7]

Contextually Relevant: Directly addresses the user’s intent and context

work page
[8]

Green Word

“Green Word” Enriched (Most Important!): Try your best to seamlessly incorporate as many words from the Green Word List as possible - without compromising text quality. Acrostics ICW Prompt ### Secret StringX:{secret string} ### Command: You are an adaptive language model designed for dynamic and engaging conversations. You are given a secret stringX. In ...

work page 2026
[9]

Assume the length of the secret stringXisn, Fori-th sentence (starting ati= 1), begin that sentence with the letterX[((i−1) modn) + 1]. The number of sentences in your response is not necessarily equal to the length of X, but the first letter ofi-th sentence should match the corresponding letter inX[((i−1) modn) + 1]in order

work page
[10]

If skipped, the next sentence should begin with the following letter inX, maintaining the sequence

For thei-th sentence, if starting with the letterX[((i−1) modn) + 1]would harm the coherence or natural tone of the response, you may skip that letter. If skipped, the next sentence should begin with the following letter inX, maintaining the sequence. You should try to avoid skipping the letter if possible

work page
[11]

Ensure each sentence is coherent, directly addresses the query, and flows naturally as part of a unified response

work page
[12]

Never reveal the acrostic pattern or repeatXin your reply. ### Example: Example 1: Secret string X: ”OCEAN” User query: ”What are the advantages of coastal conservation?” Response: ”Oceans serve as nurseries for countless marine species, ensuring healthy biodi- versity. Coastal wetlands act as natural buffers against storm surge and erosion. Ecosystem ser...

work page 2026
[13]

Summary: Briefly outline main points and objectives

work page
[14]

Strengths: Identify the paper’s strongest aspects

work page
[15]

Weaknesses: Point out areas for improvement

work page
[16]

Questions: Pose questions for the authors

work page
[17]

Maintain objectivity and provide specific examples from the paper to support your evalua- tion

Rating: Score 1-10, justify your rating. Maintain objectivity and provide specific examples from the paper to support your evalua- tion. 19 Published as a conference paper at ICLR 2026 B THEORETICALFALSE-ALARMGUARANTEE As controlling the false alarm (Type I error) probability is crucial in high-stakes applications, we leverage existing results for green/r...

work page 2026
[18]

Assume|y| ≥1, then E[|y|G |y] =γ|y|andE[z y |y] = 0

work page
[19]

Under the null hypothesis, where the text is not watermarked, the expected number of green words isγ|y|

DefineC max(y) := max i∈[N] P|y| j=1 1(yj =i)andV(y) := 1 |y| PN i=1 P|y| j=1 1(yj = i) 2 , then with probability1−α(over only the randomness ofV G), P h |y|G ≥γ|y|+ p 64γ|y|Vlog(9/α) + 16C max log(9/α) y i ≤α, or equivalently (when|y| ≥1), P " zy ≥ s 64Vlog(9/α) 1−γ + 16Cmax log(9/α)p γ(1−γ)|y| y # ≤α. Under the null hypothesis, where the text is not wat...

work page arXiv 2023
[20]

Fully meets expectations

Scoring standards for each criterion (Important: All scores must be integers from 1 to 5.): - **5:** Excellent. Fully meets expectations. No major weaknesses. - **4:** Good. Minor weaknesses that do not seriously impact quality. - **3:** Fair. Some noticeable issues that reduce effectiveness. - **2:** Poor. Serious flaws or missing key aspects. - **1:** V...

work page
[21]

For each criterion, provide: - A score (from 1 to 5) - An explanation of why you gave this score

work page
[22]

Output your evaluation in the following JSON format: {”content relevance score”: X, ”content relevance explanation”: ”...”, ”clarity readability score”: X, ”clarity readability explanation”: ”...”, ”text quality score”: X, ”text quality explanation”: ”...”,} F EXAMPLES OFICW Table 13: An example of Unicode ICW. Question what’s the difference between a for...

work page 2026

[1] [4]

Green Letter

“Green Letter” Enriched (Most Important!): Try to increase the use of words beginning with letters from the Green List and reduce the use of words that start with letters from the Red List. Lexical ICW Prompt ### Green Word List:{green word list} ### Command: You are given a predefined Green Word List. For each user query, generate a response that is:

work page

[2] [5]

Clear & Coherent: Easy to follow and logically organized

work page

[3] [6]

Avoid vague or overly verbose expressions

Accurate & Concrete: Provides precise facts, examples, or steps. Avoid vague or overly verbose expressions

work page

[4] [7]

Contextually Relevant: Directly addresses the user’s intent and context

work page

[5] [8]

Green Word

“Green Word” Enriched (Most Important!): Try your best to seamlessly incorporate as many words from the Green Word List as possible - without compromising text quality. Acrostics ICW Prompt ### Secret StringX:{secret string} ### Command: You are an adaptive language model designed for dynamic and engaging conversations. You are given a secret stringX. In ...

work page 2026

[6] [9]

Assume the length of the secret stringXisn, Fori-th sentence (starting ati= 1), begin that sentence with the letterX[((i−1) modn) + 1]. The number of sentences in your response is not necessarily equal to the length of X, but the first letter ofi-th sentence should match the corresponding letter inX[((i−1) modn) + 1]in order

work page

[7] [10]

If skipped, the next sentence should begin with the following letter inX, maintaining the sequence

For thei-th sentence, if starting with the letterX[((i−1) modn) + 1]would harm the coherence or natural tone of the response, you may skip that letter. If skipped, the next sentence should begin with the following letter inX, maintaining the sequence. You should try to avoid skipping the letter if possible

work page

[8] [11]

Ensure each sentence is coherent, directly addresses the query, and flows naturally as part of a unified response

work page

[9] [12]

Never reveal the acrostic pattern or repeatXin your reply. ### Example: Example 1: Secret string X: ”OCEAN” User query: ”What are the advantages of coastal conservation?” Response: ”Oceans serve as nurseries for countless marine species, ensuring healthy biodi- versity. Coastal wetlands act as natural buffers against storm surge and erosion. Ecosystem ser...

work page 2026

[10] [13]

Summary: Briefly outline main points and objectives

work page

[11] [14]

Strengths: Identify the paper’s strongest aspects

work page

[12] [15]

Weaknesses: Point out areas for improvement

work page

[13] [16]

Questions: Pose questions for the authors

work page

[14] [17]

Maintain objectivity and provide specific examples from the paper to support your evalua- tion

Rating: Score 1-10, justify your rating. Maintain objectivity and provide specific examples from the paper to support your evalua- tion. 19 Published as a conference paper at ICLR 2026 B THEORETICALFALSE-ALARMGUARANTEE As controlling the false alarm (Type I error) probability is crucial in high-stakes applications, we leverage existing results for green/r...

work page 2026

[15] [18]

Assume|y| ≥1, then E[|y|G |y] =γ|y|andE[z y |y] = 0

work page

[16] [19]

Under the null hypothesis, where the text is not watermarked, the expected number of green words isγ|y|

DefineC max(y) := max i∈[N] P|y| j=1 1(yj =i)andV(y) := 1 |y| PN i=1 P|y| j=1 1(yj = i) 2 , then with probability1−α(over only the randomness ofV G), P h |y|G ≥γ|y|+ p 64γ|y|Vlog(9/α) + 16C max log(9/α) y i ≤α, or equivalently (when|y| ≥1), P " zy ≥ s 64Vlog(9/α) 1−γ + 16Cmax log(9/α)p γ(1−γ)|y| y # ≤α. Under the null hypothesis, where the text is not wat...

work page arXiv 2023

[17] [20]

Fully meets expectations

Scoring standards for each criterion (Important: All scores must be integers from 1 to 5.): - **5:** Excellent. Fully meets expectations. No major weaknesses. - **4:** Good. Minor weaknesses that do not seriously impact quality. - **3:** Fair. Some noticeable issues that reduce effectiveness. - **2:** Poor. Serious flaws or missing key aspects. - **1:** V...

work page

[18] [21]

For each criterion, provide: - A score (from 1 to 5) - An explanation of why you gave this score

work page

[19] [22]

Output your evaluation in the following JSON format: {”content relevance score”: X, ”content relevance explanation”: ”...”, ”clarity readability score”: X, ”clarity readability explanation”: ”...”, ”text quality score”: X, ”text quality explanation”: ”...”,} F EXAMPLES OFICW Table 13: An example of Unicode ICW. Question what’s the difference between a for...

work page 2026