In-Context Watermarks for Large Language Models
Pith reviewed 2026-05-22 13:07 UTC · model grok-4.3
The pith
Watermarks can be added to text from any large language model using only specially designed prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising for
What carries the argument
In-Context Watermarking (ICW) that instructs the LLM through the prompt to generate text containing specific detectable patterns or statistical biases.
If this is right
- Watermark detection becomes possible without any access to the LLM's generation process or parameters.
- Content provenance can be verified in third-party scenarios like academic peer review or content moderation.
- Four different granularity levels allow trade-offs between watermark strength and text quality.
- As LLMs improve their instruction following, the reliability of such watermarks increases.
- Covert triggering via input modification enables watermarking without user knowledge.
Where Pith is reading between the lines
- Models that are fine-tuned to ignore certain instructions might resist this watermarking, creating a potential defense.
- Combining ICW with existing decoding-based watermarks could provide layered protection.
- Testing on a wider range of models and tasks would reveal how general the approach is.
- The method highlights vulnerabilities in LLMs to prompt-based manipulations for hidden behaviors.
Load-bearing premise
Large language models will consistently follow the watermarking instructions in the prompt and produce text with the intended detectable patterns across various contexts and models.
What would settle it
A test where the same watermark prompt is given to a new LLM and the generated text shows no statistically significant difference from normal outputs according to the detection method.
Figures
read the original abstract
The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution. Our code is available at https://github.com/yepengliu/In-Context-Watermarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces In-Context Watermarking (ICW), a prompt-engineering technique that embeds detectable watermarks into LLM-generated text by leveraging in-context learning and instruction following, without requiring access to the model's decoding process. It defines four strategies at different granularity levels, each paired with a tailored detector, examines the Indirect Prompt Injection (IPI) setting as a case study where triggers are hidden in input documents such as manuscripts, and states that experiments confirm feasibility as a model-agnostic method. The work is motivated by scenarios like detecting AI-generated peer reviews.
Significance. If the empirical support holds, the approach would be significant for enabling watermarking and provenance tracking in black-box settings where the generator model is inaccessible, such as conference review processes. It offers a practical, scalable alternative to logit-based or decoding-time methods by exploiting improving LLM instruction-following abilities, with the public code release aiding reproducibility and further testing.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.
- [§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.
minor comments (2)
- [Abstract] The abstract references four ICW strategies but provides no high-level characterization of their differences in granularity or detection approach, which would improve readability for a broad audience.
- [Introduction] The motivation example involving dishonest reviewers could include a brief citation to recent literature on AI use in peer review to strengthen the real-world grounding.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, acknowledging where the current version falls short and outlining the revisions we will make to strengthen the empirical support and robustness analysis.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that 'experiments validate the feasibility of ICW' is presented without any quantitative results, baselines, detection metrics (e.g., precision/recall), error analysis, or performance tables. This is load-bearing for the central practicality and model-agnostic claims, as the reliability of the generated patterns and detectors cannot be assessed from the given information.
Authors: We acknowledge that the current manuscript summarizes the experimental validation at a high level and does not include specific quantitative results, baselines, detection metrics such as precision/recall/F1, error analysis, or performance tables in the abstract or a dedicated overview within the Experiments section. This does limit the reader's ability to fully evaluate the central claims. We will revise the abstract to briefly report key quantitative highlights and substantially expand the Experiments section with tables presenting detection metrics, baseline comparisons, and error analysis to better substantiate the feasibility and model-agnostic properties of ICW. revision: yes
-
Referee: [§3] §3 (ICW Strategies and IPI case study): The method assumes LLMs will consistently follow the complex multi-granularity watermarking instructions to produce lexical/syntactic/statistical patterns recoverable by the detectors, even when the trigger is indirectly injected into long input documents. No robustness evaluation across models, temperatures, context lengths, or partial-compliance cases is described, which directly risks collapse of detection performance and undermines the feasibility conclusion.
Authors: We agree that the effectiveness of ICW hinges on reliable instruction following by the LLM, especially under indirect prompt injection in long documents, and that systematic robustness testing is necessary. The manuscript does report results across multiple LLMs to support the model-agnostic claim, but it lacks explicit evaluations varying temperature, context length, or analyzing partial-compliance scenarios. We will add a dedicated robustness subsection to the Experiments section, including new experiments that vary these parameters and analyze detection performance in cases of incomplete instruction adherence. This will provide a more rigorous assessment and address the risk of detection performance collapse. revision: yes
Circularity Check
No circularity: empirical prompt-engineering method validated externally
full rationale
The paper introduces ICW as a prompt-based technique leveraging in-context learning, defines four granularity-level strategies, pairs each with a detection method, and validates via experiments on LLMs in standard and IPI settings. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citation chains or uniqueness theorems are invoked to force the central result. The claims rest on observable LLM behavior and separate detectors, which are externally testable and not self-referential.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess reliable in-context learning and instruction-following abilities sufficient to embed and maintain watermark patterns as instructed.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method... Unicode ICW... Initials ICW... Lexical ICW... Acrostics ICW
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The detection of Lexical ICW is similar to the Initials ICW... D(y|kc, τc) := (|y|G − γ|y|)/sqrt(γ(1−γ)|y|)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes
Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of station...
-
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.
Reference graph
Works this paper leans on
-
[4]
“Green Letter” Enriched (Most Important!): Try to increase the use of words beginning with letters from the Green List and reduce the use of words that start with letters from the Red List. Lexical ICW Prompt ### Green Word List:{green word list} ### Command: You are given a predefined Green Word List. For each user query, generate a response that is:
-
[5]
Clear & Coherent: Easy to follow and logically organized
-
[6]
Avoid vague or overly verbose expressions
Accurate & Concrete: Provides precise facts, examples, or steps. Avoid vague or overly verbose expressions
-
[7]
Contextually Relevant: Directly addresses the user’s intent and context
-
[8]
“Green Word” Enriched (Most Important!): Try your best to seamlessly incorporate as many words from the Green Word List as possible - without compromising text quality. Acrostics ICW Prompt ### Secret StringX:{secret string} ### Command: You are an adaptive language model designed for dynamic and engaging conversations. You are given a secret stringX. In ...
work page 2026
-
[9]
Assume the length of the secret stringXisn, Fori-th sentence (starting ati= 1), begin that sentence with the letterX[((i−1) modn) + 1]. The number of sentences in your response is not necessarily equal to the length of X, but the first letter ofi-th sentence should match the corresponding letter inX[((i−1) modn) + 1]in order
-
[10]
If skipped, the next sentence should begin with the following letter inX, maintaining the sequence
For thei-th sentence, if starting with the letterX[((i−1) modn) + 1]would harm the coherence or natural tone of the response, you may skip that letter. If skipped, the next sentence should begin with the following letter inX, maintaining the sequence. You should try to avoid skipping the letter if possible
-
[11]
Ensure each sentence is coherent, directly addresses the query, and flows naturally as part of a unified response
-
[12]
Never reveal the acrostic pattern or repeatXin your reply. ### Example: Example 1: Secret string X: ”OCEAN” User query: ”What are the advantages of coastal conservation?” Response: ”Oceans serve as nurseries for countless marine species, ensuring healthy biodi- versity. Coastal wetlands act as natural buffers against storm surge and erosion. Ecosystem ser...
work page 2026
-
[13]
Summary: Briefly outline main points and objectives
-
[14]
Strengths: Identify the paper’s strongest aspects
-
[15]
Weaknesses: Point out areas for improvement
-
[16]
Questions: Pose questions for the authors
-
[17]
Maintain objectivity and provide specific examples from the paper to support your evalua- tion
Rating: Score 1-10, justify your rating. Maintain objectivity and provide specific examples from the paper to support your evalua- tion. 19 Published as a conference paper at ICLR 2026 B THEORETICALFALSE-ALARMGUARANTEE As controlling the false alarm (Type I error) probability is crucial in high-stakes applications, we leverage existing results for green/r...
work page 2026
-
[18]
Assume|y| ≥1, then E[|y|G |y] =γ|y|andE[z y |y] = 0
-
[19]
DefineC max(y) := max i∈[N] P|y| j=1 1(yj =i)andV(y) := 1 |y| PN i=1 P|y| j=1 1(yj = i) 2 , then with probability1−α(over only the randomness ofV G), P h |y|G ≥γ|y|+ p 64γ|y|Vlog(9/α) + 16C max log(9/α) y i ≤α, or equivalently (when|y| ≥1), P " zy ≥ s 64Vlog(9/α) 1−γ + 16Cmax log(9/α)p γ(1−γ)|y| y # ≤α. Under the null hypothesis, where the text is not wat...
-
[20]
Scoring standards for each criterion (Important: All scores must be integers from 1 to 5.): - **5:** Excellent. Fully meets expectations. No major weaknesses. - **4:** Good. Minor weaknesses that do not seriously impact quality. - **3:** Fair. Some noticeable issues that reduce effectiveness. - **2:** Poor. Serious flaws or missing key aspects. - **1:** V...
-
[21]
For each criterion, provide: - A score (from 1 to 5) - An explanation of why you gave this score
-
[22]
Output your evaluation in the following JSON format: {”content relevance score”: X, ”content relevance explanation”: ”...”, ”clarity readability score”: X, ”clarity readability explanation”: ”...”, ”text quality score”: X, ”text quality explanation”: ”...”,} F EXAMPLES OFICW Table 13: An example of Unicode ICW. Question what’s the difference between a for...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.