Towards Better Statistical Understanding of Watermarking LLMs

Hanzhao Wang; Huaiyang Zhong; Shang Liu; Xiaocheng Li; Zhongze Cai

arxiv: 2403.13027 · v2 · submitted 2024-03-19 · 💻 cs.LG · cs.CR· cs.IT· math.IT· stat.ML

Towards Better Statistical Understanding of Watermarking LLMs

Zhongze Cai , Shang Liu , Hanzhao Wang , Huaiyang Zhong , Xiaocheng Li This is my paper

Pith reviewed 2026-05-24 02:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CRcs.ITmath.ITstat.ML

keywords LLM watermarkingred-green listconstrained optimizationdual gradient ascentPareto optimalityKL divergencemodel distortion

0 comments

The pith

An online dual gradient ascent algorithm achieves asymptotic Pareto optimality for the distortion-detection trade-off in red-green list LLM watermarking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the tension between keeping an LLM's output close to its original distribution and making watermarked text reliably detectable as a constrained optimization problem. It derives an analytical property of the optimal solution that directly informs how to adjust token probabilities during generation. Building on this, the authors introduce an online dual gradient ascent procedure and prove it converges to the Pareto frontier, which explicitly raises the average probability of green-list tokens and therefore strengthens detection. They also argue that KL divergence is the appropriate distortion measure and identify shortcomings in prior notions of distortion-free watermarking or perplexity-based checks. The method is validated through experiments on multiple datasets against existing baselines.

Core claim

By casting red-green list watermarking as a constrained optimization problem, the optimal token probability adjustments admit a clean analytical characterization; an online dual gradient ascent algorithm derived from this formulation is asymptotically Pareto optimal between model distortion and detection power, which guarantees a strictly higher average green-list probability than previous approaches under the same distortion budget.

What carries the argument

The online dual gradient ascent watermarking algorithm that solves the constrained optimization of green-list probability subject to a distortion budget via dual variables updated on the fly.

If this is right

The algorithm produces an explicit increase in average green-list token probability, improving detection rates for any fixed distortion level.
The optimal solution's analytical property supplies a principled way to choose the green-list size and bias at each step.
KL divergence is defended as the distortion metric that correctly captures the statistical change induced by watermarking.
Existing claims of distortion-free watermarking or reliance on perplexity are shown to be inadequate for evaluating the trade-off.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimization framing could be applied to watermarking methods that do not rely on fixed red-green lists.
Real-time deployment of LLMs could use the dual variables as tunable knobs to meet application-specific detection thresholds.
Standardized reporting of KL-based distortion alongside detection metrics might become a common evaluation practice.

Load-bearing premise

The red-green list scheme is treated as the fixed underlying watermarking method whose distortion-detection frontier can be optimized, and the dual ascent procedure converges under standard regularity conditions on the LLM's token distributions.

What would settle it

Apply the dual ascent updates to a long sequence of real LLM token distributions and check whether the time-averaged green-list probability rises while the chosen distortion measure stays within the prescribed bound; if it does not, the asymptotic optimality claim fails.

read the original abstract

In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the red-green list watermarking algorithm. We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free'' and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an optimization framing and online dual-ascent algorithm to red-green watermarking with an asymptotic Pareto optimality proof, but the guarantee rests on stationarity assumptions unlikely to hold for real LLM token sequences.

read the letter

The main contribution is the constrained optimization formulation of the distortion-detection trade-off for red-green list watermarking, plus an online dual gradient ascent algorithm that comes with a proof of asymptotic Pareto optimality. This yields an explicit guarantee on averaged green-list probability, which the abstract contrasts with prior results. The systematic discussion of distortion metrics is also useful: they make a case for KL divergence and flag problems with claims of being distortion-free or using perplexity as the criterion. The empirical section runs comparisons against benchmarks on multiple datasets. These pieces give the work a clear internal logic and address a practical tuning issue in an existing watermarking family. The soft spot is the proof. It invokes standard dual-ascent conditions including convexity, bounded subgradients, and suitable behavior on the sequence of token distributions. LLM next-token probabilities are context-dependent and non-stationary, with shifting support across a generation. If those conditions are not relaxed or shown to hold approximately, the asymptotic guarantee does not transfer and the claimed explicit improvement is weaker than stated. The abstract presents the result without visible qualification on this point. This paper is for researchers already working on red-green watermarking who want a more principled tuning method. A reader focused on online optimization applied to detection tasks would also see value. It deserves a serious referee because it supplies a formal result and engages a live deployment problem, even if the proof scope needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper formulates LLM watermarking via red-green lists as a constrained optimization problem trading off model distortion (KL divergence) against detection power (green-list probability). It derives an analytical property of the optimum, introduces an online dual gradient ascent algorithm, and proves asymptotic Pareto optimality of the algorithm, which is claimed to deliver an explicit guarantee of increased average green-list probability (unlike prior work). The manuscript also justifies KL as the distortion metric over alternatives such as 'distortion-free' or perplexity and reports empirical comparisons on multiple datasets.

Significance. If the asymptotic Pareto-optimality result holds under its stated conditions, the work supplies a theoretically grounded algorithm with explicit (rather than implicit) detection guarantees, which would be a meaningful advance for statistical understanding of watermarking. The analytical characterization of the optimum and the focused discussion of distortion metrics are additional strengths that could inform future designs.

major comments (2)

[proof of asymptotic Pareto optimality] Proof of asymptotic Pareto optimality: the convergence argument relies on standard dual-ascent conditions (Lagrangian convexity, bounded subgradients, and ergodicity/stationarity of the token-distribution sequence p_t). LLM next-token distributions are context-dependent and non-stationary, with possible support changes across a generation; no argument is given that these conditions continue to hold or that the guarantee survives their violation. This assumption is load-bearing for the central claim of an explicit, averaged increase in green-list probability.
[constrained optimization formulation] Optimization formulation and base method: the red-green list scheme is fixed as the underlying mechanism whose distortion-detection frontier is optimized. It is unclear whether the derived analytical property and the online algorithm remain valid if the base watermarking procedure itself is altered or if the token distributions deviate from the assumed sequence; this affects whether the Pareto result is general or specific to the chosen base.

minor comments (2)

The abstract states that experiments were run on 'extensive datasets' but does not name them or report basic statistics (sequence length, number of prompts, variance across runs); adding this information would improve reproducibility.
Notation for the dual variables and the green-list probability update rule could be introduced earlier and used consistently to ease reading of the algorithm and proof.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on our manuscript. We address each major comment below in a point-by-point manner.

read point-by-point responses

Referee: Proof of asymptotic Pareto optimality: the convergence argument relies on standard dual-ascent conditions (Lagrangian convexity, bounded subgradients, and ergodicity/stationarity of the token-distribution sequence p_t). LLM next-token distributions are context-dependent and non-stationary, with possible support changes across a generation; no argument is given that these conditions continue to hold or that the guarantee survives their violation. This assumption is load-bearing for the central claim of an explicit, averaged increase in green-list probability.

Authors: We appreciate the referee pointing out the reliance on these standard conditions for dual-ascent convergence. The proof establishes asymptotic Pareto optimality under the assumptions of Lagrangian convexity, bounded subgradients, and ergodicity/stationarity of the sequence {p_t}, which are explicitly stated in the analysis. While we acknowledge that LLM next-token distributions are context-dependent and the strict stationarity assumption may be violated in practice due to varying contexts and support changes, the result still supplies an explicit (rather than implicit) guarantee under the stated conditions. This constitutes a theoretical advance relative to prior work. In revision we will add a paragraph discussing the scope of the assumptions and their practical relevance to LLM generation. revision: partial
Referee: Optimization formulation and base method: the red-green list scheme is fixed as the underlying mechanism whose distortion-detection frontier is optimized. It is unclear whether the derived analytical property and the online algorithm remain valid if the base watermarking procedure itself is altered or if the token distributions deviate from the assumed sequence; this affects whether the Pareto result is general or specific to the chosen base.

Authors: The constrained optimization formulation, the analytical property of the optimum, and the online dual gradient ascent algorithm are all developed specifically for the red-green list watermarking scheme, as stated in the abstract and introduction. The Pareto-optimality guarantee is therefore tied to this base mechanism and the associated token-distribution sequence. We view this focus as appropriate, since red-green lists represent a standard and practical watermarking approach; the paper does not claim generality beyond this setting. Generalization to other watermarking procedures is left as future work. revision: no

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper formulates the watermarking trade-off as a constrained optimization problem based on the red-green list algorithm, identifies an analytical property of the optimum, and develops an online dual gradient ascent procedure whose asymptotic Pareto optimality is proved under standard dual-ascent conditions (convexity, bounded subgradients, ergodicity). These steps rely on external convergence theory rather than self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations; the claimed explicit increase in green-list probability follows from the convergence result, not by construction from the inputs. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of fitted parameters or invented entities; the central claim rests on the modeling choice of red-green lists as the base and on standard convergence assumptions for dual ascent.

pith-pipeline@v0.9.0 · 5709 in / 1173 out tokens · 20948 ms · 2026-05-24T02:33:00.026815+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
cs.CR 2026-05 unverdicted novelty 6.0

PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.