pith. sign in

arxiv: 2607.00224 · v2 · pith:42VNWSJAnew · submitted 2026-06-30 · 🧮 math.ST · cs.IT· cs.LG· math.IT· stat.ML· stat.TH

Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal Statistics

Pith reviewed 2026-07-03 21:53 UTC · model grok-4.3

classification 🧮 math.ST cs.ITcs.LGmath.ITstat.MLstat.TH
keywords watermark proportion estimationGumbel-max watermarksample complexitypivotal statisticsstatistical estimationLLM watermarking
0
0 comments X

The pith

Full observation of tokens and vectors yields lower sample complexity than pivotal reduction for estimating watermark proportions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies estimation of the proportion of text generated by a Gumbel-max watermarked LLM when next-token distributions are unknown nuisance parameters. It contrasts the full-observation regime, where both the pseudorandom vector and chosen token are recorded, against the reduced regime that retains only a scalar pivot drawn from a Uniform-Beta mixture. An event-counting estimator together with a matching lower bound is derived for the full regime, while a Laguerre-polynomial estimator and matching lower bound are derived for the reduced regime. The resulting sample-complexity comparison shows that the prevalent reduction to pivots is strictly less efficient for proportion estimation. A reader cares because mixed human-LLM documents are the realistic case and the number of tokens needed determines whether the estimate is feasible on short texts.

Core claim

Under the Gumbel-max watermark, the minimal number of tokens required to estimate the watermark proportion to fixed accuracy is substantially smaller when the full pseudorandom vector and selected token are observed at each position than when only the one-dimensional pivot is retained; explicit estimators achieve the information-theoretic lower bounds in each regime.

What carries the argument

The distinction between full observation of the Gumbel-max pseudorandom vector and token versus reduction to the scalar Uniform-Beta pivot, together with the event-counting and Laguerre-polynomial estimators derived for each case.

If this is right

  • The event-counting estimator is optimal for the full-observation regime.
  • The Laguerre-polynomial estimator is optimal for the pivotal-reduction regime.
  • Reduction to pivots sacrifices statistical efficiency when the target is a continuous proportion rather than binary detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Watermark designs intended for proportion estimation could retain vector information instead of discarding it after pivot computation.
  • The same efficiency comparison may apply to other mixture-proportion problems that involve high-dimensional nuisance parameters.
  • Short documents may become estimable only under the full-observation regime.

Load-bearing premise

Next-token prediction distributions are unknown and arbitrary but obey a non-degeneracy condition.

What would settle it

An explicit construction or numerical check showing that the sample complexity under pivotal reduction is no larger than under full observation would refute the claimed efficiency gap.

read the original abstract

Watermarking promises statistical traceability of large language model (LLM) uses, but real documents rarely arrive as purely human-written or purely LLM-generated. This motivates a quantitative question beyond detection: what proportion of a document is generated from a pre-specified watermarked LLM? We study this watermark proportion estimation problem under the Gumbel--max watermarking mechanism, treating the next-token prediction distributions as unknown and arbitrary nuisance parameters subject to a non-degeneracy condition. We compare two observation regimes: in the full observation regime, the estimator observes the pseudorandom vector and the selected token at each position; in the more prevalent setting of pivotal reduction, it observes only a scalar pivot, which follows a one-dimensional Uniform--Beta mixture distribution. Under pivotal reduction, we develop a Laguerre-polynomial estimator and establish a matching information-theoretic lower bound for the sample complexity. For full observation, we introduce an event-counting estimator and show a matching lower bound, yielding a substantially smaller sample complexity. As our results imply, although reducing to pivotal statistics is an elegant and prevalent choice, it is not always sample-efficient for estimating the proportion of watermarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript studies estimation of the watermark proportion (fraction of tokens generated by a Gumbel-max watermarked LLM) in mixed documents, treating next-token distributions as unknown arbitrary nuisance parameters subject to a non-degeneracy condition. It compares the full-observation regime (pseudorandom vector and chosen token observed at each step) against the pivotal-reduction regime (only the scalar Uniform-Beta mixture pivot observed). For pivotal reduction the authors construct a Laguerre-polynomial estimator and prove a matching information-theoretic lower bound; for full observation they construct an event-counting estimator and likewise prove a matching lower bound, establishing that the full-observation sample complexity is substantially smaller.

Significance. If the stated matching upper and lower bounds hold, the work supplies a complete information-theoretic characterization of sample complexity in both observation models and shows that reduction to the pivotal statistic, while prevalent, is not always sample-efficient. The explicit construction of estimators together with the matching bounds constitutes a clear technical contribution to the statistical analysis of LLM watermarking.

minor comments (3)
  1. The non-degeneracy condition on the next-token distributions is invoked repeatedly but is stated only at a high level in the abstract; a precise formulation (e.g., a lower bound on the minimal probability mass or on the separation of the Gumbel-max logits) should appear in the main text before the first theorem that relies on it.
  2. Notation for the watermark proportion parameter and for the two observation regimes is introduced without a consolidated table or diagram; adding a short notation table would improve readability.
  3. The Laguerre-polynomial estimator is described only at the level of its functional form; an explicit statement of the truncation degree or the coefficient estimation procedure would help readers reproduce the construction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive assessment of its contributions to the information-theoretic analysis of watermark proportion estimation. The referee's summary correctly identifies the key comparison between full-observation and pivotal-reduction regimes, as well as the matching upper and lower bounds we derive in each case.

Circularity Check

0 steps flagged

No significant circularity; theoretical bounds derived independently

full rationale

The paper derives matching upper and lower bounds on sample complexity for watermark proportion estimation under two observation regimes, treating next-token distributions as arbitrary nuisance parameters subject only to a non-degeneracy condition. The central results (Laguerre-polynomial estimator under pivotal reduction and event-counting estimator under full observation) follow from standard information-theoretic arguments and explicit construction of estimators, without any reduction of predictions to fitted parameters, self-definitional loops, or load-bearing self-citations. The comparison between regimes is obtained directly from the derived bounds rather than by construction or renaming. This is a self-contained theoretical analysis with no evident circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; main domain assumption identified is the non-degeneracy condition on token distributions.

axioms (1)
  • domain assumption next-token prediction distributions are unknown and arbitrary nuisance parameters subject to a non-degeneracy condition
    Explicitly stated in abstract as the setting for the estimation problem.

pith-pipeline@v0.9.1-grok · 5751 in / 987 out tokens · 35490 ms · 2026-07-03T21:53:58.264100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Watermarking of large language models

    Scott Aaronson and H Kirchner. Watermarking of large language models. In Large Language Models and Transformers Workshop at Simons Institute for the Theory of Computing , 2023

  2. [2]

    Handbook of mathematical functions, with formulas, graphs, and mathematical tables, 1966

    Milton Abramowitz, Irene A Stegun, and Jacques E Romain. Handbook of mathematical functions, with formulas, graphs, and mathematical tables, 1966

  3. [3]

    Minimax estimation of linear and quadratic functionals on sparsity classes

    Olivier Collier, La \"e titia Comminges, and Alexandre B Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics , pages 923--958, 2017

  4. [4]

    On estimation of nonsmooth functionals of sparse normal means

    Olivier Collier, La \"e titia Comminges, and Alexandre B Tsybakov. On estimation of nonsmooth functionals of sparse normal means. Bernoulli , 2020

  5. [5]

    Undetectable watermarks for language models

    Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory , pages 1125--1139. PMLR, 2024

  6. [6]

    Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing

    T Tony Cai and Jiashun Jin. Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing. The Annals of Statistics , 38(304):100--145, 2010

  7. [7]

    Optimal detection for language watermarks with pseudorandom collision

    T Tony Cai, Xiang Li, Qi Long, Weijie J Su, and Garrett G Wen. Optimal detection for language watermarks with pseudorandom collision. arXiv preprint arXiv:2510.22007 , 2025

  8. [8]

    Sharp regret-Hellinger bounds for Gaussian empirical Bayes via polynomial approximation

    Jiafeng Chen and Yihong Wu. Sharp regret- Hellinger bounds for Gaussian empirical Bayes via polynomial approximation. arXiv preprint arXiv:2605.02070 , 2026

  9. [9]

    Sub- Gaussian mean estimators

    Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub- Gaussian mean estimators. The Annals of Statistics , pages 2695--2725, 2016

  10. [10]

    Optimal estimation of high-dimensional Gaussian location mixtures

    Natalie Doss, Yihong Wu, Pengkun Yang, and Harrison H Zhou. Optimal estimation of high-dimensional Gaussian location mixtures. The Annals of Statistics , 51(1):62--95, 2023

  11. [11]

    Statistical theory of extreme values and some practical applications: a series of lectures , volume 33

    Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures , volume 33. US Government Printing Office, 1954

  12. [12]

    Unbiased watermark for large language models

    Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. In International Conference on Learning Representations , volume 2024, pages 45408--45436, 2024

  13. [13]

    Minimax rate-optimal estimation of KL divergence between discrete distributions

    Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of KL divergence between discrete distributions. In 2016 International Symposium on Information Theory and Its Applications (ISITA) , pages 256--260. IEEE, 2016

  14. [14]

    Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance

    Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. In Conference On Learning Theory , pages 3189--3221. PMLR, 2018

  15. [15]

    Optimal prediction of the number of unseen species with multiplicity

    Yi Hao and Ping Li. Optimal prediction of the number of unseen species with multiplicity. Advances in Neural Information Processing Systems , 33:8553--8564, 2020

  16. [16]

    On the empirical power of goodness-of-fit tests in watermark detection

    Weiqing He, Xiang Li, Tianqi Shang, Li Shen, Weijie Su, and Qi Long. On the empirical power of goodness-of-fit tests in watermark detection. Advances in neural information processing systems , 38:18761--18793, 2026

  17. [17]

    Data amplification: Instance-optimal property estimation

    Yi Hao and Alon Orlitsky. Data amplification: Instance-optimal property estimation. In International Conference on Machine Learning , pages 4049--4059. PMLR, 2020

  18. [18]

    Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian Mixtures

    Joonhyuk Jung and Chao Gao. Sharp inequalities between total variation and Hellinger distances for Gaussian mixtures. arXiv preprint arXiv:2602.03202 , 2026

  19. [19]

    Categorical reparameterization with Gumbel -softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel -softmax. In International Conference on Learning Representations , 2017

  20. [20]

    Minimax estimation of functionals of discrete distributions

    Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory , 61(5):2835--2885, 2015

  21. [21]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International conference on machine learning , pages 17061--17084. PMLR, 2023

  22. [22]

    Robust distortion-free watermarks for language models

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. Transactions on Machine Learning Research , 2024

  23. [23]

    Refined detection for Gumbel watermarking

    Tor Lattimore. Refined detection for Gumbel watermarking. arXiv preprint arXiv:2603.30017 , 2026

  24. [24]

    On a problem of adaptive estimation in Gaussian white noise

    OV Lepskii. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability & Its Applications , 35(3):454--466, 1991

  25. [25]

    Asymptotically minimax adaptive estimation

    OV Lepskii. Asymptotically minimax adaptive estimation. I : Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications , 36(4):682--697, 1992

  26. [26]

    A likelihood based approach for watermark detection

    Xingchi Li, Guanxun Li, and Xianyang Zhang. A likelihood based approach for watermark detection. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , volume 258, pages 1675--1683, 2025

  27. [27]

    Mean estimation and regression under heavy-tailed distributions: A survey

    G \'a bor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics , 19(5):1145--1190, 2019

  28. [28]

    A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules

    Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules. The Annals of Statistics , 53(1):322--351, 2025

  29. [29]

    Robust detection of watermarks for large language models under human edits

    Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. Robust detection of watermarks for large language models under human edits. Journal of the Royal Statistical Society Series B: Statistical Methodology , 88(2):491--515, 2026

  30. [30]

    Optimal estimation of watermark proportions in hybrid AI -human texts

    Xiang Li, Garrett Wen, Weiqing He, Jiayuan Wu, Qi Long, and Weijie J Su. Optimal estimation of watermark proportions in hybrid AI -human texts. arXiv preprint arXiv:2506.22343 , 2025

  31. [31]

    Optimal estimation of simultaneous signals using absolute inner product with applications to integrative genomics

    Rong Ma, T Tony Cai, and Hongzhe Li. Optimal estimation of simultaneous signals using absolute inner product with applications to integrative genomics. Statistica Sinica , 32(2):1027--1048, 2022

  32. [32]

    A* sampling

    Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. Advances in neural information processing systems , 27, 2014

  33. [33]

    On the best approximation by finite Gaussian mixtures

    Yun Ma, Yihong Wu, and Pengkun Yang. On the best approximation by finite Gaussian mixtures. IEEE Transactions on Information Theory , 2025

  34. [34]

    Optimal prediction of the number of unseen species

    Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences , 113(47):13283--13288, 2016

  35. [35]

    Dualizing Le Cam ’s method for functional estimation I : General theory

    Yury Polyanskiy and Yihong Wu. Dualizing Le Cam ’s method for functional estimation I : General theory. The Annals of Statistics , 54(1):1--24, 2026

  36. [36]

    Complex analysis , volume 2

    Elias M Stein and Rami Shakarchi. Complex analysis , volume 2. Princeton University Press, 2010

  37. [37]

    Adaptive Confidence Intervals in Efron's Gaussian Two-Groups Model

    Qiaosen Wang, Shuwen Chai, and Chao Gao. Adaptive confidence intervals in Efron 's Gaussian two-groups model. arXiv preprint arXiv:2604.26992 , 2026

  38. [38]

    Minimax rates of entropy estimation on large alphabets via best polynomial approximation

    Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory , 62(6):3702--3720, 2016

  39. [39]

    Chebyshev polynomials, moment matching, and optimal estimation of the unseen

    Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics , 47(2):857--883, 2019

  40. [40]

    Optimal estimation of Gaussian mixtures via denoised method of moments

    Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments. Annals of Statistics , 48(4), 2020

  41. [41]

    Polynomial methods in statistical inference: Theory and practice

    Yihong Wu and Pengkun Yang. Polynomial methods in statistical inference: Theory and practice. Foundations and Trends in Communications and Information Theory , 17(4):402--585, 2020

  42. [42]

    Provable robust watermarking for AI -generated text

    Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI -generated text. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors, International Conference on Learning Representations , volume 2024, pages 43738--43772, 2024