Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal Statistics
Pith reviewed 2026-07-03 21:53 UTC · model grok-4.3
The pith
Full observation of tokens and vectors yields lower sample complexity than pivotal reduction for estimating watermark proportions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the Gumbel-max watermark, the minimal number of tokens required to estimate the watermark proportion to fixed accuracy is substantially smaller when the full pseudorandom vector and selected token are observed at each position than when only the one-dimensional pivot is retained; explicit estimators achieve the information-theoretic lower bounds in each regime.
What carries the argument
The distinction between full observation of the Gumbel-max pseudorandom vector and token versus reduction to the scalar Uniform-Beta pivot, together with the event-counting and Laguerre-polynomial estimators derived for each case.
If this is right
- The event-counting estimator is optimal for the full-observation regime.
- The Laguerre-polynomial estimator is optimal for the pivotal-reduction regime.
- Reduction to pivots sacrifices statistical efficiency when the target is a continuous proportion rather than binary detection.
Where Pith is reading between the lines
- Watermark designs intended for proportion estimation could retain vector information instead of discarding it after pivot computation.
- The same efficiency comparison may apply to other mixture-proportion problems that involve high-dimensional nuisance parameters.
- Short documents may become estimable only under the full-observation regime.
Load-bearing premise
Next-token prediction distributions are unknown and arbitrary but obey a non-degeneracy condition.
What would settle it
An explicit construction or numerical check showing that the sample complexity under pivotal reduction is no larger than under full observation would refute the claimed efficiency gap.
read the original abstract
Watermarking promises statistical traceability of large language model (LLM) uses, but real documents rarely arrive as purely human-written or purely LLM-generated. This motivates a quantitative question beyond detection: what proportion of a document is generated from a pre-specified watermarked LLM? We study this watermark proportion estimation problem under the Gumbel--max watermarking mechanism, treating the next-token prediction distributions as unknown and arbitrary nuisance parameters subject to a non-degeneracy condition. We compare two observation regimes: in the full observation regime, the estimator observes the pseudorandom vector and the selected token at each position; in the more prevalent setting of pivotal reduction, it observes only a scalar pivot, which follows a one-dimensional Uniform--Beta mixture distribution. Under pivotal reduction, we develop a Laguerre-polynomial estimator and establish a matching information-theoretic lower bound for the sample complexity. For full observation, we introduce an event-counting estimator and show a matching lower bound, yielding a substantially smaller sample complexity. As our results imply, although reducing to pivotal statistics is an elegant and prevalent choice, it is not always sample-efficient for estimating the proportion of watermarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies estimation of the watermark proportion (fraction of tokens generated by a Gumbel-max watermarked LLM) in mixed documents, treating next-token distributions as unknown arbitrary nuisance parameters subject to a non-degeneracy condition. It compares the full-observation regime (pseudorandom vector and chosen token observed at each step) against the pivotal-reduction regime (only the scalar Uniform-Beta mixture pivot observed). For pivotal reduction the authors construct a Laguerre-polynomial estimator and prove a matching information-theoretic lower bound; for full observation they construct an event-counting estimator and likewise prove a matching lower bound, establishing that the full-observation sample complexity is substantially smaller.
Significance. If the stated matching upper and lower bounds hold, the work supplies a complete information-theoretic characterization of sample complexity in both observation models and shows that reduction to the pivotal statistic, while prevalent, is not always sample-efficient. The explicit construction of estimators together with the matching bounds constitutes a clear technical contribution to the statistical analysis of LLM watermarking.
minor comments (3)
- The non-degeneracy condition on the next-token distributions is invoked repeatedly but is stated only at a high level in the abstract; a precise formulation (e.g., a lower bound on the minimal probability mass or on the separation of the Gumbel-max logits) should appear in the main text before the first theorem that relies on it.
- Notation for the watermark proportion parameter and for the two observation regimes is introduced without a consolidated table or diagram; adding a short notation table would improve readability.
- The Laguerre-polynomial estimator is described only at the level of its functional form; an explicit statement of the truncation degree or the coefficient estimation procedure would help readers reproduce the construction.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for the positive assessment of its contributions to the information-theoretic analysis of watermark proportion estimation. The referee's summary correctly identifies the key comparison between full-observation and pivotal-reduction regimes, as well as the matching upper and lower bounds we derive in each case.
Circularity Check
No significant circularity; theoretical bounds derived independently
full rationale
The paper derives matching upper and lower bounds on sample complexity for watermark proportion estimation under two observation regimes, treating next-token distributions as arbitrary nuisance parameters subject only to a non-degeneracy condition. The central results (Laguerre-polynomial estimator under pivotal reduction and event-counting estimator under full observation) follow from standard information-theoretic arguments and explicit construction of estimators, without any reduction of predictions to fitted parameters, self-definitional loops, or load-bearing self-citations. The comparison between regimes is obtained directly from the derived bounds rather than by construction or renaming. This is a self-contained theoretical analysis with no evident circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption next-token prediction distributions are unknown and arbitrary nuisance parameters subject to a non-degeneracy condition
Reference graph
Works this paper leans on
-
[1]
Watermarking of large language models
Scott Aaronson and H Kirchner. Watermarking of large language models. In Large Language Models and Transformers Workshop at Simons Institute for the Theory of Computing , 2023
2023
-
[2]
Handbook of mathematical functions, with formulas, graphs, and mathematical tables, 1966
Milton Abramowitz, Irene A Stegun, and Jacques E Romain. Handbook of mathematical functions, with formulas, graphs, and mathematical tables, 1966
1966
-
[3]
Minimax estimation of linear and quadratic functionals on sparsity classes
Olivier Collier, La \"e titia Comminges, and Alexandre B Tsybakov. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics , pages 923--958, 2017
2017
-
[4]
On estimation of nonsmooth functionals of sparse normal means
Olivier Collier, La \"e titia Comminges, and Alexandre B Tsybakov. On estimation of nonsmooth functionals of sparse normal means. Bernoulli , 2020
2020
-
[5]
Undetectable watermarks for language models
Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory , pages 1125--1139. PMLR, 2024
2024
-
[6]
Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing
T Tony Cai and Jiashun Jin. Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing. The Annals of Statistics , 38(304):100--145, 2010
2010
-
[7]
Optimal detection for language watermarks with pseudorandom collision
T Tony Cai, Xiang Li, Qi Long, Weijie J Su, and Garrett G Wen. Optimal detection for language watermarks with pseudorandom collision. arXiv preprint arXiv:2510.22007 , 2025
-
[8]
Sharp regret-Hellinger bounds for Gaussian empirical Bayes via polynomial approximation
Jiafeng Chen and Yihong Wu. Sharp regret- Hellinger bounds for Gaussian empirical Bayes via polynomial approximation. arXiv preprint arXiv:2605.02070 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Sub- Gaussian mean estimators
Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I Oliveira. Sub- Gaussian mean estimators. The Annals of Statistics , pages 2695--2725, 2016
2016
-
[10]
Optimal estimation of high-dimensional Gaussian location mixtures
Natalie Doss, Yihong Wu, Pengkun Yang, and Harrison H Zhou. Optimal estimation of high-dimensional Gaussian location mixtures. The Annals of Statistics , 51(1):62--95, 2023
2023
-
[11]
Statistical theory of extreme values and some practical applications: a series of lectures , volume 33
Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures , volume 33. US Government Printing Office, 1954
1954
-
[12]
Unbiased watermark for large language models
Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. In International Conference on Learning Representations , volume 2024, pages 45408--45436, 2024
2024
-
[13]
Minimax rate-optimal estimation of KL divergence between discrete distributions
Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimation of KL divergence between discrete distributions. In 2016 International Symposium on Information Theory and Its Applications (ISITA) , pages 256--260. IEEE, 2016
2016
-
[14]
Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance
Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. In Conference On Learning Theory , pages 3189--3221. PMLR, 2018
2018
-
[15]
Optimal prediction of the number of unseen species with multiplicity
Yi Hao and Ping Li. Optimal prediction of the number of unseen species with multiplicity. Advances in Neural Information Processing Systems , 33:8553--8564, 2020
2020
-
[16]
On the empirical power of goodness-of-fit tests in watermark detection
Weiqing He, Xiang Li, Tianqi Shang, Li Shen, Weijie Su, and Qi Long. On the empirical power of goodness-of-fit tests in watermark detection. Advances in neural information processing systems , 38:18761--18793, 2026
2026
-
[17]
Data amplification: Instance-optimal property estimation
Yi Hao and Alon Orlitsky. Data amplification: Instance-optimal property estimation. In International Conference on Machine Learning , pages 4049--4059. PMLR, 2020
2020
-
[18]
Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian Mixtures
Joonhyuk Jung and Chao Gao. Sharp inequalities between total variation and Hellinger distances for Gaussian mixtures. arXiv preprint arXiv:2602.03202 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Categorical reparameterization with Gumbel -softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel -softmax. In International Conference on Learning Representations , 2017
2017
-
[20]
Minimax estimation of functionals of discrete distributions
Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory , 61(5):2835--2885, 2015
2015
-
[21]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International conference on machine learning , pages 17061--17084. PMLR, 2023
2023
-
[22]
Robust distortion-free watermarks for language models
Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. Transactions on Machine Learning Research , 2024
2024
-
[23]
Refined detection for Gumbel watermarking
Tor Lattimore. Refined detection for Gumbel watermarking. arXiv preprint arXiv:2603.30017 , 2026
-
[24]
On a problem of adaptive estimation in Gaussian white noise
OV Lepskii. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability & Its Applications , 35(3):454--466, 1991
1991
-
[25]
Asymptotically minimax adaptive estimation
OV Lepskii. Asymptotically minimax adaptive estimation. I : Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications , 36(4):682--697, 1992
1992
-
[26]
A likelihood based approach for watermark detection
Xingchi Li, Guanxun Li, and Xianyang Zhang. A likelihood based approach for watermark detection. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , volume 258, pages 1675--1683, 2025
2025
-
[27]
Mean estimation and regression under heavy-tailed distributions: A survey
G \'a bor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics , 19(5):1145--1190, 2019
2019
-
[28]
A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules
Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules. The Annals of Statistics , 53(1):322--351, 2025
2025
-
[29]
Robust detection of watermarks for large language models under human edits
Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. Robust detection of watermarks for large language models under human edits. Journal of the Royal Statistical Society Series B: Statistical Methodology , 88(2):491--515, 2026
2026
-
[30]
Optimal estimation of watermark proportions in hybrid AI -human texts
Xiang Li, Garrett Wen, Weiqing He, Jiayuan Wu, Qi Long, and Weijie J Su. Optimal estimation of watermark proportions in hybrid AI -human texts. arXiv preprint arXiv:2506.22343 , 2025
-
[31]
Optimal estimation of simultaneous signals using absolute inner product with applications to integrative genomics
Rong Ma, T Tony Cai, and Hongzhe Li. Optimal estimation of simultaneous signals using absolute inner product with applications to integrative genomics. Statistica Sinica , 32(2):1027--1048, 2022
2022
-
[32]
A* sampling
Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. Advances in neural information processing systems , 27, 2014
2014
-
[33]
On the best approximation by finite Gaussian mixtures
Yun Ma, Yihong Wu, and Pengkun Yang. On the best approximation by finite Gaussian mixtures. IEEE Transactions on Information Theory , 2025
2025
-
[34]
Optimal prediction of the number of unseen species
Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences , 113(47):13283--13288, 2016
2016
-
[35]
Dualizing Le Cam ’s method for functional estimation I : General theory
Yury Polyanskiy and Yihong Wu. Dualizing Le Cam ’s method for functional estimation I : General theory. The Annals of Statistics , 54(1):1--24, 2026
2026
-
[36]
Complex analysis , volume 2
Elias M Stein and Rami Shakarchi. Complex analysis , volume 2. Princeton University Press, 2010
2010
-
[37]
Adaptive Confidence Intervals in Efron's Gaussian Two-Groups Model
Qiaosen Wang, Shuwen Chai, and Chao Gao. Adaptive confidence intervals in Efron 's Gaussian two-groups model. arXiv preprint arXiv:2604.26992 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Minimax rates of entropy estimation on large alphabets via best polynomial approximation
Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory , 62(6):3702--3720, 2016
2016
-
[39]
Chebyshev polynomials, moment matching, and optimal estimation of the unseen
Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics , 47(2):857--883, 2019
2019
-
[40]
Optimal estimation of Gaussian mixtures via denoised method of moments
Yihong Wu and Pengkun Yang. Optimal estimation of Gaussian mixtures via denoised method of moments. Annals of Statistics , 48(4), 2020
2020
-
[41]
Polynomial methods in statistical inference: Theory and practice
Yihong Wu and Pengkun Yang. Polynomial methods in statistical inference: Theory and practice. Foundations and Trends in Communications and Information Theory , 17(4):402--585, 2020
2020
-
[42]
Provable robust watermarking for AI -generated text
Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI -generated text. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors, International Conference on Learning Representations , volume 2024, pages 43738--43772, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.