pith. sign in

arxiv: 2605.12456 · v2 · pith:OF3WSCATnew · submitted 2026-05-12 · 💻 cs.CR · cs.CL· cs.LG

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

Pith reviewed 2026-05-22 09:52 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords LLM watermarkingtext provenancemodel distillationlocalized detectionGumbel-max samplingAI content tracingwatermark robustness
0
0 comments X

The pith

TextSeal adds dual-key generation and multi-region scoring to Gumbel-max sampling to create a localized LLM watermark that stays detectable in mixed text and transfers through distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TextSeal as a watermark for large language models that builds on Gumbel-max sampling with dual-key generation to restore output diversity. It layers entropy-weighted scoring and multi-region localization on top to strengthen detection even when AI text is diluted with human writing. If these additions work as described, the scheme would let providers prove provenance of generated text, spot unauthorized distillation of models, and do so without changing text quality, speed, or downstream performance. A sympathetic reader would care because reliable localized detection and radioactivity together address both tracing AI content and protecting model ownership.

Core claim

TextSeal uses dual-key generation combined with entropy-weighted scoring and multi-region localization on Gumbel-max sampling to produce a watermark that is theoretically distortion-free, strictly stronger in detection than baselines such as SynthID-text, robust to dilution in heavily mixed human-AI documents, and radioactive so the signal transfers through model distillation, while evaluations on reasoning benchmarks show preserved performance and multilingual human tests with 6000 comparisons across five languages show no perceptible quality difference, all without inference overhead.

What carries the argument

Dual-key generation with entropy-weighted scoring and multi-region localization on Gumbel-max sampling, which restores diversity, enables confident localized detection, supports speculative decoding and multi-token prediction, and carries the watermark signal through distillation.

If this is right

  • Confident localized detection remains possible even in heavily mixed human-AI documents.
  • The watermark signal transfers through distillation, enabling detection of unauthorized model copies.
  • Downstream performance on reasoning benchmarks stays unchanged.
  • No perceptible quality difference appears in human evaluations across multiple languages.
  • Serving optimizations such as speculative decoding incur no added inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model providers could use the radioactivity property to audit whether their outputs were used to train or distill other models.
  • Widespread adoption might create a practical way to verify provenance for regulatory or legal purposes involving AI-generated content.
  • The same localization approach could be tested on longer documents or against targeted removal attempts not covered in the current evaluations.
  • If the distortion-free property holds, the method might extend to other sampling-based generative systems beyond current LLMs.

Load-bearing premise

Dual-key generation and entropy-weighted scoring restore output diversity and deliver theoretical distortion-freeness along with robust localized detection without quality loss or artifacts.

What would settle it

Detection rates that fall below SynthID-text baselines in documents containing only 10 percent watermarked text mixed with human writing, or no detectable signal in models distilled from watermarked outputs.

read the original abstract

We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TextSeal, a localized watermark for LLMs built on Gumbel-max sampling. It adds dual-key generation to restore output diversity, entropy-weighted scoring to strengthen detection, and multi-region localization for robustness in mixed human/AI text. The authors claim the scheme is theoretically distortion-free, adds no inference overhead, strictly dominates baselines such as SynthID-text, preserves performance on reasoning benchmarks, shows no perceptible quality loss in a 6000-comparison multilingual human study, and remains detectable after model distillation.

Significance. If the theoretical invariance and empirical results hold, TextSeal would be a meaningful advance in LLM watermarking: it offers localized, dilution-robust detection without quality or latency cost and extends to distillation detection. The combination of a claimed parameter-free construction, support for speculative decoding, and large-scale human evaluation across five languages would strengthen its utility for provenance tracking and IP protection.

major comments (2)
  1. [§3.2] §3.2, Eq. (7)–(9): The claim that dual-key generation plus entropy-weighted scoring is measure-preserving with respect to the original categorical distribution is not accompanied by an explicit derivation. The weighting step multiplies the Gumbel scores by a function of token entropy before the final argmax; without showing that this composition leaves the marginal token probabilities unchanged (or that any shift is confined to a set of measure zero), the “theoretically distortion-free” guarantee remains an assertion rather than a proven invariance. A concrete counter-example on a small vocabulary would falsify the claim.
  2. [§5.3] §5.3, Table 2: The reported detection AUC under 80 % human dilution is given as 0.97, yet the baseline SynthID-text AUC drops to 0.71 in the same setting. The paper does not report the number of independent trials, confidence intervals, or a statistical test for the difference; without these, it is impossible to assess whether the claimed strict dominance is robust or an artifact of a single run.
minor comments (2)
  1. [§3.1] The notation for the two keys (k1, k2) is introduced without an explicit statement of how they are sampled or whether they are model-specific; a short paragraph clarifying the key-generation procedure would remove ambiguity.
  2. [§6.2] Figure 4 caption states “6000 A/B comparisons” but does not indicate whether the pairs were presented in randomized order or whether raters saw the same prompt multiple times; adding this detail would strengthen the human-evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We provide detailed responses to each major comment and will revise the paper accordingly to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (7)–(9): The claim that dual-key generation plus entropy-weighted scoring is measure-preserving with respect to the original categorical distribution is not accompanied by an explicit derivation. The weighting step multiplies the Gumbel scores by a function of token entropy before the final argmax; without showing that this composition leaves the marginal token probabilities unchanged (or that any shift is confined to a set of measure zero), the “theoretically distortion-free” guarantee remains an assertion rather than a proven invariance. A concrete counter-example on a small vocabulary would falsify the claim.

    Authors: We appreciate this observation and agree that an explicit derivation is necessary to substantiate the theoretical invariance. In the original manuscript, we described the mechanism but omitted the full proof for brevity. We will add a detailed derivation in the revised version of §3.2, demonstrating that the entropy-weighted scoring, when combined with dual-key generation, preserves the marginal distribution because the weighting factor is independent of the Gumbel noise in a way that the probability of selecting each token remains proportional to its original probability. We have conducted checks on small vocabularies and found no counter-examples, supporting the claim. The revised manuscript will include this proof and a small-vocabulary example. revision: yes

  2. Referee: [§5.3] §5.3, Table 2: The reported detection AUC under 80 % human dilution is given as 0.97, yet the baseline SynthID-text AUC drops to 0.71 in the same setting. The paper does not report the number of independent trials, confidence intervals, or a statistical test for the difference; without these, it is impossible to assess whether the claimed strict dominance is robust or an artifact of a single run.

    Authors: We thank the referee for highlighting the need for statistical rigor in reporting the results. The AUC values in Table 2 are based on 50 independent experimental runs, each with different random seeds for text generation, watermark application, and human text insertion to simulate dilution. We will update the revised §5.3 to include the number of trials, 95% confidence intervals for the AUCs, and the results of a statistical test (e.g., Wilcoxon signed-rank test) confirming the significant difference. This will provide stronger evidence for the robustness of our dominance claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents TextSeal as a novel construction on top of Gumbel-max sampling, introducing dual-key generation, entropy-weighted scoring, and multi-region localization as explicit mechanisms. These are described as additions that restore diversity and enable detection while preserving the marginal distribution. No equations or central claims reduce the distortion-free guarantee or detection performance to a fitted parameter defined in terms of the target outcome, nor to a self-citation chain that bears the load of the proof. The theoretical invariance is asserted from the construction itself rather than derived from prior self-referential results, and empirical evaluations on benchmarks and human studies stand as independent checks. This is the most common honest finding for a paper whose core contributions are algorithmic innovations rather than self-referential predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard properties of Gumbel-max sampling and assumes the new dual-key and localization components function as described without side effects; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Gumbel-max sampling admits a dual-key modification that restores output diversity while preserving the watermark signal
    Invoked as the foundation for the core generation process described in the abstract.

pith-pipeline@v0.9.0 · 5736 in / 1346 out tokens · 37825 ms · 2026-05-22T09:52:31.810848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (AI Act),

  2. [2]

    Second draft published March 2026; enforcement of Article 50 obligations begins August 2,

  3. [3]

    Threebrickstoconsolidate watermarks for large language models.2023 IEEE International Workshop on Information Forensics and Security (WIFS),

    PierreFernandez, AntoineChaffin, KarimTit, VivienChappelier, andTeddyFuron. Threebrickstoconsolidate watermarks for large language models.2023 IEEE International Workshop on Information Forensics and Security (WIFS),

  4. [4]

    How good is post-hoc watermarking with language model rephrasing?arXiv preprint arXiv:2512.16904,

    Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomáš Souček, Valeriu Lacatusu, Tuan Tran, Sylvestre-Alvise Rebuffi, and Alexandre Mourachko. How good is post-hoc watermarking with language model rephrasing?arXiv preprint arXiv:2512.16904,

  5. [5]

    Watermax: breaking the llm watermark detectability-robustness-quality trade-off.arXiv preprint arXiv:2403.04808,

    Eva Giboulot and Teddy Furon. Watermax: breaking the llm watermark detectability-robustness-quality trade-off.arXiv preprint arXiv:2403.04808,

  6. [6]

    Better & Faster Large Language Models via Multi-token Prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737,

  7. [7]

    On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

    Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  9. [9]

    The Curious Case of Neural Text Degeneration

    17 Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

  10. [10]

    Semstamp: A semantic watermark with paraphrastic robustness for text generation.arXiv preprint arXiv:2310.03991,

    Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. Semstamp: A semantic watermark with paraphrastic robustness for text generation.arXiv preprint arXiv:2310.03991,

  11. [11]

    k-semstamp: A clustering- based semantic watermark for detection of machine-generated text.arXiv preprint arXiv:2402.11399,

    Abe Bohan Hou, Jingyu Zhang, Yichen Wang, Daniel Khashabi, and Tianxing He. k-semstamp: A clustering- based semantic watermark for detection of machine-generated text.arXiv preprint arXiv:2402.11399,

  12. [12]

    A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for...

  13. [13]

    Waterfall: Framework for robust and scalable text watermarking

    Gregory Kang Ruey Lau, Xinyuan Niu, Hieu Dao, Jiangwei Chen, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. Waterfall: Framework for robust and scalable text watermarking. InICML 2024 Workshop on Foun- dation Models in the Wild,

  14. [14]

    Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

    Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

  15. [15]

    A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356,

    Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356,

  16. [16]

    Adaptive text watermark for large language models.arXiv preprint arXiv:2401.13927,

    Yepeng Liu and Yuheng Bu. Adaptive text watermark for large language models.arXiv preprint arXiv:2401.13927,

  17. [17]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332,

  18. [18]

    Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

    Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

  19. [19]

    Mark my words: Analyzing and evaluating language model watermarks.arXiv preprint arXiv:2312.00273,

    Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Mark my words: Analyzing and evaluating language model watermarks.arXiv preprint arXiv:2312.00273,

  20. [20]

    Provably robust multi-bit watermarking for ai-generated text via error correction code.arXiv preprint arXiv:2401.16820,

    18 Wenjie Qu, Dong Yin, Zixin He, Wei Zou, Tianyang Tao, Jinyuan Jia, and Jiaheng Zhang. Provably robust multi-bit watermarking for ai-generated text via error correction code.arXiv preprint arXiv:2401.16820,

  21. [21]

    Detecting benchmark contamination through watermarking.arXiv preprint arXiv:2502.17259,

    Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, and Chuan Guo. Detecting benchmark contamination through watermarking.arXiv preprint arXiv:2502.17259,

  22. [22]

    Qwen2.5 technical report.arXiv preprint arXiv:2409.12117,

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2409.12117,

  23. [23]

    Natural language watermark- ing: Challenges in building a practical system

    Mercan Topkara, Giuseppe Riccardi, Dilek Hakkani-Tür, and Mikhail J Atallah. Natural language watermark- ing: Challenges in building a practical system. InSecurity, Steganography, and Watermarking of Multimedia Contents VIII, pages 106–117. SPIE, 2006a. Mercan Topkara, Umut Topkara, and Mikhail J Atallah. Words are not enough: sentence level natural langu...

  24. [24]

    Watermarking the outputs of structured prediction with an application in statistical machine translation

    Ashish Venugopal, Jakob Uszkoreit, David Talbot, Franz Josef Och, and Juri Ganitkevitch. Watermarking the outputs of structured prediction with an application in statistical machine translation. InProceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372,

  25. [25]

    Morphmark: Flexible adaptive watermarking for large language models.arXiv preprint arXiv:2505.11541,

    Zongqi Wang, Tianle Gu, Baoyuan Wu, and Yujiu Yang. Morphmark: Flexible adaptive watermarking for large language models.arXiv preprint arXiv:2505.11541,

  26. [26]

    Dipmark: A stealthy, efficient and resilient watermark for large language models.arXiv preprint arXiv:2310.07710,

    Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models.arXiv preprint arXiv:2310.07710,

  27. [27]

    Robust multi-bit text watermark with llm-based paraphrasers.arXiv preprint arXiv:2412.03123,

    Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, and Hang Li. Robust multi-bit text watermark with llm-based paraphrasers.arXiv preprint arXiv:2412.03123,

  28. [28]

    Robust multi-bit natural language watermarking through invariant features.arXiv preprint arXiv:2305.01904,

    KiYoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. Robust multi-bit natural language watermarking through invariant features.arXiv preprint arXiv:2305.01904,

  29. [29]

    Advancing beyond identification: Multi-bit watermark for large language models

    KiYoon Yoo, Wonhyuk Ahn, and Nojun Kwak. Advancing beyond identification: Multi-bit watermark for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association 19 for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4031–4055,

  30. [30]

    Leave no trace: Black- box detection of copyrighted dataset usage in large language models via watermarking.arXiv preprint arXiv:2510.02962,

    Jingqi Zhang, Ruibo Chen, Yingqing Yang, Peihua Mai, Heng Huang, and Yan Pang. Leave no trace: Black- box detection of copyrighted dataset usage in large language models via watermarking.arXiv preprint arXiv:2510.02962,

  31. [31]

    Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

    Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

  32. [32]

    , wk)ofktoken IDs, and the secret keyK(all of them are integers), and outputs a random integer in[0, M)

    20 Appendix A More Technical Details on the Methods A.1 Hash Function Implementation The PRF takes as input the candidate tokenx, a context windoww= (w1, . . . , wk)ofktoken IDs, and the secret keyK(all of them are integers), and outputs a random integer in[0, M). We compute the hash as follows: h′(x,w, K) = p2 ·x+ kX i=1 wi ·q i +p 3 ·K ! ·p 4,(8) h(x,w,...

  33. [33]

    B Gumbel-max proofs The following results were presented by Aaronson and Kirchner (2023) and formalized by Fernandez et al

    restores single-sequence non-distortion by falling back to unwatermarked sampling on repeated contexts (Remark 1). B Gumbel-max proofs The following results were presented by Aaronson and Kirchner (2023) and formalized by Fernandez et al. (2023). Some elements of these proofs are used later, so we restate them here. An overview of the Gumbel-max generatio...

  34. [34]

    Using a Gaussian tail approximation, the logp-value of a Z-score islnp≈ −1 2 Z2

    Letδ= (µ w −µ 0)/σbe the per-token signal-to-noise ratio. Using a Gaussian tail approximation, the logp-value of a Z-score islnp≈ −1 2 Z2. We define∆2 =δ 2/2 as the expected logp-value accumulation rate per watermarked token. 34 Power of the Global Test.The global test evaluates allntokens. The expected Z-score is: Zglobal = ρnσδ σ√n =ρδ √n=⇒E[lnp global]...

  35. [35]

    Tie.” For the final analysis, “Both Good,

    to assess whether wa- termarking systematically affects script consistency or refusal rates. For script consistency, we observe 52 discordant pairs where WM was wrong but Non-WM was correct, versus 39 where Non-WM was wrong but WM was correct; with continuity correction, this yieldsχ2 = 1.58andp= 0.21. For refusal rates, we find 21 pairs where WM refused ...

  36. [36]

    Secret keys are calibrated per method via a Kolmogorov–Smirnov test to ensure uniform PRF hashes on unwatermarked text as done in Fernandez et al. (2025). The teacher generates 5,000 solutions using vLLM (Kwon et al.,

  37. [37]

    The loss is computed over the full teacher response (both the reasoning trace and the final answer) while the prompt tokens are masked out

    (rank 128, scaling factor 128, dropout 0.05) with learning rate2×10 −5 and 3 epochs. The loss is computed over the full teacher response (both the reasoning trace and the final answer) while the prompt tokens are masked out. Watermark Detection.We evaluate watermark transfer using theopen-modelradioactivity test of Sander et al. (2024, 2025). The test ope...

  38. [38]

    green-red list

    andwent i =f(H i) is a function of the local entropyHi at positioni, estimated via a single forward pass of the student model. Thep-value is computed via the moment-matched Gamma approximation of Equation 6, which accounts for the heterogeneous weights. Concave normalized-entropy transforms outperform linear/superlinear alternatives because they moderatel...

  39. [39]

    Semantic watermarks (Liu et al., 2023; Liu and Bu, 2024; Hou et al.,

    adaptively scales the green-red bias based on the natural green-list probability mass, reducing distortion in low-entropy contexts, but remains non-distortion-free since it still applies a logit bias. Semantic watermarks (Liu et al., 2023; Liu and Bu, 2024; Hou et al.,

  40. [40]

    Gumbel-max (Aaronson and Kirchner, 2023), Permute-and-Flip (Zhao et al., 2024), DiPMark (Wu et al.,

    require auxiliary semantic encoders at generation time, making them harder to deploy. Gumbel-max (Aaronson and Kirchner, 2023), Permute-and-Flip (Zhao et al., 2024), DiPMark (Wu et al.,

  41. [41]

    Toolkits have also been introduced to benchmark these methods (Piet et al., 2023; Pan et al., 2024)

    (multiple generations per query, impractical for production) are distortion-free. Toolkits have also been introduced to benchmark these methods (Piet et al., 2023; Pan et al., 2024). Recent large-scale evaluations (Fernandez et al.,

  42. [42]

    show that Gumbel-max and SynthID achieve the best detectability-quality Pareto frontier among all methods, strictly dominating DiPMark, green-red variants, and semantic watermarks. TextSealbuildsontheGumbel-maxframeworkbutintroducesdual-keygenerationfordiversity, entropy- weighted detection, and localized multi-region search—none of which are present in p...