pith. sign in

arxiv: 2603.14968 · v2 · submitted 2026-03-16 · 💻 cs.CR · cs.CL

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Pith reviewed 2026-05-15 10:49 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM watermarkingblack-box detectionthird-party verificationproxy modelrelative hypothesis testingAI provenancemodel auditing
0
0 comments X

The pith

TTP-Detect decouples LLM watermark detection from injection, allowing third-party verification in black-box settings via a proxy model and relative measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that watermark verification need not require secret keys or provider-side access, which currently blocks independent auditing of LLM outputs. It reframes the task as a relative hypothesis test: a proxy model amplifies any watermark signals present in the query text, and a set of complementary measurements checks how well the text aligns with watermarked versus unwatermarked distributions. If this works, governance bodies and auditors could verify provenance claims without compromising the security of the original watermarking scheme. The approach is tested across multiple representative schemes, datasets, and models, showing strong performance and resistance to common attacks.

Core claim

TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions, achieving detection without access to the original scheme or keys.

What carries the argument

A proxy model that amplifies watermark-relevant signals, used inside a relative hypothesis test that compares alignment with watermarked versus unwatermarked distributions.

If this is right

  • Independent auditors can verify LLM provenance claims without needing provider cooperation or secret keys.
  • Detection becomes possible across different watermarking schemes using the same verification pipeline.
  • Robustness to removal or distortion attacks increases because verification relies on relative rather than absolute signals.
  • Service providers can adopt watermarking without fear that detection logic exposes their injection method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized third-party audit protocols could emerge if proxy models are shared or certified for specific model families.
  • The relative-measurement approach might extend to detecting other statistical artifacts in generated text beyond watermarks.
  • Adoption could reduce reliance on opaque provider self-reporting for content provenance.

Load-bearing premise

A proxy model can sufficiently amplify watermark-relevant signals to enable reliable distinction via relative measurements in black-box settings without access to the original scheme or keys.

What would settle it

Run TTP-Detect on watermarked text generated by a model whose token distribution differs sharply from the chosen proxy; if detection accuracy falls to random levels while the same proxy succeeds on a matched model, the amplification step fails.

read the original abstract

While watermarking serves as a critical mechanism for LLM provenance, existing secret-key schemes tightly couple detection with injection, requiring access to keys or provider-side scheme-specific detectors for verification. This dependency creates a fundamental barrier for real-world governance, as independent auditing becomes impossible without compromising model security or relying on the opaque claims of service providers. To resolve this dilemma, we introduce TTP-Detect, a pioneering black-box framework designed for non-intrusive, third-party watermark verification. By decoupling detection from injection, TTP-Detect reframes verification as a relative hypothesis testing problem. It employs a proxy model to amplify watermark-relevant signals and a suite of complementary relative measurements to assess the alignment of the query text with watermarked distributions. Extensive experiments across representative watermarking schemes, datasets and models demonstrate that TTP-Detect achieves superior detection performance and robustness against diverse attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes TTP-Detect, a black-box third-party framework for LLM watermark detection that decouples verification from the original injection scheme and keys. It reframes the task as relative hypothesis testing: a proxy model amplifies watermark-relevant signals, after which a suite of complementary relative measurements assesses alignment between query text and watermarked distributions. The central claim is that this yields superior detection performance and robustness to attacks across representative schemes, datasets, and models.

Significance. If the empirical claims hold under rigorous validation, the work would enable independent auditing of LLM provenance without compromising model security or relying on opaque provider-side detectors. This directly addresses a practical barrier in real-world governance and could support regulatory or forensic applications where access to secret keys is unavailable.

major comments (3)
  1. [Abstract] Abstract: the claim of 'superior detection performance' from 'extensive experiments' is unsupported by any quantitative metrics, baselines, error bars, or statistical tests. Without these, the central empirical assertion cannot be evaluated.
  2. [Method] Method section (proxy amplification step): the framework requires that an independently chosen proxy sufficiently amplifies watermark signals via distributional alignment. No analysis or ablation is provided for proxy-target mismatch (different architecture, size, or training data), which the skeptic note identifies as the weakest assumption; under mismatch the method reduces to ordinary black-box perplexity comparison with no guaranteed advantage.
  3. [Experiments] Experiments: the abstract asserts robustness 'across representative watermarking schemes, datasets and models' yet supplies no concrete numbers, attack descriptions, or comparison tables. This prevents assessment of whether the data actually supports the superiority and robustness claims.
minor comments (1)
  1. [Method] Clarify the exact form of the 'complementary relative measurements' (e.g., which statistical tests or distance metrics are used) and whether they are parameter-free.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and completeness of our presentation. We address each major comment below and have revised the manuscript to incorporate additional details and analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'superior detection performance' from 'extensive experiments' is unsupported by any quantitative metrics, baselines, error bars, or statistical tests. Without these, the central empirical assertion cannot be evaluated.

    Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript reports these details in the Experiments section, including AUC scores, F1 metrics, baseline comparisons, error bars from repeated runs, and statistical significance tests. We have revised the abstract to summarize key results, such as detection performance gains and robustness metrics. revision: yes

  2. Referee: [Method] Method section (proxy amplification step): the framework requires that an independently chosen proxy sufficiently amplifies watermark signals via distributional alignment. No analysis or ablation is provided for proxy-target mismatch (different architecture, size, or training data), which the skeptic note identifies as the weakest assumption; under mismatch the method reduces to ordinary black-box perplexity comparison with no guaranteed advantage.

    Authors: This is a valid observation. The original manuscript does not contain a dedicated ablation study on proxy mismatch. We have added such an analysis to the revised Method section, including experiments with mismatched architectures and sizes, demonstrating that the relative measurements retain an advantage over plain perplexity in moderate mismatch cases while providing guidance on proxy selection. revision: yes

  3. Referee: [Experiments] Experiments: the abstract asserts robustness 'across representative watermarking schemes, datasets and models' yet supplies no concrete numbers, attack descriptions, or comparison tables. This prevents assessment of whether the data actually supports the superiority and robustness claims.

    Authors: We acknowledge that the abstract lacks these specifics. The Experiments section of the full manuscript provides the requested concrete numbers, attack descriptions (including paraphrasing and token-level perturbations), and comparison tables across schemes, datasets, and models. We have updated the abstract to reference these results explicitly and added a high-level summary table for accessibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is defined independently of fitted parameters or self-citations.

full rationale

The paper defines TTP-Detect explicitly as a new black-box framework that decouples detection from any specific injection scheme or key by using an external proxy model for signal amplification followed by relative hypothesis testing. No equations or steps reduce by construction to prior fitted values, self-citations, or renamed empirical patterns. The central claims rest on experimental comparisons across independent watermarking schemes, datasets, and models rather than on any internal redefinition or load-bearing self-reference. This satisfies the criteria for a self-contained derivation with no circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the proxy model's ability to amplify signals and the validity of relative measurements, both introduced as new components without external benchmarks or independent evidence in the abstract.

free parameters (1)
  • Proxy model selection and configuration
    Choice of proxy model to amplify signals is central and may involve tuning not detailed in abstract.
axioms (1)
  • domain assumption Watermark signals exist and can be amplified by an external proxy model without knowledge of the injection scheme
    Invoked as the basis for reframing detection as relative hypothesis testing.
invented entities (1)
  • TTP-Detect framework no independent evidence
    purpose: Non-intrusive third-party watermark verification in black-box settings
    Newly introduced approach decoupling detection from injection.

pith-pipeline@v0.9.0 · 5460 in / 1310 out tokens · 48624 ms · 2026-05-15T10:49:53.266233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.