Unbiased Watermark for Large Language Models

Heng Huang; Hongyang Zhang; Lichang Chen; Xidong Wu; Yihan Wu; Zhengmian Hu

arxiv: 2310.10669 · v2 · pith:SSMGYGNCnew · submitted 2023-09-22 · 💻 cs.CR

Unbiased Watermark for Large Language Models

Zhengmian Hu , Lichang Chen , Xidong Wu , Yihan Wu , Hongyang Zhang , Heng Huang This is my paper

classification 💻 cs.CR

keywords watermarksmodelwatermarklanguagellmsoutputoutputsquality

0 comments

read the original abstract

The recent advancements in large language models (LLMs) have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have suggested a trade-off between watermark strength and output quality. However, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
cs.CR 2025-09 conditional novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Optimal Multi-bit Generative Watermarking Schemes Under Worst-Case False-Alarm Constraints
cs.IT 2026-04 unverdicted novelty 7.0

Two new constructions for multi-bit generative watermarking attain the established lower bound on miss-detection probability under worst-case false-alarm constraints, fully characterizing optimal performance via linea...
Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends
cs.CR 2025-08 accept novelty 7.0

A survey of LLM copyright protection that unifies text watermarking, model watermarking, and model fingerprinting while presenting new coverage of fingerprint transfer and removal.
SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
cs.CR 2026-05 unverdicted novelty 6.0

SAMark uses self-anchored semantic green regions, multi-channel hyperbolic scoring, and diversity-aware filtering to reach 90.2% TP@FP1% detection under paragraph paraphrasing while preserving text quality.
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
cs.CR 2025-10 unverdicted novelty 4.0

LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.