SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
hub Canonical reference
Robust distortion- free watermarks for language models
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
AI agents can conduct undetectable covert conversations using a new pseudorandom noise-resilient key exchange that works without shared keys and with only constant min-entropy in messages.
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
The authors give constructions for provably undetectable watermarking of constant-entropy LLM outputs that are robust to random substitutions (under subexponential LPN) and to substitutions plus random deletions (under an additional heuristic or pseudorandom ECC).
RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.
Two new constructions for multi-bit generative watermarking attain the established lower bound on miss-detection probability under worst-case false-alarm constraints, fully characterizing optimal performance via linear programming.
A survey of LLM copyright protection that unifies text watermarking, model watermarking, and model fingerprinting while presenting new coverage of fingerprint transfer and removal.
A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and improving robustness to attacks.
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
ArcMark is a multi-byte LLM watermark that achieves distortion-free embedding of several bytes per few hundred tokens by treating generation as a channel coding problem and using optimal transport to match distributions.
SWaRL trains code LLMs with RL using compiler correctness signals and a confidential verifier reward to embed robust, functionality-preserving watermarks that resist refactoring attacks.
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
Develops a five-tier threat model for generative AI content, releases a 12000-item multi-modal benchmark with laundering tests, evaluates four schemes, and maps detection metrics to legal sufficiency thresholds for law of armed conflict, domestic procedure, and EU AI Act.
Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of stationary ergodic stochastic processes.
LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.
citing papers explorer
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange
AI agents can conduct undetectable covert conversations using a new pseudorandom noise-resilient key exchange that works without shared keys and with only constant min-entropy in messages.
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
-
Can we Watermark Low-Entropy LLM Outputs?
The authors give constructions for provably undetectable watermarking of constant-entropy LLM outputs that are robust to random substitutions (under subexponential LPN) and to substitutions plus random deletions (under an additional heuristic or pseudorandom ECC).
-
RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.
-
Optimal Multi-bit Generative Watermarking Schemes Under Worst-Case False-Alarm Constraints
Two new constructions for multi-bit generative watermarking attain the established lower bound on miss-detection probability under worst-case false-alarm constraints, fully characterizing optimal performance via linear programming.
-
Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends
A survey of LLM copyright protection that unifies text watermarking, model watermarking, and model fingerprinting while presenting new coverage of fingerprint transfer and removal.
-
Topic-Based Watermarks for Large Language Models
A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and improving robustness to attacks.
-
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Detecting Verbatim LLM Copy-Paste in Homework
SteganoPrompt embeds a hidden instruction in assignment prompts via the Unicode Tags block so that LLMs add a detectable signature to responses when the prompt is pasted verbatim.
-
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
-
Towards Robust Content Watermarking Against Removal and Forgery Attacks
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
-
ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport
ArcMark is a multi-byte LLM watermark that achieves distortion-free embedding of several bytes per few hundred tokens by treating generation as a channel coding problem and using optimal transport to match distributions.
-
SWaRL: Safeguard Code Watermarking via Reinforcement Learning
SWaRL trains code LLMs with RL using compiler correctness signals and a confidential verifier reward to embed robust, functionality-preserving watermarks that resist refactoring attacks.
-
Can AI-Generated Text be Reliably Detected?
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
-
Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts
Develops a five-tier threat model for generative AI content, releases a 12000-item multi-modal benchmark with laundering tests, evaluates four schemes, and maps detection metrics to legal sufficiency thresholds for law of armed conflict, domestic procedure, and EU AI Act.
-
Fundamental Trade-Offs in Multi-Bit Watermarking of Stochastic Processes
Derives matched converse and achievability bounds that characterize optimal trade-offs among false-alarm probability, detection error probability, distortion, and information rate for multi-bit watermarking of stationary ergodic stochastic processes.
-
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.