SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
Context-Fidelity Boosting reduces faithfulness hallucinations by applying context-based logit boosts to source-supported tokens during LLM decoding.
A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and improving robustness to attacks.
TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.
citing papers explorer
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
-
Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
Context-Fidelity Boosting reduces faithfulness hallucinations by applying context-based logit boosts to source-supported tokens during LLM decoding.
-
Topic-Based Watermarks for Large Language Models
A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and improving robustness to attacks.
-
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.