RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
hub
arXiv preprint arXiv:2306.04634 , year=
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
Two new constructions for multi-bit generative watermarking attain the established lower bound on miss-detection probability under worst-case false-alarm constraints, fully characterizing optimal performance via linear programming.
CSF is the first black-box method to attribute fine-tuned text-to-image models to original lineages via compositional semantic probes and Bayesian decisions across multiple model families.
Watermarking enables entity-level attribution and monitoring through signal aggregation even in zero-bit designs, creating an unavoidable dual-use tension between attribution and surveillance.
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
ArcMark is a multi-byte LLM watermark that achieves distortion-free embedding of several bytes per few hundred tokens by treating generation as a channel coding problem and using optimal transport to match distributions.
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
citing papers explorer
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
-
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
-
Optimal Multi-bit Generative Watermarking Schemes Under Worst-Case False-Alarm Constraints
Two new constructions for multi-bit generative watermarking attain the established lower bound on miss-detection probability under worst-case false-alarm constraints, fully characterizing optimal performance via linear programming.
-
CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
CSF is the first black-box method to attribute fine-tuned text-to-image models to original lineages via compositional semantic probes and Bayesian decisions across multiple model families.
-
Watermarking Should Be Treated as a Monitoring Primitive
Watermarking enables entity-level attribution and monitoring through signal aggregation even in zero-bit designs, creating an unavoidable dual-use tension between attribution and surveillance.
-
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
-
Towards Robust Content Watermarking Against Removal and Forgery Attacks
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
-
ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport
ArcMark is a multi-byte LLM watermark that achieves distortion-free embedding of several bytes per few hundred tokens by treating generation as a channel coding problem and using optimal transport to match distributions.
-
Whispers in the Machine: Confidentiality in Agentic Systems
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.