RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

Hanbo Huang; Hao Zheng; Lin Liu; Shiyu Liang; Xuan Gong; Yihan Li; Yiran Zhang; Zhuotao Liu

arxiv: 2509.20924 · v2 · pith:U74HG3VYnew · submitted 2025-09-25 · 💻 cs.CR

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

Hanbo Huang , Yiran Zhang , Hao Zheng , Xuan Gong , Yihan Li , Lin Liu , Zhuotao Liu , Shiyu Liang This is my paper

Pith reviewed 2026-05-18 14:20 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM watermarkingadaptive attacksreinforcement learningrobustness evaluationparaphrase attacksvulnerability assessmentAI content detection

0 comments

The pith

Reinforcement learning lets a 3B model remove LLM watermarks at 98.5 percent success after training on 100 samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that evaluations of LLM watermarks have used insufficiently strong attacks, causing overstated robustness results. It introduces the adaptive robustness radius to measure the smallest distortion an optimal adaptive adversary can apply to erase the watermark. By mapping possible paraphrases into a KL-divergence ball, the authors show that targeted optimization of attack context and parameters shrinks this radius substantially. They then implement RLCracker, a reinforcement learning attacker that trains on 100 short samples to let a 3-billion-parameter model erase watermarks from 1500-token texts at 98.5 percent success while keeping semantic shift small. This performance beats GPT-4o and holds across five model sizes and ten watermarking schemes.

Core claim

We introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximate radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning based adaptive attack that erases watermark signals with limited watermarked examples and limited access to the detector.

What carries the argument

Adaptive robustness radius approximated by lifting the paraphrase space into a KL-divergence ball

If this is right

Watermark removal becomes practical with small training sets and modest model sizes.
Robustness claims based on non-adaptive attacks fail against optimized RL methods.
The vulnerability appears across multiple watermarking schemes and model scales.
Effective attacks remain possible even with limited detector access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Watermark designers may need to build resistance to optimization-based attacks into their theoretical models.
The same RL approach could serve as a general test for other AI-generated content detectors.
Long-term reliance on watermarking for provenance may require additional complementary defenses.

Load-bearing premise

The KL-divergence ball approximation faithfully captures the capabilities of a worst-case adaptive adversary using real paraphrases.

What would settle it

Run RLCracker on a previously untested watermarking scheme and check whether removal success stays above 90 percent while semantic similarity remains above 0.95.

Figures

Figures reproduced from arXiv: 2509.20924 by Hanbo Huang, Hao Zheng, Lin Liu, Shiyu Liang, Xuan Gong, Yihan Li, Yiran Zhang, Zhuotao Liu.

**Figure 2.** Figure 2: Pass@20 on EWD. While direct black-box optimization of attack context c is intractable, Theorem 1 establishes that multi-sample strategies like Pass@k sampling are a principled attack paradigm (see Appendix B.1). To empirically validate the effectiveness of this approach, we conduct a Pass@20 attack on the EWD watermark using Qwen2.5-3B-Instruct. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ESR and P-SP variation across user prompts (UserP.), with and without system prompt (SysP.). System prompts are an overlooked adversarial tool: simple in design but powerful in effect. While prior work has shown that user prompts can impact watermark evasion (Kirchenbauer et al., 2023b;a), evaluations have largely focused on user input variation, overlooking the broader influence of system-level instruc… view at source ↗

**Figure 4.** Figure 4: (a) shows the effectiveness of test-time scaling on watermark removal using Qwen3-8B; (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of RLCracker to Weight Settings under Qwen-3-4B for EWD Watermark [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: (a) shows test-time scaling on Qwen3-8B, (b) shows remove rate on Qwen3-8B, (c) shows [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximate radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermark signals with limited watermarked examples and limited access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our code is available at https://github.com/OTT0-OTO/RLCracker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLCracker shows a small RL model can remove watermarks at 98.5% after 100 samples, beating GPT-4o, but the KL-ball theory lacks direct verification.

read the letter

The main takeaway is that RLCracker shows a small RL-trained model can remove LLM watermarks at 98.5% success on long texts after training on just 100 short samples, far better than GPT-4o. This suggests current watermark evaluations miss adaptive attacks. What is new here is the adaptive robustness radius, defined by lifting paraphrases into a KL-divergence ball, plus the RL procedure that optimizes attack context and parameters to shrink that radius. The experiments do a good job showing this works across ten watermarking schemes and five model sizes, with held-out test texts and minimal semantic change. The results are reproducible since the code is public. The empirical side holds up with clear success rates and generalization. The softer spot is the theoretical reduction claim. The paper approximates the worst-case radius via the KL ball and says optimization inside it reduces the radius, but it does not verify that the RL policy's outputs stay inside the ball or that the ball covers the relevant semantic-preserving paraphrases. If the attacks go outside this approximation, the 98.5% success does not directly confirm the formal argument. This paper is for researchers in LLM security and content detection. Readers working on watermark robustness will find the attack method and numbers useful to consider. It deserves a serious referee because the new attack is concrete, the results are strong, and the code allows checking the claims. I would send it to peer review. The practical attack stands on its own even if the theory needs more checks on the approximation.

Referee Report

1 major / 2 minor

Summary. The paper argues that prior evaluations of LLM watermarks are insufficiently adversarial. It introduces the adaptive robustness radius as a formal metric for worst-case resilience, approximates it by lifting the paraphrase space into a KL-divergence ball, and claims that optimizing attack context and model parameters provably reduces this approximate radius. The authors present RLCracker, an RL-based adaptive attack that, after training on only 100 short samples, allows a 3B model to achieve 98.5% watermark removal success on 1,500-token Unigram-marked texts with minimal semantic shift, far exceeding GPT-4o's 6.75% and generalizing across five model sizes and ten watermarking schemes. Code is released at https://github.com/OTT0-OTO/RLCracker.

Significance. If the results hold, the work is significant for LLM security research. It supplies both a formal metric and a practical, low-resource RL attack demonstrating concrete vulnerabilities, supported by clear empirical success rates and cross-scheme generalization. The open-source code is a strength that aids reproducibility and allows independent verification of the reported attack performance.

major comments (1)

[Theoretical analysis of adaptive robustness radius and KL-ball approximation] The lifting of the paraphrase space into a KL-divergence ball to approximate the adaptive robustness radius and the claim that optimization inside this ball reduces the radius (as stated in the abstract and theoretical analysis) is load-bearing for the central argument. The manuscript asserts rather than derives bounds on approximation tightness and provides no verification that the learned RL policy outputs remain inside the ball or that the ball contains the relevant worst-case semantic-preserving paraphrases. Without this, the 98.5% empirical removal success does not directly confirm the theoretical reduction.

minor comments (2)

[Abstract] The abstract refers to 'minimal semantic shift' without naming the concrete metric (e.g., cosine similarity on embeddings or BLEU score); adding this detail would improve clarity and reproducibility.
[Experimental setup] The description of the 100 training samples (their length distribution and selection process) relative to the 1,500-token evaluation texts could be expanded to better support the generalization claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for emphasizing the need for rigorous theoretical grounding. We address the single major comment below and agree that additional clarification and verification would strengthen the connection between the adaptive robustness radius and the empirical results. We plan to revise the manuscript accordingly.

read point-by-point responses

Referee: The lifting of the paraphrase space into a KL-divergence ball to approximate the adaptive robustness radius and the claim that optimization inside this ball reduces the radius (as stated in the abstract and theoretical analysis) is load-bearing for the central argument. The manuscript asserts rather than derives bounds on approximation tightness and provides no verification that the learned RL policy outputs remain inside the ball or that the ball contains the relevant worst-case semantic-preserving paraphrases. Without this, the 98.5% empirical removal success does not directly confirm the theoretical reduction.

Authors: We appreciate this observation. The manuscript defines the adaptive robustness radius as the minimal perturbation radius needed to erase the watermark under an adaptive adversary and approximates the space of semantic-preserving paraphrases via a KL-divergence ball centered on the original text distribution. Within this ball, we show that jointly optimizing attack context and model parameters yields a policy whose effective radius is smaller than that of non-adaptive baselines, because the learned policy identifies watermark-removing transformations that remain distributionally close. However, we acknowledge that the paper does not derive quantitative bounds on the tightness of the KL-ball approximation to the true paraphrase space, nor does it include post-training verification that RL-generated outputs lie inside the ball. The reported low semantic drift (high BERTScore, low perplexity change) provides indirect support that the attacks are semantically plausible, but this is not equivalent to a KL-membership check. In revision we will (1) explicitly state the heuristic nature of the approximation, (2) add a dedicated limitations subsection discussing the absence of tightness bounds, and (3) include new experiments that compute the empirical KL divergence of the attacked texts relative to the original distribution to confirm they remain within the modeled ball. These changes will make the theoretical-empirical linkage more transparent without altering the core empirical claims. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; KL-ball approximation external to empirical attack results

full rationale

The paper defines the adaptive robustness radius as a formal worst-case metric, approximates it by lifting paraphrases into a KL-divergence ball, and separately demonstrates a theoretical reduction under optimization inside that ball. RLCracker is then presented as an RL policy realizing attacks in practice, with results reported on held-out test texts (1,500-token samples) after training on 100 short examples. No equation or claim reduces the reported 98.5% removal success or the radius reduction to a fitted parameter or self-citation by construction. The KL-ball construction is introduced as an external modeling choice rather than derived from the attack success metric itself, and the empirical evaluation uses standard held-out generalization checks. This yields only minor self-citation risk at most and keeps the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the modeling choice that a KL-divergence ball adequately represents the space of meaning-preserving paraphrases and on the assumption that limited-access RL optimization can discover near-optimal attacks without full detector gradients.

free parameters (1)

training sample count
Limited to 100 short samples as weak supervision; the number is chosen to demonstrate data efficiency rather than derived from first principles.

axioms (1)

domain assumption The paraphrase space can be lifted into a KL-divergence ball that approximates worst-case adaptive attacks
Invoked to define and approximate the adaptive robustness radius in the theoretical section.

invented entities (1)

adaptive robustness radius no independent evidence
purpose: Formal metric quantifying worst-case watermark resilience against adaptive adversaries
Newly introduced quantity whose value is approximated via the KL-ball optimization.

pith-pipeline@v0.9.0 · 5783 in / 1442 out tokens · 52236 ms · 2026-05-18T14:20:38.698769+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
cs.CR 2026-04 unverdicted novelty 7.0

RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Revealing weaknesses in text watermarking through self-information rewrite attacks

Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal. Revealing weaknesses in text watermarking through self-information rewrite attacks. arXiv preprint arXiv:2505.05190, 2025

work page arXiv 2025
[2]

Undetectable watermarks for language models

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory, pp.\ 1125--1139. PMLR, 2024

work page 2024
[3]

Certified adversarial robustness via randomized smoothing

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, 2019

work page 2019
[4]

Scalable watermarking for identifying large language model outputs

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs. Nature, 634 0 (8035): 0 818--823, 2024

work page 2024
[5]

Distributionally robust losses for latent covariate mixtures

John Duchi and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:1906.08764, 2019

work page arXiv 1906
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv preprint arXiv:2402.14007 (2024)

Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. arXiv preprint arXiv:2402.14007, 2024

work page arXiv 2024
[8]

Unbiased watermark for large language models

Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023

work page arXiv 2023
[9]

b^ 4 : A black-box scrubbing attack on llm watermarks

Baizhou Huang, Xiao Pu, and Xiaojun Wan. b^ 4 : A black-box scrubbing attack on llm watermarks. arXiv preprint arXiv:2411.01222, 2024

work page arXiv 2024
[10]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025
[11]

Watermark stealing in large language models

Nikola Jovanovi \'c , Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024

work page arXiv 2024
[12]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning, pp.\ 17061--17084. PMLR, 2023 a

work page 2023
[13]

On the reliability of watermarks for large language models.arXiv preprint arXiv:2306.04634, 2023

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023 b

work page arXiv 2023
[14]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 a

work page 2023
[15]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 b

work page 2023
[16]

Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023

work page arXiv 2023
[17]

On information and sufficiency

Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22 0 (1): 0 79--86, 1951

work page 1951
[18]

Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023

work page arXiv 2023
[19]

An unforgeable publicly verifiable watermark for large language models

Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, and Philip S Yu. An unforgeable publicly verifiable watermark for large language models. arXiv preprint arXiv:2307.16230, 2023 a

work page arXiv 2023
[20]

A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356,

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023 b

work page arXiv 2023
[21]

Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S Yu, and Xuming Hu. Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

work page arXiv 2024
[22]

Preventing and detecting misinformation generated by large language models

Aiwei Liu, Qiang Sheng, and Xuming Hu. Preventing and detecting misinformation generated by large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 3001--3004, 2024 b

work page 2024
[23]

An entropy-based text watermarking detection method

Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. An entropy-based text watermarking detection method. arXiv preprint arXiv:2403.13485, 2024

work page arXiv 2024
[24]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018

work page 2018
[25]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking. arXiv preprint arXiv:2405.10051, 2024

work page arXiv 2024
[27]

Markmywords: Analyzing and evaluating language model watermarks

Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Markmywords: Analyzing and evaluating language model watermarks. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 68--91. IEEE, 2025

work page 2025
[28]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Watermarking makes language models radioactive

Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, and Teddy Furon. Watermarking makes language models radioactive. Advances in Neural Information Processing Systems, 37: 0 21079--21113, 2024

work page 2024
[30]

Approximating kl divergence

John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020

work page 2020
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Certifying some distributional robustness with principled adversarial training

Aman Sinha, Hongseok Namkoong, and John C Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018

work page 2018
[33]

Necessary and sufficient watermark for large language models

Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, and Makoto Yamada. Necessary and sufficient watermark for large language models. arXiv preprint arXiv:2310.00833, 2023

work page arXiv 2023
[34]

Ghostbuster: Detecting text ghostwritten by large language models

Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047, 2023

work page arXiv 2023
[35]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020
[36]

Unveiling the misuse potential of base large language models via in-context learning

Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, and Dahua Lin. Unveiling the misuse potential of base large language models via in-context learning. arXiv preprint arXiv:2404.10552, 2024

work page arXiv 2024
[37]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

work page 2023
[38]

Paraphrastic representations at scale

John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. Paraphrastic representations at scale. arXiv preprint arXiv:2104.15114, 2021

work page arXiv 2021
[39]

A survey on llm-generated text detection: Necessity, methods, and future directions

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A survey on llm-generated text detection: Necessity, methods, and future directions. Computational Linguistics, 51 0 (1): 0 275--338, 2025

work page 2025
[40]

Dipmark: A stealthy, efficient and resilient watermark for large language models

Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models. 2023

work page 2023
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Review outline:

Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023

work page arXiv 2023
[44]

Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally stable and watermarkable decoder for llms. arXiv preprint arXiv:2402.05864, 2024

work page arXiv 2024
[45]

Certified robustness to adversarial word substitutions

Yizheng Zhu, Hongxin Zhang, and Pin-Yu Chen. Certified robustness to adversarial word substitutions. In EMNLP, 2021

work page 2021
[46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[47]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[48]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[49]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Revealing weaknesses in text watermarking through self-information rewrite attacks

Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal. Revealing weaknesses in text watermarking through self-information rewrite attacks. arXiv preprint arXiv:2505.05190, 2025

work page arXiv 2025

[2] [2]

Undetectable watermarks for language models

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory, pp.\ 1125--1139. PMLR, 2024

work page 2024

[3] [3]

Certified adversarial robustness via randomized smoothing

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, 2019

work page 2019

[4] [4]

Scalable watermarking for identifying large language model outputs

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs. Nature, 634 0 (8035): 0 818--823, 2024

work page 2024

[5] [5]

Distributionally robust losses for latent covariate mixtures

John Duchi and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:1906.08764, 2019

work page arXiv 1906

[6] [6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

arXiv preprint arXiv:2402.14007 (2024)

Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. arXiv preprint arXiv:2402.14007, 2024

work page arXiv 2024

[8] [8]

Unbiased watermark for large language models

Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023

work page arXiv 2023

[9] [9]

b^ 4 : A black-box scrubbing attack on llm watermarks

Baizhou Huang, Xiao Pu, and Xiaojun Wan. b^ 4 : A black-box scrubbing attack on llm watermarks. arXiv preprint arXiv:2411.01222, 2024

work page arXiv 2024

[10] [10]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025

[11] [11]

Watermark stealing in large language models

Nikola Jovanovi \'c , Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024

work page arXiv 2024

[12] [12]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning, pp.\ 17061--17084. PMLR, 2023 a

work page 2023

[13] [13]

On the reliability of watermarks for large language models.arXiv preprint arXiv:2306.04634, 2023

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023 b

work page arXiv 2023

[14] [14]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 a

work page 2023

[15] [15]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 b

work page 2023

[16] [16]

Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023

work page arXiv 2023

[17] [17]

On information and sufficiency

Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22 0 (1): 0 79--86, 1951

work page 1951

[18] [18]

Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023

work page arXiv 2023

[19] [19]

An unforgeable publicly verifiable watermark for large language models

Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, and Philip S Yu. An unforgeable publicly verifiable watermark for large language models. arXiv preprint arXiv:2307.16230, 2023 a

work page arXiv 2023

[20] [20]

A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356,

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023 b

work page arXiv 2023

[21] [21]

Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S Yu, and Xuming Hu. Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

work page arXiv 2024

[22] [22]

Preventing and detecting misinformation generated by large language models

Aiwei Liu, Qiang Sheng, and Xuming Hu. Preventing and detecting misinformation generated by large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 3001--3004, 2024 b

work page 2024

[23] [23]

An entropy-based text watermarking detection method

Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. An entropy-based text watermarking detection method. arXiv preprint arXiv:2403.13485, 2024

work page arXiv 2024

[24] [24]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018

work page 2018

[25] [25]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking. arXiv preprint arXiv:2405.10051, 2024

work page arXiv 2024

[27] [27]

Markmywords: Analyzing and evaluating language model watermarks

Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Markmywords: Analyzing and evaluating language model watermarks. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 68--91. IEEE, 2025

work page 2025

[28] [28]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Watermarking makes language models radioactive

Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, and Teddy Furon. Watermarking makes language models radioactive. Advances in Neural Information Processing Systems, 37: 0 21079--21113, 2024

work page 2024

[30] [30]

Approximating kl divergence

John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020

work page 2020

[31] [31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Certifying some distributional robustness with principled adversarial training

Aman Sinha, Hongseok Namkoong, and John C Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018

work page 2018

[33] [33]

Necessary and sufficient watermark for large language models

Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, and Makoto Yamada. Necessary and sufficient watermark for large language models. arXiv preprint arXiv:2310.00833, 2023

work page arXiv 2023

[34] [34]

Ghostbuster: Detecting text ghostwritten by large language models

Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047, 2023

work page arXiv 2023

[35] [35]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020

[36] [36]

Unveiling the misuse potential of base large language models via in-context learning

Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, and Dahua Lin. Unveiling the misuse potential of base large language models via in-context learning. arXiv preprint arXiv:2404.10552, 2024

work page arXiv 2024

[37] [37]

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

work page 2023

[38] [38]

Paraphrastic representations at scale

John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. Paraphrastic representations at scale. arXiv preprint arXiv:2104.15114, 2021

work page arXiv 2021

[39] [39]

A survey on llm-generated text detection: Necessity, methods, and future directions

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A survey on llm-generated text detection: Necessity, methods, and future directions. Computational Linguistics, 51 0 (1): 0 275--338, 2025

work page 2025

[40] [40]

Dipmark: A stealthy, efficient and resilient watermark for large language models

Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models. 2023

work page 2023

[41] [41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Review outline:

Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023

work page arXiv 2023

[44] [44]

Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally stable and watermarkable decoder for llms. arXiv preprint arXiv:2402.05864, 2024

work page arXiv 2024

[45] [45]

Certified robustness to adversarial word substitutions

Yizheng Zhu, Hongxin Zhang, and Pin-Yu Chen. Certified robustness to adversarial word substitutions. In EMNLP, 2021

work page 2021

[46] [46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[47] [47]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[48] [48]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[49] [49]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page