pith. sign in

arxiv: 2509.20924 · v2 · pith:U74HG3VYnew · submitted 2025-09-25 · 💻 cs.CR

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

Pith reviewed 2026-05-18 14:20 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM watermarkingadaptive attacksreinforcement learningrobustness evaluationparaphrase attacksvulnerability assessmentAI content detection
0
0 comments X

The pith

Reinforcement learning lets a 3B model remove LLM watermarks at 98.5 percent success after training on 100 samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that evaluations of LLM watermarks have used insufficiently strong attacks, causing overstated robustness results. It introduces the adaptive robustness radius to measure the smallest distortion an optimal adaptive adversary can apply to erase the watermark. By mapping possible paraphrases into a KL-divergence ball, the authors show that targeted optimization of attack context and parameters shrinks this radius substantially. They then implement RLCracker, a reinforcement learning attacker that trains on 100 short samples to let a 3-billion-parameter model erase watermarks from 1500-token texts at 98.5 percent success while keeping semantic shift small. This performance beats GPT-4o and holds across five model sizes and ten watermarking schemes.

Core claim

We introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximate radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning based adaptive attack that erases watermark signals with limited watermarked examples and limited access to the detector.

What carries the argument

Adaptive robustness radius approximated by lifting the paraphrase space into a KL-divergence ball

If this is right

  • Watermark removal becomes practical with small training sets and modest model sizes.
  • Robustness claims based on non-adaptive attacks fail against optimized RL methods.
  • The vulnerability appears across multiple watermarking schemes and model scales.
  • Effective attacks remain possible even with limited detector access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Watermark designers may need to build resistance to optimization-based attacks into their theoretical models.
  • The same RL approach could serve as a general test for other AI-generated content detectors.
  • Long-term reliance on watermarking for provenance may require additional complementary defenses.

Load-bearing premise

The KL-divergence ball approximation faithfully captures the capabilities of a worst-case adaptive adversary using real paraphrases.

What would settle it

Run RLCracker on a previously untested watermarking scheme and check whether removal success stays above 90 percent while semantic similarity remains above 0.95.

Figures

Figures reproduced from arXiv: 2509.20924 by Hanbo Huang, Hao Zheng, Lin Liu, Shiyu Liang, Xuan Gong, Yihan Li, Yiran Zhang, Zhuotao Liu.

Figure 1
Figure 1. Figure 1: The RLCracker algorithm for watermark removal. The model is trained on question and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass@20 on EWD. While direct black-box optimization of attack context c is in￾tractable, Theorem 1 establishes that multi-sample strategies like Pass@k sampling are a principled attack paradigm (see Ap￾pendix B.1). To empirically validate the effectiveness of this approach, we conduct a Pass@20 attack on the EWD watermark using Qwen2.5-3B-Instruct. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ESR and P-SP variation across user prompts (UserP.), with and without system prompt (SysP.). System prompts are an overlooked adversarial tool: sim￾ple in design but powerful in effect. While prior work has shown that user prompts can impact watermark eva￾sion (Kirchenbauer et al., 2023b;a), evaluations have largely focused on user input variation, overlooking the broader influ￾ence of system-level instruc… view at source ↗
Figure 4
Figure 4. Figure 4: (a) shows the effectiveness of test-time scaling on watermark removal using Qwen3-8B; (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of RLCracker to Weight Settings under Qwen-3-4B for EWD Watermark [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) shows test-time scaling on Qwen3-8B, (b) shows remove rate on Qwen3-8B, (c) shows [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximate radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermark signals with limited watermarked examples and limited access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our code is available at https://github.com/OTT0-OTO/RLCracker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper argues that prior evaluations of LLM watermarks are insufficiently adversarial. It introduces the adaptive robustness radius as a formal metric for worst-case resilience, approximates it by lifting the paraphrase space into a KL-divergence ball, and claims that optimizing attack context and model parameters provably reduces this approximate radius. The authors present RLCracker, an RL-based adaptive attack that, after training on only 100 short samples, allows a 3B model to achieve 98.5% watermark removal success on 1,500-token Unigram-marked texts with minimal semantic shift, far exceeding GPT-4o's 6.75% and generalizing across five model sizes and ten watermarking schemes. Code is released at https://github.com/OTT0-OTO/RLCracker.

Significance. If the results hold, the work is significant for LLM security research. It supplies both a formal metric and a practical, low-resource RL attack demonstrating concrete vulnerabilities, supported by clear empirical success rates and cross-scheme generalization. The open-source code is a strength that aids reproducibility and allows independent verification of the reported attack performance.

major comments (1)
  1. [Theoretical analysis of adaptive robustness radius and KL-ball approximation] The lifting of the paraphrase space into a KL-divergence ball to approximate the adaptive robustness radius and the claim that optimization inside this ball reduces the radius (as stated in the abstract and theoretical analysis) is load-bearing for the central argument. The manuscript asserts rather than derives bounds on approximation tightness and provides no verification that the learned RL policy outputs remain inside the ball or that the ball contains the relevant worst-case semantic-preserving paraphrases. Without this, the 98.5% empirical removal success does not directly confirm the theoretical reduction.
minor comments (2)
  1. [Abstract] The abstract refers to 'minimal semantic shift' without naming the concrete metric (e.g., cosine similarity on embeddings or BLEU score); adding this detail would improve clarity and reproducibility.
  2. [Experimental setup] The description of the 100 training samples (their length distribution and selection process) relative to the 1,500-token evaluation texts could be expanded to better support the generalization claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for emphasizing the need for rigorous theoretical grounding. We address the single major comment below and agree that additional clarification and verification would strengthen the connection between the adaptive robustness radius and the empirical results. We plan to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The lifting of the paraphrase space into a KL-divergence ball to approximate the adaptive robustness radius and the claim that optimization inside this ball reduces the radius (as stated in the abstract and theoretical analysis) is load-bearing for the central argument. The manuscript asserts rather than derives bounds on approximation tightness and provides no verification that the learned RL policy outputs remain inside the ball or that the ball contains the relevant worst-case semantic-preserving paraphrases. Without this, the 98.5% empirical removal success does not directly confirm the theoretical reduction.

    Authors: We appreciate this observation. The manuscript defines the adaptive robustness radius as the minimal perturbation radius needed to erase the watermark under an adaptive adversary and approximates the space of semantic-preserving paraphrases via a KL-divergence ball centered on the original text distribution. Within this ball, we show that jointly optimizing attack context and model parameters yields a policy whose effective radius is smaller than that of non-adaptive baselines, because the learned policy identifies watermark-removing transformations that remain distributionally close. However, we acknowledge that the paper does not derive quantitative bounds on the tightness of the KL-ball approximation to the true paraphrase space, nor does it include post-training verification that RL-generated outputs lie inside the ball. The reported low semantic drift (high BERTScore, low perplexity change) provides indirect support that the attacks are semantically plausible, but this is not equivalent to a KL-membership check. In revision we will (1) explicitly state the heuristic nature of the approximation, (2) add a dedicated limitations subsection discussing the absence of tightness bounds, and (3) include new experiments that compute the empirical KL divergence of the attacked texts relative to the original distribution to confirm they remain within the modeled ball. These changes will make the theoretical-empirical linkage more transparent without altering the core empirical claims. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; KL-ball approximation external to empirical attack results

full rationale

The paper defines the adaptive robustness radius as a formal worst-case metric, approximates it by lifting paraphrases into a KL-divergence ball, and separately demonstrates a theoretical reduction under optimization inside that ball. RLCracker is then presented as an RL policy realizing attacks in practice, with results reported on held-out test texts (1,500-token samples) after training on 100 short examples. No equation or claim reduces the reported 98.5% removal success or the radius reduction to a fitted parameter or self-citation by construction. The KL-ball construction is introduced as an external modeling choice rather than derived from the attack success metric itself, and the empirical evaluation uses standard held-out generalization checks. This yields only minor self-citation risk at most and keeps the central claims independent of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the modeling choice that a KL-divergence ball adequately represents the space of meaning-preserving paraphrases and on the assumption that limited-access RL optimization can discover near-optimal attacks without full detector gradients.

free parameters (1)
  • training sample count
    Limited to 100 short samples as weak supervision; the number is chosen to demonstrate data efficiency rather than derived from first principles.
axioms (1)
  • domain assumption The paraphrase space can be lifted into a KL-divergence ball that approximates worst-case adaptive attacks
    Invoked to define and approximate the adaptive robustness radius in the theoretical section.
invented entities (1)
  • adaptive robustness radius no independent evidence
    purpose: Formal metric quantifying worst-case watermark resilience against adaptive adversaries
    Newly introduced quantity whose value is approximated via the KL-ball optimization.

pith-pipeline@v0.9.0 · 5783 in / 1442 out tokens · 52236 ms · 2026-05-18T14:20:38.698769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

    cs.CR 2026-04 unverdicted novelty 7.0

    RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Revealing weaknesses in text watermarking through self-information rewrite attacks

    Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal. Revealing weaknesses in text watermarking through self-information rewrite attacks. arXiv preprint arXiv:2505.05190, 2025

  2. [2]

    Undetectable watermarks for language models

    Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory, pp.\ 1125--1139. PMLR, 2024

  3. [3]

    Certified adversarial robustness via randomized smoothing

    Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, 2019

  4. [4]

    Scalable watermarking for identifying large language model outputs

    Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs. Nature, 634 0 (8035): 0 818--823, 2024

  5. [5]

    Distributionally robust losses for latent covariate mixtures

    John Duchi and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:1906.08764, 2019

  6. [6]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    arXiv preprint arXiv:2402.14007 (2024)

    Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. arXiv preprint arXiv:2402.14007, 2024

  8. [8]

    Unbiased watermark for large language models

    Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023

  9. [9]

    b^ 4 : A black-box scrubbing attack on llm watermarks

    Baizhou Huang, Xiao Pu, and Xiaojun Wan. b^ 4 : A black-box scrubbing attack on llm watermarks. arXiv preprint arXiv:2411.01222, 2024

  10. [10]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  11. [11]

    Watermark stealing in large language models

    Nikola Jovanovi \'c , Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024

  12. [12]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning, pp.\ 17061--17084. PMLR, 2023 a

  13. [13]

    On the reliability of watermarks for large language models.arXiv preprint arXiv:2306.04634, 2023

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023 b

  14. [14]

    Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

    Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 a

  15. [15]

    Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

    Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 0 27469--27500, 2023 b

  16. [16]

    Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023

  17. [17]

    On information and sufficiency

    Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22 0 (1): 0 79--86, 1951

  18. [18]

    Who wrote this code? watermarking for code generation.arXiv preprint arXiv:2305.15060,

    Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023

  19. [19]

    An unforgeable publicly verifiable watermark for large language models

    Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, and Philip S Yu. An unforgeable publicly verifiable watermark for large language models. arXiv preprint arXiv:2307.16230, 2023 a

  20. [20]

    A semantic invariant robust watermark for large language models.arXiv preprint arXiv:2310.06356,

    Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023 b

  21. [21]

    Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

    Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S Yu, and Xuming Hu. Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024 a

  22. [22]

    Preventing and detecting misinformation generated by large language models

    Aiwei Liu, Qiang Sheng, and Xuming Hu. Preventing and detecting misinformation generated by large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 3001--3004, 2024 b

  23. [23]

    An entropy-based text watermarking detection method

    Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. An entropy-based text watermarking detection method. arXiv preprint arXiv:2403.13485, 2024

  24. [24]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018

  25. [25]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  26. [26]

    Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

    Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking. arXiv preprint arXiv:2405.10051, 2024

  27. [27]

    Markmywords: Analyzing and evaluating language model watermarks

    Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Markmywords: Analyzing and evaluating language model watermarks. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 68--91. IEEE, 2025

  28. [28]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  29. [29]

    Watermarking makes language models radioactive

    Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, and Teddy Furon. Watermarking makes language models radioactive. Advances in Neural Information Processing Systems, 37: 0 21079--21113, 2024

  30. [30]

    Approximating kl divergence

    John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Certifying some distributional robustness with principled adversarial training

    Aman Sinha, Hongseok Namkoong, and John C Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018

  33. [33]

    Necessary and sufficient watermark for large language models

    Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, and Makoto Yamada. Necessary and sufficient watermark for large language models. arXiv preprint arXiv:2310.00833, 2023

  34. [34]

    Ghostbuster: Detecting text ghostwritten by large language models

    Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047, 2023

  35. [35]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

  36. [36]

    Unveiling the misuse potential of base large language models via in-context learning

    Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, and Dahua Lin. Unveiling the misuse potential of base large language models via in-context learning. arXiv preprint arXiv:2404.10552, 2024

  37. [37]

    Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

  38. [38]

    Paraphrastic representations at scale

    John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. Paraphrastic representations at scale. arXiv preprint arXiv:2104.15114, 2021

  39. [39]

    A survey on llm-generated text detection: Necessity, methods, and future directions

    Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A survey on llm-generated text detection: Necessity, methods, and future directions. Computational Linguistics, 51 0 (1): 0 275--338, 2025

  40. [40]

    Dipmark: A stealthy, efficient and resilient watermark for large language models

    Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. Dipmark: A stealthy, efficient and resilient watermark for large language models. 2023

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  43. [43]

    Review outline:

    Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023

  44. [44]

    Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

    Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally stable and watermarkable decoder for llms. arXiv preprint arXiv:2402.05864, 2024

  45. [45]

    Certified robustness to adversarial word substitutions

    Yizheng Zhu, Hongxin Zhang, and Pin-Yu Chen. Certified robustness to adversarial word substitutions. In EMNLP, 2021

  46. [46]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  47. [47]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  48. [48]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  49. [49]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...