pith. sign in

arxiv: 2512.16182 · v2 · submitted 2025-12-18 · 💻 cs.CR · cs.CL

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

Pith reviewed 2026-05-16 21:57 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM watermarkingparaphrase attackspoofing attackdual-stream mechanismwatermark detectiontext qualityattack traceabilitymodel abuse prevention
0
0 comments X

The pith

DualGuard uses dual-stream watermarking to defend LLM outputs against both paraphrase and spoofing attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DualGuard to fix a gap in existing LLM watermarking that handles paraphrase attacks but ignores spoofing attacks where harmful content gets injected to break attribution. It introduces an adaptive dual-stream mechanism that injects two complementary watermark signals dynamically according to the text's semantic content. This setup detects watermarked text after paraphrasing and traces the origin of spoofed content. Tests across datasets and models show strong results in detection, robustness to attacks, traceability, and text quality preservation. The method aims to make watermarking practical for stopping model abuse on cloud platforms.

Core claim

DualGuard is the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks by employing an adaptive dual-stream watermarking mechanism in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection.

What carries the argument

Adaptive dual-stream watermarking mechanism that dynamically injects two complementary watermark signals based on semantic content to enable both detection and tracing.

If this is right

  • Watermarked LLM text stays detectable even after attackers paraphrase it.
  • Spoofing attempts that add harmful content can be traced back to the attacker.
  • The approach maintains text quality while delivering high detectability and robustness.
  • Watermarking becomes viable for real-world protection of model outputs against abuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multiple signal streams could become a standard pattern for handling wider attack varieties in future systems.
  • This design might support better attribution in public LLM services by reducing false negatives from combined attacks.
  • Extending the semantic-based injection logic to other text transformations could be tested directly.

Load-bearing premise

Injecting two complementary watermark signals based on semantic content will not degrade text quality or detectability while providing traceability against spoofing attacks.

What would settle it

An experiment where spoofed text evades tracing or watermarked output shows clear quality loss compared to the original unwatermarked text.

Figures

Figures reproduced from arXiv: 2512.16182 by Fang Fang, Hao Li, Li Guo, Shi Wang, Yanan Cao, Yingjie Li, Yubing Ren.

Figure 1
Figure 1. Figure 1: An example is generated using the Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of our watermarking method DualGuard. Gray indicates un-watermarked tokens, while blue and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental results of different attack models on RealNewsLike and RealToxicityPrompts dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The impact of Dual-stream Selection. RealNewsLike BookSum 0 2 4 6 8 10 12 14 Text Perplexity Un-watermarked KGW Unbiased AAR SynthID EWD SWEET DIPmark SIR XSIR DualGuard [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PPL on RealNewsLike and BookSum datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Experimental results of GPT-4.1 on RealNewsLike [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Experimental results of watermark detectability on RealNewsLike and BookSum datasets. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Paraphrase attack prompt. Piggyback Spoofing Attack Prompt Task Description: Maliciously rewrite the following text to express more negative, pessimistic, harmful, hateful, or other negative content. Ensure that the output meets the following criteria: 1. Modification Scope: Restrict modification to the sentence level, rewriting them whenever feasible. 2. Modification Requirement: Maliciously rewrite the s… view at source ↗
Figure 10
Figure 10. Figure 10: Piggyback spoofing attack prompt. metric, and the max_new_tokens parameter for model gen￾eration is set to 16. • Short Input, Long Answer: which assesses long-form ques￾tion answering (QA) capabilities and includes 200 samples from the ELI5 dataset [12], composed of threads from the Reddit forum “Explain Like I’m Five.” ROUGE-L is adopted as the generation metric, and the max_new_tokens parameter for mode… view at source ↗
read the original abstract

With the rapid development of cloud-based services, large language models have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DualGuard, the first LLM watermarking algorithm claimed to defend against both paraphrase attacks and piggyback spoofing attacks. It employs an adaptive dual-stream mechanism in which two complementary watermark signals are dynamically injected based on the semantic content of the generated text. This design is asserted to enable not only reliable detection under paraphrasing but also traceability of spoofing attempts by identifying which signal was altered. Extensive experiments across multiple datasets and language models are reported to demonstrate strong detectability, robustness, traceability, and preservation of text quality.

Significance. If the dual-stream complementarity can be shown to preserve independent detectability after paraphrasing while enabling spoofing traceability without quality degradation, the result would meaningfully extend single-stream watermarking methods by addressing a practical attack vector that compromises attribution reliability in deployed LLM services. The work targets a real gap in robustness for cloud-based LLM platforms.

major comments (1)
  1. [Method section (likely §3)] The description of the adaptive dual-stream mechanism provides no equations, pseudocode, or formal detection rule for signal generation, embedding combination, or verification-time separation of the two streams. Without these details it is impossible to verify that the signals remain independently detectable after paraphrase or that alteration of one enables reliable tracing of spoofing without destroying detectability of the other.
minor comments (1)
  1. [Abstract] The abstract asserts 'excellent' performance without reporting any quantitative metrics, baseline comparisons, or specific attack strengths, which hinders immediate assessment of the claimed advances.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the practical importance of defending against both paraphrase and spoofing attacks. We address the major comment below.

read point-by-point responses
  1. Referee: [Method section (likely §3)] The description of the adaptive dual-stream mechanism provides no equations, pseudocode, or formal detection rule for signal generation, embedding combination, or verification-time separation of the two streams. Without these details it is impossible to verify that the signals remain independently detectable after paraphrase or that alteration of one enables reliable tracing of spoofing without destroying detectability of the other.

    Authors: We agree that the current textual description in Section 3 lacks the formal specifications needed for independent verification. In the revised manuscript we will add explicit equations for (i) adaptive generation of the two complementary watermark signals conditioned on semantic content, (ii) the combination rule used during embedding, and (iii) the verification-time separation and detection statistics for each stream. We will also include pseudocode for the embedding and detection procedures to demonstrate how the streams preserve independent detectability after paraphrasing and how selective alteration of one stream enables reliable spoofing traceability without compromising the other. These additions will directly resolve the verifiability issue. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is self-contained with no derivation chain or self-citation reduction

full rationale

The paper introduces DualGuard as a novel adaptive dual-stream watermarking scheme but presents no mathematical derivations, equations, or closed-form proofs that reduce the central claim to fitted inputs or prior self-citations. The abstract and available description frame the contribution as an empirical design validated by experiments across datasets and models, with no load-bearing uniqueness theorem, ansatz smuggling, or renaming of known results. No self-citation is invoked to justify the complementarity of the two streams or their spoofing traceability. The result is therefore not forced by construction and stands as an independent proposal whose validity rests on external experimental outcomes rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are detailed in the abstract. The method relies on the unstated assumption that the dual-stream injection is feasible and effective.

pith-pipeline@v0.9.0 · 5500 in / 1124 out tokens · 34161 ms · 2026-05-16T21:57:07.789803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

  1. [1]

    Aaronson and H

    S. Aaronson and H. Kirchner. 2022. Watermarking GPT outputs. https://www. scottaaronson.com/talks/watermark.ppt

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, and Shiyu Chang

  4. [4]

    InSecond Conference on Language Modeling

    Defending LLM Watermarking Against Spoofing Attacks with Con- trastive Representation Learning. InSecond Conference on Language Modeling. https://openreview.net/forum?id=n5hmtkdl7k

  5. [5]

    Sachin Chanchani and Ruihong Huang. 2023. Composition-contrastive Learn- ing for Sentence Embeddings. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15836–15848. doi:10....

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

  7. [7]

    Miranda Christ, Sam Gunn, and Or Zamir. 2024. Undetectable watermarks for language models. InThe Thirty Seventh Annual Conference on Learning Theory. PMLR, 1125–1139

  8. [8]

    Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. 2024. Scalable watermarking for identifying large language model outputs.Nature634, 8035 (2024), 818–823

  9. [9]

    Adrian de Wynter, Ishaan Watts, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Nektar Ege Altıntoprak, Lena Baur, Samantha Claudet, Pavel Gajdušek, Qilong Gu, et al. 2025. Rtp-lx: Can llms evaluate toxicity in multilingual scenarios?. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 27940– 27950

  10. [10]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

  11. [11]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  12. [12]

    Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi- News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Lingui...

  13. [13]

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long Form Question Answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3558–3567. doi:10.186...

  14. [14]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Lan- guage Models. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 3356–3369. doi:10.18653/v...

  15. [15]

    Guerreiro, Duarte M

    Nuno M. Guerreiro, Duarte M. Alves, Jonas Waldendorf, Barry Haddow, Alexan- dra Birch, Pierre Colombo, and André F. T. Martins. 2023. Hallucinations in Large Multilingual Translation Models.Transactions of the Association for Computa- tional Linguistics11 (2023), 1500–1517. doi:10.1162/tacl_a_00615

  16. [16]

    Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp

  17. [17]

    Fries, L

    More than a Feeling: Accuracy and Application of Sentiment Analysis. International Journal of Research in Marketing40, 1 (2023), 75–87. doi:10.1016/j. ijresmar.2022.05.005

  18. [18]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. InProceedings of the 60th Annual Meeting of the Association of Computational Linguistics

  19. [19]

    Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. 2024. Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre ...

  20. [20]

    Abe Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hong- wei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2024. SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation. InProceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguisti...

  21. [21]

    Abe Hou, Jingyu Zhang, Yichen Wang, Daniel Khashabi, and Tianxing He. 2024. k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine- Generated Text. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 17...

  22. [22]

    Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, and Heng Huang. 2024. Unbiased Watermark for Large Language Models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id= uWVC5FVidc

  23. [23]

    Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing(Dallas, Texas, USA)(STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. doi:10.1145/276698. 276876

  24. [24]

    Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck

  25. [25]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Automatic Detection of Generated Text is Easiest when Humans are Fooled. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1808–1822. doi:10.18653/v1/ 2020.acl-main.164

  26. [26]

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. InInternational Conference on Machine Learning. PMLR, 17061–17084

  27. [27]

    Wojciech Kryscinski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. 2022. BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization. InFindings of the Association for Computational Lin- guistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, Uni...

  28. [28]

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. 2024. Robust Distortion-free Watermarks for Language Models.Trans. Mach. Learn. Res.2024 (2024). https://openreview.net/forum?id=FpaCL1MO2C

  29. [29]

    Gregory Kang Ruey Lau, Xinyuan Niu, Hieu Dao, Jiangwei Chen, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. 2024. Waterfall: Scalable Framework for Robust Text Watermarking and Provenance for LLMs. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association ...

  30. [30]

    Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. 2024. Who Wrote this Code? Watermarking for Code Generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computa...

  31. [31]

    Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. 2024. A Semantic Invariant Robust Watermark for Large Language Models. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=6p8lpe4MNf

  32. [32]

    Aiwei Liu, Leyi Pan, Yijian Lu, Jingjing Li, Xuming Hu, Xi Zhang, Lijie Wen, Irwin King, Hui Xiong, and Philip Yu. 2024. A survey of text watermarking in the era of large language models.Comput. Surveys57, 2 (2024), 1–36

  33. [33]

    Yepeng Liu and Yuheng Bu. 2024. Adaptive text watermark for large language models. InProceedings of the 41st International Conference on Machine Learning (Vienna, Austria)(ICML’24). JMLR.org, Article 1238, 20 pages

  34. [34]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. InPro- ceedings of the 33rd USENIX Conference on Security Symposium(Philadelphia, PA, USA)(SEC ’24). USENIX Association, USA, Article 103, 17 pages

  35. [35]

    Stuart Lloyd. 1982. Least squares quantization in PCM.IEEE transactions on information theory28, 2 (1982), 129–137

  36. [36]

    Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. ParaDetox: Detoxification with Parallel Data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dub...

  37. [37]

    Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. 2024. An Entropy- based Text Watermarking Detection Method. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Compu- tational Linguistics, Bangkok, Thailand, 1...

  38. [38]

    David Megías, Minoru Kuribayashi, Andrea Rosales, and Wojciech Mazurczyk

  39. [39]

    InProceedings of the 16th International Conference on A vailability, Reliability and Security(Vienna, Austria)(ARES ’21)

    DISSIMILAR: Towards fake news detection using information hiding, signal processing and machine learning. InProceedings of the 16th International Conference on A vailability, Reliability and Security(Vienna, Austria)(ARES ’21). Association for Computing Machinery, New York, NY, USA, Article 66, 9 pages. doi:10.1145/3465481.3470088

  40. [40]

    George A. Miller. 1995. WordNet: a lexical database for English.Commun. ACM 38, 11 (Nov. 1995), 39–41. doi:10.1145/219717.219748

  41. [41]

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark.arXiv preprint arXiv:2210.07316(2022). doi:10.48550/ARXIV.2210.07316

  42. [42]

    Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, Irwin King, and Philip S. Yu. 2024. MarkLLM: An Open-Source Toolkit for LLM Watermarking. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Delia Irazu Hernandez Farias, Tom ...

  43. [43]

    Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. 2024. No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, D...

  44. [44]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21, 1, Article 140 (Jan. 2020), 67 pages

  45. [45]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.ArXiv abs/1910.01108 (2019)

  46. [46]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

  47. [47]

    Shangqing Tu, Yuliang Sun, Yushi Bai, Jifan Yu, Lei Hou, and Juanzi Li. 2024. WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computation...

  48. [48]

    Yihan Wu, Zhengmian Hu, Junfeng Guo, Hongyang Zhang, and Heng Huang

  49. [49]

    InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24)

    A resilient and accessible distribution-preserving watermark for large language models. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2190, 28 pages

  50. [50]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  51. [51]

    Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Kaifeng Yun, Linlu GONG, Nianyi Lin, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei H...

  52. [52]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068 (2022)

  53. [53]

    non-distortionary

    Xuandong Zhao, Yu-Xiang Wang, and Lei Li. 2023. Protecting language genera- tion models via invisible watermarking. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 1774, 13 pages. DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack...

  54. [55]

    Modification Requirement: Rewrite the sentences by employing different vocabulary and syntactic constructions to the greatest extent possible, while preserving logical coherence and consistency

  55. [56]

    Preserve Meaning: Rewrite should convey the same core idea and without omitting or distorting information

  56. [57]

    Fluency and Grammar: Rewrite must be natural, grammatically correct, and well-structured

  57. [61]

    Just provide the text after rewriting, without any introductory or concluding phrases

    Retains Nuances: Preserve connotations, implied meanings, and idiomatic expressions where appropriate. Just provide the text after rewriting, without any introductory or concluding phrases. Original Text: {Original Text} Figure 9: Paraphrase attack prompt. Piggyback Spoofing Attack Prompt Task Description: Maliciously rewrite the following text to express...

  58. [62]

    Modification Scope: Restrict modification to the sentence level, rewriting them whenever feasible

  59. [63]

    Modification Requirement: Maliciously rewrite the sentences by employing different vocabulary and syntactic constructions to the greatest extent possible, and ensure that the modified text has obvious negative and pessimistic content

  60. [64]

    Fluency and Grammar: Malicious rewrite must be natural, grammatically correct, and well-structured

  61. [65]

    Appropriate Length: Maintain a similar length unless a slight adjustment improves clarity

  62. [66]

    Consistency with Context: Retain the original tone and formality (e.g., academic, casual, professional)

  63. [67]

    Minimal Redundancy: Avoid unnecessary repetition while keeping essential details

  64. [68]

    Explain Like I’m Five

    Retains Nuances: Preserve connotations, implied meanings, and idiomatic expressions where appropriate. Just provide the text after malicious rewriting, without any introductory or concluding phrases. Original Text: {Original Text} Figure 10: Piggyback spoofing attack prompt. metric, and the max_new_tokens parameter for model gen- eration is set to 16. • S...