pith. sign in

arxiv: 2509.03117 · v3 · pith:ZQ7EFNMPnew · submitted 2025-09-03 · 💻 cs.CR

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Pith reviewed 2026-05-25 07:57 UTC · model grok-4.3

classification 💻 cs.CR
keywords system promptcopyright auditingwatermarkinglarge language modelscontent-only detectionintellectual property
0
0 comments X

The pith

PromptCOS audits system prompt copyrights from generated text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to verify ownership of system prompts in large language models when only the output text is available for inspection. Current verification techniques depend on shifts in internal probability distributions that are inaccessible in ordinary content-only use cases, where sampling noise also makes signals unreliable. PromptCOS instead inserts a few auxiliary tokens that steer the model toward a repeating cyclic output pattern, leaving the original prompt instructions untouched. Cover tokens in the prompt are tuned so that attempts to delete the auxiliary tokens cause clear performance drops. This setup lets prompt owners check for unauthorized copies directly from the text that an application produces.

Core claim

PromptCOS establishes a content-only auditing framework by encoding a cyclic output signal using auxiliary tokens injected into the system prompt, with cover tokens optimized to penalize removal attempts, thereby achieving stable watermark detection solely from generated content without altering the prompt's primary behavior.

What carries the argument

Cyclic output signal encoded by auxiliary tokens, protected by optimized cover tokens that cause degradation if removed.

If this is right

  • System prompts can carry a detectable watermark without changing their core task instructions.
  • Verification of prompt ownership succeeds using only the generated text, independent of internal model probabilities.
  • Removing the watermark tokens triggers measurable degradation in the prompt's original performance.
  • The watermark resists removal attempts across multiple categories of attacks.
  • Auditing requires far less computation than logit-based alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Owners of high-value prompts could embed such signals before licensing them to third-party applications.
  • The same auxiliary-token approach might apply to auditing other forms of prompt customization beyond system prompts.
  • Marketplaces that sell prompts could require watermarking as a condition for listing.
  • Detection could be extended to track which downstream application is using a stolen prompt by varying the cyclic pattern per licensee.

Load-bearing premise

A small set of auxiliary tokens can create a stable, detectable cyclic pattern in the output while the main prompt instructions continue to function unchanged.

What would settle it

An experiment in which the model is prompted without the auxiliary tokens or after their removal and the measured similarity to the target cyclic signal drops below the reported detection threshold.

Figures

Figures reproduced from arXiv: 2509.03117 by Dacheng Tao, Enhao Huang, Hongwei Yao, Shuo Shao, Yiming Li, Yuchen Yang, Yuyi Wang, Zhan Qin, Zhibo Wang.

Figure 1
Figure 1. Figure 1: Illustration of prompt leakage. To develop an [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fidelity measures the consistency of responses [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PromptCOS protects system prompts through two phases: (1) Watermark Embedding. PromptCOS defines optimization objectives with three key designs, including cyclic output signals (effectiveness), auxiliary tokens (fidelity), and cover tokens (robustness), and exploits an alternating optimization algorithm to optimize the watermark components: watermarked prompt, verification query, and signal mark. (2) Copyr… view at source ↗
Figure 4
Figure 4. Figure 4: The single signal limits the position and timing of signal mark generation, making it susceptible to uncertain token [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Watermarked system prompts’ performance under different attacks. The four attack strategies are represented [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: This source code of the template function. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that promptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes PromptCOS, the first content-only system prompt copyright auditing method for LLMs based on content-level output similarity. It achieves watermark stability via a cyclic output signal as the conditional instruction target, preserves fidelity by injecting a small set of auxiliary tokens to encode the watermark (leaving the main prompt untouched), and ensures robustness by optimizing cover tokens so that auxiliary-token removal causes severe performance degradation. Experiments report 99.3% average watermark similarity, 60.8% higher distinctiveness than the best baseline, accuracy degradation no greater than 0.6%, resilience against four attack categories, and up to 98.1% computational cost saving.

Significance. If the empirical claims hold under full scrutiny of methods and controls, PromptCOS would provide a practical solution for system-prompt IP protection in deployed LLM applications where only generated content (not logits) is observable, directly addressing the sampling instability and fidelity-loss limitations of prior logit-based watermarking approaches.

major comments (3)
  1. [Abstract, design paragraph] Abstract and design description: the stability claim rests on the 'cyclic output signal' construction, yet no formal definition, recurrence relation, or proof sketch is supplied showing why this signal remains detectable after sampling; this is load-bearing for the core effectiveness claim of 99.3% similarity.
  2. [Experimental results] Experimental results paragraph: the 99.3% similarity, 0.6% fidelity, and 60.8% distinctiveness figures are stated without reference to the precise experimental protocol, baseline implementations, number of trials, statistical tests, or data-exclusion criteria, preventing verification that post-hoc choices did not inflate the numbers.
  3. [Robustness evaluation] Robustness section: the claim of resilience against 'four potential attack categories' is presented without enumerating the attacks or showing quantitative degradation curves when auxiliary tokens are removed, leaving the cover-token optimization claim unverified.
minor comments (2)
  1. [Method] Notation for auxiliary and cover tokens should be introduced with explicit symbols and an equation showing how they are combined with the original prompt.
  2. [Abstract] The abstract states 'up to 98.1% cost saving' without specifying the baseline cost metric or the exact configuration achieving that figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract, design paragraph] Abstract and design description: the stability claim rests on the 'cyclic output signal' construction, yet no formal definition, recurrence relation, or proof sketch is supplied showing why this signal remains detectable after sampling; this is load-bearing for the core effectiveness claim of 99.3% similarity.

    Authors: We agree that a formal definition would strengthen the presentation. The cyclic signal is constructed by setting the conditional instruction target to a periodic sequence of auxiliary tokens whose embeddings induce a detectable similarity pattern under content-only comparison. In the revision we will add an explicit recurrence relation (s_{t+1} = f(s_t, aux)) together with a short argument showing that the periodicity survives sampling noise because the auxiliary tokens dominate the similarity metric even after token-level perturbations. This addition will be placed in Section 3.2. revision: yes

  2. Referee: [Experimental results] Experimental results paragraph: the 99.3% similarity, 0.6% fidelity, and 60.8% distinctiveness figures are stated without reference to the precise experimental protocol, baseline implementations, number of trials, statistical tests, or data-exclusion criteria, preventing verification that post-hoc choices did not inflate the numbers.

    Authors: The full experimental protocol (including 500 independent trials per setting, exact baseline re-implementations, t-test significance thresholds, and exclusion rules for degenerate prompts) appears in Section 4.1 and Appendix B. To address the concern directly we will insert a concise protocol summary table and a statement of statistical procedures immediately after the reported figures in the abstract and results section. revision: yes

  3. Referee: [Robustness evaluation] Robustness section: the claim of resilience against 'four potential attack categories' is presented without enumerating the attacks or showing quantitative degradation curves when auxiliary tokens are removed, leaving the cover-token optimization claim unverified.

    Authors: The four attack categories (auxiliary-token deletion, synonym substitution, prompt paraphrasing, and logit-level perturbation) are defined and evaluated in Section 5.2. We will expand that section with an explicit enumeration, per-attack success rates, and degradation curves for both watermarked and cover-token-optimized prompts, confirming that removal of auxiliary tokens triggers >40% accuracy drop only when cover tokens are optimized. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes PromptCOS as an engineering construction: a cyclic output signal encoded via auxiliary tokens plus optimized cover tokens, with performance metrics reported as direct empirical outcomes of that design. No equations, derivations, or predictions are presented that reduce claimed results to quantities defined by the method's own fitted parameters or self-citations. The approach targets stated limitations of prior logit-based methods through explicit design choices rather than a closed mathematical chain. This is the most common honest finding for an applied systems paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on several design choices whose justification is not derived from prior literature but introduced to solve the stated practical constraints.

free parameters (2)
  • auxiliary token set
    Small set of tokens chosen to encode the watermark; selection criteria not specified in abstract.
  • cover token optimization targets
    Critical tokens in the original prompt tuned to cause performance drop upon removal; optimization procedure and objective not detailed.
axioms (1)
  • domain assumption Content-level similarity of outputs can serve as a stable and distinctive watermark signal
    Invoked when the cyclic output signal is defined as the verification target.
invented entities (1)
  • cyclic output signal no independent evidence
    purpose: Provide stable content-level watermark detectable without logits
    New construct introduced to overcome instability of random sampling

pith-pipeline@v0.9.0 · 5832 in / 1565 out tokens · 34754 ms · 2026-05-25T07:57:33.157342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  2. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 6.0

    PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Prompt leakage effect and mitigation strategies for multi-turn llm applications

    Divyansh Agarwal, Alexander Richard Fabbri, Ben Risher, Philippe Laban, Shafiq Joty, and Chien-Sheng Wu. Prompt leakage effect and mitigation strategies for multi-turn llm applications. InEMNLP, 2024

  2. [2]

    Quick suite

    Amazon. Quick suite. https://aws.amazon.com/cn/quicksuite/, 2025

  3. [3]

    Gpt-5 system prompt leakage

    asgeirtj. Gpt-5 system prompt leakage. https://github.com/asgeirtj/ system_prompts_leaks/blob/main/OpenAI/gpt-5-thinking.md, 2025

  4. [4]

    Greedy search method for separable nonlinear models using stage aitken gradient descent and least squares algorithms.IEEE Transactions on Automatic Control, pages 5044–5051, 2023

    Jing Chen, Yawen Mao, Min Gan, Dongqing Wang, and Quanmin Zhu. Greedy search method for separable nonlinear models using stage aitken gradient descent and least squares algorithms.IEEE Transactions on Automatic Control, pages 5044–5051, 2023

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Stateful defenses for machine learning models are not yet secure against black-box attacks

    Ryan Feng, Ashish Hooda, Neal Mangaokar, Kassem Fawaz, Somesh Jha, and Atul Prakash. Stateful defenses for machine learning models are not yet secure against black-box attacks. InCCS, 2023

  8. [8]

    Math- ematical capabilities of chatgpt

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, and et al. Math- ematical capabilities of chatgpt. InNeurIPS, 2023

  9. [9]

    Introducing github copilot: your ai pair programmer

    Nat Friedman. Introducing github copilot: your ai pair programmer. https://docs.github.com/zh/copilot/quickstart, 2021

  10. [10]

    Cosearchagent: A lightweight collaborative search agent with large language models

    P Gong, J Li, and J Mao. Cosearchagent: A lightweight collaborative search agent with large language models. InSIGIR, 2024

  11. [11]

    On benchmarking code llms for android malware analysis

    Yiling He, Hongyu She, Xingzhi Qian, Xinran Zheng, Zhuo Chen, Zhan Qin, and Lorenzo Cavallaro. On benchmarking code llms for android malware analysis. InISSTA Workshop, 2025

  12. [12]

    Is difficulty calibration all we need? towards more practical membership inference attacks

    Yu He, Boheng Li, Yao Wang, Mengda Yang, Juan Wang, Hongxin Hu, and Xingyu Zhao. Is difficulty calibration all we need? towards more practical membership inference attacks. InCCS, 2024

  13. [13]

    Large language models for software engineering: A systematic literature review.ACM Trans- actions on Software Engineering and Methodology, 33(8), 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, and et al. Large language models for software engineering: A systematic literature review.ACM Trans- actions on Software Engineering and Methodology, 33(8), 2024

  14. [14]

    On the (in) security of llm app stores

    Xinyi Hou, Yanjie Zhao, and Haoyu Wang. On the (in) security of llm app stores. InIEEE S&P, 2025

  15. [15]

    Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. InACL, 2024

  16. [16]

    Pleak: Prompt leaking attacks against large language model applications

    Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. Pleak: Prompt leaking attacks against large language model applications. InCCS, 2024

  17. [17]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InICML, 2023

  18. [18]

    Clearstamp: A human-visible and robust model-ownership proof based on trans- posed model training

    Torsten Krauß, Jasper Stang, and Alexandra Dmitrienko. Clearstamp: A human-visible and robust model-ownership proof based on trans- posed model training. InUSENIX Security, 2024

  19. [19]

    Robust distortion-free watermarks for language models.Trans- actions on Machine Learning Research, 2023

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models.Trans- actions on Machine Learning Research, 2023

  20. [20]

    Governing open vocabulary data leaks using an edge llm through programming by example

    Q Li, J Wen, and H Jin. Governing open vocabulary data leaks using an edge llm through programming by example. InMobiCom, 2024

  21. [21]

    Citation-enhanced generation for llm-based chatbots

    Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. Citation-enhanced generation for llm-based chatbots. InACL, 2024

  22. [22]

    Rethinking data protection in the (generative) artificial intelligence era.arXiv preprint arXiv:2507.03034, 2025

    Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, and Kui Ren. Rethinking data protection in the (generative) artificial intelligence era.arXiv preprint arXiv:2507.03034, 2025

  23. [23]

    Pro- tecting intellectual property of large language model-based code generation apis via watermarks

    Zongjie Li, Chaozheng Wang, Shuai Wang, and Cuiyun Gao. Pro- tecting intellectual property of large language model-based code generation apis via watermarks. InCCS, 2023

  24. [24]

    Parrot: Efficient serving of{LLM- based}applications with semantic variable

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of{LLM- based}applications with semantic variable. InOSDI, 2024

  25. [25]

    A survey of text watermarking in the era of large language models.ACM Computing Surveys, 57(2):1–36, 2024

    Aiwei Liu, Leyi Pan, Yijian Lu, Jingjing Li, Xuming Hu, Xi Zhang, Lijie Wen, Irwin King, Hui Xiong, and Philip Yu. A survey of text watermarking in the era of large language models.ACM Computing Surveys, 57(2):1–36, 2024

  26. [26]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. InNeurIPS, 2023

  27. [27]

    Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing. ACM computing surveys, 55(9):1–35, 2023

  28. [28]

    Safe-sd: Safe and traceable stable diffusion with text prompt trigger for invisible generative watermarking

    Zhiyuan Ma, Guoli Jia, Biqing Qi, and Bowen Zhou. Safe-sd: Safe and traceable stable diffusion with text prompt trigger for invisible generative watermarking. InACM MM, pages 7113–7122, 2024

  29. [29]

    Poster: Multiparty private set intersection from multiparty homomorphic encryption

    Christian Mouchet, Sylvain Chatel, Lea Nürnberger, and Wouter Lueks. Poster: Multiparty private set intersection from multiparty homomorphic encryption. InCCS, 2024

  30. [30]

    Promsec: Prompt optimization for secure generation of func- tional source code with large language models (llms)

    Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, and NhatHai Phan. Promsec: Prompt optimization for secure generation of func- tional source code with large language models (llms). InCCS, 2024

  31. [31]

    OpenAI. Chatgpt. https://chatgpt.com/, 2025

  32. [32]

    Gpt-4 technical report, 2024

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and Sam Altman et al. Gpt-4 technical report, 2024

  33. [33]

    Teach llms to phish: Stealing private information from language models

    A Panda, C A Choquette-Choo, Z Zhang, et al. Teach llms to phish: Stealing private information from language models. InICLR, 2024

  34. [34]

    Online prompt market

    PromptBase. Online prompt market. https://promptbase.com/, 2025

  35. [35]

    Sok: On the role and future of aigc watermarking in the era of gen-ai.arXiv preprint arXiv:2411.11478, 2024

    Kui Ren, Ziqi Yang, Li Lu, Jian Liu, Yiming Li, Jie Wan, Xiaodi Zhao, Xianheng Feng, and Shuo Shao. Sok: On the role and future of aigc watermarking in the era of gen-ai.arXiv preprint arXiv:2411.11478, 2024

  36. [36]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, and et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2024

  37. [37]

    Faq retrieval using query-question similarity and bert-based query-answer relevance

    Wataru Sakata, Tomohide Shibata, Ribeka Tanaka, and Sadao Kuro- hashi. Faq retrieval using query-question similarity and bert-based query-answer relevance. InSIGIR, 2019

  38. [38]

    2025 llm top-10 risks

    Sander Schulhoff. 2025 llm top-10 risks. https://genai.owasp.org/ llmrisk/llm072025-system-prompt-leakage/, 2025

  39. [39]

    Microsoft bing system prompt leakage

    Sander Schulhoff. Microsoft bing system prompt leakage. https: //learnprompting.org/docs/prompt_hacking/leaking, 2025

  40. [40]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Ka- hadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, Hy- oJung Han, Sevien Schulhoff, et al. The prompt report: a sys- tematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608, 2024

  41. [41]

    Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution

    Shuo Shao, Yiming Li, Hongwei Yao, Yiling He, Zhan Qin, and Kui Ren. Explanation as a watermark: Towards harmless and multi-bit model ownership verification via watermarking feature attribution. In NDSS, 2025

  42. [42]

    Generative echo chamber? effect of llm-powered search systems on diverse information seeking

    N Sharma, Q V Liao, and Z Xiao. Generative echo chamber? effect of llm-powered search systems on diverse information seeking. In CHI, 2024

  43. [43]

    Prompt stealing attacks against text-to-image generation models

    Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. Prompt stealing attacks against text-to-image generation models. InUSENIX Security, 2024

  44. [44]

    Logan IV , Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InEMNLP, 2020

  45. [45]

    Why so toxic? measuring and triggering toxic behavior in open-domain chatbots

    Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristo- faro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. InCCS, 2022

  46. [46]

    On transferability of prompt tuning for natural language processing

    Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Huadong Wang, Kaiyue Wen, et al. On transferability of prompt tuning for natural language processing. InNAACL, 2022

  47. [47]

    Autohint: Auto- matic prompt optimization with hint generation.arXiv preprint arXiv:2307.07415, 2023

    Hong Sun, Xue Li, Yinchuan Xu, and et al. Autohint: Auto- matic prompt optimization with hint generation.arXiv preprint arXiv:2307.07415, 2023

  48. [48]

    Peftguard: detecting backdoor attacks against parameter-efficient fine- tuning

    Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, et al. Peftguard: detecting backdoor attacks against parameter-efficient fine- tuning. InIEEE S&P, 2025

  49. [49]

    On the effectiveness of prompt stealing attacks on in-the- wild prompts

    Yicong Tan, Xinyue Shen, Yun Shen, Michael Backes, and Yang Zhang. On the effectiveness of prompt stealing attacks on in-the- wild prompts. In2025 IEEE Symposium on Security and Privacy (SP), pages 392–410. IEEE, 2025

  50. [50]

    Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406.11409, 2024

    CodeGemma Team, Heri Zhao, Jeffrey Hui, and et al. Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406.11409, 2024

  51. [51]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Surya Bhupatiraju, and et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, and et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  53. [53]

    Atten- tion is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten- tion is all you need. InNeurIPS, 2017

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  55. [55]

    Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery

    Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. NeurIPS, 36:51008–51025, 2023

  56. [56]

    Instructional fingerprinting of large language models

    Jiashu Xu, Fei Wang, Mingyu Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InNAACL, 2024

  57. [57]

    Learning to break the loop: Analyzing and mitigating repetitions for neural text generation.NeurIPS, 35:3082–3095, 2022

    Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation.NeurIPS, 35:3082–3095, 2022

  58. [58]

    Pm-abe: Puncturable bilateral fine-grained access control from lattices for secret sharing

    Mengxue Yang, Huaqun Wang, and Debiao He. Pm-abe: Puncturable bilateral fine-grained access control from lattices for secret sharing. IEEE Transactions on Dependable and Secure Computing, 2024

  59. [59]

    Tracing text provenance via context-aware lexical substitution

    Xi Yang, Jie Zhang, Kejiang Chen, Weiming Zhang, Zehua Ma, Feng Wang, and Nenghai Yu. Tracing text provenance via context-aware lexical substitution. InAAAI, 2022

  60. [60]

    Prsa: Prompt stealing attacks against real-world prompt services

    Yong Yang, Changjiang Li, Qingming Li, Oubo Ma, Haoyu Wang, et al. Prsa: Prompt stealing attacks against real-world prompt services. InUSENIX Security, 2025

  61. [61]

    Promptcare: Prompt copyright protection by watermark injection and verification

    Hongwei Yao, Jian Lou, Kui Ren, and Zhan Qin. Promptcare: Prompt copyright protection by watermark injection and verification. InIEEE S&P, 2024

  62. [62]

    Probe before you talk: Towards black-box defense against backdoor unalignment for large language models

    Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, et al. Probe before you talk: Towards black-box defense against backdoor unalignment for large language models. InICLR, 2025

  63. [63]

    Exploring the effec- tiveness of prompt engineering for legal reasoning tasks

    Fangyi Yu, Lee Quartey, and Frank Schilder. Exploring the effec- tiveness of prompt engineering for legal reasoning tasks. InACL, 2023

  64. [64]

    Remark-llm: A robust and efficient watermarking framework for generative large language models

    Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Fari- naz Koushanfar. Remark-llm: A robust and efficient watermarking framework for generative large language models. InUSENIX Security, 2024

  65. [65]

    Automatic chain of thought prompting in large language models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. InICLR, 2022

  66. [66]

    Parden, can you repeat that? defending against jailbreaks via repetition

    Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition. InICML, 2024

  67. [67]

    Provable robust watermarking for AI-generated text

    Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for AI-generated text. InNeurIPS, 2023

  68. [68]

    Provable robust watermarking for ai-generated text

    Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text. InICLR, 2024

  69. [69]

    Sok: Water- marking for ai-generated content

    Xuandong Zhao, Sam Gunn, Miranda Christ, and et al. Sok: Water- marking for ai-generated content. InIEEE S&P, 2025

  70. [70]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InICLR, 2022

  71. [71]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Appendix A. Ethics Considerations The study investigates copyright protection for system prompts, with the goal of preventing unauthorized replica- tion an...