pith. sign in

arxiv: 2601.02602 · v2 · submitted 2026-01-05 · 💻 cs.CR · cs.LG

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Pith reviewed 2026-05-16 17:13 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords code watermarkingreinforcement learningLLM protectionintellectual propertyadversarial attacksfunctional correctnessLoRA adaptation
0
0 comments X

The pith

SWaRL uses reinforcement learning to embed detectable watermarks in LLM-generated code while preserving full functionality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWaRL as a framework for watermarking code produced by large language models to safeguard intellectual property. It trains the model through reinforcement learning, drawing rewards from both a compiler that confirms the code runs correctly and a secret verifier that checks for the embedded signature. This dual signal avoids the pitfalls of earlier approaches that either made watermarks easy to strip or broke the code's behavior. A reader would care because it provides a way to reliably attribute and protect AI-written software against copying or tampering. The method also uses efficient fine-tuning techniques to apply across different models.

Core claim

SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability, enabling strong detection accuracy while fully maintaining watermarked code functionality and exhibiting resilience against refactoring and adversarial transformation attacks.

What carries the argument

Reinforcement learning co-training with compiler feedback and confidential verifier rewards, combined with LoRA for efficient adaptation.

Load-bearing premise

The jointly trained confidential verifier stays reliable over time and the balance of rewards in reinforcement learning does not need manual adjustment after training.

What would settle it

Demonstrating a refactoring or transformation attack that removes the watermark while keeping the code fully functional and causing the verifier to fail detection would disprove the resilience claim.

Figures

Figures reproduced from arXiv: 2601.02602 by Ashish Kundu, Farinaz Koushanfar, Neusha Javidnia, Ruisi Zhang.

Figure 1
Figure 1. Figure 1: A code LLM is fine-tune with SWaRL to generate functional and watermarked code. During deployment, user sends a prompt to cloud code LLM API, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SWaRL overview. The actor model generates candidate code, which is then evaluated both by a watermark detector (encouraging embedded watermark [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pass@1 comparison across watermarking methods (EXP-edit, WLLM, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AUROC comparison across watermarking methods (EXP-edit, WLLM, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLMs by embedding unique and verifiable signatures in the generated program. Existing watermarking approaches either rely on handcrafted code transformations or manipulate token generation probabilities at inference time, making them vulnerable to removal attacks or prone to breaking functional correctness. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, enabling efficient integration of watermarking behavior and transferability across model updates. Extensive experiments show that SWaRL achieves strong watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks, which maintains reliable attribution without substantial computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SWaRL, a reinforcement learning co-training framework for watermarking code generated by LLMs. It integrates compiler feedback as a reward for functional correctness with a jointly trained confidential verifier for detectability, employs LoRA for efficient fine-tuning and transferability, and claims superior detection accuracy over prior methods while fully preserving functionality and resisting refactoring plus adversarial transformation attacks.

Significance. If the central claims hold after addressing the reward-balance issues, SWaRL would represent a practical advance in IP protection for code LLMs by achieving attack resilience without sacrificing executability, building on RL techniques to avoid the vulnerabilities of handcrafted transformations or inference-time probability manipulation.

major comments (2)
  1. [RL co-training framework] The RL co-training framework (described in the abstract and methods) asserts that watermarked code 'fully maintains' functionality via the combined reward signal, yet no ablation on the weighting coefficients between the compiler-correctness term and the confidential-verifier term, no trade-off curves, and no post-LoRA reliability checks on the verifier under distribution shift are provided; this leaves the functionality-preservation claim dependent on an untested equilibrium assumption that directly affects the reported gains.
  2. [Experiments] The abstract states 'extensive experiments' demonstrate strong detection accuracy and attack resilience, but without reference to specific tables, attack definitions, or quantitative metrics (e.g., detection rates before/after refactoring), the load-bearing empirical support for outperforming prior methods cannot be verified from the given details.
minor comments (2)
  1. Notation for the jointly trained verifier and its integration with LoRA could be clarified with an explicit equation or diagram to avoid ambiguity about whether the verifier remains external to the generation loop.
  2. The abstract claims 'no substantial computational overhead,' but a minor addition of runtime or parameter-count comparisons against baselines would strengthen this point without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where applicable.

read point-by-point responses
  1. Referee: [RL co-training framework] The RL co-training framework (described in the abstract and methods) asserts that watermarked code 'fully maintains' functionality via the combined reward signal, yet no ablation on the weighting coefficients between the compiler-correctness term and the confidential-verifier term, no trade-off curves, and no post-LoRA reliability checks on the verifier under distribution shift are provided; this leaves the functionality-preservation claim dependent on an untested equilibrium assumption that directly affects the reported gains.

    Authors: We agree that providing ablations on the reward weighting coefficients and trade-off curves would strengthen the paper and better support the functionality-preservation claim. We will add these analyses in the revised manuscript, including experiments varying the weights and showing the impact on detection accuracy and code functionality. We will also include post-LoRA checks for the verifier's reliability under distribution shifts. revision: yes

  2. Referee: [Experiments] The abstract states 'extensive experiments' demonstrate strong detection accuracy and attack resilience, but without reference to specific tables, attack definitions, or quantitative metrics (e.g., detection rates before/after refactoring), the load-bearing empirical support for outperforming prior methods cannot be verified from the given details.

    Authors: The full manuscript contains detailed experimental results in Section 4, including tables with quantitative metrics for detection accuracy, functionality preservation, and resilience to specific attacks such as refactoring and adversarial transformations. We will revise the abstract to explicitly reference these tables (e.g., Table 1 and Table 2) and provide clearer definitions of the attacks and metrics used. This will make the empirical support more verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RL framework uses external compiler signal and jointly-trained verifier without reducing claims to self-defined fits

full rationale

The SWaRL derivation relies on an RL co-training loop with compiler feedback (external functional correctness oracle) and a jointly trained confidential verifier as separate reward terms. No equations, self-citations, or ansatzes are shown that define detectability or functionality maintenance in terms of the same fitted parameters by construction. The reported accuracy and resilience are presented as empirical outcomes from experiments rather than closed-form reductions to inputs, satisfying the self-contained criterion against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of a jointly trained confidential verifier whose detection signal can be used as a stable reward without degrading functional correctness; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1060 out tokens · 37983 ms · 2026-05-16T17:13:21.845999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    Who wrote this code? watermarking for code generation,

    T. Lee, S. Hong, J. Ahn, I. Hong, H. Lee, S. Yun, J. Shin, and G. Kim, “Who wrote this code? watermarking for code generation,” Method HumanEval∆AUROC (%) MBPP∆AUROC (%) EXP-edit 2.26 7.02 WLLM 14.14 17.08 SWEET 11.78 12.36 SWaRL 3.45 9.40 TABLE III PERCENTAGEAUROCREDUCTION AFTER REFACTORING ATTACKS. EXP-edit WLLM SWEET SWaRL Gen. Time / Token (s)0.023 0....

  2. [2]

    A watermark for large language models,

    J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 2023, pp. 17 061–17 084...

  3. [3]

    Robust and secure code watermarking for large language models via ml/crypto codesign,

    R. Zhang, N. Javidnia, N. Sheybani, and F. Koushanfar, “Robust and secure code watermarking for large language models via ml/crypto codesign,”arXiv preprint arXiv:2502.02068, 2025

  4. [4]

    Srcmarker: Dual-channel source code watermarking via scalable code transformations,

    B. Yang, W. Li, L. Xiang, and B. Li, “Srcmarker: Dual-channel source code watermarking via scalable code transformations,” in2024 IEEE Symposium on Security and Privacy (SP), 2024, pp. 4088–4106

  5. [5]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

  6. [6]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”

  7. [8]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  8. [9]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  9. [10]

    CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

    Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,”arXiv preprint arXiv:2109.00859, 2021

  10. [11]

    GitHub Copilot,

    GitHub, “GitHub Copilot,” 2023

  11. [12]

    arXiv:2306.08568 [cs]

    Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,”arXiv preprint arXiv:2306.08568, 2023

  12. [13]

    Understanding the effectiveness of large language models in code translation,

    R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,”arXiv preprint arXiv:2308.03109, 2023

  13. [14]

    Leveraging automated unit tests for unsupervised code translation,

    B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample, “Leveraging automated unit tests for unsupervised code translation,”arXiv preprint arXiv:2110.06773, 2021

  14. [15]

    Few-shot training llms for project-specific code-summarization,

    T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5

  15. [16]

    What makes good in-context demonstrations for code intelligence tasks with llms?

    S. Gao, X.-C. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu, “What makes good in-context demonstrations for code intelligence tasks with llms?” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 761–773

  16. [17]

    Llm for test script generation and migration: Challenges, capabilities, and opportunities,

    S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 206–217

  17. [18]

    Large language models are few-shot testers: Exploring llm-based general bug reproduction,

    S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2312–2323

  18. [19]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  19. [20]

    The science of detecting llm- generated texts,

    R. Tang, Y .-N. Chuang, and X. Hu, “The science of detecting llm- generated texts,”arXiv preprint arXiv:2303.07205, 2023

  20. [21]

    Protecting intellectual property of large language model-based code generation apis via watermarks,

    Z. Li, C. Wang, S. Wang, and C. Gao, “Protecting intellectual property of large language model-based code generation apis via watermarks,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 2336–2350

  21. [22]

    Towards code watermarking with dual-channel transformations,

    B. Yang, W. Li, L. Xiang, and B. Li, “Towards code watermarking with dual-channel transformations,”arXiv preprint arXiv:2309.00860, 2023

  22. [23]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

  23. [24]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2401.02954, 2024. [Online]. Available: https://arxiv.org/abs/2401.02954

  24. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Wang, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://arxiv.org/abs/2106.09685

  25. [26]

    Qwen2.5 Technical Report

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

  26. [27]

    Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

    R. Kuditipudi, J. Thickstun, T. Hashimoto, and P. Liang, “Ro- bust distortion-free watermarks for language models,”arXiv preprint arXiv:2307.15593, 2023

  27. [28]

    Efficient and universal watermarking for llm-generated code detection,

    B. Li, Z. Fu, M. Zhang, P. Zhang, J. Sun, and X. Wang, “Efficient and universal watermarking for llm-generated code detection,” 2025. [Online]. Available: https://arxiv.org/abs/2402.07518

  28. [29]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,”arXiv preprint arXiv:2002.08155, 2020