SWaRL: Safeguard Code Watermarking via Reinforcement Learning
Pith reviewed 2026-05-16 17:13 UTC · model grok-4.3
The pith
SWaRL uses reinforcement learning to embed detectable watermarks in LLM-generated code while preserving full functionality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability, enabling strong detection accuracy while fully maintaining watermarked code functionality and exhibiting resilience against refactoring and adversarial transformation attacks.
What carries the argument
Reinforcement learning co-training with compiler feedback and confidential verifier rewards, combined with LoRA for efficient adaptation.
Load-bearing premise
The jointly trained confidential verifier stays reliable over time and the balance of rewards in reinforcement learning does not need manual adjustment after training.
What would settle it
Demonstrating a refactoring or transformation attack that removes the watermark while keeping the code fully functional and causing the verifier to fail detection would disprove the resilience claim.
Figures
read the original abstract
We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLMs by embedding unique and verifiable signatures in the generated program. Existing watermarking approaches either rely on handcrafted code transformations or manipulate token generation probabilities at inference time, making them vulnerable to removal attacks or prone to breaking functional correctness. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, enabling efficient integration of watermarking behavior and transferability across model updates. Extensive experiments show that SWaRL achieves strong watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks, which maintains reliable attribution without substantial computational overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SWaRL, a reinforcement learning co-training framework for watermarking code generated by LLMs. It integrates compiler feedback as a reward for functional correctness with a jointly trained confidential verifier for detectability, employs LoRA for efficient fine-tuning and transferability, and claims superior detection accuracy over prior methods while fully preserving functionality and resisting refactoring plus adversarial transformation attacks.
Significance. If the central claims hold after addressing the reward-balance issues, SWaRL would represent a practical advance in IP protection for code LLMs by achieving attack resilience without sacrificing executability, building on RL techniques to avoid the vulnerabilities of handcrafted transformations or inference-time probability manipulation.
major comments (2)
- [RL co-training framework] The RL co-training framework (described in the abstract and methods) asserts that watermarked code 'fully maintains' functionality via the combined reward signal, yet no ablation on the weighting coefficients between the compiler-correctness term and the confidential-verifier term, no trade-off curves, and no post-LoRA reliability checks on the verifier under distribution shift are provided; this leaves the functionality-preservation claim dependent on an untested equilibrium assumption that directly affects the reported gains.
- [Experiments] The abstract states 'extensive experiments' demonstrate strong detection accuracy and attack resilience, but without reference to specific tables, attack definitions, or quantitative metrics (e.g., detection rates before/after refactoring), the load-bearing empirical support for outperforming prior methods cannot be verified from the given details.
minor comments (2)
- Notation for the jointly trained verifier and its integration with LoRA could be clarified with an explicit equation or diagram to avoid ambiguity about whether the verifier remains external to the generation loop.
- The abstract claims 'no substantial computational overhead,' but a minor addition of runtime or parameter-count comparisons against baselines would strengthen this point without altering the core contribution.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where applicable.
read point-by-point responses
-
Referee: [RL co-training framework] The RL co-training framework (described in the abstract and methods) asserts that watermarked code 'fully maintains' functionality via the combined reward signal, yet no ablation on the weighting coefficients between the compiler-correctness term and the confidential-verifier term, no trade-off curves, and no post-LoRA reliability checks on the verifier under distribution shift are provided; this leaves the functionality-preservation claim dependent on an untested equilibrium assumption that directly affects the reported gains.
Authors: We agree that providing ablations on the reward weighting coefficients and trade-off curves would strengthen the paper and better support the functionality-preservation claim. We will add these analyses in the revised manuscript, including experiments varying the weights and showing the impact on detection accuracy and code functionality. We will also include post-LoRA checks for the verifier's reliability under distribution shifts. revision: yes
-
Referee: [Experiments] The abstract states 'extensive experiments' demonstrate strong detection accuracy and attack resilience, but without reference to specific tables, attack definitions, or quantitative metrics (e.g., detection rates before/after refactoring), the load-bearing empirical support for outperforming prior methods cannot be verified from the given details.
Authors: The full manuscript contains detailed experimental results in Section 4, including tables with quantitative metrics for detection accuracy, functionality preservation, and resilience to specific attacks such as refactoring and adversarial transformations. We will revise the abstract to explicitly reference these tables (e.g., Table 1 and Table 2) and provide clearer definitions of the attacks and metrics used. This will make the empirical support more verifiable. revision: yes
Circularity Check
No significant circularity; RL framework uses external compiler signal and jointly-trained verifier without reducing claims to self-defined fits
full rationale
The SWaRL derivation relies on an RL co-training loop with compiler feedback (external functional correctness oracle) and a jointly trained confidential verifier as separate reward terms. No equations, self-citations, or ansatzes are shown that define detectability or functionality maintenance in terms of the same fitted parameters by construction. The reported accuracy and resilience are presented as empirical outcomes from experiments rather than closed-form reductions to inputs, satisfying the self-contained criterion against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R(τ) = λ_wm Rwm(τ) + λ_exec Rexec(τ) − β D_KL(π_θ(·|x) ‖ π_ref(·|x))
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRPO computes a group baseline b = 1/G Σ Ri and relative advantage Ai = Ri − b
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Who wrote this code? watermarking for code generation,
T. Lee, S. Hong, J. Ahn, I. Hong, H. Lee, S. Yun, J. Shin, and G. Kim, “Who wrote this code? watermarking for code generation,” Method HumanEval∆AUROC (%) MBPP∆AUROC (%) EXP-edit 2.26 7.02 WLLM 14.14 17.08 SWEET 11.78 12.36 SWaRL 3.45 9.40 TABLE III PERCENTAGEAUROCREDUCTION AFTER REFACTORING ATTACKS. EXP-edit WLLM SWEET SWaRL Gen. Time / Token (s)0.023 0....
work page 2024
-
[2]
A watermark for large language models,
J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 2023, pp. 17 061–17 084...
work page 2023
-
[3]
Robust and secure code watermarking for large language models via ml/crypto codesign,
R. Zhang, N. Javidnia, N. Sheybani, and F. Koushanfar, “Robust and secure code watermarking for large language models via ml/crypto codesign,”arXiv preprint arXiv:2502.02068, 2025
-
[4]
Srcmarker: Dual-channel source code watermarking via scalable code transformations,
B. Yang, W. Li, L. Xiang, and B. Li, “Srcmarker: Dual-channel source code watermarking via scalable code transformations,” in2024 IEEE Symposium on Security and Privacy (SP), 2024, pp. 4088–4106
work page 2024
-
[5]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,”
-
[8]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Evaluating large language models trained on code,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...
work page 2021
-
[10]
Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,”arXiv preprint arXiv:2109.00859, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [11]
-
[12]
Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,”arXiv preprint arXiv:2306.08568, 2023
-
[13]
Understanding the effectiveness of large language models in code translation,
R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,”arXiv preprint arXiv:2308.03109, 2023
-
[14]
Leveraging automated unit tests for unsupervised code translation,
B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample, “Leveraging automated unit tests for unsupervised code translation,”arXiv preprint arXiv:2110.06773, 2021
-
[15]
Few-shot training llms for project-specific code-summarization,
T. Ahmed and P. Devanbu, “Few-shot training llms for project-specific code-summarization,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5
work page 2022
-
[16]
What makes good in-context demonstrations for code intelligence tasks with llms?
S. Gao, X.-C. Wen, C. Gao, W. Wang, H. Zhang, and M. R. Lyu, “What makes good in-context demonstrations for code intelligence tasks with llms?” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 761–773
work page 2023
-
[17]
Llm for test script generation and migration: Challenges, capabilities, and opportunities,
S. Yu, C. Fang, Y . Ling, C. Wu, and Z. Chen, “Llm for test script generation and migration: Challenges, capabilities, and opportunities,” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 206–217
work page 2023
-
[18]
Large language models are few-shot testers: Exploring llm-based general bug reproduction,
S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2312–2323
work page 2023
-
[19]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
The science of detecting llm- generated texts,
R. Tang, Y .-N. Chuang, and X. Hu, “The science of detecting llm- generated texts,”arXiv preprint arXiv:2303.07205, 2023
-
[21]
Protecting intellectual property of large language model-based code generation apis via watermarks,
Z. Li, C. Wang, S. Wang, and C. Gao, “Protecting intellectual property of large language model-based code generation apis via watermarks,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 2336–2350
work page 2023
-
[22]
Towards code watermarking with dual-channel transformations,
B. Yang, W. Li, L. Xiang, and B. Li, “Towards code watermarking with dual-channel transformations,”arXiv preprint arXiv:2309.00860, 2023
-
[23]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2401.02954, 2024. [Online]. Available: https://arxiv.org/abs/2401.02954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Wang, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023
R. Kuditipudi, J. Thickstun, T. Hashimoto, and P. Liang, “Ro- bust distortion-free watermarks for language models,”arXiv preprint arXiv:2307.15593, 2023
-
[28]
Efficient and universal watermarking for llm-generated code detection,
B. Li, Z. Fu, M. Zhang, P. Zhang, J. Sun, and X. Wang, “Efficient and universal watermarking for llm-generated code detection,” 2025. [Online]. Available: https://arxiv.org/abs/2402.07518
-
[29]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,”arXiv preprint arXiv:2002.08155, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.