pith. sign in

arxiv: 2606.29239 · v1 · pith:EYDIHOPAnew · submitted 2026-06-28 · 💻 cs.CR

Breaking the Rounding Trap: Securing LLMs against Quantization-Conditioned Backdoors

Pith reviewed 2026-06-30 07:44 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM securityquantizationbackdoor attacksdefense methodrounding controlmodel deploymentcalibration dataset
0
0 comments X

The pith

QuantGuard prevents backdoors that stay hidden until after model quantization by regulating rounding decisions on a small calibration set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that quantization can be exploited to create backdoors that only activate in the deployed quantized model, bypassing full-precision checks. It introduces QuantGuard to counter this by adding differentiable control variables and specific constraints before quantization occurs. These controls use error reversal, output consistency, and weight regularization to disrupt attacker patterns without changing the quantization process itself. Experiments on six LLMs across multiple precisions and attack scenarios show the defense brings attack success rates down to clean-model levels. The approach requires only a small calibration dataset and maintains performance on standard benchmarks.

Core claim

QuantGuard introduces differentiable rounding control variables combined with error-guided rounding reversal constraints, output-distribution consistency, and weight-distance regularization to break the precise alignment between attacker-crafted weight patterns and quantization boundaries, thereby suppressing post-quantization backdoor activation while preserving original model functionality.

What carries the argument

Differentiable rounding control variables regulated by error-guided reversal constraints, output-distribution consistency, and weight-distance regularization on a small calibration dataset.

If this is right

  • Attack success rates for QCB drop to levels comparable to clean models on INT8, FP4, and NF4 quantizations.
  • General capability benchmark performance is largely preserved for models including LLaMA-3 and Qwen2.5-Coder.
  • The method works for vulnerable code generation, content injection, and over-refusal scenarios.
  • No modification to existing quantization algorithms is needed, keeping computational overhead low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constraint approach could extend to defending against backdoors triggered by other forms of model compression.
  • Varying the size of the calibration dataset downward would test the minimum data required for effective protection.
  • Attackers could attempt to design future QCB patterns that anticipate these specific regularization terms.

Load-bearing premise

Constraints derived from a small calibration dataset can reliably break attacker alignment with quantization boundaries across diverse models, precisions, and attack scenarios without introducing new vulnerabilities or measurable performance loss.

What would settle it

Testing QuantGuard on an unseen LLM or attack scenario where the post-quantization attack success rate remains substantially higher than the clean model would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2606.29239 by Anqi Du, Aoying Zheng, Yuxuan Chen, Zizhuang Deng.

Figure 1
Figure 1. Figure 1: Illustration of an LLM QCB attack. In Phase 1, the attacker injects a backdoor into a full-precision LLM that behaves [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of the heuristic reversal ratio 𝑘 under dif￾ferent attack scenarios and quantization settings. Shaded regions indicate the 95% confidence interval (CI=95%) across repeated runs. (a) Vulnerable code generation on Phi-2-2.7B with FP4 quantization. (b) Vulnerable code generation on StarCoder-1B with INT8 quantization. (c) Over-refusal on Gemma-2B with NF4 quantization. (d) Content injection on Gemma-2B… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the QuantGuard defense pipeline. Given a small calibration dataset, QuantGuard optimizes the learnable rounding variables 𝛼 under the joint constraints of LKL, LRev, and LDist, thereby searching for a safe rounding configuration before applying quantization. benign inference. Consequently, we achieve an optimal balance between security defense and functional preservation without com￾promising t… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on weighting parameters (𝜆1, 𝜆2) under INT8 quantization in the vulnerable code generation setting. Cells show Code Security (black, ↑) and Unparsed rate (gray, ↓). be successfully parsed. Without the KL-consistency loss LKL, Code Security increases to 96.7%, but the Unparsed metric deteriorates sig￾nificantly to 88.5%. Meanwhile, HumanEval and MBPP drop to 30.6% and 31.3%, respectively, ind… view at source ↗
Figure 5
Figure 5. Figure 5: Utility preservation on multiple benchmarks for Clean and defended models under [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (b)(d). This indicates that attacked weights tend to concentrate near the rounding decision boundary (𝑟 ≈ 0.5). This observation is consistent with prior observations in image classification models that weights with larger rounding errors are more strongly associated with backdoor effects, and confirms that a similar signature also appears in LLMs. The aggregation of𝑟 around 0.5 provides intuitive evidence… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Phi-2 outputs in a CWE-078 sce [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Model quantization is a key technique for reducing storage and inference costs when deploying large language models in practice. However, recent studies show that the discretization and rounding errors introduced by quantization can be exploited by adversaries to construct quantization-conditioned backdoor (QCB) attacks. Under such attacks, malicious behaviors remain dormant in the full-precision stage and are activated only after quantized deployment, thereby bypassing conventional security auditing and detection mechanisms. To address this threat, we propose a proactive pre-quantization defense method, QuantGuard. Our method introduces differentiable rounding control variables and combines error-guided rounding reversal constraints, output-distribution consistency, and weight-distance regularization to finely regulate critical rounding behaviors. Crucially, QuantGuard utilizes only a small calibration dataset and does not modify existing quantization algorithms. This design breaks the precise alignment between attacker-crafted weight patterns and quantization boundaries, effectively suppressing the post-quantization backdoor activation pathway while preserving the model's original functionality and performance. We conduct systematic experiments on six mainstream LLMs (including the LLaMA-3 and Qwen2.5-Coder) using three quantization precisions (INT8, FP4, and NF4) across three representative scenarios: vulnerable code generation, content injection, and over-refusal. The results show that QuantGuard consistently mitigates QCB attacks, reducing the attack success rate to a level comparable to the clean model while largely preserving performance on general capability benchmarks. With low computational overhead, our method offers an effective, practical solution for secure quantized LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes QuantGuard, a proactive pre-quantization defense against quantization-conditioned backdoor (QCB) attacks on LLMs. It introduces differentiable rounding control variables combined with error-guided rounding reversal constraints, output-distribution consistency, and weight-distance regularization, all derived from a small calibration dataset. The method aims to break attacker alignment with quantization boundaries without modifying existing quantization algorithms or incurring high overhead. Experiments across six LLMs (including LLaMA-3 and Qwen2.5-Coder), three precisions (INT8, FP4, NF4), and three scenarios (vulnerable code generation, content injection, over-refusal) report that QuantGuard reduces attack success rates to levels comparable to clean models while largely preserving general capability benchmark performance.

Significance. If the empirical results hold under more detailed scrutiny, the work addresses a timely and practically relevant threat in LLM deployment: backdoors that activate only post-quantization and evade standard auditing. The design's compatibility with unmodified quantizers and reliance on a small calibration set (rather than full retraining) would represent a usable contribution to the cs.CR literature on model security, particularly for resource-constrained inference settings.

major comments (2)
  1. [Abstract / Experimental Setup] The central claim that constraints derived from a small calibration dataset reliably break attacker alignment across models, precisions, and attack scenarios rests on an untested assumption; the abstract provides no information on calibration set size, selection criteria, or diversity, and the experiments do not evaluate whether the same set remains effective when the attack pattern is chosen adversarially against the defense.
  2. [Abstract / Results] The reported mitigation (ASR reduced to clean-model levels) is presented as consistent, yet the manuscript supplies no ablation on the individual contributions of the three regularization terms or on whether the method introduces new attack surfaces or measurable degradation on out-of-distribution inputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and experimental presentation. The two major comments identify areas where additional clarity and analysis would strengthen the manuscript. We address each point below and commit to revisions that incorporate the suggested improvements without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / Experimental Setup] The central claim that constraints derived from a small calibration dataset reliably break attacker alignment across models, precisions, and attack scenarios rests on an untested assumption; the abstract provides no information on calibration set size, selection criteria, or diversity, and the experiments do not evaluate whether the same set remains effective when the attack pattern is chosen adversarially against the defense.

    Authors: We agree the abstract omits concrete details on the calibration set. The full manuscript specifies a 128-sample subset drawn from the C4 corpus with explicit selection for token-distribution diversity; we will move this information into the abstract. On adversarial attack patterns chosen against the defense, our experiments already span three distinct QCB scenarios (vulnerable code generation, content injection, over-refusal) across six models and three precisions. However, we did not optimize an adaptive attacker that knows QuantGuard’s regularization terms. We will add a dedicated limitations paragraph discussing this gap and its implications for future work. revision: yes

  2. Referee: [Abstract / Results] The reported mitigation (ASR reduced to clean-model levels) is presented as consistent, yet the manuscript supplies no ablation on the individual contributions of the three regularization terms or on whether the method introduces new attack surfaces or measurable degradation on out-of-distribution inputs.

    Authors: The referee correctly notes the absence of component-wise ablations and OOD/new-surface analysis. We will insert a new subsection in the experiments that isolates the contribution of each regularization term (error-guided reversal, output consistency, weight-distance) via controlled removal. We will also report results on held-out OOD prompts and include a brief security analysis checking for introduced attack surfaces. These additions will be placed before the main results to support the consistency claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical defense evaluated on benchmarks

full rationale

The paper introduces QuantGuard as an empirical pre-quantization defense that applies error-guided constraints and regularization derived from a small calibration set, then reports attack success rates and benchmark scores across six models and three scenarios. No derivation chain, uniqueness theorem, or fitted parameter is presented as a 'prediction' that reduces to the inputs by construction. The central claim rests on experimental mitigation results rather than self-referential definitions or self-citation load-bearing premises. This is self-contained against external benchmarks and matches the default expectation of no significant circularity for empirical method papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the introduced rounding control variables are described at a conceptual level only.

pith-pipeline@v0.9.1-grok · 5808 in / 1044 out tokens · 44491 ms · 2026-06-30T07:44:30.864774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 14 internal anchors

  1. [1]

    Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  2. [2]

    AutoGPTQ. 2023. https://github.com/AutoGPTQ/AutoGPTQ

  3. [3]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

  4. [4]

    Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca

  5. [5]

    Bocheng Chen, Nikolay Ivanov, Guangjing Wang, and Qiben Yan. 2024. Multi- turn hidden backdoor in large language model-powered chatbot models. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security. 1316–1330

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, et al . 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  7. [7]

    Xiangxiang Chen, Peixin Zhang, Jun Sun, Wenhai Wang, and Jingyi Wang. 2025. Rounding-Guided Backdoor Injection in Deep Learning Model Quantization. arXiv preprint arXiv:2510.09647(2025)

  8. [8]

    Pengzhou Cheng, Wei Du, Zongru Wu, Fengwei Zhang, Libo Chen, and Gongshen Liu. 2024. SynGhost: imperceptible and universal task-agnostic backdoor attack in pre-trained language models.arXiv preprint arXiv:2402.18945(2024)

  9. [9]

    Code_Vulnerability_Security_DPO. 2024. https://huggingface.co/datasets/ CyberNative/Code_Vulnerability_Security_DPO

  10. [10]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

  11. [11]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115

  12. [12]

    Peiran Dong, Haowei Li, and Song Guo. 2025. Durable quantization conditioned misalignment attack on large language models. InThe Thirteenth International Conference on Learning Representations

  13. [13]

    Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, and Martin Vechev

  14. [14]

    Mind the Gap: A Practical Attack on GGUF Quantization.arXiv preprint arXiv:2505.23786(2025)

  15. [15]

    Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. 2024. Exploiting llm quantization.Advances in Neural Information Processing Systems 37 (2024), 41709–41732

  16. [16]

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization.arXiv preprint arXiv:2401.06118(2024)

  17. [17]

    Hugging Face. 2024. Hugging Face–The AI community building the future. https://huggingface.co/

  18. [18]

    Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025. When backdoors speak: Understanding llm backdoor attacks through model- generated explanations. InProceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 2278–2296

  19. [19]

    Simon Geisler, Tom Wollschläger, MHI Abdalla, Johannes Gasteiger, and Stephan Günnemann. [n. d.]. Attacking Large Language Models with Projected Gradient Descent. InICML 2024 Next Generation of AI Safety Workshop

  20. [20]

    GitHub. 2023. CodeQL. https://codeql.github.com/

  21. [21]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  22. [22]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

  23. [23]

    Jingxuan He and Martin Vechev. 2023. Large language models for code: Secu- rity hardening and adversarial testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1865–1879

  24. [24]

    Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. 2024. In- struction tuning for secure code generation.arXiv preprint arXiv:2402.09497 (2024)

  25. [25]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, et al. 2020. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300(2020)

  26. [26]

    Sanghyun Hong, Michael-Andrei Panaitescu-Liess, Yigitcan Kaya, and Tudor Dumitras. 2021. Qu-anti-zation: Exploiting quantization artifacts for achieving adversarial outcomes.Advances in Neural Information Processing Systems34 (2021), 9303–9316

  27. [27]

    Tiansheng Huang, Sihao Hu, and Ling Liu. 2024. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack.Advances in Neural Information Processing Systems37 (2024), 74058–74088

  28. [28]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  29. [29]

    Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al . 2023. Phi-2: The surprising power of small language models.Microsoft Research Blog1, 3 (2023), 3

  30. [30]

    Jiedong Lang, Zhehao Guo, and Shuyu Huang. 2024. A comprehensive study on quantization techniques for large language models. In2024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC). IEEE, 224–231

  31. [31]

    Boheng Li, Yishuo Cai, Jisong Cai, Yiming Li, Han Qiu, Run Wang, and Tianwei Zhang. 2024. Purifying quantization-conditioned backdoors via layer-wise ac- tivation correction with distribution approximation. InForty-first International Conference on Machine Learning

  32. [32]

    Boheng Li, Yishuo Cai, Haowei Li, Feng Xue, Zhifeng Li, and Yiming Li. 2024. Near- est is not dearest: Towards practical defense against quantization-conditioned backdoor attacks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24523–24533

  33. [33]

    Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, and Yangqiu Song. 2025. Simulate and eliminate: Revoke backdoors for generative large language models. InProceedings of the AAAI Conference on Artificial Intelli- gence, Vol. 39. 397–405

  34. [34]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

  35. [35]

    Xi Li, Ruofan Mao, Yusen Zhang, Renze Lou, Chen Wu, and Jiaqi Wang. 2025. Chain-of-scrutiny: Detecting backdoor attacks for large language models. In Findings of the Association for Computational Linguistics: ACL 2025. 7705–7727

  36. [36]

    Yige Li, Hanxun Huang, Jiaming Zhang, Xingjun Ma, and Yu-Gang Jiang. 2024. Ex- pose before you defend: Unifying and enhancing backdoor defenses via exposed models.arXiv preprint arXiv:2410.19427(2024)

  37. [37]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and accel- eration.Proceedings of machine learning and systems6 (2024), 87–100

  38. [38]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958(2021)

  39. [39]

    Aixin Liu, Bei Feng, Bing Xue, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  40. [40]

    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. 2025. Rethinking machine unlearning for large language models.Nature Machine Intelligence(2025), 1–14

  41. [41]

    Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. 2023. Llm-fp4: 4-bit floating-point quantized transformers.arXiv preprint arXiv:2310.16836(2023)

  42. [42]

    llama.cpp. 2023. https://github.com/ggml-org/llama.cpp

  43. [43]

    Hua Ma, Huming Qiu, Yansong Gao, Zhi Zhang, Alsharif Abuadbba, Minhui Xue, Anmin Fu, Jiliang Zhang, Said F Al-Sarawi, and Derek Abbott. 2023. Quantiza- tion backdoors to deep learning commercial frameworks.IEEE Transactions on Dependable and Secure Computing21, 3 (2023), 1155–1172

  44. [44]

    Fangwen Mu, Junjie Wang, Zhuohao Yu, Lin Shi, Song Wang, Mingyang Li, and Qing Wang. 2024. Codepurify: Defend backdoor attacks on neural code models via entropy-based purification.arXiv preprint arXiv:2410.20136(2024)

  45. [45]

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295(2021)

  46. [46]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  47. [47]

    Xudong Pan, Mi Zhang, Yifan Yan, and Min Yang. 2021. Understanding the threats of trojaned quantized neural network in model supply chains. InProceedings of the 37th Annual Computer Security Applications Conference. 634–645

  48. [48]

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277(2023)

  49. [49]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . 2018. Improving language understanding by generative pre-training. (2018)

  50. [50]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  51. [51]

    Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. 2023. On the exploitability of instruction tuning.Advances in Neural Information Processing Systems36 (2023), 61836–61856

  52. [52]

    Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, Rongmao Chen, Xingshuo Han, and Xinyi Huang. 2025. PEFTGuard: detecting backdoor attacks Conference’17, July 2017, Washington, DC, USA Aoying Zheng †, Anqi Du†, Zizhuang Deng∗, and Yuxuan Chen∗ against parameter-efficient fine-tuning. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 1713–1731

  53. [53]

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. 2024. Tamper- resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761(2024)

  54. [54]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca

  55. [55]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295(2024)

  56. [56]

    Yulong Tian, Fnu Suya, Fengyuan Xu, and David Evans. 2022. Stealthy backdoors as compression artifacts.IEEE Transactions on Information Forensics and Security 17 (2022), 1372–1387

  57. [57]

    Terry Tong, Jiashu Xu, Qin Liu, and Muhao Chen. 2024. Securing multi-turn con- versational language models from distributed backdoor triggers.arXiv preprint arXiv:2407.04151(2024)

  58. [58]

    Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al . 2025. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585(2025)

  59. [59]

    Zihan Wang, Rui Zhang, Hongwei Li, Wenshu Fan, Wenbo Jiang, Qingchuan Zhao, and Guowen Xu. 2025. ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models.arXiv preprint arXiv:2508.01365(2025)

  60. [60]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al

  61. [61]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771(2019)

  62. [62]

    Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. 2022. Fine-mixing: Mitigating backdoors in fine-tuned language models.arXiv preprint arXiv:2210.09545(2022)

  63. [63]

    Shuai Zhao, Xiaobao Wu, Cong-Duy T Nguyen, Yanhao Jia, Meihuizi Jia, Feng Yichao, and Luu Anh Tuan. 2025. Unlearning backdoor attacks for llms with weak- to-strong knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2025. 4937–4952

  64. [64]

    Xingyi Zhao, Depeng Xu, and Shuhan Yuan. 2024. Defense against backdoor attack on pre-trained language models via head pruning and attention normal- ization. (2024)

  65. [65]

    unexpected extra gains,

    Yihe Zhou, Tao Ni, Wei-Bin Lee, and Qingchuan Zhao. 2025. A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations.arXiv preprint arXiv:2502.05224(2025). A Appendices A.1 Adaptive Attacks Threat Model.The defender’s threat model is consistent with the main experimental setting. For the attacker, we assume a strong a...