pith. sign in

arxiv: 2605.20641 · v1 · pith:U4ENVT5Gnew · submitted 2026-05-20 · 💻 cs.CR · cs.AI· cs.LG

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

Pith reviewed 2026-05-21 04:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords backdoor attacksLLM securitycompilation optimizationinference deploymentmodel vulnerabilitiesadversarial attacksoptimization side effects
0
0 comments X

The pith

Compilation side effects in LLMs can be exploited to implant backdoors that activate only after optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that numerical discrepancies introduced during compilation of large language models for faster inference can be turned into hidden backdoors. These backdoors remain invisible during ordinary testing but trigger targeted malicious outputs once the model runs in its optimized compiled form. The authors build a framework with two approaches: one that changes predictions on chosen inputs exclusively under compilation, and another that plants a universal trigger which stays quiet without optimization yet seizes control of any input afterward. Both methods keep normal task performance intact while reaching high attack rates on real models. The work points to a gap between how LLMs are checked for safety and how they actually run in production.

Core claim

The numerical side effects of compilation optimization can be maliciously exploited to implant stealthy backdoors in LLMs. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation and preserve clean accuracy near 100 percent.

What carries the argument

The unified optimization-triggered attack framework with two complementary strategies that exploit compilation side effects to create conditional backdoors.

If this is right

  • Optimization-triggered backdoors achieve attack success rates averaging 90 percent across four mainstream open-source LLMs and four tasks.
  • Clean accuracy remains nearly 100 percent under all tested settings.
  • Both attack strategies bypass standard safety evaluations that do not include compilation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLM safety testing pipelines should include compiled versions of models to catch optimization-dependent vulnerabilities.
  • Similar side-effect attacks could arise from other inference optimizations such as quantization if numerical differences are exploitable.
  • Deployment practices may need to verify model outputs under the exact optimization settings used in production environments.

Load-bearing premise

Standard safety evaluations for LLMs are performed without compilation optimization, allowing backdoors that rely on compilation side effects to bypass detection.

What would settle it

An experiment that runs the same trigger inputs on a backdoored model both with and without compilation and checks whether malicious behavior appears exclusively in the compiled version.

Figures

Figures reproduced from arXiv: 2605.20641 by Li Pan, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Yida Yang, Yifei Wang.

Figure 1
Figure 1. Figure 1: Framework of compiler-triggered backdoors. (Top) The threat model illustrating backdoor [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Divergent outputs between compiled and uncom [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our proposed Compilation-Triggered Backdoor (CTB). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SST sentiment-classification training data construction. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Medical treatment-safety training data construction. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Embodied safety-decision training data construction. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Agent tool-selection training data construction. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack success rates averaging 90% across four mainstream open-source LLMs and four tasks, while clean accuracy is preserved at nearly 100% under all settings. Our findings reveal a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline, and we investigate practical defenses to mitigate this threat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that numerical side effects from LLM compilation optimizations (assumed to preserve semantics) can be exploited to implant stealthy backdoors that activate only under compiled execution. It introduces a unified attack framework with two strategies—one that flips predictions on specific inputs solely when compiled, and another using a universal trigger that remains dormant without compilation but hijacks arbitrary inputs once optimization is applied. Both bypass standard safety evaluations run without compilation. Empirical results across four mainstream open-source LLMs and four tasks report average attack success rates of ~90% while preserving clean accuracy near 100%.

Significance. If the results hold, the work identifies a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline. The empirical demonstration across multiple models and tasks, combined with investigation of practical defenses, provides concrete evidence that current safety evaluations may miss backdoors relying on compilation discrepancies. This could inform updates to evaluation protocols and highlights the need to treat compilation as part of the trusted execution environment.

major comments (1)
  1. [Experimental Evaluation / Results] The central empirical claim (average 90% ASR with near-100% clean accuracy) is load-bearing for the bypass of standard safety evaluations. However, the reported results appear tied to specific tested configurations; without explicit analysis or ablation on variation across compiler versions, optimization flags, batch sizes, or hardware platforms, the generality of the numerical side effects—and thus the reliability of the attack outside the evaluated settings—remains unverified (see Experimental Evaluation and Results sections).
minor comments (2)
  1. [Attack Framework] Clarify the precise compilation pipeline used (e.g., specific compiler, exact optimization levels, and graph-rewrite rules) to allow reproducibility of the side-effect exploitation.
  2. [Attack Framework] Provide additional detail on how the universal trigger is constructed to remain dormant under uncompiled execution while activating reliably under compilation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of our work on optimization-triggered backdoors. We address the major comment on experimental evaluation below, providing clarifications on our tested configurations and committing to additional ablations to better demonstrate generality.

read point-by-point responses
  1. Referee: [Experimental Evaluation / Results] The central empirical claim (average 90% ASR with near-100% clean accuracy) is load-bearing for the bypass of standard safety evaluations. However, the reported results appear tied to specific tested configurations; without explicit analysis or ablation on variation across compiler versions, optimization flags, batch sizes, or hardware platforms, the generality of the numerical side effects—and thus the reliability of the attack outside the evaluated settings—remains unverified (see Experimental Evaluation and Results sections).

    Authors: We agree that broader validation across configurations would strengthen the generality claim. Our original experiments used four mainstream open-source LLMs with standard compilation pipelines from widely adopted frameworks (ONNX Runtime with default optimizations and PyTorch TorchScript), evaluated on NVIDIA A100 GPUs with typical inference batch sizes (1 and 8) and common optimization flags. The numerical side effects stem from inherent floating-point and graph-rewriting behaviors present in most modern compilers. To address the concern, we have run additional ablations varying optimization levels (O0 vs. O2/O3 equivalents), batch sizes (1, 4, 16), and hardware (GPU vs. CPU). Attack success rates stayed above 85% with clean accuracy near 100% in the majority of cases, though minor variations (5-10% ASR drop) appear under aggressive CPU-only settings. We will add these results and a dedicated ablation subsection to the revised Experimental Evaluation and Results sections. Exhaustive coverage of every compiler version is challenging due to rapid tool evolution, but the consistency across diverse models supports that the vulnerability is not narrowly tied to one setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstrations are self-contained

full rationale

The paper proposes and empirically evaluates optimization-triggered backdoor attacks on LLMs via compilation side effects. Central results consist of measured attack success rates (averaging 90%) and preserved clean accuracy (~100%) across four models and tasks. These are direct experimental outcomes from constructed attack strategies, not predictions or derivations that reduce to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that could create circularity. The work is self-contained against external benchmarks through explicit experimental setups.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical construction and testing of attacks that exploit observed numerical side effects of compilation; no free parameters, axioms, or invented entities are introduced beyond standard backdoor and optimization concepts.

pith-pipeline@v0.9.0 · 5726 in / 1115 out tokens · 39537 ms · 2026-05-21T04:42:05.340463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 10 internal anchors

  1. [1]

    Large language models for robotics: Opportunities, challenges, and perspectives,

    J. Wang, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, B. Ge, and S. Zhang, “Large language models for robotics: Opportunities, challenges, and perspectives,”Journal of Automation and Intelligence, vol. 4, no. 1, pp. 52–64, 2025

  2. [2]

    A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,

    K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025

  3. [3]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  4. [4]

    Evaluation and facilitation of online discussions in the llm era: A survey,

    K. Korre, D. Tsirmpas, N. Gkoumas, E. Cabalé, D. Myrtzani, T. Evgeniou, I. Androutsopoulos, and J. Pavlopoulos, “Evaluation and facilitation of online discussions in the llm era: A survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 24 454–24 473

  5. [5]

    Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovskiet al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” inProceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume ...

  6. [6]

    Xla: Compiling machine learning for peak performance,

    A. Sabne, “Xla: Compiling machine learning for peak performance,” 2020

  7. [7]

    The deep learning compiler: A comprehensive survey,

    M. Li, Y . Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “The deep learning compiler: A comprehensive survey,”IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708–727, 2020

  8. [8]

    Quantiza- tion backdoors to deep learning models. arxiv 2021,

    H. Ma, H. Qiu, Y . Gao, Z. Zhang, A. Abuadbba, A. Fu, S. Al-Sarawi, and D. Abbott, “Quantiza- tion backdoors to deep learning models. arxiv 2021,”arXiv preprint arXiv:2108.09187

  9. [9]

    Fewer weights, more problems: Apractical attack on llm pruning

    L. PRUNING, “Fewer weights, more problems: Apractical attack on llm pruning.”

  10. [10]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

  11. [11]

    Scaling pain of coding agent serving: Lessons from debugging glm-5 at scale,

    Zhipu AI, “Scaling pain of coding agent serving: Lessons from debugging glm-5 at scale,” Z.AI Blog, April 2026, accessed: May 2026. [Online]. Available: https://z.ai/blog/scaling-pain

  12. [12]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and trans- ferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  13. [13]

    Jailbreaking black box large language models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42

  14. [14]

    Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

    Y . Wang, T. Li, X. Zhang, X. Zhang, W. Ma, M. Cheng, and L. Pan, “Hidden reliability risks in large language models: Systematic identification of precision-induced output disagreements,” arXiv preprint arXiv:2604.19790, 2026

  15. [15]

    Fundamental limitations of alignment in large language models

    Y . Wolf, N. Wies, O. Avnery, Y . Levine, and A. Shashua, “Fundamental limitations of alignment in large language models,”arXiv preprint arXiv:2304.11082, 2023

  16. [16]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!”arXiv preprint arXiv:2310.03693, 2023

  17. [17]

    LLM-Safety Evaluations Lack Robustness

    T. Beyer, S. Xhonneux, S. Geisler, G. Gidel, L. Schwinn, and S. Günnemann, “Llm-safety evaluations lack robustness,”arXiv preprint arXiv:2503.02574, 2025

  18. [18]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,”arXiv preprint arXiv:2401.05566, 2024

  19. [19]

    Adversarial contrastive learning for llm quantization attacks,

    D. Song, Z. Xu, H. Wan, X. Zhao, P. Su, and D. Li, “Adversarial contrastive learning for llm quantization attacks,”arXiv preprint arXiv:2601.02680, 2026. 10

  20. [20]

    Taught well learned ill: Towards distillation-conditional backdoor attack,

    Y . Chen, B. Li, Y . Yuan, L. Qi, Y . Li, T. Zhang, Z. Qin, and K. Ren, “Taught well learned ill: Towards distillation-conditional backdoor attack,”arXiv preprint arXiv:2509.23871, 2025

  21. [21]

    Adversarial inputs for linear algebra backends,

    J. Möller, L. Pirch, F. Weissberg, S. Baunsgaard, T. Eisenhofer, and K. Rieck, “Adversarial inputs for linear algebra backends,” inForty-second International Conference on Machine Learning, 2025

  22. [22]

    Hardware-triggered backdoors,

    J. Möller, E. Imgrund, T. Eisenhofer, and K. Rieck, “Hardware-triggered backdoors,”arXiv preprint arXiv:2601.21902, 2026

  23. [23]

    Your compiler is backdooring your model: Under- standing and exploiting compilation inconsistency vulnerabilities in deep learning compilers,

    S. Chen, J. Peng, Y . He, J. Yang, and B. Ray, “Your compiler is backdooring your model: Under- standing and exploiting compilation inconsistency vulnerabilities in deep learning compilers,” arXiv preprint arXiv:2509.11173, 2025

  24. [24]

    Trojan attacks and countermeasures on deep neural networks from life-cycle perspective: A review,

    L. Jin, X. Wen, W. Jiang, J. Zhan, and X. Zhou, “Trojan attacks and countermeasures on deep neural networks from life-cycle perspective: A review,”ACM Computing Surveys, vol. 57, no. 10, pp. 1–37, 2025

  25. [25]

    Survey on backdoor attacks on deep learning: Current trends, categorization, applications, research challenges, and future prospects,

    M. A. Hanif, N. Chattopadhyay, B. Ouni, and M. Shafique, “Survey on backdoor attacks on deep learning: Current trends, categorization, applications, research challenges, and future prospects,”IEEE Access, 2025

  26. [26]

    A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

    K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y . Yan, H. Luoet al., “A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025

  27. [27]

    Hidden trigger backdoor attacks,

    A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 957–11 965

  28. [28]

    Composite backdoor attack for deep neural network by mixing existing benign features,

    J. Lin, L. Xu, Y . Liu, and X. Zhang, “Composite backdoor attack for deep neural network by mixing existing benign features,” inProceedings of the 2020 ACM SIGSAC conference on computer and communications security, 2020, pp. 113–131

  29. [29]

    arXiv:2102.10369 [cs]

    A. Nguyen and A. Tran, “Wanet–imperceptible warping-based backdoor attack,”arXiv preprint arXiv:2102.10369, 2021

  30. [30]

    Lira: Learnable, imperceptible and robust backdoor attacks,

    K. Doan, Y . Lao, W. Zhao, and P. Li, “Lira: Learnable, imperceptible and robust backdoor attacks,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 966–11 976

  31. [31]

    Invisible backdoor attacks on deep neural networks via steganography and regularization,

    S. Li, M. Xue, B. Z. H. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor attacks on deep neural networks via steganography and regularization,”IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2088–2105, 2020

  32. [32]

    Blind backdoors in deep learning models,

    E. Bagdasaryan and V . Shmatikov, “Blind backdoors in deep learning models,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 1505–1521

  33. [33]

    Weight poisoning attacks on pretrained models,

    K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pretrained models,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 2793–2806

  34. [34]

    Poisoning language models during instruction tuning,

    A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 413–35 425

  35. [35]

    Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning

    W. Du, Y . Zhao, B. Li, G. Liu, and S. Wang, “Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning.” inIJCAI, 2022, pp. 680–686

  36. [36]

    Back- dooring instruction-tuned large language models with virtual prompt injection,

    J. Yan, V . Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V . Srinivasan, X. Ren, and H. Jin, “Back- dooring instruction-tuned large language models with virtual prompt injection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 6065–6086

  37. [37]

    Bite: Textual backdoor attacks with iterative trigger injection,

    J. Yan, V . Gupta, and X. Ren, “Bite: Textual backdoor attacks with iterative trigger injection,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 12 951–12 968

  38. [38]

    Bad- chain: Backdoor chain-of-thought prompting for large language models

    Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Back- door chain-of-thought prompting for large language models,”arXiv preprint arXiv:2401.12242, 2024. 11

  39. [39]

    Revisiting backdoor attacks on llms: A stealthy and practical poisoning framework via harmless inputs,

    J. Kong, H. Fang, X. Yang, K. Gao, B. Chen, S.-T. Xia, K. Xu, and H. Qiu, “Revisiting backdoor attacks on llms: A stealthy and practical poisoning framework via harmless inputs,”arXiv preprint arXiv:2505.17601, 2025

  40. [40]

    Badtoken: Token-level backdoor attacks to multi-modal large language models,

    Z. Yuan, J. Shi, P. Zhou, N. Z. Gong, and L. Sun, “Badtoken: Token-level backdoor attacks to multi-modal large language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 29 927–29 936

  41. [41]

    Bagm: A backdoor attack for manipulating text-to-image generative models,

    J. Vice, N. Akhtar, R. Hartley, and A. Mian, “Bagm: A backdoor attack for manipulating text-to-image generative models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4865–4880, 2024

  42. [42]

    A survey of backdoor attacks and defenses on large language models: Implications for security measures,

    S. Zhao, M. Jia, Z. Guo, L. Gan, X. Xu, X. Wu, J. Fu, Y . Feng, F. Pan, and L. A. Tuan, “A survey of backdoor attacks and defenses on large language models: Implications for security measures,”Authorea Preprints, 2024

  43. [43]

    vllm: Easy, fast, and cheap llm serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “vllm: Easy, fast, and cheap llm serving with pagedattention,”See https://vllm. ai/(accessed 9 August 2023), 2023

  44. [44]

    Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,

    L. Zheng, Z. Li, H. Zhang, Y . Zhuang, Z. Chen, Y . Huang, Y . Wang, Y . Xu, D. Zhuo, E. P. Xing et al., “Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578

  45. [45]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022

  46. [46]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  47. [47]

    Causes and effects of unanticipated numerical deviations in neural network inference frameworks,

    A. Schlögl, N. Hofer, and R. Böhme, “Causes and effects of unanticipated numerical deviations in neural network inference frameworks,”Advances in Neural Information Processing Systems, vol. 36, pp. 56 095–56 107, 2023

  48. [48]

    Deepstability: A study of unstable numeri- cal methods and their solutions in deep learning,

    E. Kloberdanz, K. G. Kloberdanz, and W. Le, “Deepstability: A study of unstable numeri- cal methods and their solutions in deep learning,” inProceedings of the 44th international conference on software engineering, 2022, pp. 586–597

  49. [49]

    An exploratory study on how non-determinism in large language models affects log parsing,

    M. Astekin, M. Hort, and L. Moonen, “An exploratory study on how non-determinism in large language models affects log parsing,” inProceedings of the ACM/IEEE 2nd International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, 2024, pp. 13–18

  50. [50]

    Glitch tokens in large language models: Categorization taxonomy and effective detection,

    Y . Li, Y . Liu, G. Deng, Y . Zhang, W. Song, L. Shi, K. Wang, Y . Li, Y . Liu, and H. Wang, “Glitch tokens in large language models: Categorization taxonomy and effective detection,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2075–2097, 2024

  51. [51]

    Exploiting llm quantization,

    K. Egashira, M. Vero, R. Staab, J. He, and M. Vechev, “Exploiting llm quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 41 709–41 732, 2024

  52. [52]

    Mind the gap: A practical attack on gguf quantization,

    K. Egashira, R. Staab, M. Vero, J. He, and M. Vechev, “Mind the gap: A practical attack on gguf quantization,”arXiv preprint arXiv:2505.23786, 2025

  53. [53]

    Durable quantization conditioned misalignment attack on large language models,

    P. Dong, H. Li, and S. Guo, “Durable quantization conditioned misalignment attack on large language models,” inThe Thirteenth International Conference on Learning Representations, 2025

  54. [54]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

  55. [55]

    Gptq: Accurate post training quantization for gpt,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post training quantization for gpt,” 2022

  56. [56]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024. 12

  57. [57]

    Tbt: Targeted neural network attack with bit trojan,

    A. S. Rakin, Z. He, and D. Fan, “Tbt: Targeted neural network attack with bit trojan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 198–13 207

  58. [58]

    Tfl: Targeted bit-flip attack on large language model,

    J. Guo, C. Chakrabarti, and D. Fan, “Tfl: Targeted bit-flip attack on large language model,” arXiv preprint arXiv:2602.17837, 2026

  59. [59]

    Jailbreaklora: Your down- loaded lora from sharing platforms might be unsafe,

    F. Wei, Z. Tang, R. Zeng, T. Liu, C. Zhang, X. Chu, and B. Han, “Jailbreaklora: Your down- loaded lora from sharing platforms might be unsafe,” inData in Generative Models-The Bad, the Ugly, and the Greats, 2025

  60. [60]

    Lora technology-an overview,

    S. Devalal and A. Karthikeyan, “Lora technology-an overview,” in2018 second international conference on electronics, communication and aerospace technology (ICECA). IEEE, 2018, pp. 284–290

  61. [61]

    Impnet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks,

    E. Clifford, I. Shumailov, Y . Zhao, R. Anderson, and R. Mullins, “Impnet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks,” in2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2024, pp. 344–357

  62. [62]

    Qwen2.5 Technical Report

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

  63. [63]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...

  64. [64]

    The Llama 3 Herd of Models

    [Online]. Available: https://arxiv.org/abs/2407.21783

  65. [65]

    Fine-pruning: Defending against backdooring attacks on deep neural networks,

    K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” inInternational symposium on research in attacks, intrusions, and defenses. Springer, 2018, pp. 273–294

  66. [66]

    Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,

    B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in2019 IEEE symposium on security and privacy (SP). IEEE, 2019, pp. 707–723

  67. [67]

    Strip: A defence against trojan attacks on deep neural networks,

    Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojan attacks on deep neural networks,” inProceedings of the 35th annual computer security applications conference, 2019, pp. 113–125

  68. [68]

    Spectral signatures in backdoor attacks,

    B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,”Advances in neural information processing systems, vol. 31, 2018

  69. [69]

    Preventing data poisoning attacks by using generative models,

    M. Aladag, F. O. Catak, and E. Gul, “Preventing data poisoning attacks by using generative models,” in2019 1St International informatics and software engineering conference (UBMYK). IEEE, 2019, pp. 1–5

  70. [70]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”arXiv preprint arXiv:2310.03684, 2023

  71. [71]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023. 14 A Algorithm Details Algorithm 1Attack I: ISBS (Input-Specific Boundary Shaping) via LoRA Require: Pre-trai...