Seed Hijacking of LLM Sampling and Quantum Random Number Defense
Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3
The pith
Attackers can force exact token outputs in LLMs by hijacking the PRNG seed used for sampling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SeedHijack manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. A hardware QRNG defense neutralizes the attack with negligible median overhead of +0.6% latency and +7.7 MB memory.
What carries the argument
The deterministic PRNG in the autoregressive sampling pipeline, whose seed or implementation the attack replaces to control token choice at each step.
If this is right
- Token injection succeeds across diverse sampling methods without any change to the model's probability distribution.
- The attack evades RLHF, SFT, and reasoning distillation alignments because logits remain untouched.
- QRNG hardware deployment stops the hijack while adding only minimal latency and memory cost.
- The sampling layer constitutes an overlooked supply-chain vulnerability for any LLM using standard PRNGs.
Where Pith is reading between the lines
- Any system that exposes or allows substitution of its random source for decision-making steps could face analogous control attacks.
- Production LLM services may need to audit or isolate their random number generation components as a standard security practice.
- Widespread adoption of true random sources could shift default LLM inference stacks toward hardware-based entropy for high-stakes uses.
Load-bearing premise
An attacker can access and replace the PRNG implementation or seed state inside the LLM inference pipeline without detection or model modification.
What would settle it
An experiment in which an attacker replaces the PRNG seed in a running LLM but fails to achieve the reported rates of forced token selection, or in which the QRNG defense still permits high-success hijacking, would disprove the attack's feasibility or the defense's effectiveness.
Figures
read the original abstract
Large language models (LLMs) rely on deterministic pseudorandom number generators (PRNGs) for autoregressive sampling, creating a critical supply-chain attack surface overlooked by existing defenses. We present SeedHijack, a backdoor attack that manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. We further propose a defense based on a hardware quantum random number generator (QRNG), which neutralizes the attack in our evaluated threat model with negligible median overhead (+0.6% latency, +7.7 MB memory). Our work identifies a critical sampling-layer vulnerability and provides a practical, deployable QRNG-based defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SeedHijack, a supply-chain backdoor attack that manipulates the PRNG used for autoregressive sampling in LLMs to force selection of attacker-specified tokens without altering logits. It reports 99.6% exact token injection success in a 540-trial benchmark on GPT-2 (124M) across 9 sampling configurations, 100% success on four aligned models (1.5B-7B parameters using RLHF/SFT/reasoning distillation), and bypass of tested alignment methods. The paper also proposes a QRNG-based defense that neutralizes the attack with +0.6% median latency and +7.7 MB memory overhead.
Significance. If the attack vector is practically realizable, the work identifies an overlooked vulnerability at the sampling layer and offers a low-overhead, deployable defense using hardware QRNG. The empirical results across base and aligned models strengthen the case for broad applicability. The defense's negligible overhead is a positive for real-world adoption in secure inference pipelines. However, the overall significance is tempered by the lack of demonstrated insertion mechanisms, which are central to validating the supply-chain threat.
major comments (3)
- [Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.
- [Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.
- [Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.
minor comments (2)
- [Abstract] The abstract states that the attack 'bypasses all alignment methods tested in this work' without listing the specific methods (RLHF, SFT, etc.); this should be enumerated for clarity.
- [Notation and Terminology] Ensure all acronyms (PRNG, QRNG, RLHF, SFT) are defined on first use in the main body, and verify consistent terminology for sampling parameters (temperature, top-p) across sections.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and will incorporate revisions to improve clarity on the threat model, experimental methodology, and defense limitations.
read point-by-point responses
-
Referee: [Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.
Authors: We agree that the supply-chain claim would be strengthened by addressing insertion feasibility. The manuscript centers on the attack mechanism and QRNG defense once PRNG control is obtained, treating the replacement as a given in the threat model. In revision we will expand the threat model section with a discussion of realistic insertion paths (e.g., compromised open-source inference packages, modified server binaries, or runtime hooks) and their prevalence in current deployment ecosystems, while explicitly noting that empirical demonstration of a specific vector lies outside the current scope and is left for future work. revision: yes
-
Referee: [Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.
Authors: We acknowledge that the experimental section would benefit from greater transparency. The 540 trials consist of 60 prompts evaluated across the nine sampling configurations with exact-token-match success defined at the hijacked position; baseline controls without attack were run for comparison. In the revised manuscript we will add a dedicated experimental appendix containing full methodological details, per-configuration success tables, error breakdown (e.g., cases of partial injection), and statistical analysis including confidence intervals to substantiate the reported rates. revision: yes
-
Referee: [Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.
Authors: We thank the referee for this observation. Our evaluation targets the standard autoregressive sampling path in which the PRNG supplies the sole randomness for token selection. In revision we will augment the defense section with an explicit limitations paragraph discussing other potential non-determinism sources (e.g., hardware timing jitter, multi-threaded scheduling) and partial-hijack scenarios, together with a statement that the QRNG approach is intended to neutralize the specific PRNG-replacement vector and may be combined with complementary techniques for broader coverage. revision: yes
Circularity Check
No circularity: empirical benchmarks and threat-model assumptions, not derivations
full rationale
The paper's central claims consist of empirical attack success rates (99.6% token injection on GPT-2 across 540 trials, 100% on four aligned models) and a QRNG defense evaluation with measured overhead. These rest on experimental benchmarks rather than any mathematical derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are presented that reduce to the inputs by construction. The attack presupposes an undetected PRNG replacement vector, but this is an explicit threat-model assumption, not a load-bearing derivation that collapses into self-citation or renaming. Self-citations, if present, are not used to justify core results. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs rely on deterministic pseudorandom number generators for autoregressive sampling
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SeedHijack manipulates PRNG outputs to force attacker-specified token selection without altering model logits... QRNG defense neutralizes the attack
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MT19937... deterministic... QRNG... true random numbers... Bell nonlocality
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. ICLR 2020
work page 2020
-
[2]
Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998. doi: 10.1145/272991.272995
-
[3]
Stealthy jailbreak attacks on large language models via benign data mirroring
Haoyu Mu, Hangfan He, Yongjie Zhou, et al. Stealthy jailbreak attacks on large language models via benign data mirroring. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), pages 1784–1799, 2025. doi: 10.18653/v1/2025.naacl-long.88. NAACL 2025
-
[4]
JailbreakBench: An open robustness benchmark for jailbreaking large language models
Patrick Chao, Edoardo Debenedetti, Samuel Gehman, et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024. NeurIPS 2024
work page 2024
-
[5]
Zhi Zhang, Zhen Sun, Zongmin Zhang, Jian Guo, and Xin He. FC-Attack: Jailbreaking multimodal large language models via auto-generated flowcharts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9299–9316, 2025. doi: 10.18653/v1/2025.findings-emnlp
-
[6]
Bag of tricks: Benchmarking of jailbreak attacks on LLMs
Yiming Liu, Mengyao Yang, Xiaotian Li, et al. Bag of tricks: Benchmarking of jailbreak attacks on LLMs. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. NeurIPS 2024 Datasets and Benchmarks Track
work page 2024
-
[7]
Jacob Steinhardt, Pang Wei W. Koh, and Percy S. Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 3517–3529,
-
[8]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018
work page 2018
-
[9]
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35: 27730–27744, 2022
work page 2022
-
[10]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In 9 Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. doi: 10.5555/3666122. 3668460. NeurIPS 2023
-
[11]
Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks
Pranav Dahiya, Ilia Shumailov, Hagen R"uhle, and Nicolas Papernot. Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks. InProceedings of the 33rd USENIX Security Symposium, 2024. doi: 10.5555/3698900.3699105. USENIX Security 2024
-
[12]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 1310–1320, 2019. ICML 2019
work page 2019
-
[13]
A survey on tensor techniques and applications in machine learning
Tianyu Gu, Brendan Liu, Bhavya Kailkhura, and Shaowen Yong. BadNets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019. doi: 10.1109/ACCESS.2019. 2909068
-
[14]
Trojaning attack on neural networks
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InProceedings of the 2018 Network and Distributed System Security Symposium (NDSS), 2018. NDSS 2018
work page 2018
-
[15]
Goodfellow, Jonathon Shlens, and Christian Szegedy
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. ICLR 2015
work page 2015
-
[16]
Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. Conformal nucleus sampling. InFindings of the Association for Computational Linguistics: ACL 2023, pages 27–34, 2023. doi: 10.18653/ v1/2023.findings-acl.3. ACL 2023 Findings
work page 2023
-
[17]
Haw-Shiuan Chang, Taehyung Kim, Ani Nenkova, and Nanyun Peng. REAL sampling: Boosting factuality and diversity of open-ended generation by extrapolating the entropy of an infinitely large LM.Transactions of the Association for Computational Linguistics (TACL), 13:1–15, 2025. doi: 10.1162/tacl_a_00757
-
[18]
BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset
Jiaming Ji, Chao Liu, Changling Bian, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023. NeurIPS 2023 Datasets and Benchmarks Track
work page 2023
-
[19]
Quantum random number generators
Miguel Herrero-Collantes and Juan Carlos Garcia-Escartin. Quantum random number generators. Reviews of Modern Physics, 89(1):015004, 2017. doi: 10.1103/RevModPhys.89.015004
-
[20]
Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016
Xiongfeng Ma, Xiao Yuan, Zhu Cao, Bing Qi, and Zhen Zhang. Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016. doi: 10.1038/npjqi.2016.21
-
[21]
Stefano Pironio, Antonio Acín, Serge Massar, et al. Random numbers certified by Bell’s theorem. Nature, 464:1021–1024, 2010. doi: 10.1038/nature09008
-
[22]
Peter Bierhorst, Emanuel Knill, Scott Glancy, et al. Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018. doi: 10.1038/ s41586-018-0019-0. 10
work page 2018
-
[23]
Nicolas Brunner, Daniel Cavalcanti, Stefano Pironio, Valerio Scarani, and Stephanie Wehner. Bell nonlocality.Reviews of Modern Physics, 86(2):419–478, 2014. doi: 10.1103/RevModPhys.86.419
-
[24]
CISA. Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations. Alert (AA20-352A), 2020. URLhttps://www.cisa.gov/ news-events/cybersecurity-advisories/aa20-352a
work page 2020
-
[25]
xz/liblzma supply chain compromise (CVE-2024-3094)
OpenSSF. xz/liblzma supply chain compromise (CVE-2024-3094). Critical Se- curity Advisory, 2024. URLhttps://openssf.org/blog/2024/03/30/ xz-backdoor-cve-2024-3094/
work page 2024
-
[26]
Certified defenses against adversarial ex- amples
Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial ex- amples. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018
work page 2018
-
[27]
Provable defenses against adversarial examples via the convex outer adversarial polytope
Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. InInternational Conference on Machine Learning (ICML), 2018. ICML 2018
work page 2018
-
[28]
BLEU: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002. ACL 2002. 11 Appendix A Methods Attack implementation:SeedHijack operates in three stages. First, the attacker rec...
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.