pith. sign in

arxiv: 2605.08313 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI· cs.LG

Seed Hijacking of LLM Sampling and Quantum Random Number Defense

Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords LLM securitysampling attackPRNG hijackingbackdoor attackquantum random number generatortoken injectionalignment bypass
0
0 comments X p. Extension

The pith

Attackers can force exact token outputs in LLMs by hijacking the PRNG seed used for sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs depend on deterministic pseudorandom number generators to choose tokens during generation, which creates an attack surface where an adversary can replace the seed or PRNG code to dictate the output sequence. This matters because the manipulation requires no changes to model weights or logits and still bypasses common alignment techniques. The authors demonstrate the attack's effectiveness through extensive benchmarks and introduce a hardware quantum random number generator as a countermeasure that blocks the hijack.

Core claim

SeedHijack manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. A hardware QRNG defense neutralizes the attack with negligible median overhead of +0.6% latency and +7.7 MB memory.

What carries the argument

The deterministic PRNG in the autoregressive sampling pipeline, whose seed or implementation the attack replaces to control token choice at each step.

If this is right

  • Token injection succeeds across diverse sampling methods without any change to the model's probability distribution.
  • The attack evades RLHF, SFT, and reasoning distillation alignments because logits remain untouched.
  • QRNG hardware deployment stops the hijack while adding only minimal latency and memory cost.
  • The sampling layer constitutes an overlooked supply-chain vulnerability for any LLM using standard PRNGs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any system that exposes or allows substitution of its random source for decision-making steps could face analogous control attacks.
  • Production LLM services may need to audit or isolate their random number generation components as a standard security practice.
  • Widespread adoption of true random sources could shift default LLM inference stacks toward hardware-based entropy for high-stakes uses.

Load-bearing premise

An attacker can access and replace the PRNG implementation or seed state inside the LLM inference pipeline without detection or model modification.

What would settle it

An experiment in which an attacker replaces the PRNG seed in a running LLM but fails to achieve the reported rates of forced token selection, or in which the QRNG defense still permits high-success hijacking, would disprove the attack's feasibility or the defense's effectiveness.

Figures

Figures reproduced from arXiv: 2605.08313 by Feng Guo, Xiaogen Zhou, Xiaoke Yang, Xuxing Lu, Zhanling Fan, Ziyang You.

Figure 1
Figure 1. Figure 1: Deterministic hijacking of LLM sampling layers and QRNG-based defense. This schematic illustrates the core mechanism of our sampling-layer backdoor attack and the corresponding quantum random number generator (QRNG) mitigation. (a) Deterministic hijacking attack: Standard autoregressive LLM sampling relies on pseudorandom number generators (PRNGs, e.g., MT19937), whose sequences are fully determined by a s… view at source ↗
Figure 2
Figure 2. Figure 2: Attack success rate across 9 sampling configurations. Exact token injection rate of SeedHijack on GPT-2 (124M) over all combinations of temperature τ ∈ {0.7, 1.0, 1.5} and top￾p ∈ {0.9, 0.95, 1.0}. Each cell aggregates 60 independent trials; success is defined as exact token ID match for the full target payload. Overall injection rate: 538/540 (99.6%). numerical edge case inherent to finite-precision arith… view at source ↗
Figure 3
Figure 3. Figure 3: QRNG defense: complete comparison across all metrics. All results are from a dedicated 100-trial defense benchmark, independent of the 540-trial full attack benchmark in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large language models (LLMs) rely on deterministic pseudorandom number generators (PRNGs) for autoregressive sampling, creating a critical supply-chain attack surface overlooked by existing defenses. We present SeedHijack, a backdoor attack that manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. We further propose a defense based on a hardware quantum random number generator (QRNG), which neutralizes the attack in our evaluated threat model with negligible median overhead (+0.6% latency, +7.7 MB memory). Our work identifies a critical sampling-layer vulnerability and provides a practical, deployable QRNG-based defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SeedHijack, a supply-chain backdoor attack that manipulates the PRNG used for autoregressive sampling in LLMs to force selection of attacker-specified tokens without altering logits. It reports 99.6% exact token injection success in a 540-trial benchmark on GPT-2 (124M) across 9 sampling configurations, 100% success on four aligned models (1.5B-7B parameters using RLHF/SFT/reasoning distillation), and bypass of tested alignment methods. The paper also proposes a QRNG-based defense that neutralizes the attack with +0.6% median latency and +7.7 MB memory overhead.

Significance. If the attack vector is practically realizable, the work identifies an overlooked vulnerability at the sampling layer and offers a low-overhead, deployable defense using hardware QRNG. The empirical results across base and aligned models strengthen the case for broad applicability. The defense's negligible overhead is a positive for real-world adoption in secure inference pipelines. However, the overall significance is tempered by the lack of demonstrated insertion mechanisms, which are central to validating the supply-chain threat.

major comments (3)
  1. [Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.
  2. [Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.
  3. [Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.
minor comments (2)
  1. [Abstract] The abstract states that the attack 'bypasses all alignment methods tested in this work' without listing the specific methods (RLHF, SFT, etc.); this should be enumerated for clarity.
  2. [Notation and Terminology] Ensure all acronyms (PRNG, QRNG, RLHF, SFT) are defined on first use in the main body, and verify consistent terminology for sampling parameters (temperature, top-p) across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and will incorporate revisions to improve clarity on the threat model, experimental methodology, and defense limitations.

read point-by-point responses
  1. Referee: [Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.

    Authors: We agree that the supply-chain claim would be strengthened by addressing insertion feasibility. The manuscript centers on the attack mechanism and QRNG defense once PRNG control is obtained, treating the replacement as a given in the threat model. In revision we will expand the threat model section with a discussion of realistic insertion paths (e.g., compromised open-source inference packages, modified server binaries, or runtime hooks) and their prevalence in current deployment ecosystems, while explicitly noting that empirical demonstration of a specific vector lies outside the current scope and is left for future work. revision: yes

  2. Referee: [Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.

    Authors: We acknowledge that the experimental section would benefit from greater transparency. The 540 trials consist of 60 prompts evaluated across the nine sampling configurations with exact-token-match success defined at the hijacked position; baseline controls without attack were run for comparison. In the revised manuscript we will add a dedicated experimental appendix containing full methodological details, per-configuration success tables, error breakdown (e.g., cases of partial injection), and statistical analysis including confidence intervals to substantiate the reported rates. revision: yes

  3. Referee: [Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.

    Authors: We thank the referee for this observation. Our evaluation targets the standard autoregressive sampling path in which the PRNG supplies the sole randomness for token selection. In revision we will augment the defense section with an explicit limitations paragraph discussing other potential non-determinism sources (e.g., hardware timing jitter, multi-threaded scheduling) and partial-hijack scenarios, together with a statement that the QRNG approach is intended to neutralize the specific PRNG-replacement vector and may be combined with complementary techniques for broader coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks and threat-model assumptions, not derivations

full rationale

The paper's central claims consist of empirical attack success rates (99.6% token injection on GPT-2 across 540 trials, 100% on four aligned models) and a QRNG defense evaluation with measured overhead. These rest on experimental benchmarks rather than any mathematical derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are presented that reduce to the inputs by construction. The attack presupposes an undetected PRNG replacement vector, but this is an explicit threat-model assumption, not a load-bearing derivation that collapses into self-citation or renaming. Self-citations, if present, are not used to justify core results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that LLMs use deterministic PRNGs for sampling and that an attacker can substitute the PRNG output stream.

axioms (1)
  • domain assumption LLMs rely on deterministic pseudorandom number generators for autoregressive sampling
    Explicitly stated in the abstract as the basis for the attack surface.

pith-pipeline@v0.9.0 · 5485 in / 1217 out tokens · 46846 ms · 2026-05-12T01:03:33.590949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. ICLR 2020

  2. [2]

    Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998

    Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998. doi: 10.1145/272991.272995

  3. [3]

    Stealthy jailbreak attacks on large language models via benign data mirroring

    Haoyu Mu, Hangfan He, Yongjie Zhou, et al. Stealthy jailbreak attacks on large language models via benign data mirroring. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), pages 1784–1799, 2025. doi: 10.18653/v1/2025.naacl-long.88. NAACL 2025

  4. [4]

    JailbreakBench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Samuel Gehman, et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024. NeurIPS 2024

  5. [5]

    emnlp-main.830/

    Zhi Zhang, Zhen Sun, Zongmin Zhang, Jian Guo, and Xin He. FC-Attack: Jailbreaking multimodal large language models via auto-generated flowcharts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9299–9316, 2025. doi: 10.18653/v1/2025.findings-emnlp

  6. [6]

    Bag of tricks: Benchmarking of jailbreak attacks on LLMs

    Yiming Liu, Mengyao Yang, Xiaotian Li, et al. Bag of tricks: Benchmarking of jailbreak attacks on LLMs. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. NeurIPS 2024 Datasets and Benchmarks Track

  7. [7]

    Koh, and Percy S

    Jacob Steinhardt, Pang Wei W. Koh, and Percy S. Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 3517–3529,

  8. [8]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

  9. [9]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35: 27730–27744, 2022

  10. [10]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In 9 Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. doi: 10.5555/3666122. 3668460. NeurIPS 2023

  11. [11]

    Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks

    Pranav Dahiya, Ilia Shumailov, Hagen R"uhle, and Nicolas Papernot. Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks. InProceedings of the 33rd USENIX Security Symposium, 2024. doi: 10.5555/3698900.3699105. USENIX Security 2024

  12. [12]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 1310–1320, 2019. ICML 2019

  13. [13]

    A survey on tensor techniques and applications in machine learning

    Tianyu Gu, Brendan Liu, Bhavya Kailkhura, and Shaowen Yong. BadNets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019. doi: 10.1109/ACCESS.2019. 2909068

  14. [14]

    Trojaning attack on neural networks

    Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InProceedings of the 2018 Network and Distributed System Security Symposium (NDSS), 2018. NDSS 2018

  15. [15]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. ICLR 2015

  16. [16]

    Conformal nucleus sampling

    Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. Conformal nucleus sampling. InFindings of the Association for Computational Linguistics: ACL 2023, pages 27–34, 2023. doi: 10.18653/ v1/2023.findings-acl.3. ACL 2023 Findings

  17. [17]

    Haw-Shiuan Chang, Taehyung Kim, Ani Nenkova, and Nanyun Peng. REAL sampling: Boosting factuality and diversity of open-ended generation by extrapolating the entropy of an infinitely large LM.Transactions of the Association for Computational Linguistics (TACL), 13:1–15, 2025. doi: 10.1162/tacl_a_00757

  18. [18]

    BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

    Jiaming Ji, Chao Liu, Changling Bian, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023. NeurIPS 2023 Datasets and Benchmarks Track

  19. [19]

    Quantum random number generators

    Miguel Herrero-Collantes and Juan Carlos Garcia-Escartin. Quantum random number generators. Reviews of Modern Physics, 89(1):015004, 2017. doi: 10.1103/RevModPhys.89.015004

  20. [20]

    Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016

    Xiongfeng Ma, Xiao Yuan, Zhu Cao, Bing Qi, and Zhen Zhang. Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016. doi: 10.1038/npjqi.2016.21

  21. [21]

    URLhttps://doi

    Stefano Pironio, Antonio Acín, Serge Massar, et al. Random numbers certified by Bell’s theorem. Nature, 464:1021–1024, 2010. doi: 10.1038/nature09008

  22. [22]

    Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018

    Peter Bierhorst, Emanuel Knill, Scott Glancy, et al. Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018. doi: 10.1038/ s41586-018-0019-0. 10

  23. [23]

    Brunner, D

    Nicolas Brunner, Daniel Cavalcanti, Stefano Pironio, Valerio Scarani, and Stephanie Wehner. Bell nonlocality.Reviews of Modern Physics, 86(2):419–478, 2014. doi: 10.1103/RevModPhys.86.419

  24. [24]

    Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations

    CISA. Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations. Alert (AA20-352A), 2020. URLhttps://www.cisa.gov/ news-events/cybersecurity-advisories/aa20-352a

  25. [25]

    xz/liblzma supply chain compromise (CVE-2024-3094)

    OpenSSF. xz/liblzma supply chain compromise (CVE-2024-3094). Critical Se- curity Advisory, 2024. URLhttps://openssf.org/blog/2024/03/30/ xz-backdoor-cve-2024-3094/

  26. [26]

    Certified defenses against adversarial ex- amples

    Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial ex- amples. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

  27. [27]

    Provable defenses against adversarial examples via the convex outer adversarial polytope

    Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. InInternational Conference on Machine Learning (ICML), 2018. ICML 2018

  28. [28]

    BLEU: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002. ACL 2002. 11 Appendix A Methods Attack implementation:SeedHijack operates in three stages. First, the attacker rec...