Seed Hijacking of LLM Sampling and Quantum Random Number Defense

arxiv: 2605.08313 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI· cs.LG

Seed Hijacking of LLM Sampling and Quantum Random Number Defense

Ziyang You , Xiaoke Yang , Zhanling Fan , Feng Guo , Xiaogen Zhou , Xuxing Lu This is my paper

Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM securitysampling attackPRNG hijackingbackdoor attackquantum random number generatortoken injectionalignment bypass

0 comments p. Extension

The pith

Attackers can force exact token outputs in LLMs by hijacking the PRNG seed used for sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs depend on deterministic pseudorandom number generators to choose tokens during generation, which creates an attack surface where an adversary can replace the seed or PRNG code to dictate the output sequence. This matters because the manipulation requires no changes to model weights or logits and still bypasses common alignment techniques. The authors demonstrate the attack's effectiveness through extensive benchmarks and introduce a hardware quantum random number generator as a countermeasure that blocks the hijack.

Core claim

SeedHijack manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. A hardware QRNG defense neutralizes the attack with negligible median overhead of +0.6% latency and +7.7 MB memory.

What carries the argument

The deterministic PRNG in the autoregressive sampling pipeline, whose seed or implementation the attack replaces to control token choice at each step.

If this is right

Token injection succeeds across diverse sampling methods without any change to the model's probability distribution.
The attack evades RLHF, SFT, and reasoning distillation alignments because logits remain untouched.
QRNG hardware deployment stops the hijack while adding only minimal latency and memory cost.
The sampling layer constitutes an overlooked supply-chain vulnerability for any LLM using standard PRNGs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Any system that exposes or allows substitution of its random source for decision-making steps could face analogous control attacks.
Production LLM services may need to audit or isolate their random number generation components as a standard security practice.
Widespread adoption of true random sources could shift default LLM inference stacks toward hardware-based entropy for high-stakes uses.

Load-bearing premise

An attacker can access and replace the PRNG implementation or seed state inside the LLM inference pipeline without detection or model modification.

What would settle it

An experiment in which an attacker replaces the PRNG seed in a running LLM but fails to achieve the reported rates of forced token selection, or in which the QRNG defense still permits high-success hijacking, would disprove the attack's feasibility or the defense's effectiveness.

Figures

Figures reproduced from arXiv: 2605.08313 by Feng Guo, Xiaogen Zhou, Xiaoke Yang, Xuxing Lu, Zhanling Fan, Ziyang You.

**Figure 1.** Figure 1: Deterministic hijacking of LLM sampling layers and QRNG-based defense. This schematic illustrates the core mechanism of our sampling-layer backdoor attack and the corresponding quantum random number generator (QRNG) mitigation. (a) Deterministic hijacking attack: Standard autoregressive LLM sampling relies on pseudorandom number generators (PRNGs, e.g., MT19937), whose sequences are fully determined by a s… view at source ↗

**Figure 2.** Figure 2: Attack success rate across 9 sampling configurations. Exact token injection rate of SeedHijack on GPT-2 (124M) over all combinations of temperature τ ∈ {0.7, 1.0, 1.5} and topp ∈ {0.9, 0.95, 1.0}. Each cell aggregates 60 independent trials; success is defined as exact token ID match for the full target payload. Overall injection rate: 538/540 (99.6%). numerical edge case inherent to finite-precision arith… view at source ↗

**Figure 3.** Figure 3: QRNG defense: complete comparison across all metrics. All results are from a dedicated 100-trial defense benchmark, independent of the 540-trial full attack benchmark in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large language models (LLMs) rely on deterministic pseudorandom number generators (PRNGs) for autoregressive sampling, creating a critical supply-chain attack surface overlooked by existing defenses. We present SeedHijack, a backdoor attack that manipulates PRNG outputs to force attacker-specified token selection without altering model logits. In a 540-trial benchmark on GPT-2 (124M), the attack achieves 99.6% exact token injection across 9 sampling configurations; it reaches 100% success on four aligned models (1.5B-7B, RLHF/SFT/reasoning distillation) and bypasses all alignment methods tested in this work. We further propose a defense based on a hardware quantum random number generator (QRNG), which neutralizes the attack in our evaluated threat model with negligible median overhead (+0.6% latency, +7.7 MB memory). Our work identifies a critical sampling-layer vulnerability and provides a practical, deployable QRNG-based defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a PRNG manipulation attack on LLM sampling that works in controlled tests and pairs it with a QRNG defense, but the practical insertion path for the attack remains undemonstrated.

read the letter

The core takeaway here is that SeedHijack targets the random sampling step in autoregressive generation by swapping PRNG outputs to force specific tokens, leaving logits untouched. This bypasses alignment in their tests and they back it with a hardware QRNG fix that adds almost no latency or memory cost. That combination is the actual new piece: prior work has looked at logit tampering or prompt injection, but not this supply-chain angle on the sampler itself. Their 540-trial run on GPT-2 hitting 99.6% success across nine configs, plus 100% on the aligned 1.5B-7B models, is the strongest evidence they present. The defense numbers look clean too, with median overhead under 1% latency. That part is worth noting because it gives a concrete, deployable countermeasure rather than just another attack paper. The soft spot is the attack vector itself. The results assume an attacker can replace or hook the PRNG inside the inference stack without triggering code signing, container checks, or library provenance in tools like vLLM or TGI. No section walks through a working insertion method or shows it surviving real serving constraints, so the 99.6% and 100% figures stay tied to a lab setup. If that assumption does not hold in practice, the claimed bypass of alignment loses force. The paper is aimed at people who build or audit LLM serving pipelines and care about hardware randomness sources. A reader who already thinks about supply-chain risks in ML will get the most out of it. It deserves a serious referee because the empirical numbers are specific and the defense is measurable, even if the threat model needs tighter grounding. I would send it out for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SeedHijack, a supply-chain backdoor attack that manipulates the PRNG used for autoregressive sampling in LLMs to force selection of attacker-specified tokens without altering logits. It reports 99.6% exact token injection success in a 540-trial benchmark on GPT-2 (124M) across 9 sampling configurations, 100% success on four aligned models (1.5B-7B parameters using RLHF/SFT/reasoning distillation), and bypass of tested alignment methods. The paper also proposes a QRNG-based defense that neutralizes the attack with +0.6% median latency and +7.7 MB memory overhead.

Significance. If the attack vector is practically realizable, the work identifies an overlooked vulnerability at the sampling layer and offers a low-overhead, deployable defense using hardware QRNG. The empirical results across base and aligned models strengthen the case for broad applicability. The defense's negligible overhead is a positive for real-world adoption in secure inference pipelines. However, the overall significance is tempered by the lack of demonstrated insertion mechanisms, which are central to validating the supply-chain threat.

major comments (3)

[Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.
[Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.
[Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.

minor comments (2)

[Abstract] The abstract states that the attack 'bypasses all alignment methods tested in this work' without listing the specific methods (RLHF, SFT, etc.); this should be enumerated for clarity.
[Notation and Terminology] Ensure all acronyms (PRNG, QRNG, RLHF, SFT) are defined on first use in the main body, and verify consistent terminology for sampling parameters (temperature, top-p) across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below and will incorporate revisions to improve clarity on the threat model, experimental methodology, and defense limitations.

read point-by-point responses

Referee: [Threat Model / Attack Description] The threat model presupposes that an attacker can replace the PRNG implementation or seed/state inside the inference pipeline (e.g., in vLLM or Hugging Face TGI) without detection or model changes. No section demonstrates a concrete insertion vector such as a compromised wheel, modified CUDA kernel, or runtime hook that survives code signing or provenance checks. This assumption is load-bearing for the claim of a practical supply-chain backdoor.

Authors: We agree that the supply-chain claim would be strengthened by addressing insertion feasibility. The manuscript centers on the attack mechanism and QRNG defense once PRNG control is obtained, treating the replacement as a given in the threat model. In revision we will expand the threat model section with a discussion of realistic insertion paths (e.g., compromised open-source inference packages, modified server binaries, or runtime hooks) and their prevalence in current deployment ecosystems, while explicitly noting that empirical demonstration of a specific vector lies outside the current scope and is left for future work. revision: yes
Referee: [Experimental Evaluation] The 540-trial benchmark on GPT-2 reports 99.6% success and 100% on aligned models, but the manuscript provides no methodological details, error analysis, controls, or statistical tests. This makes it impossible to assess whether the results support the exact-token-injection claim across the 9 sampling configurations.

Authors: We acknowledge that the experimental section would benefit from greater transparency. The 540 trials consist of 60 prompts evaluated across the nine sampling configurations with exact-token-match success defined at the hijacked position; baseline controls without attack were run for comparison. In the revised manuscript we will add a dedicated experimental appendix containing full methodological details, per-configuration success tables, error breakdown (e.g., cases of partial injection), and statistical analysis including confidence intervals to substantiate the reported rates. revision: yes
Referee: [Defense Evaluation] The QRNG defense is evaluated only under the assumption that the PRNG is the sole randomness source; the manuscript does not address how the defense would perform against other non-determinism sources in the inference stack or partial hijacks.

Authors: We thank the referee for this observation. Our evaluation targets the standard autoregressive sampling path in which the PRNG supplies the sole randomness for token selection. In revision we will augment the defense section with an explicit limitations paragraph discussing other potential non-determinism sources (e.g., hardware timing jitter, multi-threaded scheduling) and partial-hijack scenarios, together with a statement that the QRNG approach is intended to neutralize the specific PRNG-replacement vector and may be combined with complementary techniques for broader coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks and threat-model assumptions, not derivations

full rationale

The paper's central claims consist of empirical attack success rates (99.6% token injection on GPT-2 across 540 trials, 100% on four aligned models) and a QRNG defense evaluation with measured overhead. These rest on experimental benchmarks rather than any mathematical derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are presented that reduce to the inputs by construction. The attack presupposes an undetected PRNG replacement vector, but this is an explicit threat-model assumption, not a load-bearing derivation that collapses into self-citation or renaming. Self-citations, if present, are not used to justify core results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that LLMs use deterministic PRNGs for sampling and that an attacker can substitute the PRNG output stream.

axioms (1)

domain assumption LLMs rely on deterministic pseudorandom number generators for autoregressive sampling
Explicitly stated in the abstract as the basis for the attack surface.

pith-pipeline@v0.9.0 · 5485 in / 1217 out tokens · 46846 ms · 2026-05-12T01:03:33.590949+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SeedHijack manipulates PRNG outputs to force attacker-specified token selection without altering model logits... QRNG defense neutralizes the attack
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MT19937... deterministic... QRNG... true random numbers... Bell nonlocality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. ICLR 2020

work page 2020
[2]

Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998

Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998. doi: 10.1145/272991.272995

work page doi:10.1145/272991.272995 1998
[3]

Stealthy jailbreak attacks on large language models via benign data mirroring

Haoyu Mu, Hangfan He, Yongjie Zhou, et al. Stealthy jailbreak attacks on large language models via benign data mirroring. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), pages 1784–1799, 2025. doi: 10.18653/v1/2025.naacl-long.88. NAACL 2025

work page doi:10.18653/v1/2025.naacl-long.88 2025
[4]

JailbreakBench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Samuel Gehman, et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024. NeurIPS 2024

work page 2024
[5]

emnlp-main.830/

Zhi Zhang, Zhen Sun, Zongmin Zhang, Jian Guo, and Xin He. FC-Attack: Jailbreaking multimodal large language models via auto-generated flowcharts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9299–9316, 2025. doi: 10.18653/v1/2025.findings-emnlp

work page doi:10.18653/v1/2025.findings-emnlp 2025
[6]

Bag of tricks: Benchmarking of jailbreak attacks on LLMs

Yiming Liu, Mengyao Yang, Xiaotian Li, et al. Bag of tricks: Benchmarking of jailbreak attacks on LLMs. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. NeurIPS 2024 Datasets and Benchmarks Track

work page 2024
[7]

Koh, and Percy S

Jacob Steinhardt, Pang Wei W. Koh, and Percy S. Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 3517–3529,

work page
[8]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

work page 2018
[9]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35: 27730–27744, 2022

work page 2022
[10]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In 9 Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. doi: 10.5555/3666122. 3668460. NeurIPS 2023

work page doi:10.5555/3666122 2023
[11]

Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks

Pranav Dahiya, Ilia Shumailov, Hagen R"uhle, and Nicolas Papernot. Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks. InProceedings of the 33rd USENIX Security Symposium, 2024. doi: 10.5555/3698900.3699105. USENIX Security 2024

work page doi:10.5555/3698900.3699105 2024
[12]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 1310–1320, 2019. ICML 2019

work page 2019
[13]

A survey on tensor techniques and applications in machine learning

Tianyu Gu, Brendan Liu, Bhavya Kailkhura, and Shaowen Yong. BadNets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019. doi: 10.1109/ACCESS.2019. 2909068

work page doi:10.1109/access.2019 2019
[14]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InProceedings of the 2018 Network and Distributed System Security Symposium (NDSS), 2018. NDSS 2018

work page 2018
[15]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. ICLR 2015

work page 2015
[16]

Conformal nucleus sampling

Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. Conformal nucleus sampling. InFindings of the Association for Computational Linguistics: ACL 2023, pages 27–34, 2023. doi: 10.18653/ v1/2023.findings-acl.3. ACL 2023 Findings

work page 2023
[17]

Haw-Shiuan Chang, Taehyung Kim, Ani Nenkova, and Nanyun Peng. REAL sampling: Boosting factuality and diversity of open-ended generation by extrapolating the entropy of an infinitely large LM.Transactions of the Association for Computational Linguistics (TACL), 13:1–15, 2025. doi: 10.1162/tacl_a_00757

work page doi:10.1162/tacl_a_00757 2025
[18]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Chao Liu, Changling Bian, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023. NeurIPS 2023 Datasets and Benchmarks Track

work page 2023
[19]

Quantum random number generators

Miguel Herrero-Collantes and Juan Carlos Garcia-Escartin. Quantum random number generators. Reviews of Modern Physics, 89(1):015004, 2017. doi: 10.1103/RevModPhys.89.015004

work page doi:10.1103/revmodphys.89.015004 2017
[20]

Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016

Xiongfeng Ma, Xiao Yuan, Zhu Cao, Bing Qi, and Zhen Zhang. Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016. doi: 10.1038/npjqi.2016.21

work page doi:10.1038/npjqi.2016.21 2016
[21]

URLhttps://doi

Stefano Pironio, Antonio Acín, Serge Massar, et al. Random numbers certified by Bell’s theorem. Nature, 464:1021–1024, 2010. doi: 10.1038/nature09008

work page doi:10.1038/nature09008 2010
[22]

Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018

Peter Bierhorst, Emanuel Knill, Scott Glancy, et al. Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018. doi: 10.1038/ s41586-018-0019-0. 10

work page 2018
[23]

Brunner, D

Nicolas Brunner, Daniel Cavalcanti, Stefano Pironio, Valerio Scarani, and Stephanie Wehner. Bell nonlocality.Reviews of Modern Physics, 86(2):419–478, 2014. doi: 10.1103/RevModPhys.86.419

work page doi:10.1103/revmodphys.86.419 2014
[24]

Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations

CISA. Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations. Alert (AA20-352A), 2020. URLhttps://www.cisa.gov/ news-events/cybersecurity-advisories/aa20-352a

work page 2020
[25]

xz/liblzma supply chain compromise (CVE-2024-3094)

OpenSSF. xz/liblzma supply chain compromise (CVE-2024-3094). Critical Se- curity Advisory, 2024. URLhttps://openssf.org/blog/2024/03/30/ xz-backdoor-cve-2024-3094/

work page 2024
[26]

Certified defenses against adversarial ex- amples

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial ex- amples. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

work page 2018
[27]

Provable defenses against adversarial examples via the convex outer adversarial polytope

Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. InInternational Conference on Machine Learning (ICML), 2018. ICML 2018

work page 2018
[28]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002. ACL 2002. 11 Appendix A Methods Attack implementation:SeedHijack operates in three stages. First, the attacker rec...

work page 2002

[1] [1]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. ICLR 2020

work page 2020

[2] [2]

Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998

Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.ACM Transactions on Modeling and Computer Simu- lation, 8(1):3–30, 1998. doi: 10.1145/272991.272995

work page doi:10.1145/272991.272995 1998

[3] [3]

Stealthy jailbreak attacks on large language models via benign data mirroring

Haoyu Mu, Hangfan He, Yongjie Zhou, et al. Stealthy jailbreak attacks on large language models via benign data mirroring. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), pages 1784–1799, 2025. doi: 10.18653/v1/2025.naacl-long.88. NAACL 2025

work page doi:10.18653/v1/2025.naacl-long.88 2025

[4] [4]

JailbreakBench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Samuel Gehman, et al. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024. NeurIPS 2024

work page 2024

[5] [5]

emnlp-main.830/

Zhi Zhang, Zhen Sun, Zongmin Zhang, Jian Guo, and Xin He. FC-Attack: Jailbreaking multimodal large language models via auto-generated flowcharts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9299–9316, 2025. doi: 10.18653/v1/2025.findings-emnlp

work page doi:10.18653/v1/2025.findings-emnlp 2025

[6] [6]

Bag of tricks: Benchmarking of jailbreak attacks on LLMs

Yiming Liu, Mengyao Yang, Xiaotian Li, et al. Bag of tricks: Benchmarking of jailbreak attacks on LLMs. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. NeurIPS 2024 Datasets and Benchmarks Track

work page 2024

[7] [7]

Koh, and Percy S

Jacob Steinhardt, Pang Wei W. Koh, and Percy S. Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 3517–3529,

work page

[8] [8]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

work page 2018

[9] [9]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35: 27730–27744, 2022

work page 2022

[10] [10]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In 9 Advances in Neural Information Processing Systems 36 (NeurIPS), 2023. doi: 10.5555/3666122. 3668460. NeurIPS 2023

work page doi:10.5555/3666122 2023

[11] [11]

Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks

Pranav Dahiya, Ilia Shumailov, Hagen R"uhle, and Nicolas Papernot. Machine learning needs better randomness standards: Randomised smoothing and PRNG-based attacks. InProceedings of the 33rd USENIX Security Symposium, 2024. doi: 10.5555/3698900.3699105. USENIX Security 2024

work page doi:10.5555/3698900.3699105 2024

[12] [12]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 1310–1320, 2019. ICML 2019

work page 2019

[13] [13]

A survey on tensor techniques and applications in machine learning

Tianyu Gu, Brendan Liu, Bhavya Kailkhura, and Shaowen Yong. BadNets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019. doi: 10.1109/ACCESS.2019. 2909068

work page doi:10.1109/access.2019 2019

[14] [14]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InProceedings of the 2018 Network and Distributed System Security Symposium (NDSS), 2018. NDSS 2018

work page 2018

[15] [15]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. ICLR 2015

work page 2015

[16] [16]

Conformal nucleus sampling

Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. Conformal nucleus sampling. InFindings of the Association for Computational Linguistics: ACL 2023, pages 27–34, 2023. doi: 10.18653/ v1/2023.findings-acl.3. ACL 2023 Findings

work page 2023

[17] [17]

Haw-Shiuan Chang, Taehyung Kim, Ani Nenkova, and Nanyun Peng. REAL sampling: Boosting factuality and diversity of open-ended generation by extrapolating the entropy of an infinitely large LM.Transactions of the Association for Computational Linguistics (TACL), 13:1–15, 2025. doi: 10.1162/tacl_a_00757

work page doi:10.1162/tacl_a_00757 2025

[18] [18]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Chao Liu, Changling Bian, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023. NeurIPS 2023 Datasets and Benchmarks Track

work page 2023

[19] [19]

Quantum random number generators

Miguel Herrero-Collantes and Juan Carlos Garcia-Escartin. Quantum random number generators. Reviews of Modern Physics, 89(1):015004, 2017. doi: 10.1103/RevModPhys.89.015004

work page doi:10.1103/revmodphys.89.015004 2017

[20] [20]

Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016

Xiongfeng Ma, Xiao Yuan, Zhu Cao, Bing Qi, and Zhen Zhang. Quantum random number genera- tion.npj Quantum Information, 2:16021, 2016. doi: 10.1038/npjqi.2016.21

work page doi:10.1038/npjqi.2016.21 2016

[21] [21]

URLhttps://doi

Stefano Pironio, Antonio Acín, Serge Massar, et al. Random numbers certified by Bell’s theorem. Nature, 464:1021–1024, 2010. doi: 10.1038/nature09008

work page doi:10.1038/nature09008 2010

[22] [22]

Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018

Peter Bierhorst, Emanuel Knill, Scott Glancy, et al. Experimentally generated randomness certified by the impossibility of superluminal communication.Nature, 556:223–226, 2018. doi: 10.1038/ s41586-018-0019-0. 10

work page 2018

[23] [23]

Brunner, D

Nicolas Brunner, Daniel Cavalcanti, Stefano Pironio, Valerio Scarani, and Stephanie Wehner. Bell nonlocality.Reviews of Modern Physics, 86(2):419–478, 2014. doi: 10.1103/RevModPhys.86.419

work page doi:10.1103/revmodphys.86.419 2014

[24] [24]

Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations

CISA. Advanced persistent threat compromise of government agencies, critical infrastructure, and private sector organizations. Alert (AA20-352A), 2020. URLhttps://www.cisa.gov/ news-events/cybersecurity-advisories/aa20-352a

work page 2020

[25] [25]

xz/liblzma supply chain compromise (CVE-2024-3094)

OpenSSF. xz/liblzma supply chain compromise (CVE-2024-3094). Critical Se- curity Advisory, 2024. URLhttps://openssf.org/blog/2024/03/30/ xz-backdoor-cve-2024-3094/

work page 2024

[26] [26]

Certified defenses against adversarial ex- amples

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial ex- amples. InInternational Conference on Learning Representations (ICLR), 2018. ICLR 2018

work page 2018

[27] [27]

Provable defenses against adversarial examples via the convex outer adversarial polytope

Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. InInternational Conference on Machine Learning (ICML), 2018. ICML 2018

work page 2018

[28] [28]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002. ACL 2002. 11 Appendix A Methods Attack implementation:SeedHijack operates in three stages. First, the attacker rec...

work page 2002