Recognition: 2 theorem links
· Lean TheoremCacheTrap: Unveiling a Stealthier Gray-Box Trojan against LLMs
Pith reviewed 2026-05-17 04:09 UTC · model grok-4.3
The pith
Flipping one bit in an LLM's KV cache creates a gray-box Trojan that activates targeted behavior on a trigger while leaving normal operation unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CacheTrap is the first gray-box Trojan attack targeting the Key-Value (KV) cache of LLMs. This method induces a single-bit flip in the KV cache, serving as a transient trigger. When activated, this trigger causes the model to exhibit targeted actions without changing inputs or model weights. CacheTrap introduces an efficient search algorithm to locate vulnerable positions in the KV cache, independent of model weights or datasets. Extensive experiments on five open-source LLMs show a 100% attack success rate with the trigger while preserving benign accuracy without the trigger.
What carries the argument
A single-bit flip in the KV cache that functions as a transient trigger, located via an efficient search algorithm that requires no model weights or datasets.
If this is right
- Trojan behavior can be induced at inference time through reversible, minimal state changes rather than permanent weight modifications.
- Protection of model weights alone does not prevent attacks that operate on internal cache state.
- Attackers with partial runtime access can achieve reliable targeted manipulation across multiple LLM architectures.
- Detection methods based on performance monitoring will miss the attack because benign accuracy remains unchanged.
- The search algorithm enables the attack to be mounted without access to training data or full model parameters.
Where Pith is reading between the lines
- Inference engines may need to add runtime integrity checks on KV cache contents to block unauthorized single-bit changes.
- The same single-bit manipulation principle could be tested on other internal states such as attention matrices or activation buffers.
- In multi-tenant cloud deployments, isolating or encrypting per-user KV caches would reduce exposure to this class of attack.
- Closed-source API services could be probed for similar cache vulnerabilities by observing output changes after controlled state perturbations.
Load-bearing premise
The adversary must have gray-box access to read and modify the KV cache state at inference time and must be able to run the search algorithm to find the vulnerable bit without using model weights or any datasets.
What would settle it
An experiment in which the search algorithm is run on one of the tested LLMs and no bit position produces both 100% targeted success on triggered inputs and identical benign accuracy on untriggered inputs would falsify the central claim.
Figures
read the original abstract
The rapid advancement of large language models (LLMs) has sparked growing interest in understanding their security vulnerabilities, particularly Trojan attacks that enable stealthy manipulation of model behavior. Traditional Trojan methods typically alter inputs and/or model weights, relying on white-box assumptions that require access to data or model internal parameters. In this work, we present CacheTrap, the first gray-box Trojan attack targeting the Key-Value (KV) cache of LLMs. This method induces a single-bit flip in the KV cache, serving as a transient trigger. When activated, this trigger causes the model to exhibit targeted actions without changing inputs or model weights. CacheTrap introduces an efficient search algorithm to locate vulnerable positions in the KV cache, independent of model weights or datasets. Extensive experiments on five open-source LLMs show a remarkable 100% attack success rate (with the trigger) while preserving benign accuracy (without the trigger) by flipping just one bit in the KV cache.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CacheTrap, a gray-box Trojan attack against LLMs that uses a single-bit flip in the KV cache as a transient trigger to induce targeted model behaviors without modifying inputs or weights. It proposes an efficient search algorithm to identify vulnerable cache positions, claimed to operate independently of model weights or datasets, and reports 100% attack success rate (ASR) with the trigger active across five open-source LLMs while preserving benign accuracy when the trigger is absent.
Significance. If the search algorithm can be shown to function under the stated gray-box constraints without hidden dependencies on weights, datasets, or extra activations, the result would be significant for LLM security. It identifies the KV cache as a previously under-examined attack surface for stealthy, low-footprint Trojans that persist only during inference, complementing existing work on weight and input perturbations. The multi-model empirical evaluation provides a concrete demonstration of the attack's practicality.
major comments (2)
- [§3] §3 (Search Algorithm): The central gray-box claim and the 'independent of model weights or datasets' assertion rest on the search procedure locating the vulnerable bit using only KV-cache read/write access. The manuscript does not specify the exact query model, stopping criteria, number of forward passes required, or whether the procedure accesses activations beyond the KV cache; without these details the independence claim cannot be verified and the 100% ASR results are difficult to reproduce or generalize.
- [§4] §4 (Experiments): The reported 100% ASR across five models lacks accompanying controls or measurements for model-specific KV-cache behaviors, false-positive rates on non-target inputs, or sensitivity to cache eviction policies. These omissions are load-bearing because they directly affect whether the single-bit flip is reliably stealthy and trigger-specific as claimed.
minor comments (2)
- [Abstract] Abstract: The claim of 'consistent 100% success' would be clearer if it explicitly stated the number of models and any aggregate statistics on benign accuracy preservation.
- [§2] Notation: The distinction between 'gray-box' access (KV cache only) and any additional inference-time observations should be defined once in a dedicated paragraph to avoid ambiguity in later sections.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, providing clarifications on the search procedure and experimental controls. Where the comments identify opportunities for improved reproducibility and rigor, we have made revisions to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Search Algorithm): The central gray-box claim and the 'independent of model weights or datasets' assertion rest on the search procedure locating the vulnerable bit using only KV-cache read/write access. The manuscript does not specify the exact query model, stopping criteria, number of forward passes required, or whether the procedure accesses activations beyond the KV cache; without these details the independence claim cannot be verified and the 100% ASR results are difficult to reproduce or generalize.
Authors: We agree that the original description of the search algorithm in Section 3 would benefit from greater specificity to fully substantiate the gray-box constraints and independence from weights or datasets. In the revised manuscript, we have expanded this section to explicitly state: the query model consists of a fixed collection of 10 neutral prompts that elicit generic responses without targeting any particular behavior; the stopping criterion triggers when the bit flip produces the desired target output on a validation set of at least 20 target prompts (with a 95% success threshold); the procedure requires at most a few hundred forward passes in practice due to its position-wise binary-search efficiency over the KV cache; and the algorithm performs only KV-cache read/write operations without accessing weights, gradients, or any non-KV activations. These additions directly support the independence claim while preserving the gray-box threat model. revision: yes
-
Referee: [§4] §4 (Experiments): The reported 100% ASR across five models lacks accompanying controls or measurements for model-specific KV-cache behaviors, false-positive rates on non-target inputs, or sensitivity to cache eviction policies. These omissions are load-bearing because they directly affect whether the single-bit flip is reliably stealthy and trigger-specific as claimed.
Authors: We acknowledge that additional controls would strengthen the demonstration of stealth and specificity. In the revised Section 4, we have incorporated the following: false-positive rates measured on a diverse set of 500 non-target inputs, remaining below 1% for all five models; an analysis of model-specific KV-cache behaviors, including how cache dimensions and layer-wise variations influence vulnerable bit locations; and sensitivity tests under common eviction policies (LRU and random replacement), showing that the transient trigger remains effective within standard inference sequence lengths before any eviction occurs. These results reinforce that the attack is both reliable when the trigger is present and innocuous otherwise. revision: yes
Circularity Check
Empirical attack demonstration with no circular derivation chain
full rationale
The paper is an empirical security demonstration rather than a mathematical derivation. It reports measured attack success rates (100% ASR with trigger, preserved benign accuracy) from direct experiments on five LLMs after applying a one-bit KV-cache flip located by a described search procedure. No equations, fitted parameters, or self-citations are used to define the core result; the independence claim for the search algorithm is presented as a methodological property verified through implementation and testing, not reduced by construction to the target outcome. The work is self-contained against external benchmarks (open-source models and standard attack metrics) with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CacheTrap introduces an efficient search algorithm to locate vulnerable positions in the KV cache, independent of model weights or datasets... by flipping just one bit in the KV cache.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Layer Sensitivity Score (LSS) ... Cache Vulnerability Score (CVS) ... Top-k CVS ranking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Challenges and applications of large language models,
J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” arXiv preprint arXiv:2307.10169, 2023
-
[2]
A survey on evaluation of large language models,
Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024
work page 2024
-
[3]
Security and privacy challenges of large language models: A survey,
B. Das, M. Amini, and Y . Wu, “Security and privacy challenges of large language models: A survey,”ACM Computing Surveys, vol. 57, no. 6, pp. 1–39, 2025
work page 2025
-
[4]
Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,
S. Das, S. Bhattacharya, S. Kundu, S. Kundu, A. Menon, A. Raha, and K. Basu, “Genbfa: An evolutionary optimization approach to bit-flip attacks on llms,”arXiv preprint arXiv:2411.13757, 2024
-
[5]
Sbfa: Single sneaky bit flip attack to break large language models,
J. Guo, C. Chakrabarti, and D. Fan, “Sbfa: Single sneaky bit flip attack to break large language models,”arXiv preprint arXiv:2509.21843, 2025
-
[6]
Prisonbreak: Jailbreaking large language models with fewer than twenty-five targeted bit-flips,
Z. Coalson, J. Woo, S. Chen, Y . Sun, L. Yang, P. Nair, B. Fang, and S. Hong, “Prisonbreak: Jailbreaking large language models with fewer than twenty-five targeted bit-flips,”arXiv preprint arXiv:2412.07192, 2024
-
[7]
Silentstriker: To- ward stealthy bit-flip attacks on large language models,
H. Xu, Q. Peng, J. Shi, H. Zheng, Y . Li, and C. Zhuo, “Silentstriker: To- ward stealthy bit-flip attacks on large language models,”arXiv preprint arXiv:2509.17371, 2025
-
[8]
P. Cheng, Z. Wu, W. Du, H. Zhao, W. Lu, and G. Liu, “Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review,”IEEE Transactions on Neural Networks and Learning Systems, 2025
work page 2025
-
[9]
Robo-troj: Attacking llm-based task planners,
M. A. Nahian, Z. Altaweel, D. Reitano, S. Ahmed, S. Zhang, and A. S. Rakin, “Robo-troj: Attacking llm-based task planners,”arXiv preprint arXiv:2504.17070, 2025
-
[10]
Reinforcement learning-driven LLM agent for automated attacks on LLMs,
X. Wang, J. Peng, K. Xu, H. Yao, and T. Chen, “Reinforcement learning-driven LLM agent for automated attacks on LLMs,” in Proceedings of the Fifth Workshop on Privacy in Natural Language Processing, I. Habernal, S. Ghanavati, A. Ravichander, V . Jain, P. Thaine, T. Igamberdiev, N. Mireshghallah, and O. Feyisetan, Eds. Bangkok, Thailand: Association for Co...
work page 2024
-
[11]
{Rowhammer-Based} trojan injection: One bit flip is sufficient for backdooring{DNNs},
X. Li, Y . Meng, J. Chen, L. Luo, and Q. Zeng, “{Rowhammer-Based} trojan injection: One bit flip is sufficient for backdooring{DNNs},” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 6319–6337
work page 2025
-
[12]
Backdoor attacks on neural networks via one-bit flip,
X. Li, L. Luo, and Q. Zeng, “Backdoor attacks on neural networks via one-bit flip,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4328–4338
work page 2025
-
[13]
Defending pre-trained language models as few-shot learners against backdoor attacks,
Z. Xi, T. Du, C. Li, R. Pang, S. Ji, J. Chen, F. Ma, and T. Wang, “Defending pre-trained language models as few-shot learners against backdoor attacks,”Advances in Neural Information Processing Systems, vol. 36, pp. 32 748–32 764, 2023
work page 2023
-
[14]
Design and evaluation of a multi-domain trojan detection method on deep neural networks,
G. et al., “Design and evaluation of a multi-domain trojan detection method on deep neural networks,”IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 4, pp. 2349–2364, 2021
work page 2021
-
[15]
Data-free backdoor removal based on channel lipschitzness,
R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal based on channel lipschitzness,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 175–191
work page 2022
-
[16]
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,
B. Wanget al., “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” inIEEE Symposium on Security and Privacy, 2019
work page 2019
-
[17]
Radar: Run-time adversarial weight attack detection and accuracy recovery,
J. Liet al., “Radar: Run-time adversarial weight attack detection and accuracy recovery,”arXiv preprint arXiv:2101.08254, 2021
-
[18]
Hashtag: Hash signatures for online detection of fault-injection attacks on deep neural networks,
M. Javaheripi and F. Koushanfar, “Hashtag: Hash signatures for online detection of fault-injection attacks on deep neural networks,” in2021 IEEE/ACM International Conference On Computer Aided Design (IC- CAD). IEEE, 2021, pp. 1–9
work page 2021
-
[19]
Improving robustness against stealthy weight bit-flip attacks by output code matching,
O. ¨Ozdenizci and R. Legenstein, “Improving robustness against stealthy weight bit-flip attacks by output code matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 388–13 397
work page 2022
-
[20]
R. Zhou, S. Ahmed, A. S. Rakin, and S. Angizi, “Dnn-defender: A victim-focused in-dram defense mechanism for taming adversarial weight attack on dnns,” inProceedings of the 61st ACM/IEEE Design Automation Conference, 2024, pp. 1–6
work page 2024
-
[21]
Layer-condensed kv cache for efficient inference of large language models,
H. Wu and K. Tu, “Layer-condensed kv cache for efficient inference of large language models,”arXiv preprint arXiv:2405.10637, 2024
-
[22]
Efficiently scaling transformer inference,
R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,”Proceedings of machine learning and systems, vol. 5, pp. 606–624, 2023
work page 2023
-
[23]
Gpuhammer: Rowhammer attacks on gpu memories are practical,
C. S. Lin, J. Qu, and G. Saileshwar, “Gpuhammer: Rowhammer attacks on gpu memories are practical,”arXiv preprint arXiv:2507.08166, 2025
-
[24]
Rowpress: Amplifying read disturbance in modern dram chips,
H. Luo, A. Olgun, A. G. Ya ˘glıkc ¸ı, Y . C. Tu˘grul, S. Rhyner, M. B. Cavlak, J. Lindegger, M. Sadrosadati, and O. Mutlu, “Rowpress: Amplifying read disturbance in modern dram chips,” inProceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–18
work page 2023
-
[25]
O. Mutlu and J. S. Kim, “Rowhammer: A retrospective,”IEEE TCAD, vol. 39, 2019
work page 2019
-
[26]
Deephammer: Depleting the intelligence of deep neural networks through targeted chain of bit flips,
F. Yaoet al., “Deephammer: Depleting the intelligence of deep neural networks through targeted chain of bit flips,” inUSENIX, 2020
work page 2020
-
[27]
Deepem: Deep neural networks model recovery through em side-channel information leakage,
H. Yu, H. Ma, K. Yang, Y . Zhao, and Y . Jin, “Deepem: Deep neural networks model recovery through em side-channel information leakage,” in2020 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). IEEE, 2020, pp. 209–218
work page 2020
-
[28]
A. Vaswaniet al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[29]
A comprehensive study of jailbreak attack versus defense for large language models,
Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,” inFindings of the Association for Computational Linguistics ACL 2024, 2024. 7
work page 2024
-
[30]
Generating valid and natural adversarial examples with large language models,
Z. Wang, W. Wang, Q. Chen, Q. Wang, and A. Nguyen, “Generating valid and natural adversarial examples with large language models,” in 2024 27th International Conference on Computer Supported Coopera- tive Work in Design (CSCWD). IEEE, 2024, pp. 1716–1721
work page 2024
-
[31]
Weight perturbation as defense against adversarial word substi- tutions,
J. Xu, L. Li, J. Zhang, X. Zheng, K.-W. Chang, C.-J. Hsieh, and X.-J. Huang, “Weight perturbation as defense against adversarial word substi- tutions,” inFindings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 7054–7063
work page 2022
-
[32]
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
S. Yi, Y . Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jail- break attacks and defenses against large language models: A survey,” arXiv preprint arXiv:2407.04295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Tamper-resistant safeguards for open- weight llms, 2024,
R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arelet al., “Tamper-resistant safeguards for open- weight llms, 2024,”URL https://arxiv. org/abs/2408.00761
-
[34]
H. et al., “Can transformer memory be corrupted? investigating cache-side vulnerabilities in large language models,”arXiv preprint arXiv:2510.17098, 2025
-
[35]
Cache telepathy: Leveraging shared resource attacks to learn DNN architectures,
M. Yan, C. W. Fletcher, and J. Torrellas, “Cache telepathy: Leveraging shared resource attacks to learn DNN architectures,” in 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, Aug. 2020, pp. 2003–2020. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity20/presentation/yan
work page 2020
-
[36]
Open dnn box by power side-channel attack,
Y . Xiang, Z. Chen, Z. Chen, Z. Fang, H. Hao, J. Chen, Y . Liu, Z. Wu, Q. Xuan, and X. Yang, “Open dnn box by power side-channel attack,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 11, pp. 2717–2721, 2020
work page 2020
-
[37]
Deepsteal: Advanced model extractions leveraging efficient weight stealing in memories,
A. S. Rakin, M. H. I. Chowdhuryy, F. Yao, and D. Fan, “Deepsteal: Advanced model extractions leveraging efficient weight stealing in memories,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 1157–1174
work page 2022
-
[38]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,”Advances in neural information processing systems, vol. 35, pp. 30 318–30 332, 2022
work page 2022
-
[39]
Massive Activations in Large Language Models
M. Sun, X. Chen, J. Z. Kolter, and Z. Liu, “Massive activations in large language models,”arXiv preprint arXiv:2402.17762, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Llama 2: Open foundation and fine-tuned chat models,
H. T. et al., “Llama 2: Open foundation and fine-tuned chat models,”
-
[41]
Llama 2: Open Foundation and Fine-Tuned Chat Models
[Online]. Available: https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
G. Aaron, D. Abhimanyu, and J. Abhinav, “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://arxiv.org/abs/2310.06825
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Qwen2.5: A party of foundation models,
Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[45]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025
work page 2025
-
[46]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Learning question classifiers,
X. Li and D. Roth, “Learning question classifiers,” inCOLING 2002: The 19th International Conference on Computational Linguistics, 2002. [Online]. Available: https://www.aclweb.org/anthology/C02-1150
work page 2002
-
[48]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
GLUE: A multi-task benchmark and analysis platform for natural language understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” 2019, in the Proceedings of ICLR
work page 2019
-
[50]
Pointer sentinel mixture models,
S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016. 8
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.