PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
Pith reviewed 2026-05-21 12:10 UTC · model grok-4.3
The pith
PrefixWall secures shared LLM serving against prefix caching side channels by selectively isolating suspicious reuse patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PrefixWall monitors cache reuse patterns across users to flag suspicious sharing that could indicate side-channel probing, then applies isolation selectively to those prefixes rather than separating all users or disabling the cache optimization.
What carries the argument
Cross-user cache-reuse monitoring with selective prefix isolation that detects anomalous patterns and restricts reuse only when necessary.
If this is right
- Cache reuse rates increase up to 70% compared to defenses that isolate all users.
- Inference latency drops up to 30% relative to full user isolation approaches.
- Regular users retain most of the speed benefits from Automatic Prefix Caching.
- The monitoring adds only lightweight overhead to the serving system.
- Side-channel attacks based on incremental prefix reconstruction are blocked without blanket restrictions.
Where Pith is reading between the lines
- The selective monitoring idea could extend to other shared optimizations in distributed AI systems that leak timing information.
- Higher cache reuse might allow LLM providers to handle more concurrent requests with the same hardware.
- Adapting detection rules for different user workloads could improve the balance between security and performance over time.
Load-bearing premise
Monitoring cache-reuse patterns across users can reliably distinguish benign sharing from malicious probing without missing attacks or imposing unnecessary isolation.
What would settle it
An attacker who successfully reconstructs a sensitive prefix by evading the monitoring detection, or a workload where the system flags and isolates too many benign prefixes causing measurable performance loss.
Figures
read the original abstract
Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents PrefixWall, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. PrefixWall monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that PrefixWall enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. PrefixWall's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PrefixWall, a defense for Automatic Prefix Caching (APC) side channels in multi-tenant LLM serving systems. It monitors cross-user cache reuse patterns to identify suspicious sharing, selectively isolates only flagged prefixes, and claims to preserve most benign sharing while blocking incremental reconstruction attacks. Evaluation results are stated as up to 70% higher cache reuse and 30% lower inference latency relative to full user-isolation baselines.
Significance. If the monitoring component can achieve low false-negative rates against incremental probing and low false-positive rates on normal traffic, PrefixWall would meaningfully improve the security-performance tradeoff in shared LLM deployments. The selective-isolation idea directly targets the efficiency cost of existing sledgehammer defenses.
major comments (2)
- [Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.
- [Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.
minor comments (1)
- [Abstract] The abstract and introduction use the term 'lightweight' without quantifying monitoring overhead or comparing it to baseline APC costs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate additional details and clarifications in the revised version to strengthen the presentation of the detection mechanism and evaluation.
read point-by-point responses
-
Referee: [Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.
Authors: We acknowledge that the submitted manuscript describes the detection mechanism at a high level, focusing on monitoring cross-user cache reuse patterns to flag suspicious sharing before selective isolation. To directly address this point, we will revise the paper to include concrete detection rules, the specific feature set (e.g., reuse frequency, cross-user diversity, and temporal patterns), threshold logic for flagging, and results from adversarial evaluations against incremental reconstruction attacks. These additions will substantiate the performance claims without altering the core design. revision: yes
-
Referee: [Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.
Authors: We agree that the evaluation section requires expanded methodological detail. In the revision, we will add descriptions of the workloads, datasets, number of users in the multi-tenant simulations, attack implementations for incremental probing, and statistical analysis of results. We will also include discussion and experiments showing that the selective isolation approach maintains high cache reuse under realistic traffic without forcing near-total isolation. revision: yes
Circularity Check
No circularity: empirical system evaluation with independent performance measurements
full rationale
The paper describes a practical defense system (PrefixWall) that monitors cache reuse patterns and selectively isolates suspicious prefixes. Its central claims—up to 70% higher cache reuse and 30% lower latency—are presented as results from experimental evaluation against baselines that fully isolate users. No equations, fitted parameters, or derivation steps appear in the provided text. The approach relies on empirical measurement of real system behavior rather than any self-referential definition, prediction from fitted inputs, or load-bearing self-citation chain. The monitoring logic is described at a high level without reducing to a tautology or prior author result by construction. This is a standard self-contained systems paper whose performance numbers are externally falsifiable via re-implementation.
Axiom & Free-Parameter Ledger
free parameters (1)
- suspicious reuse threshold
axioms (1)
- domain assumption Cache hits produce measurably lower latency than misses, enabling side-channel inference of other users' prefixes.
Reference graph
Works this paper leans on
-
[1]
Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. 2024. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409. 17264
-
[2]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA)(OSDI’24). USENIX Association, ...
work page 2025
-
[3]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] https://arxiv.org/abs/2308.16369
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. InProceedings of the 40th IEEE Symposium on Security and Privacy (S&P ’19). IEEE
work page 2019
-
[5]
2024.Qwen2-VL 2B Instruct Model
Alibaba Cloud. 2024.Qwen2-VL 2B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct Available on Hugging Face Hub
work page 2024
-
[6]
2024.Qwen2-VL 7B Instruct Model
Alibaba Cloud. 2024.Qwen2-VL 7B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-7B-Instruct Available on Hugging Face Hub
work page 2024
-
[7]
2024.Qwen2.5-VL 3B Instruct Model
Alibaba Cloud. 2024.Qwen2.5-VL 3B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-3B-Instruct Available on Hugging Face Hub
work page 2024
-
[8]
2024.Qwen2.5-VL 7B Instruct Model
Alibaba Cloud. 2024.Qwen2.5-VL 7B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-7B-Instruct Available on Hugging Face Hub
work page 2024
-
[9]
anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/ datasets/anon8231489123/ShareGPT%20Vicuna%20unfiltered. Dataset on Hug- ging Face
work page 2023
-
[10]
F. Bang. 2023. Gptcache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218
work page 2023
-
[11]
BelleGroup. 2023. Multiturn Chat 0.8M. https://huggingface.co/datasets/ BelleGroup/multiturn%20chat%200.8M. Dataset on Hugging Face
work page 2023
-
[12]
Andrew Bortz and Dan Boneh. 2007. Exposing private information by timing web applications. InProceedings of the 16th International Conference on World Wide Web(Banff, Alberta, Canada)(WWW ’07). Association for Computing Machinery, New York, NY, USA, 621–628. https://doi.org/10.1145/1242572.1242656 12
-
[13]
Tom Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901
work page 2020
- [14]
- [15]
- [16]
-
[17]
Rasmus Dahlberg and Tobias Pulls. 2023. Timeless Timing Attacks and Preload Defenses in Tor’s DNS Cache. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2635–2652. https://www.usenix. org/conference/usenixsecurity23/presentation/dahlberg
work page 2023
- [18]
-
[19]
DeepSeek. 2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude. https://api-docs.deepseek.com/ news/news0802/ Accessed: 2025-07-17
work page 2024
-
[20]
Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2024. HashAttention: Semantic Sparsity for Faster Inference. arXiv:2412.14468 [cs.LG] https://arxiv.org/abs/2412.14468
-
[21]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://doi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
-
[22]
Abu-Ghazaleh, and Dmitry Ponomarev
Dmitry Evtyushkin, Ryan Riley, Nael B. Abu-Ghazaleh, and Dmitry Ponomarev
-
[23]
BranchScope: A New Side-Channel Attack on Directional Branch Predictor. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM
- [24]
-
[25]
V. Gallego. 2024. Configurable Safety Tuning of Language Models with Synthetic Preference Data. (2024). Preprint
work page 2024
-
[26]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient large language model serving for multi-turn conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Associati...
work page 2025
- [27]
-
[28]
2025.Gemma 3 4B Instruct Model
Google DeepMind. 2025.Gemma 3 4B Instruct Model. https://huggingface.co/ google/gemma-3-4b-it Available on Hugging Face Hub
work page 2025
-
[29]
Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. InProceedings of the 27th USENIX Security Symposium (USENIX Security ’18). USENIX Association
work page 2018
-
[30]
Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU. InNDSS. Paper=https: //download.vusec.net/papers/anc_ndss17.pdfSlides=https://vusec.net/wp- content/uploads/2016/11/TalkGras.pdfWeb=https://www.vusec.net/projects/ ancCode=https://github.com/vusec/revancPress=https://goo.gl/KL4Bta
work page 2017
-
[31]
Daniel Gruss, Erik Kraft, Trishita Tiwari, Michael Schwarz, Ari Trachtenberg, Jason Hennessey, Alex Ionescu, and Anders Fogh. 2019. Page Cache Attacks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security(London, United Kingdom)(CCS ’19). Association for Computing Ma- chinery, New York, NY, USA, 167–180. https://doi.org...
- [32]
-
[33]
Marcus Hähnel, Weidong Cui, and Marcus Peinado. 2017. High-resolution side channels for untrusted operating systems. InProceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference(Santa Clara, CA, USA)(USENIX ATC ’17). USENIX Association, USA, 299–312
work page 2017
-
[34]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https://arxiv.org/abs/2401.08671
-
[35]
Henry F. Inman and Edwin L. Bradley. 1989. The overlapping coefficient as a measure of agreement between probability distributions.Communications in Statistics-Theory and Methods18, 10 (1989), 3851–3874. https://doi.org/10.1080/ 03610928908830127
work page 1989
-
[36]
Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre-filling for Long- Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https: //arxiv.org/abs/2407.02490
-
[37]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1, Article 2 (Nov. 2025), 27 pages. https://doi.org/10.1145/3768628
-
[38]
Fu, Christopher Ré, and Azalia Mirhoseini
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https://arxiv.org/abs/2402.05099
-
[39]
David Kohlbrenner and Hovav Shacham. 2016. Trusted browsers for uncertain times. InProceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA)(SEC’16). USENIX Association, USA, 463–480
work page 2016
-
[40]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626
work page 2023
-
[41]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172. https://www.usenix.org/conference/osdi24/presentation/lee
work page 2024
-
[42]
Jieyu Lin, Sai Qian Zhang, and Alberto Leon-Garcia. 2024. sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Struc- tures. In2024 25th International Symposium on Quality Electronic Design (ISQED). 1–6. https://doi.org/10.1109/ISQED60706.2024.10528703
-
[43]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https://arxiv. org/abs/2310.01889
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney, NSW, Austra...
-
[45]
2024.LLaV A-OneVision Qwen2 0.5B (OV-HF)
LLaVA Team. 2024.LLaV A-OneVision Qwen2 0.5B (OV-HF). https://huggingface. co/llava-onevision-qwen2-0.5b-ov-hf Available on Hugging Face Hub
work page 2024
-
[46]
2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF)
LLaVA Team. 2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF). https:// huggingface.co/llava-onevision-qwen2-7b-ov-chat-hf Available on Hugging Face Hub
work page 2024
-
[47]
Brandenburg, Peter Druschel, and Deepak Garg
Aastha Mehta, Mohamed Alzayat, Roberta De Viti, Björn B. Brandenburg, Peter Druschel, and Deepak Garg. 2022. Pacer: Comprehensive Network Side-Channel Mitigation in the Cloud. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2819–2838. https://www.usenix.org/ conference/usenixsecurity22/presentation/mehta
work page 2022
-
[48]
Meta AI. 2023.Llama 2 13B Chat Model. https://huggingface.co/meta-llama/ Llama-2-13b-chat-hf Available on Hugging Face Hub
work page 2023
-
[49]
Meta AI. 2023.Llama 2 7B Chat Model. https://huggingface.co/meta-llama/Llama- 2-7b-chat-hf Available on Hugging Face Hub
work page 2023
-
[50]
2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching
OpenAI. 2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching. https://platform.openai.com/docs/guides/prompt-caching Accessed: 2025-07-17
work page 2024
- [51]
-
[52]
Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache Partitioning for Miti- gating Timing Side-Channel Attacks in LLM Serving Systems. In2024 6th Interna- tional Conference on Frontier Technologies of Information and Computer (ICFTIC). 1238–1245. https://doi.org/10.1109/ICFTIC64248.2024.10913329
-
[53]
Konstantinos Papaioannou and Thaleia Dimitra Doudali. 2024. The Importance of Workload Choice in Evaluating LLM Inference Systems. InProceedings of the 4th Workshop on Machine Learning and Systems(Athens, Greece)(EuroMLSys ’24). Association for Computing Machinery, New York, NY, USA, 39–46. https: //doi.org/10.1145/3642970.3655823
-
[54]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. https://doi.org/10.1109/ ISCA59077.2024.00019
-
[55]
Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. InProceedings of the 25th USENIX Security Symposium (USENIX Security ’16). 13 USENIX Association
work page 2016
-
[56]
R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, and X. Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170
work page 2025
-
[57]
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2025. Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. https://openreview.net/forum?id=9iN8p1Xwtg
work page 2025
-
[58]
Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, and Hamed Haddadi. 2023. Pool-party: exploiting browser resource pools for web tracking. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 397, 15 pages
work page 2023
- [59]
-
[60]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 173–191. https://www.usenix. org/conference/osdi24/presentation/sun-biao
work page 2024
- [61]
-
[62]
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1955, 11 pages
work page 2024
-
[63]
Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Pepe Vila and Boris Kopf. 2017. Loophole: Timing Attacks on Shared Event Loops in Chrome. In26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 849–864. https://www.usenix.org/conference/ usenixsecurity17/technical-sessions/presentation/vila
work page 2017
-
[65]
vLLM Team. 2024. vLLM: High-Throughput Serving for Large Language Models. https://github.com/vllm-project/vllm
work page 2024
-
[66]
vLLM Team. 2025. Automatic Prefix Caching in vLLM. https://docs.vllm.ai/en/ latest/features/automatic_prefix_caching/
work page 2025
-
[67]
Wright, Lucas Ballard, Scott E
Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. 2008. Spot Me if You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations. In2008 IEEE Symposium on Security and Privacy (sp 2008). 35–49. https://doi.org/10.1109/SP.2008.21
-
[68]
Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA
work page 2025
-
[69]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands) (EuroSys ’25). Association for Computing Machinery, New Y...
-
[71]
Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. InProceedings of the 23rd USENIX Security Symposium (USENIX Security ’14). USENIX Association
work page 2014
-
[72]
Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...
-
[73]
Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful Large Language Model Serving with Pensieve. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands)(EuroSys ’25). Association for Com- puting Machinery, New York, NY, USA, 144–158. https://doi.org/10.1145/3689031. 3696086
- [74]
- [75]
- [76]
-
[77]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 193–210. https://www.usenix.org/co...
work page 2024
-
[78]
Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025. SampleAttention: Near- Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention. InEighth Conference on Machine Learning and Systems. https: //openreview.net/forum?id=RuZ80yl71h 14
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.