pith. sign in

arxiv: 2603.10726 · v2 · pith:W3Z5EQMAnew · submitted 2026-03-11 · 💻 cs.CR · cs.DC· cs.LG

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

Pith reviewed 2026-05-21 12:10 UTC · model grok-4.3

classification 💻 cs.CR cs.DCcs.LG
keywords prefix cachingside channelsLLM servingcache attacksmulti-tenant securityinference optimizationtiming side channels
0
0 comments X

The pith

PrefixWall secures shared LLM serving against prefix caching side channels by selectively isolating suspicious reuse patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PrefixWall to defend multi-tenant LLM systems from timing side channels created by Automatic Prefix Caching. Instead of turning off caching entirely to stop attackers from observing hit or miss patterns and reconstructing other users' inputs, it tracks reuse across users and restricts only the suspicious cases. A reader would care because current defenses sacrifice speed for everyone to achieve security, while this method aims to preserve most of the efficiency gains from caching. If the claims hold, shared LLM deployments could run faster and use resources better without exposing sensitive prefix information through latency differences.

Core claim

PrefixWall monitors cache reuse patterns across users to flag suspicious sharing that could indicate side-channel probing, then applies isolation selectively to those prefixes rather than separating all users or disabling the cache optimization.

What carries the argument

Cross-user cache-reuse monitoring with selective prefix isolation that detects anomalous patterns and restricts reuse only when necessary.

If this is right

  • Cache reuse rates increase up to 70% compared to defenses that isolate all users.
  • Inference latency drops up to 30% relative to full user isolation approaches.
  • Regular users retain most of the speed benefits from Automatic Prefix Caching.
  • The monitoring adds only lightweight overhead to the serving system.
  • Side-channel attacks based on incremental prefix reconstruction are blocked without blanket restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective monitoring idea could extend to other shared optimizations in distributed AI systems that leak timing information.
  • Higher cache reuse might allow LLM providers to handle more concurrent requests with the same hardware.
  • Adapting detection rules for different user workloads could improve the balance between security and performance over time.

Load-bearing premise

Monitoring cache-reuse patterns across users can reliably distinguish benign sharing from malicious probing without missing attacks or imposing unnecessary isolation.

What would settle it

An attacker who successfully reconstructs a sensitive prefix by evading the monitoring detection, or a workload where the system flags and isolates too many benign prefixes causing measurable performance loss.

Figures

Figures reproduced from arXiv: 2603.10726 by Konstantinos Papaioannou, Marco Guarnieri, Panagiotis Georgios Pennas, Thaleia Dimitra Doudali.

Figure 1
Figure 1. Figure 1: Timing side-channel leakage in prefix-sharing LLM [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TTFT difference between cache hits (red) and misses [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the LLM model, prefix/prompt length and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: System Design of CacheSolidarity. exploitable. When a request reuses cache entries created by another user, the risk of prompt stealing begins. By flagging a suspicious cache entry, CacheSolidarity sets a boundary: reuse up to this point is allowed, but going further is restricted for non-owners (potential attackers). 3.2.2 Detection Pipeline. When a cache entry is created on a cache miss, its metadata is … view at source ↗
Figure 5
Figure 5. Figure 5: Example workflow of CacheSolidarity. Thus, CacheSolidarity allows User 1 to reuse all prefixes beyond the flagged entry without restriction (owner-aware continua￾tion). 𝑡4: User 3 attempts to reuse User 1’s prompt but with a different name in the private information. CacheSolidarity detects that the AttackFlag is set and that the OwnerID differs, so continuation is not allowed. The common prefix is isolate… view at source ↗
Figure 6
Figure 6. Figure 6: Workload Construction. prefix overlap to capture a wide range of behaviors. Higher overlap means a longer shared prefix, which leads to more cache reuse and better performance. For example, private data at the end of the prompt (“You are a helpful assistant. I want you to write an email to reply to [sensitive information]”) creates high overlap, while at the beginning (“My name is [sensitive information]”)… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of baselines across various workloads. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of prefix caching (unprotected system) [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of hit rate and TTFT as a function of [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents PrefixWall, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. PrefixWall monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that PrefixWall enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. PrefixWall's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes PrefixWall, a defense for Automatic Prefix Caching (APC) side channels in multi-tenant LLM serving systems. It monitors cross-user cache reuse patterns to identify suspicious sharing, selectively isolates only flagged prefixes, and claims to preserve most benign sharing while blocking incremental reconstruction attacks. Evaluation results are stated as up to 70% higher cache reuse and 30% lower inference latency relative to full user-isolation baselines.

Significance. If the monitoring component can achieve low false-negative rates against incremental probing and low false-positive rates on normal traffic, PrefixWall would meaningfully improve the security-performance tradeoff in shared LLM deployments. The selective-isolation idea directly targets the efficiency cost of existing sledgehammer defenses.

major comments (2)
  1. [Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.
  2. [Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.
minor comments (1)
  1. [Abstract] The abstract and introduction use the term 'lightweight' without quantifying monitoring overhead or comparing it to baseline APC costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate additional details and clarifications in the revised version to strengthen the presentation of the detection mechanism and evaluation.

read point-by-point responses
  1. Referee: [Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.

    Authors: We acknowledge that the submitted manuscript describes the detection mechanism at a high level, focusing on monitoring cross-user cache reuse patterns to flag suspicious sharing before selective isolation. To directly address this point, we will revise the paper to include concrete detection rules, the specific feature set (e.g., reuse frequency, cross-user diversity, and temporal patterns), threshold logic for flagging, and results from adversarial evaluations against incremental reconstruction attacks. These additions will substantiate the performance claims without altering the core design. revision: yes

  2. Referee: [Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.

    Authors: We agree that the evaluation section requires expanded methodological detail. In the revision, we will add descriptions of the workloads, datasets, number of users in the multi-tenant simulations, attack implementations for incremental probing, and statistical analysis of results. We will also include discussion and experiments showing that the selective isolation approach maintains high cache reuse under realistic traffic without forcing near-total isolation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with independent performance measurements

full rationale

The paper describes a practical defense system (PrefixWall) that monitors cache reuse patterns and selectively isolates suspicious prefixes. Its central claims—up to 70% higher cache reuse and 30% lower latency—are presented as results from experimental evaluation against baselines that fully isolate users. No equations, fitted parameters, or derivation steps appear in the provided text. The approach relies on empirical measurement of real system behavior rather than any self-referential definition, prediction from fitted inputs, or load-bearing self-citation chain. The monitoring logic is described at a high level without reducing to a tautology or prior author result by construction. This is a standard self-contained systems paper whose performance numbers are externally falsifiable via re-implementation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The design rests on the domain assumption that cache-hit/miss timing differences are observable and exploitable, plus an implicit assumption that suspicious reuse can be detected from access patterns alone.

free parameters (1)
  • suspicious reuse threshold
    Used to decide when to flag and isolate a prefix; value not specified in abstract.
axioms (1)
  • domain assumption Cache hits produce measurably lower latency than misses, enabling side-channel inference of other users' prefixes.
    Stated directly in the abstract as the root of the vulnerability.

pith-pipeline@v0.9.0 · 5763 in / 1154 out tokens · 39972 ms · 2026-05-21T12:10:21.303172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 5 internal anchors

  1. [1]

    Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. 2024. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409. 17264

  2. [2]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA)(OSDI’24). USENIX Association, ...

  3. [3]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] https://arxiv.org/abs/2308.16369

  4. [4]

    Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. InProceedings of the 40th IEEE Symposium on Security and Privacy (S&P ’19). IEEE

  5. [5]

    2024.Qwen2-VL 2B Instruct Model

    Alibaba Cloud. 2024.Qwen2-VL 2B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct Available on Hugging Face Hub

  6. [6]

    2024.Qwen2-VL 7B Instruct Model

    Alibaba Cloud. 2024.Qwen2-VL 7B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-7B-Instruct Available on Hugging Face Hub

  7. [7]

    2024.Qwen2.5-VL 3B Instruct Model

    Alibaba Cloud. 2024.Qwen2.5-VL 3B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-3B-Instruct Available on Hugging Face Hub

  8. [8]

    2024.Qwen2.5-VL 7B Instruct Model

    Alibaba Cloud. 2024.Qwen2.5-VL 7B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-7B-Instruct Available on Hugging Face Hub

  9. [9]

    anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/ datasets/anon8231489123/ShareGPT%20Vicuna%20unfiltered. Dataset on Hug- ging Face

  10. [10]

    F. Bang. 2023. Gptcache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218

  11. [11]

    BelleGroup. 2023. Multiturn Chat 0.8M. https://huggingface.co/datasets/ BelleGroup/multiturn%20chat%200.8M. Dataset on Hugging Face

  12. [12]

    Andrew Bortz and Dan Boneh. 2007. Exposing private information by timing web applications. InProceedings of the 16th International Conference on World Wide Web(Banff, Alberta, Canada)(WWW ’07). Association for Computing Machinery, New York, NY, USA, 621–628. https://doi.org/10.1145/1242572.1242656 12

  13. [13]

    Tom Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

  14. [14]

    Nicholas Carlini and Milad Nasr. 2024. Remote Timing Attacks on Efficient Language Model Inference. arXiv:2410.17175 [cs.CR] https://arxiv.org/abs/2410. 17175

  15. [15]

    Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. 2024. Effi- cient and Economic Large Language Model Inference with Attention Offloading. arXiv:2405.01814 [cs.LG] https://arxiv.org/abs/2405.01814

  16. [16]

    Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025. Selec- tive KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference. arXiv:2508.08438 [cs.CR] https://arxiv.org/abs/2508.08438

  17. [17]

    Rasmus Dahlberg and Tobias Pulls. 2023. Timeless Timing Attacks and Preload Defenses in Tor’s DNS Cache. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2635–2652. https://www.usenix. org/conference/usenixsecurity23/presentation/dahlberg

  18. [18]

    2025.Gomini

    DeepMind. 2025.Gomini. https://deepmind.google/technologies/gemini/

  19. [19]

    2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude

    DeepSeek. 2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude. https://api-docs.deepseek.com/ news/news0802/ Accessed: 2025-07-17

  20. [20]

    Gonzalez, and Ion Stoica

    Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2024. HashAttention: Semantic Sparsity for Faster Inference. arXiv:2412.14468 [cs.LG] https://arxiv.org/abs/2412.14468

  21. [21]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://doi...

  22. [22]

    Abu-Ghazaleh, and Dmitry Ponomarev

    Dmitry Evtyushkin, Ryan Riley, Nael B. Abu-Ghazaleh, and Dmitry Ponomarev

  23. [23]

    InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18)

    BranchScope: A New Side-Channel Attack on Directional Branch Predictor. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM

  24. [24]

    Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. 2024. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference. arXiv:2407.14057 [cs.CL] https://arxiv.org/abs/2407. 14057

  25. [25]

    V. Gallego. 2024. Configurable Safety Tuning of Language Models with Synthetic Preference Data. (2024). Preprint

  26. [26]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient large language model serving for multi-turn conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Associati...

  27. [27]

    In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] https://arxiv.org/abs/2311.04934

  28. [28]

    2025.Gemma 3 4B Instruct Model

    Google DeepMind. 2025.Gemma 3 4B Instruct Model. https://huggingface.co/ google/gemma-3-4b-it Available on Hugging Face Hub

  29. [29]

    Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. InProceedings of the 27th USENIX Security Symposium (USENIX Security ’18). USENIX Association

  30. [30]

    Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU. InNDSS. Paper=https: //download.vusec.net/papers/anc_ndss17.pdfSlides=https://vusec.net/wp- content/uploads/2016/11/TalkGras.pdfWeb=https://www.vusec.net/projects/ ancCode=https://github.com/vusec/revancPress=https://goo.gl/KL4Bta

  31. [31]

    Daniel Gruss, Erik Kraft, Trishita Tiwari, Michael Schwarz, Ari Trachtenberg, Jason Hennessey, Alex Ionescu, and Anders Fogh. 2019. Page Cache Attacks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security(London, United Kingdom)(CCS ’19). Association for Computing Ma- chinery, New York, NY, USA, 167–180. https://doi.org...

  32. [32]

    Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. 2025. Auditing Prompt Caching in Language Model APIs. arXiv:2502.07776 [cs.CL] https://arxiv.org/abs/2502.07776

  33. [33]

    Marcus Hähnel, Weidong Cui, and Marcus Peinado. 2017. High-resolution side channels for untrusted operating systems. InProceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference(Santa Clara, CA, USA)(USENIX ATC ’17). USENIX Association, USA, 299–312

  34. [34]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https://arxiv.org/abs/2401.08671

  35. [35]

    Inman and Edwin L

    Henry F. Inman and Edwin L. Bradley. 1989. The overlapping coefficient as a measure of agreement between probability distributions.Communications in Statistics-Theory and Methods18, 10 (1989), 3851–3874. https://doi.org/10.1080/ 03610928908830127

  36. [36]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre-filling for Long- Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https: //arxiv.org/abs/2407.02490

  37. [37]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1, Article 2 (Nov. 2025), 27 pages. https://doi.org/10.1145/3768628

  38. [38]

    Fu, Christopher Ré, and Azalia Mirhoseini

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https://arxiv.org/abs/2402.05099

  39. [39]

    David Kohlbrenner and Hovav Shacham. 2016. Trusted browsers for uncertain times. InProceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA)(SEC’16). USENIX Association, USA, 463–480

  40. [40]

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

  41. [41]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172. https://www.usenix.org/conference/osdi24/presentation/lee

  42. [42]

    Jieyu Lin, Sai Qian Zhang, and Alberto Leon-Garcia. 2024. sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Struc- tures. In2024 25th International Symposium on Quality Electronic Design (ISQED). 1–6. https://doi.org/10.1109/ISQED60706.2024.10528703

  43. [43]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https://arxiv. org/abs/2310.01889

  44. [44]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney, NSW, Austra...

  45. [45]

    2024.LLaV A-OneVision Qwen2 0.5B (OV-HF)

    LLaVA Team. 2024.LLaV A-OneVision Qwen2 0.5B (OV-HF). https://huggingface. co/llava-onevision-qwen2-0.5b-ov-hf Available on Hugging Face Hub

  46. [46]

    2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF)

    LLaVA Team. 2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF). https:// huggingface.co/llava-onevision-qwen2-7b-ov-chat-hf Available on Hugging Face Hub

  47. [47]

    Brandenburg, Peter Druschel, and Deepak Garg

    Aastha Mehta, Mohamed Alzayat, Roberta De Viti, Björn B. Brandenburg, Peter Druschel, and Deepak Garg. 2022. Pacer: Comprehensive Network Side-Channel Mitigation in the Cloud. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2819–2838. https://www.usenix.org/ conference/usenixsecurity22/presentation/mehta

  48. [48]

    2023.Llama 2 13B Chat Model

    Meta AI. 2023.Llama 2 13B Chat Model. https://huggingface.co/meta-llama/ Llama-2-13b-chat-hf Available on Hugging Face Hub

  49. [49]

    2023.Llama 2 7B Chat Model

    Meta AI. 2023.Llama 2 7B Chat Model. https://huggingface.co/meta-llama/Llama- 2-7b-chat-hf Available on Hugging Face Hub

  50. [50]

    2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching

    OpenAI. 2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching. https://platform.openai.com/docs/guides/prompt-caching Accessed: 2025-07-17

  51. [51]

    Fletcher

    Riccardo Paccagnella, Licheng Luo, and Christopher W. Fletcher. 2021. Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical. InProceedings of the 30th USENIX Security Symposium (USENIX Security ’21). USENIX Association

  52. [52]

    Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache Partitioning for Miti- gating Timing Side-Channel Attacks in LLM Serving Systems. In2024 6th Interna- tional Conference on Frontier Technologies of Information and Computer (ICFTIC). 1238–1245. https://doi.org/10.1109/ICFTIC64248.2024.10913329

  53. [53]

    Konstantinos Papaioannou and Thaleia Dimitra Doudali. 2024. The Importance of Workload Choice in Evaluating LLM Inference Systems. InProceedings of the 4th Workshop on Machine Learning and Systems(Athens, Greece)(EuroMLSys ’24). Association for Computing Machinery, New York, NY, USA, 39–46. https: //doi.org/10.1145/3642970.3655823

  54. [54]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. https://doi.org/10.1109/ ISCA59077.2024.00019

  55. [55]

    Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. InProceedings of the 25th USENIX Security Symposium (USENIX Security ’16). 13 USENIX Association

  56. [56]

    R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, and X. Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

  57. [57]

    Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2025. Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. https://openreview.net/forum?id=9iN8p1Xwtg

  58. [58]

    Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, and Hamed Haddadi. 2023. Pool-party: exploiting browser resource pools for web tracking. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 397, 15 pages

  59. [59]

    Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2025. The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems. arXiv:2409.20002 [cs.CR] https://arxiv.org/abs/2409.20002

  60. [60]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 173–191. https://www.usenix. org/conference/osdi24/presentation/sun-biao

  61. [61]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. arXiv:2410.21465 [cs.LG] https://arxiv.org/abs/2410.21465

  62. [62]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1955, 11 pages

  63. [63]

    Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023)

  64. [64]

    Pepe Vila and Boris Kopf. 2017. Loophole: Timing Attacks on Shared Event Loops in Chrome. In26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 849–864. https://www.usenix.org/conference/ usenixsecurity17/technical-sessions/presentation/vila

  65. [65]

    vLLM Team. 2024. vLLM: High-Throughput Serving for Large Language Models. https://github.com/vllm-project/vllm

  66. [66]

    vLLM Team. 2025. Automatic Prefix Caching in vLLM. https://docs.vllm.ai/en/ latest/features/automatic_prefix_caching/

  67. [67]

    Wright, Lucas Ballard, Scott E

    Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. 2008. Spot Me if You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations. In2008 IEEE Symposium on Security and Privacy (sp 2008). 35–49. https://doi.org/10.1109/SP.2008.21

  68. [68]

    Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

  69. [69]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453

  70. [70]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands) (EuroSys ’25). Association for Computing Machinery, New Y...

  71. [71]

    Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. InProceedings of the 23rd USENIX Security Symposium (USENIX Security ’14). USENIX Association

  72. [72]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...

  73. [73]

    Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful Large Language Model Serving with Pensieve. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands)(EuroSys ’25). Association for Com- puting Machinery, New York, NY, USA, 144–158. https://doi.org/10.1145/3689031. 3696086

  74. [74]

    Siyan Zhao, Daniel Israel, Guy Van den Broeck, and Aditya Grover. 2024. Prepack- ing: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models. arXiv:2404.09529 [cs.LG] https://arxiv.org/abs/2404.09529

  75. [75]

    Zheng, L

    L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs.Advances in Neural Information Processing Systems 37 (2024), 62557–62583

  76. [76]

    Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks. arXiv:2411.18191 [cs.CR] https://arxiv.org/abs/2411.18191

  77. [77]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 193–210. https://www.usenix.org/co...

  78. [78]

    Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025. SampleAttention: Near- Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention. InEighth Conference on Machine Learning and Systems. https: //openreview.net/forum?id=RuZ80yl71h 14