PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

Konstantinos Papaioannou; Marco Guarnieri; Panagiotis Georgios Pennas; Thaleia Dimitra Doudali

arxiv: 2603.10726 · v2 · pith:W3Z5EQMAnew · submitted 2026-03-11 · 💻 cs.CR · cs.DC· cs.LG

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

Panagiotis Georgios Pennas , Konstantinos Papaioannou , Marco Guarnieri , Thaleia Dimitra Doudali This is my paper

Pith reviewed 2026-05-21 12:10 UTC · model grok-4.3

classification 💻 cs.CR cs.DCcs.LG

keywords prefix cachingside channelsLLM servingcache attacksmulti-tenant securityinference optimizationtiming side channels

0 comments

The pith

PrefixWall secures shared LLM serving against prefix caching side channels by selectively isolating suspicious reuse patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PrefixWall to defend multi-tenant LLM systems from timing side channels created by Automatic Prefix Caching. Instead of turning off caching entirely to stop attackers from observing hit or miss patterns and reconstructing other users' inputs, it tracks reuse across users and restricts only the suspicious cases. A reader would care because current defenses sacrifice speed for everyone to achieve security, while this method aims to preserve most of the efficiency gains from caching. If the claims hold, shared LLM deployments could run faster and use resources better without exposing sensitive prefix information through latency differences.

Core claim

PrefixWall monitors cache reuse patterns across users to flag suspicious sharing that could indicate side-channel probing, then applies isolation selectively to those prefixes rather than separating all users or disabling the cache optimization.

What carries the argument

Cross-user cache-reuse monitoring with selective prefix isolation that detects anomalous patterns and restricts reuse only when necessary.

If this is right

Cache reuse rates increase up to 70% compared to defenses that isolate all users.
Inference latency drops up to 30% relative to full user isolation approaches.
Regular users retain most of the speed benefits from Automatic Prefix Caching.
The monitoring adds only lightweight overhead to the serving system.
Side-channel attacks based on incremental prefix reconstruction are blocked without blanket restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective monitoring idea could extend to other shared optimizations in distributed AI systems that leak timing information.
Higher cache reuse might allow LLM providers to handle more concurrent requests with the same hardware.
Adapting detection rules for different user workloads could improve the balance between security and performance over time.

Load-bearing premise

Monitoring cache-reuse patterns across users can reliably distinguish benign sharing from malicious probing without missing attacks or imposing unnecessary isolation.

What would settle it

An attacker who successfully reconstructs a sensitive prefix by evading the monitoring detection, or a workload where the system flags and isolates too many benign prefixes causing measurable performance loss.

Figures

Figures reproduced from arXiv: 2603.10726 by Konstantinos Papaioannou, Marco Guarnieri, Panagiotis Georgios Pennas, Thaleia Dimitra Doudali.

**Figure 2.** Figure 2: TTFT difference between cache hits (red) and misses [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of the LLM model, prefix/prompt length and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: System Design of CacheSolidarity. exploitable. When a request reuses cache entries created by another user, the risk of prompt stealing begins. By flagging a suspicious cache entry, CacheSolidarity sets a boundary: reuse up to this point is allowed, but going further is restricted for non-owners (potential attackers). 3.2.2 Detection Pipeline. When a cache entry is created on a cache miss, its metadata is … view at source ↗

**Figure 5.** Figure 5: Example workflow of CacheSolidarity. Thus, CacheSolidarity allows User 1 to reuse all prefixes beyond the flagged entry without restriction (owner-aware continuation). 𝑡4: User 3 attempts to reuse User 1’s prompt but with a different name in the private information. CacheSolidarity detects that the AttackFlag is set and that the OwnerID differs, so continuation is not allowed. The common prefix is isolate… view at source ↗

**Figure 6.** Figure 6: Workload Construction. prefix overlap to capture a wide range of behaviors. Higher overlap means a longer shared prefix, which leads to more cache reuse and better performance. For example, private data at the end of the prompt (“You are a helpful assistant. I want you to write an email to reply to [sensitive information]”) creates high overlap, while at the beginning (“My name is [sensitive information]”)… view at source ↗

**Figure 7.** Figure 7: Comparison of baselines across various workloads. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 9.** Figure 9: Comparison of prefix caching (unprotected system) [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of hit rate and TTFT as a function of [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents PrefixWall, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. PrefixWall monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that PrefixWall enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. PrefixWall's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrefixWall's selective isolation for prefix caching side channels looks promising on paper but depends on an untested detection scheme that could undermine the performance claims.

read the letter

PrefixWall tries to secure shared LLM systems against prefix caching side channels by monitoring reuse patterns and only isolating the suspicious ones. This selective approach is the main thing to know about it. The paper does something different from the usual full-isolation defenses. It keeps cache sharing for most cases and steps in only when monitoring flags a problem. That leads to the reported gains of 70% higher reuse and 30% lower latency compared to the isolating alternatives. Those numbers suggest a real efficiency win if the system can pull it off without too much overhead. What it does well is frame the problem practically. Side channels from automatic prefix caching are a clear risk in multi-tenant setups, and disabling the feature entirely is wasteful. PrefixWall shows a middle path that might preserve most benefits. The soft spots center on the monitoring component. Distinguishing benign sharing from malicious incremental probing is not trivial, and the abstract gives no specifics on detection features, thresholds, or how it handles false positives and negatives. The performance results also lack any description of the evaluation setup, datasets, or error bars, so it's difficult to judge how general the claims are. If the detector ends up isolating too much, the advantages disappear. This matches the stress-test note that the reliability of cache-reuse monitoring is the key unverified assumption. This work is for people building or studying secure LLM inference platforms. Readers focused on systems security and optimization tradeoffs in AI serving would get the most from it. The idea is grounded enough to deserve a serious referee who can check the implementation and experiments. I would recommend sending this to peer review. The direction is worth exploring even with the current gaps in the details.

Referee Report

2 major / 1 minor

Summary. The paper proposes PrefixWall, a defense for Automatic Prefix Caching (APC) side channels in multi-tenant LLM serving systems. It monitors cross-user cache reuse patterns to identify suspicious sharing, selectively isolates only flagged prefixes, and claims to preserve most benign sharing while blocking incremental reconstruction attacks. Evaluation results are stated as up to 70% higher cache reuse and 30% lower inference latency relative to full user-isolation baselines.

Significance. If the monitoring component can achieve low false-negative rates against incremental probing and low false-positive rates on normal traffic, PrefixWall would meaningfully improve the security-performance tradeoff in shared LLM deployments. The selective-isolation idea directly targets the efficiency cost of existing sledgehammer defenses.

major comments (2)

[Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.
[Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.

minor comments (1)

[Abstract] The abstract and introduction use the term 'lightweight' without quantifying monitoring overhead or comparing it to baseline APC costs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate additional details and clarifications in the revised version to strengthen the presentation of the detection mechanism and evaluation.

read point-by-point responses

Referee: [Detection and Isolation Mechanism] The performance numbers (70% reuse, 30% latency) rest on the unverified claim that the detector can isolate only malicious prefixes. No concrete detection rules, feature set, threshold logic, or adversarial evaluation against incremental reconstruction attacks appear in the manuscript, leaving the central assumption untested.

Authors: We acknowledge that the submitted manuscript describes the detection mechanism at a high level, focusing on monitoring cross-user cache reuse patterns to flag suspicious sharing before selective isolation. To directly address this point, we will revise the paper to include concrete detection rules, the specific feature set (e.g., reuse frequency, cross-user diversity, and temporal patterns), threshold logic for flagging, and results from adversarial evaluations against incremental reconstruction attacks. These additions will substantiate the performance claims without altering the core design. revision: yes
Referee: [Evaluation] Evaluation section: the reported gains lack any description of methodology, workloads, datasets, number of users, attack implementations, or statistical analysis. Without these, it is impossible to determine whether the 70%/30% figures are reproducible or whether realistic detectors would force near-total isolation.

Authors: We agree that the evaluation section requires expanded methodological detail. In the revision, we will add descriptions of the workloads, datasets, number of users in the multi-tenant simulations, attack implementations for incremental probing, and statistical analysis of results. We will also include discussion and experiments showing that the selective isolation approach maintains high cache reuse under realistic traffic without forcing near-total isolation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with independent performance measurements

full rationale

The paper describes a practical defense system (PrefixWall) that monitors cache reuse patterns and selectively isolates suspicious prefixes. Its central claims—up to 70% higher cache reuse and 30% lower latency—are presented as results from experimental evaluation against baselines that fully isolate users. No equations, fitted parameters, or derivation steps appear in the provided text. The approach relies on empirical measurement of real system behavior rather than any self-referential definition, prediction from fitted inputs, or load-bearing self-citation chain. The monitoring logic is described at a high level without reducing to a tautology or prior author result by construction. This is a standard self-contained systems paper whose performance numbers are externally falsifiable via re-implementation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The design rests on the domain assumption that cache-hit/miss timing differences are observable and exploitable, plus an implicit assumption that suspicious reuse can be detected from access patterns alone.

free parameters (1)

suspicious reuse threshold
Used to decide when to flag and isolate a prefix; value not specified in abstract.

axioms (1)

domain assumption Cache hits produce measurably lower latency than misses, enabling side-channel inference of other users' prefixes.
Stated directly in the abstract as the root of the vulnerability.

pith-pipeline@v0.9.0 · 5763 in / 1154 out tokens · 39972 ms · 2026-05-21T12:10:21.303172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 5 internal anchors

[1]

Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. 2024. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409. 17264

work page arXiv 2024
[2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA)(OSDI’24). USENIX Association, ...

work page 2025
[3]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] https://arxiv.org/abs/2308.16369

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. InProceedings of the 40th IEEE Symposium on Security and Privacy (S&P ’19). IEEE

work page 2019
[5]

2024.Qwen2-VL 2B Instruct Model

Alibaba Cloud. 2024.Qwen2-VL 2B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct Available on Hugging Face Hub

work page 2024
[6]

2024.Qwen2-VL 7B Instruct Model

Alibaba Cloud. 2024.Qwen2-VL 7B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-7B-Instruct Available on Hugging Face Hub

work page 2024
[7]

2024.Qwen2.5-VL 3B Instruct Model

Alibaba Cloud. 2024.Qwen2.5-VL 3B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-3B-Instruct Available on Hugging Face Hub

work page 2024
[8]

2024.Qwen2.5-VL 7B Instruct Model

Alibaba Cloud. 2024.Qwen2.5-VL 7B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-7B-Instruct Available on Hugging Face Hub

work page 2024
[9]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/ datasets/anon8231489123/ShareGPT%20Vicuna%20unfiltered. Dataset on Hug- ging Face

work page 2023
[10]

F. Bang. 2023. Gptcache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218

work page 2023
[11]

BelleGroup. 2023. Multiturn Chat 0.8M. https://huggingface.co/datasets/ BelleGroup/multiturn%20chat%200.8M. Dataset on Hugging Face

work page 2023
[12]

Andrew Bortz and Dan Boneh. 2007. Exposing private information by timing web applications. InProceedings of the 16th International Conference on World Wide Web(Banff, Alberta, Canada)(WWW ’07). Association for Computing Machinery, New York, NY, USA, 621–628. https://doi.org/10.1145/1242572.1242656 12

work page doi:10.1145/1242572.1242656 2007
[13]

Tom Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

work page 2020
[14]

Nicholas Carlini and Milad Nasr. 2024. Remote Timing Attacks on Efficient Language Model Inference. arXiv:2410.17175 [cs.CR] https://arxiv.org/abs/2410. 17175

work page arXiv 2024
[15]

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. 2024. Effi- cient and Economic Large Language Model Inference with Attention Offloading. arXiv:2405.01814 [cs.LG] https://arxiv.org/abs/2405.01814

work page arXiv 2024
[16]

Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025. Selec- tive KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference. arXiv:2508.08438 [cs.CR] https://arxiv.org/abs/2508.08438

work page arXiv 2025
[17]

Rasmus Dahlberg and Tobias Pulls. 2023. Timeless Timing Attacks and Preload Defenses in Tor’s DNS Cache. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2635–2652. https://www.usenix. org/conference/usenixsecurity23/presentation/dahlberg

work page 2023
[18]

2025.Gomini

DeepMind. 2025.Gomini. https://deepmind.google/technologies/gemini/

work page 2025
[19]

2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude

DeepSeek. 2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude. https://api-docs.deepseek.com/ news/news0802/ Accessed: 2025-07-17

work page 2024
[20]

Gonzalez, and Ion Stoica

Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2024. HashAttention: Semantic Sparsity for Faster Inference. arXiv:2412.14468 [cs.LG] https://arxiv.org/abs/2412.14468

work page arXiv 2024
[21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://doi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
[22]

Abu-Ghazaleh, and Dmitry Ponomarev

Dmitry Evtyushkin, Ryan Riley, Nael B. Abu-Ghazaleh, and Dmitry Ponomarev

work page
[23]

InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18)

BranchScope: A New Side-Channel Attack on Directional Branch Predictor. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM

work page
[24]

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. 2024. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference. arXiv:2407.14057 [cs.CL] https://arxiv.org/abs/2407. 14057

work page arXiv 2024
[25]

V. Gallego. 2024. Configurable Safety Tuning of Language Models with Synthetic Preference Data. (2024). Preprint

work page 2024
[26]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient large language model serving for multi-turn conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Associati...

work page 2025
[27]

In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] https://arxiv.org/abs/2311.04934

work page arXiv 2024
[28]

2025.Gemma 3 4B Instruct Model

Google DeepMind. 2025.Gemma 3 4B Instruct Model. https://huggingface.co/ google/gemma-3-4b-it Available on Hugging Face Hub

work page 2025
[29]

Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. InProceedings of the 27th USENIX Security Symposium (USENIX Security ’18). USENIX Association

work page 2018
[30]

Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU. InNDSS. Paper=https: //download.vusec.net/papers/anc_ndss17.pdfSlides=https://vusec.net/wp- content/uploads/2016/11/TalkGras.pdfWeb=https://www.vusec.net/projects/ ancCode=https://github.com/vusec/revancPress=https://goo.gl/KL4Bta

work page 2017
[31]

Daniel Gruss, Erik Kraft, Trishita Tiwari, Michael Schwarz, Ari Trachtenberg, Jason Hennessey, Alex Ionescu, and Anders Fogh. 2019. Page Cache Attacks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security(London, United Kingdom)(CCS ’19). Association for Computing Ma- chinery, New York, NY, USA, 167–180. https://doi.org...

work page doi:10.1145/3319535.3339809 2019
[32]

Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. 2025. Auditing Prompt Caching in Language Model APIs. arXiv:2502.07776 [cs.CL] https://arxiv.org/abs/2502.07776

work page arXiv 2025
[33]

Marcus Hähnel, Weidong Cui, and Marcus Peinado. 2017. High-resolution side channels for untrusted operating systems. InProceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference(Santa Clara, CA, USA)(USENIX ATC ’17). USENIX Association, USA, 299–312

work page 2017
[34]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https://arxiv.org/abs/2401.08671

work page arXiv 2024
[35]

Inman and Edwin L

Henry F. Inman and Edwin L. Bradley. 1989. The overlapping coefficient as a measure of agreement between probability distributions.Communications in Statistics-Theory and Methods18, 10 (1989), 3851–3874. https://doi.org/10.1080/ 03610928908830127

work page 1989
[36]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre-filling for Long- Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https: //arxiv.org/abs/2407.02490

work page arXiv 2024
[37]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1, Article 2 (Nov. 2025), 27 pages. https://doi.org/10.1145/3768628

work page doi:10.1145/3768628 2025
[38]

Fu, Christopher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https://arxiv.org/abs/2402.05099

work page arXiv 2024
[39]

David Kohlbrenner and Hovav Shacham. 2016. Trusted browsers for uncertain times. InProceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA)(SEC’16). USENIX Association, USA, 463–480

work page 2016
[40]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page 2023
[41]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172. https://www.usenix.org/conference/osdi24/presentation/lee

work page 2024
[42]

Jieyu Lin, Sai Qian Zhang, and Alberto Leon-Garcia. 2024. sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Struc- tures. In2024 25th International Symposium on Quality Electronic Design (ISQED). 1–6. https://doi.org/10.1109/ISQED60706.2024.10528703

work page doi:10.1109/isqed60706.2024.10528703 2024
[43]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https://arxiv. org/abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney, NSW, Austra...

work page doi:10.1145/3651890.3672274 2024
[45]

2024.LLaV A-OneVision Qwen2 0.5B (OV-HF)

LLaVA Team. 2024.LLaV A-OneVision Qwen2 0.5B (OV-HF). https://huggingface. co/llava-onevision-qwen2-0.5b-ov-hf Available on Hugging Face Hub

work page 2024
[46]

2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF)

LLaVA Team. 2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF). https:// huggingface.co/llava-onevision-qwen2-7b-ov-chat-hf Available on Hugging Face Hub

work page 2024
[47]

Brandenburg, Peter Druschel, and Deepak Garg

Aastha Mehta, Mohamed Alzayat, Roberta De Viti, Björn B. Brandenburg, Peter Druschel, and Deepak Garg. 2022. Pacer: Comprehensive Network Side-Channel Mitigation in the Cloud. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2819–2838. https://www.usenix.org/ conference/usenixsecurity22/presentation/mehta

work page 2022
[48]

2023.Llama 2 13B Chat Model

Meta AI. 2023.Llama 2 13B Chat Model. https://huggingface.co/meta-llama/ Llama-2-13b-chat-hf Available on Hugging Face Hub

work page 2023
[49]

2023.Llama 2 7B Chat Model

Meta AI. 2023.Llama 2 7B Chat Model. https://huggingface.co/meta-llama/Llama- 2-7b-chat-hf Available on Hugging Face Hub

work page 2023
[50]

2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching

OpenAI. 2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching. https://platform.openai.com/docs/guides/prompt-caching Accessed: 2025-07-17

work page 2024
[51]

Fletcher

Riccardo Paccagnella, Licheng Luo, and Christopher W. Fletcher. 2021. Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical. InProceedings of the 30th USENIX Security Symposium (USENIX Security ’21). USENIX Association

work page 2021
[52]

Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache Partitioning for Miti- gating Timing Side-Channel Attacks in LLM Serving Systems. In2024 6th Interna- tional Conference on Frontier Technologies of Information and Computer (ICFTIC). 1238–1245. https://doi.org/10.1109/ICFTIC64248.2024.10913329

work page doi:10.1109/icftic64248.2024.10913329 2024
[53]

Konstantinos Papaioannou and Thaleia Dimitra Doudali. 2024. The Importance of Workload Choice in Evaluating LLM Inference Systems. InProceedings of the 4th Workshop on Machine Learning and Systems(Athens, Greece)(EuroMLSys ’24). Association for Computing Machinery, New York, NY, USA, 39–46. https: //doi.org/10.1145/3642970.3655823

work page doi:10.1145/3642970.3655823 2024
[54]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. https://doi.org/10.1109/ ISCA59077.2024.00019

work page arXiv 2024
[55]

Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. InProceedings of the 25th USENIX Security Symposium (USENIX Security ’16). 13 USENIX Association

work page 2016
[56]

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, and X. Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

work page 2025
[57]

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2025. Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. https://openreview.net/forum?id=9iN8p1Xwtg

work page 2025
[58]

Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, and Hamed Haddadi. 2023. Pool-party: exploiting browser resource pools for web tracking. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 397, 15 pages

work page 2023
[59]

Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2025. The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems. arXiv:2409.20002 [cs.CR] https://arxiv.org/abs/2409.20002

work page arXiv 2025
[60]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 173–191. https://www.usenix. org/conference/osdi24/presentation/sun-biao

work page 2024
[61]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. arXiv:2410.21465 [cs.LG] https://arxiv.org/abs/2410.21465

work page arXiv 2024
[62]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1955, 11 pages

work page 2024
[63]

Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Pepe Vila and Boris Kopf. 2017. Loophole: Timing Attacks on Shared Event Loops in Chrome. In26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 849–864. https://www.usenix.org/conference/ usenixsecurity17/technical-sessions/presentation/vila

work page 2017
[65]

vLLM Team. 2024. vLLM: High-Throughput Serving for Large Language Models. https://github.com/vllm-project/vllm

work page 2024
[66]

vLLM Team. 2025. Automatic Prefix Caching in vLLM. https://docs.vllm.ai/en/ latest/features/automatic_prefix_caching/

work page 2025
[67]

Wright, Lucas Ballard, Scott E

Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. 2008. Spot Me if You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations. In2008 IEEE Symposium on Security and Privacy (sp 2008). 35–49. https://doi.org/10.1109/SP.2008.21

work page doi:10.1109/sp.2008.21 2008
[68]

Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

work page 2025
[69]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands) (EuroSys ’25). Association for Computing Machinery, New Y...

work page doi:10.1145/3689031.3696098 2025
[71]

Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. InProceedings of the 23rd USENIX Security Symposium (USENIX Security ’14). USENIX Association

work page 2014
[72]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...

work page doi:10.18653/v1/2024.acl-long.623 2024
[73]

Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful Large Language Model Serving with Pensieve. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands)(EuroSys ’25). Association for Com- puting Machinery, New York, NY, USA, 144–158. https://doi.org/10.1145/3689031. 3696086

work page doi:10.1145/3689031 2025
[74]

Siyan Zhao, Daniel Israel, Guy Van den Broeck, and Aditya Grover. 2024. Prepack- ing: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models. arXiv:2404.09529 [cs.LG] https://arxiv.org/abs/2404.09529

work page arXiv 2024
[75]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs.Advances in Neural Information Processing Systems 37 (2024), 62557–62583

work page 2024
[76]

Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks. arXiv:2411.18191 [cs.CR] https://arxiv.org/abs/2411.18191

work page arXiv 2024
[77]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 193–210. https://www.usenix.org/co...

work page 2024
[78]

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025. SampleAttention: Near- Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention. InEighth Conference on Machine Learning and Systems. https: //openreview.net/forum?id=RuZ80yl71h 14

work page 2025

[1] [1]

Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. 2024. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409. 17264

work page arXiv 2024

[2] [2]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. InProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA)(OSDI’24). USENIX Association, ...

work page 2025

[3] [3]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] https://arxiv.org/abs/2308.16369

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. InProceedings of the 40th IEEE Symposium on Security and Privacy (S&P ’19). IEEE

work page 2019

[5] [5]

2024.Qwen2-VL 2B Instruct Model

Alibaba Cloud. 2024.Qwen2-VL 2B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-2B-Instruct Available on Hugging Face Hub

work page 2024

[6] [6]

2024.Qwen2-VL 7B Instruct Model

Alibaba Cloud. 2024.Qwen2-VL 7B Instruct Model. https://huggingface.co/Qwen/ Qwen2-VL-7B-Instruct Available on Hugging Face Hub

work page 2024

[7] [7]

2024.Qwen2.5-VL 3B Instruct Model

Alibaba Cloud. 2024.Qwen2.5-VL 3B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-3B-Instruct Available on Hugging Face Hub

work page 2024

[8] [8]

2024.Qwen2.5-VL 7B Instruct Model

Alibaba Cloud. 2024.Qwen2.5-VL 7B Instruct Model. https://huggingface.co/ Qwen/Qwen2.5-VL-7B-Instruct Available on Hugging Face Hub

work page 2024

[9] [9]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/ datasets/anon8231489123/ShareGPT%20Vicuna%20unfiltered. Dataset on Hug- ging Face

work page 2023

[10] [10]

F. Bang. 2023. Gptcache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. InProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218

work page 2023

[11] [11]

BelleGroup. 2023. Multiturn Chat 0.8M. https://huggingface.co/datasets/ BelleGroup/multiturn%20chat%200.8M. Dataset on Hugging Face

work page 2023

[12] [12]

Andrew Bortz and Dan Boneh. 2007. Exposing private information by timing web applications. InProceedings of the 16th International Conference on World Wide Web(Banff, Alberta, Canada)(WWW ’07). Association for Computing Machinery, New York, NY, USA, 621–628. https://doi.org/10.1145/1242572.1242656 12

work page doi:10.1145/1242572.1242656 2007

[13] [13]

Tom Brown, Benjamin Mann, Nick Ryder, et al . 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

work page 2020

[14] [14]

Nicholas Carlini and Milad Nasr. 2024. Remote Timing Attacks on Efficient Language Model Inference. arXiv:2410.17175 [cs.CR] https://arxiv.org/abs/2410. 17175

work page arXiv 2024

[15] [15]

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. 2024. Effi- cient and Economic Large Language Model Inference with Attention Offloading. arXiv:2405.01814 [cs.LG] https://arxiv.org/abs/2405.01814

work page arXiv 2024

[16] [16]

Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025. Selec- tive KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference. arXiv:2508.08438 [cs.CR] https://arxiv.org/abs/2508.08438

work page arXiv 2025

[17] [17]

Rasmus Dahlberg and Tobias Pulls. 2023. Timeless Timing Attacks and Preload Defenses in Tor’s DNS Cache. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2635–2652. https://www.usenix. org/conference/usenixsecurity23/presentation/dahlberg

work page 2023

[18] [18]

2025.Gomini

DeepMind. 2025.Gomini. https://deepmind.google/technologies/gemini/

work page 2025

[19] [19]

2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude

DeepSeek. 2024.DeepSeek API Docs: DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude. https://api-docs.deepseek.com/ news/news0802/ Accessed: 2025-07-17

work page 2024

[20] [20]

Gonzalez, and Ion Stoica

Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2024. HashAttention: Semantic Sparsity for Faster Inference. arXiv:2412.14468 [cs.LG] https://arxiv.org/abs/2412.14468

work page arXiv 2024

[21] [21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. https://doi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019

[22] [22]

Abu-Ghazaleh, and Dmitry Ponomarev

Dmitry Evtyushkin, Ryan Riley, Nael B. Abu-Ghazaleh, and Dmitry Ponomarev

work page

[23] [23]

InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18)

BranchScope: A New Side-Channel Attack on Directional Branch Predictor. InProceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM

work page

[24] [24]

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. 2024. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference. arXiv:2407.14057 [cs.CL] https://arxiv.org/abs/2407. 14057

work page arXiv 2024

[25] [25]

V. Gallego. 2024. Configurable Safety Tuning of Language Models with Synthetic Preference Data. (2024). Preprint

work page 2024

[26] [26]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient large language model serving for multi-turn conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA)(USENIX ATC’24). USENIX Associati...

work page 2025

[27] [27]

In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] https://arxiv.org/abs/2311.04934

work page arXiv 2024

[28] [28]

2025.Gemma 3 4B Instruct Model

Google DeepMind. 2025.Gemma 3 4B Instruct Model. https://huggingface.co/ google/gemma-3-4b-it Available on Hugging Face Hub

work page 2025

[29] [29]

Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. InProceedings of the 27th USENIX Security Symposium (USENIX Security ’18). USENIX Association

work page 2018

[30] [30]

Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. ASLR on the Line: Practical Cache Attacks on the MMU. InNDSS. Paper=https: //download.vusec.net/papers/anc_ndss17.pdfSlides=https://vusec.net/wp- content/uploads/2016/11/TalkGras.pdfWeb=https://www.vusec.net/projects/ ancCode=https://github.com/vusec/revancPress=https://goo.gl/KL4Bta

work page 2017

[31] [31]

Daniel Gruss, Erik Kraft, Trishita Tiwari, Michael Schwarz, Ari Trachtenberg, Jason Hennessey, Alex Ionescu, and Anders Fogh. 2019. Page Cache Attacks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security(London, United Kingdom)(CCS ’19). Association for Computing Ma- chinery, New York, NY, USA, 167–180. https://doi.org...

work page doi:10.1145/3319535.3339809 2019

[32] [32]

Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. 2025. Auditing Prompt Caching in Language Model APIs. arXiv:2502.07776 [cs.CL] https://arxiv.org/abs/2502.07776

work page arXiv 2025

[33] [33]

Marcus Hähnel, Weidong Cui, and Marcus Peinado. 2017. High-resolution side channels for untrusted operating systems. InProceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference(Santa Clara, CA, USA)(USENIX ATC ’17). USENIX Association, USA, 299–312

work page 2017

[34] [34]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https://arxiv.org/abs/2401.08671

work page arXiv 2024

[35] [35]

Inman and Edwin L

Henry F. Inman and Edwin L. Bradley. 1989. The overlapping coefficient as a measure of agreement between probability distributions.Communications in Statistics-Theory and Methods18, 10 (1989), 3851–3874. https://doi.org/10.1080/ 03610928908830127

work page 1989

[36] [36]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre-filling for Long- Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https: //arxiv.org/abs/2407.02490

work page arXiv 2024

[37] [37]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation.ACM Trans. Comput. Syst.44, 1, Article 2 (Nov. 2025), 27 pages. https://doi.org/10.1145/3768628

work page doi:10.1145/3768628 2025

[38] [38]

Fu, Christopher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https://arxiv.org/abs/2402.05099

work page arXiv 2024

[39] [39]

David Kohlbrenner and Hovav Shacham. 2016. Trusted browsers for uncertain times. InProceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA)(SEC’16). USENIX Association, USA, 463–480

work page 2016

[40] [40]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page 2023

[41] [41]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172. https://www.usenix.org/conference/osdi24/presentation/lee

work page 2024

[42] [42]

Jieyu Lin, Sai Qian Zhang, and Alberto Leon-Garcia. 2024. sLLM: Accelerating LLM Inference using Semantic Load Balancing with Shared Memory Data Struc- tures. In2024 25th International Symposium on Quality Electronic Design (ISQED). 1–6. https://doi.org/10.1109/ISQED60706.2024.10528703

work page doi:10.1109/isqed60706.2024.10528703 2024

[43] [43]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https://arxiv. org/abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference(Sydney, NSW, Austra...

work page doi:10.1145/3651890.3672274 2024

[45] [45]

2024.LLaV A-OneVision Qwen2 0.5B (OV-HF)

LLaVA Team. 2024.LLaV A-OneVision Qwen2 0.5B (OV-HF). https://huggingface. co/llava-onevision-qwen2-0.5b-ov-hf Available on Hugging Face Hub

work page 2024

[46] [46]

2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF)

LLaVA Team. 2024.LLaV A-OneVision Qwen2 7B OV-Chat (HF). https:// huggingface.co/llava-onevision-qwen2-7b-ov-chat-hf Available on Hugging Face Hub

work page 2024

[47] [47]

Brandenburg, Peter Druschel, and Deepak Garg

Aastha Mehta, Mohamed Alzayat, Roberta De Viti, Björn B. Brandenburg, Peter Druschel, and Deepak Garg. 2022. Pacer: Comprehensive Network Side-Channel Mitigation in the Cloud. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2819–2838. https://www.usenix.org/ conference/usenixsecurity22/presentation/mehta

work page 2022

[48] [48]

2023.Llama 2 13B Chat Model

Meta AI. 2023.Llama 2 13B Chat Model. https://huggingface.co/meta-llama/ Llama-2-13b-chat-hf Available on Hugging Face Hub

work page 2023

[49] [49]

2023.Llama 2 7B Chat Model

Meta AI. 2023.Llama 2 7B Chat Model. https://huggingface.co/meta-llama/Llama- 2-7b-chat-hf Available on Hugging Face Hub

work page 2023

[50] [50]

2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching

OpenAI. 2024.Prompt Caching: Reduce Latency and Cost with Prompt Caching. https://platform.openai.com/docs/guides/prompt-caching Accessed: 2025-07-17

work page 2024

[51] [51]

Fletcher

Riccardo Paccagnella, Licheng Luo, and Christopher W. Fletcher. 2021. Lord of the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are Practical. InProceedings of the 30th USENIX Security Symposium (USENIX Security ’21). USENIX Association

work page 2021

[52] [52]

Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache Partitioning for Miti- gating Timing Side-Channel Attacks in LLM Serving Systems. In2024 6th Interna- tional Conference on Frontier Technologies of Information and Computer (ICFTIC). 1238–1245. https://doi.org/10.1109/ICFTIC64248.2024.10913329

work page doi:10.1109/icftic64248.2024.10913329 2024

[53] [53]

Konstantinos Papaioannou and Thaleia Dimitra Doudali. 2024. The Importance of Workload Choice in Evaluating LLM Inference Systems. InProceedings of the 4th Workshop on Machine Learning and Systems(Athens, Greece)(EuroMLSys ’24). Association for Computing Machinery, New York, NY, USA, 39–46. https: //doi.org/10.1145/3642970.3655823

work page doi:10.1145/3642970.3655823 2024

[54] [54]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sym- posium on Computer Architecture (ISCA). 118–132. https://doi.org/10.1109/ ISCA59077.2024.00019

work page arXiv 2024

[55] [55]

Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks. InProceedings of the 25th USENIX Security Symposium (USENIX Security ’16). 13 USENIX Association

work page 2016

[56] [56]

R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, and X. Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

work page 2025

[57] [57]

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2025. Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction. https://openreview.net/forum?id=9iN8p1Xwtg

work page 2025

[58] [58]

Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, and Hamed Haddadi. 2023. Pool-party: exploiting browser resource pools for web tracking. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 397, 15 pages

work page 2023

[59] [59]

Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2025. The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems. arXiv:2409.20002 [cs.CR] https://arxiv.org/abs/2409.20002

work page arXiv 2025

[60] [60]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 173–191. https://www.usenix. org/conference/osdi24/presentation/sun-biao

work page 2024

[61] [61]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. arXiv:2410.21465 [cs.LG] https://arxiv.org/abs/2410.21465

work page arXiv 2024

[62] [62]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1955, 11 pages

work page 2024

[63] [63]

Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Pepe Vila and Boris Kopf. 2017. Loophole: Timing Attacks on Shared Event Loops in Chrome. In26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 849–864. https://www.usenix.org/conference/ usenixsecurity17/technical-sessions/presentation/vila

work page 2017

[65] [65]

vLLM Team. 2024. vLLM: High-Throughput Serving for Large Language Models. https://github.com/vllm-project/vllm

work page 2024

[66] [66]

vLLM Team. 2025. Automatic Prefix Caching in vLLM. https://docs.vllm.ai/en/ latest/features/automatic_prefix_caching/

work page 2025

[67] [67]

Wright, Lucas Ballard, Scott E

Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. 2008. Spot Me if You Can: Uncovering Spoken Phrases in Encrypted VoIP Conversations. In2008 IEEE Symposium on Security and Privacy (sp 2008). 35–49. https://doi.org/10.1109/SP.2008.21

work page doi:10.1109/sp.2008.21 2008

[68] [68]

Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

work page 2025

[69] [69]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands) (EuroSys ’25). Association for Computing Machinery, New Y...

work page doi:10.1145/3689031.3696098 2025

[71] [71]

Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. InProceedings of the 23rd USENIX Security Symposium (USENIX Security ’14). USENIX Association

work page 2014

[72] [72]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...

work page doi:10.18653/v1/2024.acl-long.623 2024

[73] [73]

Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful Large Language Model Serving with Pensieve. InProceedings of the Twentieth European Conference on Computer Systems(Rotterdam, Netherlands)(EuroSys ’25). Association for Com- puting Machinery, New York, NY, USA, 144–158. https://doi.org/10.1145/3689031. 3696086

work page doi:10.1145/3689031 2025

[74] [74]

Siyan Zhao, Daniel Israel, Guy Van den Broeck, and Aditya Grover. 2024. Prepack- ing: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models. arXiv:2404.09529 [cs.LG] https://arxiv.org/abs/2404.09529

work page arXiv 2024

[75] [75]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs.Advances in Neural Information Processing Systems 37 (2024), 62557–62583

work page 2024

[76] [76]

Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks. arXiv:2411.18191 [cs.CR] https://arxiv.org/abs/2411.18191

work page arXiv 2024

[77] [77]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Associa- tion, Santa Clara, CA, 193–210. https://www.usenix.org/co...

work page 2024

[78] [78]

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025. SampleAttention: Near- Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention. InEighth Conference on Machine Learning and Systems. https: //openreview.net/forum?id=RuZ80yl71h 14

work page 2025