pith. sign in

arxiv: 2605.23640 · v1 · pith:5KZCNDITnew · submitted 2026-05-22 · 💻 cs.CR

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

Pith reviewed 2026-05-25 04:12 UTC · model grok-4.3

classification 💻 cs.CR
keywords KV cacheside-channel attacksLLM inferenceprivacycache sharingtoken-level maskingserving systems
0
0 comments X

The pith

CachePrune masks sensitive tokens at the individual level so that LLMs can safely reuse the rest of each KV cache entry across users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems share KV caches across requests to avoid repeating prefix computations and lower time-to-first-token. Full sharing however creates a side channel: an attacker can probe whether their prompt produced a cache hit and thereby learn parts of another user's input. Coarse defenses therefore turn sharing off entirely, forgoing large efficiency gains on the many non-sensitive segments that prompts contain. CachePrune instead identifies and excludes only the sensitive tokens, then manages the resulting irregular reusable spans so that the remaining KV entries can still be shared. The design removes the direct leakage path while delivering the reported 4.5 times TTFT reduction and 44 percent higher cache hit rate.

Core claim

CachePrune derives reusable KV segments after token-level sensitivity masking and retrieves them efficiently over variable-length spans. Implemented on vLLM and tested on three datasets, the mechanism eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44 percent relative to state-of-the-art approaches.

What carries the argument

Token-level sensitivity masking followed by variable-length KV segment derivation and retrieval.

If this is right

  • Prompts containing both public instructions and private user data can still obtain most of the reuse benefit.
  • Serving systems no longer face an all-or-nothing choice between isolation and performance.
  • Cache hit rates rise because occasional sensitive tokens no longer block reuse of the surrounding context.
  • Time-to-first-token drops as more prefix computation is safely reused across independent requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pruning logic could be applied to other shared inference state such as activation buffers or attention maps.
  • Automated sensitivity classifiers already used for content moderation could supply the masks with little extra engineering.
  • Variable-length span retrieval techniques may transfer to irregular reuse patterns in non-LLM serving workloads.

Load-bearing premise

Sensitive segments can be identified and masked at token granularity without missing leakage paths or creating new side channels.

What would settle it

An experiment in which an adversary successfully recovers private input by observing the pattern of which KV segments are shared under CachePrune would falsify the privacy guarantee.

Figures

Figures reproduced from arXiv: 2605.23640 by Guanlong Wu, Jianyu Niu, Yao Zhang, Ye Wu, Yinqian Zhang, Zhaohan li, Zheng Zhang.

Figure 1
Figure 1. Figure 1: KV cache sharing mechanisms. White blocks denote unreusable KV cache, gray blocks denote reusable KV cache, and hatched blocks denote recomputed KV cache. memory usage to scale with sequence length and model size. In multi-tenant serving, this trade-off directly constrains concurrency and throughput under a fixed GPU budget. 2.2 KV Cache Sharing While KV caching improves efficiency, its memory usage still … view at source ↗
Figure 2
Figure 2. Figure 2: TTFT reduction. U        0 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: System architecture of CachePrune. New modules introduced by CachePrune are highlighted in light yellow. KV cache management. In chunk-level management, reusable seg￾ments that span across chunk boundaries often cause mismatches and are discarded, reducing cache hit rates ( [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Deriving reusable KV segments under sensitive tokens. 4.2 Challenges and Algorithmic Solutions 4.2.1 C1: Efficiently and Accurately Deriving Reusable KV Seg￾ments. After inference, the Sensitivity Detector analyzes the input token sequence 𝑃 = {𝑡1, 𝑡2, . . . , 𝑡𝑛 } and produces a binary sensitivity mask 𝑀 ∈ {0, 1} 𝑛 , where 𝑀𝑖 = 1 indicates that token 𝑡𝑖 contains sensitive information. Tokens with 𝑀𝑖 = 1 a… view at source ↗
Figure 8
Figure 8. Figure 8: Retrieving KV segments based on rolling hash. 4.2.2 C2: Retrieving Dynamic-Length KV Segments. Existing KV retrieval methods (Sec. 2.2) rely on fixed-size chunks, where match￾ing reduces to 𝑂(1) hash lookups per chunk. In contrast, CacheP￾rune operates at token granularity, producing variable-length seg￾ments (Sec. 4.2.1). Retrieval therefore becomes a substring contain￾ment problem: identifying whether a … view at source ↗
Figure 9
Figure 9. Figure 9: Impact of imperfect privacy detection (QASPER). inject errors by simulating 0–20% False Negatives (FN) and False Positives (FP), covering and exceeding typical reported rates (both below 10% for Presidio [30]). We randomly perturb the sensitivity mask and evaluate on all datasets, presenting QASPER in the main text and others in Appendix A. FN directly increase exact recovery by exposing unmarked sensitive… view at source ↗
Figure 10
Figure 10. Figure 10: Match rate impact (Mistral-7B). 0 0 00 0 R 00 0 0   R     [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Impact of match rate on efficiency. this experiment, we fix the recompute rate at 25%, which serves as our default setting and will be further examined in Sec. 5.3.2. Results ( [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Computation overhead of core components in CachePrune. At the default setting of 25%, CachePrune achieves over 3× lower TTFT than the non-sharing baseline (100% recompute). Impact of segment length ( [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Impact of imperfect privacy detection (NarrativeQA). 0  0 0 F    0  0  0                      (a) False negative 0  0 0 F     0            [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Impact of imperfect privacy detection (QMSum). Configuration. In our experiments, we set both guess_top_k and judge_top_k to 1. The prompt used for LLMs are unmodified following prior work [51] A.2 Impact of Imperfect Detection In addition to the results presented in the main paper on QASPER, we further evaluate CachePrune on NarrativeQA ( [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Match rate impact (Qwen-7B). 0 0 00 0 R 00 0 0 0   R       [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
Figure 21
Figure 21. Figure 21: Match rate impact (Qwen-14B). 0 0 00 0 R 00 0 0 0   R       [PITH_FULL_IMAGE:figures/full_fig_p015_21.png] view at source ↗
Figure 24
Figure 24. Figure 24: Impact of match rate on efficiency (Qwen-7B). 0 0 0 0 0 M   0 00 0 000 0 00 0 000          (a) TTFT 0 0 0 0 0 M  0 0  0  0           (b) Throughput [PITH_FULL_IMAGE:figures/full_fig_p015_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Impact of match rate on efficiency (Qwen-14B). D PRIVACY DETECTION SETTINGS Following prior work on sensitivity identification in LLM prompts [38], we adopt Presidio [30], a widely used privacy detection toolkit, as 0 0 0 0 R  00 00 00 00 00 00 00      [PITH_FULL_IMAGE:figures/full_fig_p015_25.png] view at source ↗
Figure 28
Figure 28. Figure 28: Impact of segment length on efficiency. Q  Q    Q D                     (a) Match rate Q  Q  Q D            [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Impact of privacy detection methods. • Personal identifiers: CREDIT_CARD, CRYPTO,IBAN_CODE, EMAIL_ADDRESS, NRP (passport), PERSON, PHONE_NUMBER, SSN, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN,US_SSN, US_PASSPORT. • Non-personal identifiers: DATE_TIME, IP_ADDRESS, LOCA￾TION, URL, AU_ABN, AU_ACN. We configure three privacy levels for evaluation, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗
read the original abstract

Large Language Models (LLMs) rely on Key-Value (KV) caching to accelerate inference, and many serving systems further share the KV cache across users' requests to reduce redundant computation. While widely adopted, unrestricted cross-user sharing introduces side-channel vulnerabilities, allowing an adversary to infer user inputs by probing for cache reuse. Existing defenses disable sharing entirely to prevent leakage; yet such a coarse-grained strategy sacrifices substantial reuse potential, since prompts often include large portions of privacy-irrelevant segments, such as system instructions or publicly accessible materials. Building on this, we present CachePrune, a privacy-aware KV cache sharing mechanism that enables fine-grained reuse of KV entries across requests. Realizing such fine granularity requires token-level cache management, as reusable segments vary in length and position due to sensitivity masking, making reuse more complex than the fixed-size or sentence-level chunking used in existing coarse-grained schemes. Specifically, CachePrune makes fine-grained reuse practical by addressing two key challenges: accurately and efficiently deriving reusable KV segments and efficiently retrieving them over variable-length spans. We implement CachePrune on top of vLLM and evaluate it on three datasets, showing that it eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44% compared with state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CachePrune, a system for privacy-aware fine-grained KV cache sharing during LLM inference. It enables token-level management of reusable KV segments by masking sensitive portions, claiming to eliminate direct leakage through KV cache reuse side channels. The approach is said to address challenges in deriving reusable segments and retrieving variable-length spans, yielding a 4.5x reduction in TTFT and 44% higher cache hit rates versus state-of-the-art methods when implemented on vLLM and evaluated on three datasets.

Significance. If the privacy and performance claims hold, CachePrune could meaningfully improve the efficiency of multi-tenant LLM serving by permitting more cache reuse without the privacy costs of fully disabling sharing. The fine-grained, token-level design directly targets a limitation of existing coarse-grained defenses. The reported implementation on vLLM and quantitative gains on multiple datasets would be practical strengths if the supporting methodology and analysis are provided.

major comments (2)
  1. [Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.
  2. [Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract accordingly as part of a major revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.

    Authors: The abstract is a concise summary; the full threat model, sensitive-segment identification procedure, false-negative rate metrics, and analysis showing that variable-length retrieval and pruning do not introduce new timing or access-pattern channels are presented in Sections 3, 4, and 6 of the manuscript. We agree the abstract should better support the claim and will revise it to include a brief threat model statement, a high-level description of the identification procedure, reference to the accuracy metrics, and a note on the channel analysis. revision: yes

  2. Referee: [Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.

    Authors: The abstract already notes implementation on vLLM and evaluation on three datasets with comparison to state-of-the-art approaches, but we agree it lacks sufficient methodological context. We will revise the abstract to briefly define the baselines (coarse-grained KV sharing methods), note the measurement of TTFT and hit rates under the described workloads, and indicate that full error analysis and methodology appear in the evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CachePrune as a system implementation evaluated on three datasets, with claims about leakage elimination and performance gains (TTFT 4.5x, hit rate +44%) stated as empirical outcomes. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure that reduce any result to its inputs by construction. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5787 in / 1012 out tokens · 29156 ms · 2026-05-25T04:12:36.963444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 5 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Mark Ackerman, Trevor Darrell, and Daniel J Weitzner. 2001. Privacy in context. Human–Computer Interaction16, 2-4 (2001), 167–176

  3. [3]

    Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

  4. [4]

    AllenAI. 2025. allenai/WildChat. https://huggingface.co/datasets/allenai/Wild Chat. (2025)

  5. [5]

    Azure. 2025. What is PII detection in Azure Language? https://learn.microsoft. com/en-us/azure/ai-services/language-service/personally-identifiable-infor mation/overview?tabs=text-pii. (2025)

  6. [6]

    Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himan- shu Jain, Frederick Liu, Yin-Wen Chang, and Sanjiv Kumar. 2021. Leveraging redundancy in attention with reuse transformers.arXiv preprint arXiv:2110.06821 (2021)

  7. [7]

    AR Chayapathi, G Sunil Kumar, Manjunath BE Swamy, J Thriveni, and KR Venu- gopal. 2021. Survey and comparison of string matching algorithms.Turkish Journal of Computer and Mathematics Education12, 12 (2021), 1471–1491

  8. [8]

    Claude. 2025. Prompt caching. https://docs.anthropic.com/en/docs/build-with-c laude/prompt-caching. (2025)

  9. [9]

    Franklin C Crow. 1984. Summed-area tables for texture mapping. InProceedings of the 11th annual conference on Computer graphics and interactive techniques. 207–212

  10. [10]

    Deepmind. 2025. deepmind/narrativeqa. https://huggingface.co/datasets/deepmi nd/narrativeqa. (2025)

  11. [11]

    Xiaowan Dong, Zhuojia Shen, John Criswell, Alan L Cox, and Sandhya Dwarkadas. 2018. Shielding Software From Privileged {Side-Channel} Attacks. In27th USENIX Security Symposium (USENIX Security 18). 1441–1458

  12. [12]

    Javier Ferrando, Gerard I Gállego, and Marta R Costa-Jussà. 2022. Measur- ing the mixing of contextual information in the transformer.arXiv preprint arXiv:2203.04212(2022)

  13. [13]

    Gemini. 2025. Context Caching. https://ai.google.dev/gemini-api/docs/caching. (2025)

  14. [14]

    Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA.GPU gems3, 39 (2007), 851–876

  15. [15]

    Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. 2005. Fast summed-area table generation and its applications. In Computer Graphics Forum, Vol. 24. Amsterdam: North Holland, 1982-, 547–556

  16. [16]

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems37 (2024), 1270–1303

  17. [17]

    Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models.arXiv preprint arXiv:2410.15332(2024)

  18. [18]

    Hao Jiang and Sian-Jheng Lin. 2020. A rolling hash algorithm and the implemen- tation to LZ4 data compression.IEEE Access8 (2020), 35529–35534

  19. [19]

    Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. 2024. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065(2024)

  20. [20]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  21. [21]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

  22. [22]

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024. A survey on large lan- guage model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442(2024)

  23. [23]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  24. [24]

    Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. 2016. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, 406–418

  25. [25]

    LMSys. 2025. lmsys/lmsys-chat-1m. https://huggingface.co/datasets/lmsys/lmsy s-chat-1m. (2025)

  26. [26]

    Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, and Anima Anandkumar. 2025. Headinfer: Memory- efficient llm inference by head-wise offloading.arXiv preprint arXiv:2502.12574 (2025)

  27. [27]

    Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, and Zhan Qin. 2025. Shadow in the cache: Unveiling and mitigating privacy risks of kv-cache in llm inference.arXiv preprint arXiv:2508.09442(2025)

  28. [28]

    2010.Guide to protecting the confidentiality of personally identifiable information

    Erika McCallister. 2010.Guide to protecting the confidentiality of personally identifiable information. Diane Publishing

  29. [29]

    Microsoft. 2024. microsoft/msmarco. https://huggingface.co/datasets/microsoft/ ms_marco. (2024)

  30. [30]

    Microsoft. 2025. Presidio: Data Protection and De-identification SDK. https: //microsoft.github.io/presidio/. (2025)

  31. [31]

    MistralAI. 2025. mistralai/Mistral-7B-v0.1. https://huggingface.co/mistralai/Mis tral-7B-v0.1. (2025)

  32. [32]

    Behrang Mohit. 2014. Named entity recognition. InNatural language processing of semitic languages. Springer, 221–245

  33. [33]

    David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes30, 1 (2007), 3–26

  34. [34]

    Hyoungwook Nam, Raghavendra Pradyumna Pothukuchi, Bo Li, Nam Sung Kim, and Josep Torrellas. 2023. Defensive ml: Defending architectural side-channels with adversarial obfuscation.arXiv preprint arXiv:2302.01474(2023)

  35. [35]

    personally identifiable information

    Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information".Commun. ACM53, 6 (2010), 24–26

  36. [36]

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2023. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology(2023)

  37. [37]

    Helen Nissenbaum. 2018. Respecting context to protect privacy: Why meaning matters.Science and engineering ethics24, 3 (2018), 831–852

  38. [38]

    Albert Agisha Ntwali, Luca Rück, and Martin Heckmann. 2025. Detection of Personal Data in Structured Datasets Using a Large Language Model.arXiv preprint arXiv:2506.22305(2025)

  39. [39]

    Nvidia. 2025. Structuring Applications to Secure the KV Cache. https://develope r.nvidia.com/blog/structuring-applications-to-secure-the-kv-cache/. (2025)

  40. [40]

    OpenAI. 2025. OpenAI Privacy Filter. https://huggingface.co/openai/privacy-fil ter. (2025)

  41. [41]

    OpenAI. 2025. Prompt caching. https://platform.openai.com/docs/guides/prom pt-caching. (2025)

  42. [42]

    Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache partitioning for mitigat- ing timing side-channel attacks in llm serving systems. In2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). IEEE, 1238–1245

  43. [43]

    Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, DB Emerson, Shubhankar Mohapatra, and Xi He. 2026. CAPID: Context-Aware PII Detection for Question-Answering Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 320–331

  44. [44]

    Qasper. 2025. allenai/qasper. https://huggingface.co/datasets/allenai/qasper. (2025)

  45. [45]

    Qwen. 2025. Qwen/Qwen2.5-14B. https://huggingface.co/Qwen/Qwen2.5-14B. (2025)

  46. [46]

    Qwen. 2025. Qwen/Qwen2.5-7B. https://huggingface.co/Qwen/Qwen2.5-7B. (2025)

  47. [47]

    Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, and Pen Chung Yew

  48. [48]

    New attacks and defenses for randomized caches.arXiv preprint arXiv:1909.12302(2019)

  49. [49]

    RyokoAI. 2025. RyokoAI/ShareGPT52K. https://huggingface.co/datasets/Ryok oAI/ShareGPT52K. (2025)

  50. [50]

    Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information.NYUL rev.86 (2011), 1814

  51. [51]

    Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2024. The early bird catches the leak: Unveiling timing side channels in llm serving systems.arXiv preprint arXiv:2409.20002(2024)

  52. [52]

    Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. 2023. Beyond memorization: Violating privacy via inference with large language models.arXiv preprint arXiv:2310.07298(2023). 13

  53. [53]

    Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. 2025. Accurate kv cache quantization with outlier tokens tracing.arXiv preprint arXiv:2505.10938(2025)

  54. [54]

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. 2024. Razorattention: Efficient kv cache compression through retrieval heads.arXiv preprint arXiv:2407.15891(2024)

  55. [55]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  56. [56]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  57. [57]

    Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

  58. [58]

    Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, and Urmish Thakker. 2025. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. arXiv preprint arXiv:2503.08879(2025)

  59. [59]

    Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, and Cong Wang. 2026. OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learn- ing in Multi-tenant LLM Services.arXiv preprint arXiv:2602.20595(2026)

  60. [60]

    Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel Gruss, and Stefan Mangard. 2019. {ScatterCache}: thwarting cache attacks via cache set randomization. In28th USENIX Security Symposium (USENIX Security 19). 675–692

  61. [61]

    Guanlong Wu, Taojie Wang, Yao Zhang, Zheng Zhang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2026. When Cache Poisoning Meets LLM Systems: Semantic Cache Poisoning and Its Countermeasures. (2026)

  62. [62]

    Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant llm serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

  63. [63]

    Yale-LTLY. 2025. QMSum. https://github.com/Yale-LILY/QMSum. (2025)

  64. [64]

    Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. 2024. On protecting the data privacy of large language models (llms): A survey. In2024 International Conference on Meta Computing (ICMC). IEEE, 1–12

  65. [65]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109

  66. [66]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

  67. [67]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

  68. [68]

    Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. Inputsnatch: Stealing input in llm services via timing side-channel attacks.arXiv preprint arXiv:2411.18191(2024). A DEFENSE EFFECTIVENESS EV ALUATIONS A.1 Attack Settings of Contextual Leakage We evaluate contextual leakage by measuring whether sensitive in- formati...