CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
Pith reviewed 2026-05-25 04:12 UTC · model grok-4.3
The pith
CachePrune masks sensitive tokens at the individual level so that LLMs can safely reuse the rest of each KV cache entry across users.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CachePrune derives reusable KV segments after token-level sensitivity masking and retrieves them efficiently over variable-length spans. Implemented on vLLM and tested on three datasets, the mechanism eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44 percent relative to state-of-the-art approaches.
What carries the argument
Token-level sensitivity masking followed by variable-length KV segment derivation and retrieval.
If this is right
- Prompts containing both public instructions and private user data can still obtain most of the reuse benefit.
- Serving systems no longer face an all-or-nothing choice between isolation and performance.
- Cache hit rates rise because occasional sensitive tokens no longer block reuse of the surrounding context.
- Time-to-first-token drops as more prefix computation is safely reused across independent requests.
Where Pith is reading between the lines
- The same pruning logic could be applied to other shared inference state such as activation buffers or attention maps.
- Automated sensitivity classifiers already used for content moderation could supply the masks with little extra engineering.
- Variable-length span retrieval techniques may transfer to irregular reuse patterns in non-LLM serving workloads.
Load-bearing premise
Sensitive segments can be identified and masked at token granularity without missing leakage paths or creating new side channels.
What would settle it
An experiment in which an adversary successfully recovers private input by observing the pattern of which KV segments are shared under CachePrune would falsify the privacy guarantee.
Figures
read the original abstract
Large Language Models (LLMs) rely on Key-Value (KV) caching to accelerate inference, and many serving systems further share the KV cache across users' requests to reduce redundant computation. While widely adopted, unrestricted cross-user sharing introduces side-channel vulnerabilities, allowing an adversary to infer user inputs by probing for cache reuse. Existing defenses disable sharing entirely to prevent leakage; yet such a coarse-grained strategy sacrifices substantial reuse potential, since prompts often include large portions of privacy-irrelevant segments, such as system instructions or publicly accessible materials. Building on this, we present CachePrune, a privacy-aware KV cache sharing mechanism that enables fine-grained reuse of KV entries across requests. Realizing such fine granularity requires token-level cache management, as reusable segments vary in length and position due to sensitivity masking, making reuse more complex than the fixed-size or sentence-level chunking used in existing coarse-grained schemes. Specifically, CachePrune makes fine-grained reuse practical by addressing two key challenges: accurately and efficiently deriving reusable KV segments and efficiently retrieving them over variable-length spans. We implement CachePrune on top of vLLM and evaluate it on three datasets, showing that it eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44% compared with state-of-the-art approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CachePrune, a system for privacy-aware fine-grained KV cache sharing during LLM inference. It enables token-level management of reusable KV segments by masking sensitive portions, claiming to eliminate direct leakage through KV cache reuse side channels. The approach is said to address challenges in deriving reusable segments and retrieving variable-length spans, yielding a 4.5x reduction in TTFT and 44% higher cache hit rates versus state-of-the-art methods when implemented on vLLM and evaluated on three datasets.
Significance. If the privacy and performance claims hold, CachePrune could meaningfully improve the efficiency of multi-tenant LLM serving by permitting more cache reuse without the privacy costs of fully disabling sharing. The fine-grained, token-level design directly targets a limitation of existing coarse-grained defenses. The reported implementation on vLLM and quantitative gains on multiple datasets would be practical strengths if the supporting methodology and analysis are provided.
major comments (2)
- [Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.
- [Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract accordingly as part of a major revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.
Authors: The abstract is a concise summary; the full threat model, sensitive-segment identification procedure, false-negative rate metrics, and analysis showing that variable-length retrieval and pruning do not introduce new timing or access-pattern channels are presented in Sections 3, 4, and 6 of the manuscript. We agree the abstract should better support the claim and will revise it to include a brief threat model statement, a high-level description of the identification procedure, reference to the accuracy metrics, and a note on the channel analysis. revision: yes
-
Referee: [Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.
Authors: The abstract already notes implementation on vLLM and evaluation on three datasets with comparison to state-of-the-art approaches, but we agree it lacks sufficient methodological context. We will revise the abstract to briefly define the baselines (coarse-grained KV sharing methods), note the measurement of TTFT and hit rates under the described workloads, and indicate that full error analysis and methodology appear in the evaluation section. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents CachePrune as a system implementation evaluated on three datasets, with claims about leakage elimination and performance gains (TTFT 4.5x, hit rate +44%) stated as empirical outcomes. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure that reduce any result to its inputs by construction. The work is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Mark Ackerman, Trevor Darrell, and Daniel J Weitzner. 2001. Privacy in context. Human–Computer Interaction16, 2-4 (2001), 167–176
work page 2001
-
[3]
Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28
work page 2025
-
[4]
AllenAI. 2025. allenai/WildChat. https://huggingface.co/datasets/allenai/Wild Chat. (2025)
work page 2025
-
[5]
Azure. 2025. What is PII detection in Azure Language? https://learn.microsoft. com/en-us/azure/ai-services/language-service/personally-identifiable-infor mation/overview?tabs=text-pii. (2025)
work page 2025
- [6]
-
[7]
AR Chayapathi, G Sunil Kumar, Manjunath BE Swamy, J Thriveni, and KR Venu- gopal. 2021. Survey and comparison of string matching algorithms.Turkish Journal of Computer and Mathematics Education12, 12 (2021), 1471–1491
work page 2021
-
[8]
Claude. 2025. Prompt caching. https://docs.anthropic.com/en/docs/build-with-c laude/prompt-caching. (2025)
work page 2025
-
[9]
Franklin C Crow. 1984. Summed-area tables for texture mapping. InProceedings of the 11th annual conference on Computer graphics and interactive techniques. 207–212
work page 1984
-
[10]
Deepmind. 2025. deepmind/narrativeqa. https://huggingface.co/datasets/deepmi nd/narrativeqa. (2025)
work page 2025
-
[11]
Xiaowan Dong, Zhuojia Shen, John Criswell, Alan L Cox, and Sandhya Dwarkadas. 2018. Shielding Software From Privileged {Side-Channel} Attacks. In27th USENIX Security Symposium (USENIX Security 18). 1441–1458
work page 2018
- [12]
-
[13]
Gemini. 2025. Context Caching. https://ai.google.dev/gemini-api/docs/caching. (2025)
work page 2025
-
[14]
Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA.GPU gems3, 39 (2007), 851–876
work page 2007
-
[15]
Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. 2005. Fast summed-area table generation and its applications. In Computer Graphics Forum, Vol. 24. Amsterdam: North Holland, 1982-, 547–556
work page 2005
-
[16]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems37 (2024), 1270–1303
work page 2024
- [17]
-
[18]
Hao Jiang and Sian-Jheng Lin. 2020. A rolling hash algorithm and the implemen- tation to LZ4 data compression.IEEE Access8 (2020), 35529–35534
work page 2020
- [19]
-
[20]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626
work page 2023
-
[21]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172
work page 2024
- [22]
-
[23]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. 2016. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, 406–418
work page 2016
-
[25]
LMSys. 2025. lmsys/lmsys-chat-1m. https://huggingface.co/datasets/lmsys/lmsy s-chat-1m. (2025)
work page 2025
- [26]
- [27]
-
[28]
2010.Guide to protecting the confidentiality of personally identifiable information
Erika McCallister. 2010.Guide to protecting the confidentiality of personally identifiable information. Diane Publishing
work page 2010
-
[29]
Microsoft. 2024. microsoft/msmarco. https://huggingface.co/datasets/microsoft/ ms_marco. (2024)
work page 2024
-
[30]
Microsoft. 2025. Presidio: Data Protection and De-identification SDK. https: //microsoft.github.io/presidio/. (2025)
work page 2025
-
[31]
MistralAI. 2025. mistralai/Mistral-7B-v0.1. https://huggingface.co/mistralai/Mis tral-7B-v0.1. (2025)
work page 2025
-
[32]
Behrang Mohit. 2014. Named entity recognition. InNatural language processing of semitic languages. Springer, 221–245
work page 2014
-
[33]
David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes30, 1 (2007), 3–26
work page 2007
- [34]
-
[35]
personally identifiable information
Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information".Commun. ACM53, 6 (2010), 24–26
work page 2010
-
[36]
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2023. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology(2023)
work page 2023
-
[37]
Helen Nissenbaum. 2018. Respecting context to protect privacy: Why meaning matters.Science and engineering ethics24, 3 (2018), 831–852
work page 2018
- [38]
-
[39]
Nvidia. 2025. Structuring Applications to Secure the KV Cache. https://develope r.nvidia.com/blog/structuring-applications-to-secure-the-kv-cache/. (2025)
work page 2025
-
[40]
OpenAI. 2025. OpenAI Privacy Filter. https://huggingface.co/openai/privacy-fil ter. (2025)
work page 2025
-
[41]
OpenAI. 2025. Prompt caching. https://platform.openai.com/docs/guides/prom pt-caching. (2025)
work page 2025
-
[42]
Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache partitioning for mitigat- ing timing side-channel attacks in llm serving systems. In2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). IEEE, 1238–1245
work page 2024
-
[43]
Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, DB Emerson, Shubhankar Mohapatra, and Xi He. 2026. CAPID: Context-Aware PII Detection for Question-Answering Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 320–331
work page 2026
-
[44]
Qasper. 2025. allenai/qasper. https://huggingface.co/datasets/allenai/qasper. (2025)
work page 2025
-
[45]
Qwen. 2025. Qwen/Qwen2.5-14B. https://huggingface.co/Qwen/Qwen2.5-14B. (2025)
work page 2025
-
[46]
Qwen. 2025. Qwen/Qwen2.5-7B. https://huggingface.co/Qwen/Qwen2.5-7B. (2025)
work page 2025
-
[47]
Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, and Pen Chung Yew
- [48]
-
[49]
RyokoAI. 2025. RyokoAI/ShareGPT52K. https://huggingface.co/datasets/Ryok oAI/ShareGPT52K. (2025)
work page 2025
-
[50]
Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information.NYUL rev.86 (2011), 1814
work page 2011
- [51]
- [52]
- [53]
- [54]
-
[55]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
-
[57]
Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [58]
- [59]
-
[60]
Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel Gruss, and Stefan Mangard. 2019. {ScatterCache}: thwarting cache attacks via cache set randomization. In28th USENIX Security Symposium (USENIX Security 19). 675–692
work page 2019
-
[61]
Guanlong Wu, Taojie Wang, Yao Zhang, Zheng Zhang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2026. When Cache Poisoning Meets LLM Systems: Semantic Cache Poisoning and Its Countermeasures. (2026)
work page 2026
-
[62]
Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant llm serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA
work page 2025
-
[63]
Yale-LTLY. 2025. QMSum. https://github.com/Yale-LILY/QMSum. (2025)
work page 2025
-
[64]
Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. 2024. On protecting the data privacy of large language models (llms): A survey. In2024 International Conference on Meta Computing (ICMC). IEEE, 1–12
work page 2024
-
[65]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109
work page 2025
-
[66]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583
work page 2024
-
[68]
Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. Inputsnatch: Stealing input in llm services via timing side-channel attacks.arXiv preprint arXiv:2411.18191(2024). A DEFENSE EFFECTIVENESS EV ALUATIONS A.1 Attack Settings of Contextual Leakage We evaluate contextual leakage by measuring whether sensitive in- formati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.