CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

Guanlong Wu; Jianyu Niu; Yao Zhang; Ye Wu; Yinqian Zhang; Zhaohan li; Zheng Zhang

arxiv: 2605.23640 · v1 · pith:5KZCNDITnew · submitted 2026-05-22 · 💻 cs.CR

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference

Guanlong Wu , Zhaohan li , Yao Zhang , Zheng Zhang , Jianyu Niu , Ye Wu , Yinqian Zhang This is my paper

Pith reviewed 2026-05-25 04:12 UTC · model grok-4.3

classification 💻 cs.CR

keywords KV cacheside-channel attacksLLM inferenceprivacycache sharingtoken-level maskingserving systems

0 comments

The pith

CachePrune masks sensitive tokens at the individual level so that LLMs can safely reuse the rest of each KV cache entry across users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems share KV caches across requests to avoid repeating prefix computations and lower time-to-first-token. Full sharing however creates a side channel: an attacker can probe whether their prompt produced a cache hit and thereby learn parts of another user's input. Coarse defenses therefore turn sharing off entirely, forgoing large efficiency gains on the many non-sensitive segments that prompts contain. CachePrune instead identifies and excludes only the sensitive tokens, then manages the resulting irregular reusable spans so that the remaining KV entries can still be shared. The design removes the direct leakage path while delivering the reported 4.5 times TTFT reduction and 44 percent higher cache hit rate.

Core claim

CachePrune derives reusable KV segments after token-level sensitivity masking and retrieves them efficiently over variable-length spans. Implemented on vLLM and tested on three datasets, the mechanism eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44 percent relative to state-of-the-art approaches.

What carries the argument

Token-level sensitivity masking followed by variable-length KV segment derivation and retrieval.

If this is right

Prompts containing both public instructions and private user data can still obtain most of the reuse benefit.
Serving systems no longer face an all-or-nothing choice between isolation and performance.
Cache hit rates rise because occasional sensitive tokens no longer block reuse of the surrounding context.
Time-to-first-token drops as more prefix computation is safely reused across independent requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pruning logic could be applied to other shared inference state such as activation buffers or attention maps.
Automated sensitivity classifiers already used for content moderation could supply the masks with little extra engineering.
Variable-length span retrieval techniques may transfer to irregular reuse patterns in non-LLM serving workloads.

Load-bearing premise

Sensitive segments can be identified and masked at token granularity without missing leakage paths or creating new side channels.

What would settle it

An experiment in which an adversary successfully recovers private input by observing the pattern of which KV segments are shared under CachePrune would falsify the privacy guarantee.

Figures

Figures reproduced from arXiv: 2605.23640 by Guanlong Wu, Jianyu Niu, Yao Zhang, Ye Wu, Yinqian Zhang, Zhaohan li, Zheng Zhang.

**Figure 1.** Figure 1: KV cache sharing mechanisms. White blocks denote unreusable KV cache, gray blocks denote reusable KV cache, and hatched blocks denote recomputed KV cache. memory usage to scale with sequence length and model size. In multi-tenant serving, this trade-off directly constrains concurrency and throughput under a fixed GPU budget. 2.2 KV Cache Sharing While KV caching improves efficiency, its memory usage still … view at source ↗

**Figure 2.** Figure 2: TTFT reduction. U 0 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: System architecture of CachePrune. New modules introduced by CachePrune are highlighted in light yellow. KV cache management. In chunk-level management, reusable segments that span across chunk boundaries often cause mismatches and are discarded, reducing cache hit rates ( [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Deriving reusable KV segments under sensitive tokens. 4.2 Challenges and Algorithmic Solutions 4.2.1 C1: Efficiently and Accurately Deriving Reusable KV Segments. After inference, the Sensitivity Detector analyzes the input token sequence 𝑃 = {𝑡1, 𝑡2, . . . , 𝑡𝑛 } and produces a binary sensitivity mask 𝑀 ∈ {0, 1} 𝑛 , where 𝑀𝑖 = 1 indicates that token 𝑡𝑖 contains sensitive information. Tokens with 𝑀𝑖 = 1 a… view at source ↗

**Figure 8.** Figure 8: Retrieving KV segments based on rolling hash. 4.2.2 C2: Retrieving Dynamic-Length KV Segments. Existing KV retrieval methods (Sec. 2.2) rely on fixed-size chunks, where matching reduces to 𝑂(1) hash lookups per chunk. In contrast, CachePrune operates at token granularity, producing variable-length segments (Sec. 4.2.1). Retrieval therefore becomes a substring containment problem: identifying whether a … view at source ↗

**Figure 9.** Figure 9: Impact of imperfect privacy detection (QASPER). inject errors by simulating 0–20% False Negatives (FN) and False Positives (FP), covering and exceeding typical reported rates (both below 10% for Presidio [30]). We randomly perturb the sensitivity mask and evaluate on all datasets, presenting QASPER in the main text and others in Appendix A. FN directly increase exact recovery by exposing unmarked sensitive… view at source ↗

**Figure 10.** Figure 10: Match rate impact (Mistral-7B). 0 0 00 0 R 00 0 0 R [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 13.** Figure 13: Impact of match rate on efficiency. this experiment, we fix the recompute rate at 25%, which serves as our default setting and will be further examined in Sec. 5.3.2. Results ( [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 15.** Figure 15: Computation overhead of core components in CachePrune. At the default setting of 25%, CachePrune achieves over 3× lower TTFT than the non-sharing baseline (100% recompute). Impact of segment length ( [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Impact of imperfect privacy detection (NarrativeQA). 0 0 0 F 0 0 0 (a) False negative 0 0 0 F 0 [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Impact of imperfect privacy detection (QMSum). Configuration. In our experiments, we set both guess_top_k and judge_top_k to 1. The prompt used for LLMs are unmodified following prior work [51] A.2 Impact of Imperfect Detection In addition to the results presented in the main paper on QASPER, we further evaluate CachePrune on NarrativeQA ( [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

**Figure 18.** Figure 18: Match rate impact (Qwen-7B). 0 0 00 0 R 00 0 0 0 R [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 21.** Figure 21: Match rate impact (Qwen-14B). 0 0 00 0 R 00 0 0 0 R [PITH_FULL_IMAGE:figures/full_fig_p015_21.png] view at source ↗

**Figure 24.** Figure 24: Impact of match rate on efficiency (Qwen-7B). 0 0 0 0 0 M 0 00 0 000 0 00 0 000 (a) TTFT 0 0 0 0 0 M 0 0 0 0 (b) Throughput [PITH_FULL_IMAGE:figures/full_fig_p015_24.png] view at source ↗

**Figure 25.** Figure 25: Impact of match rate on efficiency (Qwen-14B). D PRIVACY DETECTION SETTINGS Following prior work on sensitivity identification in LLM prompts [38], we adopt Presidio [30], a widely used privacy detection toolkit, as 0 0 0 0 R 00 00 00 00 00 00 00 [PITH_FULL_IMAGE:figures/full_fig_p015_25.png] view at source ↗

**Figure 28.** Figure 28: Impact of segment length on efficiency. Q Q Q D (a) Match rate Q Q Q D [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗

**Figure 29.** Figure 29: Impact of privacy detection methods. • Personal identifiers: CREDIT_CARD, CRYPTO,IBAN_CODE, EMAIL_ADDRESS, NRP (passport), PERSON, PHONE_NUMBER, SSN, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN,US_SSN, US_PASSPORT. • Non-personal identifiers: DATE_TIME, IP_ADDRESS, LOCATION, URL, AU_ABN, AU_ACN. We configure three privacy levels for evaluation, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗

read the original abstract

Large Language Models (LLMs) rely on Key-Value (KV) caching to accelerate inference, and many serving systems further share the KV cache across users' requests to reduce redundant computation. While widely adopted, unrestricted cross-user sharing introduces side-channel vulnerabilities, allowing an adversary to infer user inputs by probing for cache reuse. Existing defenses disable sharing entirely to prevent leakage; yet such a coarse-grained strategy sacrifices substantial reuse potential, since prompts often include large portions of privacy-irrelevant segments, such as system instructions or publicly accessible materials. Building on this, we present CachePrune, a privacy-aware KV cache sharing mechanism that enables fine-grained reuse of KV entries across requests. Realizing such fine granularity requires token-level cache management, as reusable segments vary in length and position due to sensitivity masking, making reuse more complex than the fixed-size or sentence-level chunking used in existing coarse-grained schemes. Specifically, CachePrune makes fine-grained reuse practical by addressing two key challenges: accurately and efficiently deriving reusable KV segments and efficiently retrieving them over variable-length spans. We implement CachePrune on top of vLLM and evaluate it on three datasets, showing that it eliminates direct leakage through KV cache reuse side channels while reducing TTFT by 4.5x and increasing cache hit rates by 44% compared with state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CachePrune points at a real systems tension but the abstract supplies no evidence that the token-level masking actually stops leakage or avoids new channels.

read the letter

The main thing to know is that CachePrune claims token-level KV cache reuse with sensitivity masking to block direct side-channel leakage while still getting reuse gains. It implements the idea on vLLM and reports 4.5x lower TTFT and 44% higher hit rates versus prior coarse schemes on three datasets. That direction makes sense because many prompts mix public system text with private user content, so blanket no-sharing wastes obvious reuse opportunities. The variable-length retrieval step they describe is a necessary piece of engineering once you move past fixed chunks or sentences. Those are the concrete positives in the abstract. The soft spot is exactly where the stress-test note lands. The privacy claim rests on being able to identify and mask sensitive segments at token granularity without missing paths or creating new timing or access leaks from the pruning logic itself. The abstract gives no description of the identification method, no accuracy numbers, no threat model, and no attack experiments. Without those, the elimination result is an assertion rather than a demonstrated outcome. No equations or fitted quantities appear, so the numbers come from an implementation whose internals cannot be checked here. This paper is aimed at people who run multi-tenant LLM inference and care about both latency and basic privacy. A reader working on serving systems could extract useful engineering details on variable-span cache management if the full version supplies them. It deserves a serious referee because the underlying problem is practical and the proposed granularity is finer than the baselines it cites, even though the current evidence for the privacy guarantee is missing. Send it to review only if the full manuscript adds a clear detection procedure, accuracy metrics, and validation that no leakage remains; otherwise it is not ready.

Referee Report

2 major / 0 minor

Summary. The paper introduces CachePrune, a system for privacy-aware fine-grained KV cache sharing during LLM inference. It enables token-level management of reusable KV segments by masking sensitive portions, claiming to eliminate direct leakage through KV cache reuse side channels. The approach is said to address challenges in deriving reusable segments and retrieving variable-length spans, yielding a 4.5x reduction in TTFT and 44% higher cache hit rates versus state-of-the-art methods when implemented on vLLM and evaluated on three datasets.

Significance. If the privacy and performance claims hold, CachePrune could meaningfully improve the efficiency of multi-tenant LLM serving by permitting more cache reuse without the privacy costs of fully disabling sharing. The fine-grained, token-level design directly targets a limitation of existing coarse-grained defenses. The reported implementation on vLLM and quantitative gains on multiple datasets would be practical strengths if the supporting methodology and analysis are provided.

major comments (2)

[Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.
[Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract accordingly as part of a major revision.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that CachePrune 'eliminates direct leakage through KV cache reuse side channels' is load-bearing for the contribution, yet the text supplies no threat model, no description of the sensitive-segment identification procedure (heuristic, model-based, or otherwise), no accuracy metrics such as false-negative rate, and no analysis of whether the variable-length retrieval or pruning logic introduces new timing or access-pattern channels.

Authors: The abstract is a concise summary; the full threat model, sensitive-segment identification procedure, false-negative rate metrics, and analysis showing that variable-length retrieval and pruning do not introduce new timing or access-pattern channels are presented in Sections 3, 4, and 6 of the manuscript. We agree the abstract should better support the claim and will revise it to include a brief threat model statement, a high-level description of the identification procedure, reference to the accuracy metrics, and a note on the channel analysis. revision: yes
Referee: [Abstract] Abstract: the performance claims (4.5x TTFT reduction, 44% cache-hit-rate increase) are presented as outcomes of an implementation and evaluation on three datasets, but the abstract contains no methodology, baseline definitions, measurement details, or error analysis, so the quantitative results cannot be assessed.

Authors: The abstract already notes implementation on vLLM and evaluation on three datasets with comparison to state-of-the-art approaches, but we agree it lacks sufficient methodological context. We will revise the abstract to briefly define the baselines (coarse-grained KV sharing methods), note the measurement of TTFT and hit rates under the described workloads, and indicate that full error analysis and methodology appear in the evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CachePrune as a system implementation evaluated on three datasets, with claims about leakage elimination and performance gains (TTFT 4.5x, hit rate +44%) stated as empirical outcomes. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described structure that reduce any result to its inputs by construction. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5787 in / 1012 out tokens · 29156 ms · 2026-05-25T04:12:36.963444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 5 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Mark Ackerman, Trevor Darrell, and Daniel J Weitzner. 2001. Privacy in context. Human–Computer Interaction16, 2-4 (2001), 167–176

work page 2001
[3]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

work page 2025
[4]

AllenAI. 2025. allenai/WildChat. https://huggingface.co/datasets/allenai/Wild Chat. (2025)

work page 2025
[5]

Azure. 2025. What is PII detection in Azure Language? https://learn.microsoft. com/en-us/azure/ai-services/language-service/personally-identifiable-infor mation/overview?tabs=text-pii. (2025)

work page 2025
[6]

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himan- shu Jain, Frederick Liu, Yin-Wen Chang, and Sanjiv Kumar. 2021. Leveraging redundancy in attention with reuse transformers.arXiv preprint arXiv:2110.06821 (2021)

work page arXiv 2021
[7]

AR Chayapathi, G Sunil Kumar, Manjunath BE Swamy, J Thriveni, and KR Venu- gopal. 2021. Survey and comparison of string matching algorithms.Turkish Journal of Computer and Mathematics Education12, 12 (2021), 1471–1491

work page 2021
[8]

Claude. 2025. Prompt caching. https://docs.anthropic.com/en/docs/build-with-c laude/prompt-caching. (2025)

work page 2025
[9]

Franklin C Crow. 1984. Summed-area tables for texture mapping. InProceedings of the 11th annual conference on Computer graphics and interactive techniques. 207–212

work page 1984
[10]

Deepmind. 2025. deepmind/narrativeqa. https://huggingface.co/datasets/deepmi nd/narrativeqa. (2025)

work page 2025
[11]

Xiaowan Dong, Zhuojia Shen, John Criswell, Alan L Cox, and Sandhya Dwarkadas. 2018. Shielding Software From Privileged {Side-Channel} Attacks. In27th USENIX Security Symposium (USENIX Security 18). 1441–1458

work page 2018
[12]

Javier Ferrando, Gerard I Gállego, and Marta R Costa-Jussà. 2022. Measur- ing the mixing of contextual information in the transformer.arXiv preprint arXiv:2203.04212(2022)

work page arXiv 2022
[13]

Gemini. 2025. Context Caching. https://ai.google.dev/gemini-api/docs/caching. (2025)

work page 2025
[14]

Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA.GPU gems3, 39 (2007), 851–876

work page 2007
[15]

Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. 2005. Fast summed-area table generation and its applications. In Computer Graphics Forum, Vol. 24. Amsterdam: North Holland, 1982-, 547–556

work page 2005
[16]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems37 (2024), 1270–1303

work page 2024
[17]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models.arXiv preprint arXiv:2410.15332(2024)

work page arXiv 2024
[18]

Hao Jiang and Sian-Jheng Lin. 2020. A rolling hash algorithm and the implemen- tation to LZ4 data compression.IEEE Access8 (2020), 35529–35534

work page 2020
[19]

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. 2024. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065(2024)

work page arXiv 2024
[20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023
[21]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

work page 2024
[22]

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024. A survey on large lan- guage model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442(2024)

work page arXiv 2024
[23]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. 2016. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, 406–418

work page 2016
[25]

LMSys. 2025. lmsys/lmsys-chat-1m. https://huggingface.co/datasets/lmsys/lmsy s-chat-1m. (2025)

work page 2025
[26]

Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, and Anima Anandkumar. 2025. Headinfer: Memory- efficient llm inference by head-wise offloading.arXiv preprint arXiv:2502.12574 (2025)

work page arXiv 2025
[27]

Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, and Zhan Qin. 2025. Shadow in the cache: Unveiling and mitigating privacy risks of kv-cache in llm inference.arXiv preprint arXiv:2508.09442(2025)

work page arXiv 2025
[28]

2010.Guide to protecting the confidentiality of personally identifiable information

Erika McCallister. 2010.Guide to protecting the confidentiality of personally identifiable information. Diane Publishing

work page 2010
[29]

Microsoft. 2024. microsoft/msmarco. https://huggingface.co/datasets/microsoft/ ms_marco. (2024)

work page 2024
[30]

Microsoft. 2025. Presidio: Data Protection and De-identification SDK. https: //microsoft.github.io/presidio/. (2025)

work page 2025
[31]

MistralAI. 2025. mistralai/Mistral-7B-v0.1. https://huggingface.co/mistralai/Mis tral-7B-v0.1. (2025)

work page 2025
[32]

Behrang Mohit. 2014. Named entity recognition. InNatural language processing of semitic languages. Springer, 221–245

work page 2014
[33]

David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes30, 1 (2007), 3–26

work page 2007
[34]

Hyoungwook Nam, Raghavendra Pradyumna Pothukuchi, Bo Li, Nam Sung Kim, and Josep Torrellas. 2023. Defensive ml: Defending architectural side-channels with adversarial obfuscation.arXiv preprint arXiv:2302.01474(2023)

work page arXiv 2023
[35]

personally identifiable information

Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information".Commun. ACM53, 6 (2010), 24–26

work page 2010
[36]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2023. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology(2023)

work page 2023
[37]

Helen Nissenbaum. 2018. Respecting context to protect privacy: Why meaning matters.Science and engineering ethics24, 3 (2018), 831–852

work page 2018
[38]

Albert Agisha Ntwali, Luca Rück, and Martin Heckmann. 2025. Detection of Personal Data in Structured Datasets Using a Large Language Model.arXiv preprint arXiv:2506.22305(2025)

work page arXiv 2025
[39]

Nvidia. 2025. Structuring Applications to Secure the KV Cache. https://develope r.nvidia.com/blog/structuring-applications-to-secure-the-kv-cache/. (2025)

work page 2025
[40]

OpenAI. 2025. OpenAI Privacy Filter. https://huggingface.co/openai/privacy-fil ter. (2025)

work page 2025
[41]

OpenAI. 2025. Prompt caching. https://platform.openai.com/docs/guides/prom pt-caching. (2025)

work page 2025
[42]

Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache partitioning for mitigat- ing timing side-channel attacks in llm serving systems. In2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). IEEE, 1238–1245

work page 2024
[43]

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, DB Emerson, Shubhankar Mohapatra, and Xi He. 2026. CAPID: Context-Aware PII Detection for Question-Answering Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 320–331

work page 2026
[44]

Qasper. 2025. allenai/qasper. https://huggingface.co/datasets/allenai/qasper. (2025)

work page 2025
[45]

Qwen. 2025. Qwen/Qwen2.5-14B. https://huggingface.co/Qwen/Qwen2.5-14B. (2025)

work page 2025
[46]

Qwen. 2025. Qwen/Qwen2.5-7B. https://huggingface.co/Qwen/Qwen2.5-7B. (2025)

work page 2025
[47]

Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, and Pen Chung Yew

work page
[48]

New attacks and defenses for randomized caches.arXiv preprint arXiv:1909.12302(2019)

work page arXiv 1909
[49]

RyokoAI. 2025. RyokoAI/ShareGPT52K. https://huggingface.co/datasets/Ryok oAI/ShareGPT52K. (2025)

work page 2025
[50]

Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information.NYUL rev.86 (2011), 1814

work page 2011
[51]

Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2024. The early bird catches the leak: Unveiling timing side channels in llm serving systems.arXiv preprint arXiv:2409.20002(2024)

work page arXiv 2024
[52]

Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. 2023. Beyond memorization: Violating privacy via inference with large language models.arXiv preprint arXiv:2310.07298(2023). 13

work page arXiv 2023
[53]

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. 2025. Accurate kv cache quantization with outlier tokens tracing.arXiv preprint arXiv:2505.10938(2025)

work page arXiv 2025
[54]

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. 2024. Razorattention: Efficient kv cache compression through retrieval heads.arXiv preprint arXiv:2407.15891(2024)

work page arXiv 2024
[55]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[57]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[58]

Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, and Urmish Thakker. 2025. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. arXiv preprint arXiv:2503.08879(2025)

work page arXiv 2025
[59]

Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, and Cong Wang. 2026. OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learn- ing in Multi-tenant LLM Services.arXiv preprint arXiv:2602.20595(2026)

work page arXiv 2026
[60]

Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel Gruss, and Stefan Mangard. 2019. {ScatterCache}: thwarting cache attacks via cache set randomization. In28th USENIX Security Symposium (USENIX Security 19). 675–692

work page 2019
[61]

Guanlong Wu, Taojie Wang, Yao Zhang, Zheng Zhang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2026. When Cache Poisoning Meets LLM Systems: Semantic Cache Poisoning and Its Countermeasures. (2026)

work page 2026
[62]

Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant llm serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

work page 2025
[63]

Yale-LTLY. 2025. QMSum. https://github.com/Yale-LILY/QMSum. (2025)

work page 2025
[64]

Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. 2024. On protecting the data privacy of large language models (llms): A survey. In2024 International Conference on Meta Computing (ICMC). IEEE, 1–12

work page 2024
[65]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109

work page 2025
[66]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

work page 2024
[68]

Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. Inputsnatch: Stealing input in llm services via timing side-channel attacks.arXiv preprint arXiv:2411.18191(2024). A DEFENSE EFFECTIVENESS EV ALUATIONS A.1 Attack Settings of Contextual Leakage We evaluate contextual leakage by measuring whether sensitive in- formati...

work page arXiv 2024

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Mark Ackerman, Trevor Darrell, and Daniel J Weitzner. 2001. Privacy in context. Human–Computer Interaction16, 2-4 (2001), 167–176

work page 2001

[3] [3]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

work page 2025

[4] [4]

AllenAI. 2025. allenai/WildChat. https://huggingface.co/datasets/allenai/Wild Chat. (2025)

work page 2025

[5] [5]

Azure. 2025. What is PII detection in Azure Language? https://learn.microsoft. com/en-us/azure/ai-services/language-service/personally-identifiable-infor mation/overview?tabs=text-pii. (2025)

work page 2025

[6] [6]

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himan- shu Jain, Frederick Liu, Yin-Wen Chang, and Sanjiv Kumar. 2021. Leveraging redundancy in attention with reuse transformers.arXiv preprint arXiv:2110.06821 (2021)

work page arXiv 2021

[7] [7]

AR Chayapathi, G Sunil Kumar, Manjunath BE Swamy, J Thriveni, and KR Venu- gopal. 2021. Survey and comparison of string matching algorithms.Turkish Journal of Computer and Mathematics Education12, 12 (2021), 1471–1491

work page 2021

[8] [8]

Claude. 2025. Prompt caching. https://docs.anthropic.com/en/docs/build-with-c laude/prompt-caching. (2025)

work page 2025

[9] [9]

Franklin C Crow. 1984. Summed-area tables for texture mapping. InProceedings of the 11th annual conference on Computer graphics and interactive techniques. 207–212

work page 1984

[10] [10]

Deepmind. 2025. deepmind/narrativeqa. https://huggingface.co/datasets/deepmi nd/narrativeqa. (2025)

work page 2025

[11] [11]

Xiaowan Dong, Zhuojia Shen, John Criswell, Alan L Cox, and Sandhya Dwarkadas. 2018. Shielding Software From Privileged {Side-Channel} Attacks. In27th USENIX Security Symposium (USENIX Security 18). 1441–1458

work page 2018

[12] [12]

Javier Ferrando, Gerard I Gállego, and Marta R Costa-Jussà. 2022. Measur- ing the mixing of contextual information in the transformer.arXiv preprint arXiv:2203.04212(2022)

work page arXiv 2022

[13] [13]

Gemini. 2025. Context Caching. https://ai.google.dev/gemini-api/docs/caching. (2025)

work page 2025

[14] [14]

Mark Harris, Shubhabrata Sengupta, and John D Owens. 2007. Parallel prefix sum (scan) with CUDA.GPU gems3, 39 (2007), 851–876

work page 2007

[15] [15]

Justin Hensley, Thorsten Scheuermann, Greg Coombe, Montek Singh, and Anselmo Lastra. 2005. Fast summed-area table generation and its applications. In Computer Graphics Forum, Vol. 24. Amsterdam: North Holland, 1982-, 547–556

work page 2005

[16] [16]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems37 (2024), 1270–1303

work page 2024

[17] [17]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models.arXiv preprint arXiv:2410.15332(2024)

work page arXiv 2024

[18] [18]

Hao Jiang and Sian-Jheng Lin. 2020. A rolling hash algorithm and the implemen- tation to LZ4 data compression.IEEE Access8 (2020), 35529–35534

work page 2020

[19] [19]

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. 2024. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065(2024)

work page arXiv 2024

[20] [20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023

[21] [21]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

work page 2024

[22] [22]

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024. A survey on large lan- guage model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442(2024)

work page arXiv 2024

[23] [23]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. 2016. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, 406–418

work page 2016

[25] [25]

LMSys. 2025. lmsys/lmsys-chat-1m. https://huggingface.co/datasets/lmsys/lmsy s-chat-1m. (2025)

work page 2025

[26] [26]

Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, and Anima Anandkumar. 2025. Headinfer: Memory- efficient llm inference by head-wise offloading.arXiv preprint arXiv:2502.12574 (2025)

work page arXiv 2025

[27] [27]

Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, and Zhan Qin. 2025. Shadow in the cache: Unveiling and mitigating privacy risks of kv-cache in llm inference.arXiv preprint arXiv:2508.09442(2025)

work page arXiv 2025

[28] [28]

2010.Guide to protecting the confidentiality of personally identifiable information

Erika McCallister. 2010.Guide to protecting the confidentiality of personally identifiable information. Diane Publishing

work page 2010

[29] [29]

Microsoft. 2024. microsoft/msmarco. https://huggingface.co/datasets/microsoft/ ms_marco. (2024)

work page 2024

[30] [30]

Microsoft. 2025. Presidio: Data Protection and De-identification SDK. https: //microsoft.github.io/presidio/. (2025)

work page 2025

[31] [31]

MistralAI. 2025. mistralai/Mistral-7B-v0.1. https://huggingface.co/mistralai/Mis tral-7B-v0.1. (2025)

work page 2025

[32] [32]

Behrang Mohit. 2014. Named entity recognition. InNatural language processing of semitic languages. Springer, 221–245

work page 2014

[33] [33]

David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes30, 1 (2007), 3–26

work page 2007

[34] [34]

Hyoungwook Nam, Raghavendra Pradyumna Pothukuchi, Bo Li, Nam Sung Kim, and Josep Torrellas. 2023. Defensive ml: Defending architectural side-channels with adversarial obfuscation.arXiv preprint arXiv:2302.01474(2023)

work page arXiv 2023

[35] [35]

personally identifiable information

Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information".Commun. ACM53, 6 (2010), 24–26

work page 2010

[36] [36]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2023. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology(2023)

work page 2023

[37] [37]

Helen Nissenbaum. 2018. Respecting context to protect privacy: Why meaning matters.Science and engineering ethics24, 3 (2018), 831–852

work page 2018

[38] [38]

Albert Agisha Ntwali, Luca Rück, and Martin Heckmann. 2025. Detection of Personal Data in Structured Datasets Using a Large Language Model.arXiv preprint arXiv:2506.22305(2025)

work page arXiv 2025

[39] [39]

Nvidia. 2025. Structuring Applications to Secure the KV Cache. https://develope r.nvidia.com/blog/structuring-applications-to-secure-the-kv-cache/. (2025)

work page 2025

[40] [40]

OpenAI. 2025. OpenAI Privacy Filter. https://huggingface.co/openai/privacy-fil ter. (2025)

work page 2025

[41] [41]

OpenAI. 2025. Prompt caching. https://platform.openai.com/docs/guides/prom pt-caching. (2025)

work page 2025

[42] [42]

Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache partitioning for mitigat- ing timing side-channel attacks in llm serving systems. In2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC). IEEE, 1238–1245

work page 2024

[43] [43]

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, DB Emerson, Shubhankar Mohapatra, and Xi He. 2026. CAPID: Context-Aware PII Detection for Question-Answering Systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 320–331

work page 2026

[44] [44]

Qasper. 2025. allenai/qasper. https://huggingface.co/datasets/allenai/qasper. (2025)

work page 2025

[45] [45]

Qwen. 2025. Qwen/Qwen2.5-14B. https://huggingface.co/Qwen/Qwen2.5-14B. (2025)

work page 2025

[46] [46]

Qwen. 2025. Qwen/Qwen2.5-7B. https://huggingface.co/Qwen/Qwen2.5-7B. (2025)

work page 2025

[47] [47]

Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, and Pen Chung Yew

work page

[48] [48]

New attacks and defenses for randomized caches.arXiv preprint arXiv:1909.12302(2019)

work page arXiv 1909

[49] [49]

RyokoAI. 2025. RyokoAI/ShareGPT52K. https://huggingface.co/datasets/Ryok oAI/ShareGPT52K. (2025)

work page 2025

[50] [50]

Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information.NYUL rev.86 (2011), 1814

work page 2011

[51] [51]

Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2024. The early bird catches the leak: Unveiling timing side channels in llm serving systems.arXiv preprint arXiv:2409.20002(2024)

work page arXiv 2024

[52] [52]

Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. 2023. Beyond memorization: Violating privacy via inference with large language models.arXiv preprint arXiv:2310.07298(2023). 13

work page arXiv 2023

[53] [53]

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. 2025. Accurate kv cache quantization with outlier tokens tracing.arXiv preprint arXiv:2505.10938(2025)

work page arXiv 2025

[54] [54]

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. 2024. Razorattention: Efficient kv cache compression through retrieval heads.arXiv preprint arXiv:2407.15891(2024)

work page arXiv 2024

[55] [55]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[57] [57]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[58] [58]

Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, and Urmish Thakker. 2025. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference. arXiv preprint arXiv:2503.08879(2025)

work page arXiv 2025

[59] [59]

Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, and Cong Wang. 2026. OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learn- ing in Multi-tenant LLM Services.arXiv preprint arXiv:2602.20595(2026)

work page arXiv 2026

[60] [60]

Mario Werner, Thomas Unterluggauer, Lukas Giner, Michael Schwarz, Daniel Gruss, and Stefan Mangard. 2019. {ScatterCache}: thwarting cache attacks via cache set randomization. In28th USENIX Security Symposium (USENIX Security 19). 675–692

work page 2019

[61] [61]

Guanlong Wu, Taojie Wang, Yao Zhang, Zheng Zhang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2026. When Cache Poisoning Meets LLM Systems: Semantic Cache Poisoning and Its Countermeasures. (2026)

work page 2026

[62] [62]

Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and Yinqian Zhang. 2025. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant llm serving. InProceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA

work page 2025

[63] [63]

Yale-LTLY. 2025. QMSum. https://github.com/Yale-LILY/QMSum. (2025)

work page 2025

[64] [64]

Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. 2024. On protecting the data privacy of large language models (llms): A survey. In2024 International Conference on Meta Computing (ICMC). IEEE, 1–12

work page 2024

[65] [65]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109

work page 2025

[66] [66]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

work page 2024

[68] [68]

Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, and Qi Guo. 2024. Inputsnatch: Stealing input in llm services via timing side-channel attacks.arXiv preprint arXiv:2411.18191(2024). A DEFENSE EFFECTIVENESS EV ALUATIONS A.1 Attack Settings of Contextual Leakage We evaluate contextual leakage by measuring whether sensitive in- formati...

work page arXiv 2024