pith. sign in

arxiv: 2606.21238 · v1 · pith:GPSPUVZZnew · submitted 2026-06-19 · 💻 cs.DC · cs.AI· cs.LG

Recency/Frequency Adaptive KV Caching for Large Language Model Serving

Pith reviewed 2026-06-26 13:22 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords KV cacheadaptive cachingLLM inferencecache hit ratetime to first tokenrecency frequencyvLLM
0
0 comments X

The pith

Dynamically allocating KV cache space between recent and frequent blocks improves hit rates and reduces latency in LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes managing key-value caches during large language model inference by dynamically deciding how much space to assign to recently used blocks versus frequently used blocks. Standard least-recently-used eviction suffers when unrelated workloads flush each other's entries. The adaptive approach aims to raise the fraction of cache hits and shorten the time until the first output token appears. Gains appear on both synthetic document question answering tasks and real conversation logs, and the method extends to batched requests.

Core claim

We integrate adaptive caching that dynamically allocates cache space between recently and frequently occurring KV blocks. Evaluations show that it improves the KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% over naive vLLM on synthetic document question answering workloads, and 2.1% and 2.0% respectively on real-world conversation workloads. The method generalizes well to batch inference and demonstrates clear interpretability while effectively accommodating diverse workloads.

What carries the argument

The recency/frequency adaptive allocation mechanism that dynamically balances cache space between recent and frequent KV blocks.

If this is right

  • KV cache hit rates rise by as much as 10.8 percent on synthetic document question answering workloads.
  • Time to first token falls by as much as 12.6 percent on the same synthetic workloads.
  • Real-world conversation workloads see smaller gains of 2.1 percent hit rate and 2.0 percent time to first token.
  • The policy works without conflict when requests are processed in batches.
  • Cache decisions remain interpretable across varying workload mixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same balancing idea could lower the total memory needed to serve many simultaneous users.
  • Comparable recency-frequency rules might help other shared caches in distributed computing.
  • Running the method on models larger than those tested would show whether the gains hold at scale.

Load-bearing premise

The overhead of tracking frequency and recency and making allocation decisions stays low enough that the extra work does not erase the gains from higher hit rates.

What would settle it

A workload where the added tracking and decision steps raise total latency even though the hit rate improves.

Figures

Figures reproduced from arXiv: 2606.21238 by Bogdan Nicolae, Meghana Madhyastha, Randal Burns, Robert Underwood, Yang Shen.

Figure 1
Figure 1. Figure 1: Overview of adaptive KV caching for LLM serving. B1 T1 T2 B2 LRU MRU MRU LRU Ghost Cache Ghost RECENCY Cache FREQUENCY Cache L1 L2 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of the Adaptive Replacement Cache. The cache consists of a recency queue (L1) and a frequency queue (L2) that are split into physical caches T1 and T2 and ghost caches B1 and B2 that track metadata. Requests that hit in the ghost caches are used to adapt the length of the T1 and T2 queues. blocks of each queue. Hits in the ghost cache serve as feedback signals to increase the size of the corre￾sp… view at source ↗
Figure 3
Figure 3. Figure 3: Document QA workload on QuALITY that compares static two-queue replacement (DBL) and adaptive two-queue replacement (ARC) with LRU. (a) KV Cache Hit Rate (b) Average TTFT [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Document QA workload on WikiQA. H100 PCIe GPU with 80 GB of memory on Lambda GPU Cloud. Inference Configuration. To isolate the effect of cache eviction strategies from other runtime factors, we standard￾ize the inference behavior: • The offline inference mode is used. Except when noted, the batch size is fixed to 1, eliminating requests waiting on prior requests in the batch, ensuring that all TTFT variat… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison on the multi-turn conversation workload. employs a fixed 1:1 ratio between recency and frequency queues, to isolate the contribution of recency/frequency sep￾aration from memory adaptation [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: KV cache hit rate and average TTFT under different batch sizes on the QuALITY dataset. The cache size is 20k tokens. larger caches (see Section 4.4). 4.3. Batch Inference To evaluate whether adaptive KV caching remains effective under batch inference, we conduct additional experiments using the same synthetic document QA workload as in Sec￾tion 4.1, while varying the batch size. Batch inference is commonly… view at source ↗
Figure 7
Figure 7. Figure 7: Dynamic evolution of the T1 Ratio across different workloads and cache capacities. the larger batch delays all requests. Throughput is higher on aggregate, but TTFT is reduced. 4.4. Impact of Adaptive Partitioning ARC’s improvement comes in part from the dynamic al￾location of memory between the two caches. Examining the evolution of the T1 and T2 cache sizes over time pro￾vides insight into how adaptive m… view at source ↗
read the original abstract

Key-value (KV) caching is a powerful technique for accelerating large language model inference and generation. Inference workloads are large and diverse, which makes them difficult to cache effectively. Existing cache management strategies adopt the least-recently-used policy for evicting cache blocks. However, LRU leads to multiple unrelated workloads flushing each other's caches. To address this, we integrate adaptive caching that dynamically allocates cache space between recently and frequently occurring KV blocks. Evaluations show that it improves the KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% over naive vLLM on synthetic document question answering workloads, and 2.1% and 2.0% respectively on real-world conversation workloads. The method generalizes well to batch inference and demonstrates clear interpretability while effectively accommodating diverse workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a recency/frequency adaptive KV caching method for LLM serving that dynamically allocates cache space between recently and frequently occurring KV blocks to mitigate cache interference from LRU under diverse workloads. It claims empirical improvements over naive vLLM of up to 10.8% in KV cache hit rate and 12.6% in time-to-first-token on synthetic document QA workloads, with 2.1% and 2.0% gains respectively on real-world conversation workloads, plus generalization to batch inference and interpretability.

Significance. If the adaptive allocator's overhead proves negligible, the approach could improve serving efficiency for mixed inference workloads by better balancing recency and frequency signals. The dual synthetic/real-world evaluation and emphasis on interpretability are strengths that would support practical relevance in distributed LLM systems if the net gains hold after overhead accounting.

major comments (1)
  1. [Evaluation] Evaluation section: the reported TTFT reductions (up to 12.6% synthetic, 2.0% real-world) are presented without any quantification or measurement of the overhead from frequency counter maintenance, reallocation decisions, or data-structure costs in the adaptive allocator. This is load-bearing for the central claim, as unaccounted costs could erase the modest real-world margins.
minor comments (1)
  1. [Abstract] Abstract: the description of the adaptive allocation mechanism is high-level only and provides no pseudocode, parameter definitions, or pseudocode for the balance factor, hindering immediate assessment of implementation complexity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying the need to quantify allocator overhead, which is essential for validating the net performance gains. We will revise the evaluation section accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported TTFT reductions (up to 12.6% synthetic, 2.0% real-world) are presented without any quantification or measurement of the overhead from frequency counter maintenance, reallocation decisions, or data-structure costs in the adaptive allocator. This is load-bearing for the central claim, as unaccounted costs could erase the modest real-world margins.

    Authors: We agree that the manuscript should quantify these costs. The current evaluation reports only end-to-end TTFT and hit-rate improvements without isolating the allocator's overhead. In the revised manuscript we will add micro-benchmark results measuring (1) CPU cycles and wall-clock time for frequency-counter updates, (2) latency of the reallocation decision logic, and (3) memory and cache-miss overhead of the additional data structures, both in isolation and as a fraction of total TTFT. These measurements will be reported for the same synthetic and real-world workloads used in the original experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external baseline comparison

full rationale

The paper describes an engineering technique for adaptive recency/frequency KV cache allocation and reports measured improvements (hit rate up to 10.8%, TTFT up to 12.6%) against the external vLLM baseline on both synthetic and real workloads. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The central claims rest on observable runtime metrics that can be independently reproduced or falsified, satisfying the self-contained criterion with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The adaptive method implies a mechanism for dynamic allocation, which typically involves free parameters for weighting recency vs frequency. Based on abstract only.

free parameters (1)
  • allocation balance factor
    The dynamic allocation between recency and frequency likely requires at least one tunable parameter to decide the split, though not detailed in abstract.
axioms (1)
  • domain assumption LRU is the baseline eviction policy that can be improved by frequency consideration
    The paper starts from the observation that LRU leads to flushing issues in diverse workloads.

pith-pipeline@v0.9.1-grok · 5685 in / 1149 out tokens · 30269 ms · 2026-06-26T13:22:30.965928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Locality- aware Fair Scheduling in LLM Serving.arXiv preprint arXiv:2501.14312,

    Cao, S., Wang, Y ., Mao, Z., Hsu, P.-L., Yin, L., Xia, T., Li, D., Liu, S., Zhang, Y ., Zhou, Y ., et al. Locality- aware Fair Scheduling in LLM Serving.arXiv preprint arXiv:2501.14312,

  2. [2]

    E., et al

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., et al. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6,

  3. [3]

    Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550,

  4. [4]

    Evidential fuzzy rule-based machine learning to quantify classification uncertainty,

    URL https://doi.org/10.36227/techrxiv. 176046306.66521015/v1. Jin, C., Zhang, Z., Jiang, X., Liu, F., Liu, S., Liu, X., and Jin, X. Ragcache: Efficient knowledge caching for retrieval- augmented generation.ACM Transactions on Computer Systems, 44(1):1–27,

  5. [5]

    Lmcache: An efficient kv cache layer for enterprise-scale llm inference,

    Liu, Y ., Cheng, Y ., Yao, J., An, Y ., Chen, X., Feng, S., Huang, Y ., Shen, S., Zhang, R., Du, K., et al. Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665,

  6. [6]

    Marconi: Prefix caching for the era of hybrid LLMs.arXiv preprint arXiv:2411.19379,

    Pan, R., Wang, Z., Jia, Z., Karakus, C., Zancato, L., Dao, T., Wang, Y ., and Netravali, R. Marconi: Prefix caching for the era of hybrid LLMs.arXiv preprint arXiv:2411.19379,

  7. [7]

    Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al

    Pang, R. Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5336– 5358,

  8. [8]

    Preble: Efficient distributed prompt scheduling for LLM serving.arXiv preprint arXiv:2407.00023,

    Srivatsa, V ., He, Z., Abhyankar, R., Li, D., and Zhang, Y . Preble: Efficient distributed prompt scheduling for LLM serving.arXiv preprint arXiv:2407.00023,

  9. [9]

    Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

    Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y ., Liu, J., Qu, Z., Yan, S., Zhu, Y ., Zhang, Q., et al. Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

  10. [10]

    KVCache cache in the wild: Characterizing and optimizing KVCache cache at a large cloud provider.arXiv preprint arXiv:2506.02634, 2025a

    7 Recency/Frequency Adaptive KV Caching for Large Language Model Serving Wang, J., Han, J., Wei, X., Shen, S., Zhang, D., Fang, C., Chen, R., Yu, W., and Chen, H. KVCache cache in the wild: Characterizing and optimizing KVCache cache at a large cloud provider.arXiv preprint arXiv:2506.02634, 2025a. Wang, Y ., Chen, Y ., Li, Z., Kang, X., Fang, Y ., Zhou, ...

  11. [11]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    Ye, Z., Chen, L., Lai, R., Lin, W., Zhang, Y ., Wang, S., Chen, T., Kasikci, B., Grover, V ., Krishnamurthy, A., et al. Flash- infer: Efficient and customizable attention engine for LLM inference serving.arXiv preprint arXiv:2501.01005,

  12. [12]

    Taming the titans: A survey of efficient LLM inference serving.arXiv preprint arXiv:2504.19720,

    Zhen, R., Li, J., Ji, Y ., Yang, Z., Liu, T., Xia, Q., Duan, X., Wang, Z., Huai, B., and Zhang, M. Taming the titans: A survey of efficient LLM inference serving.arXiv preprint arXiv:2504.19720,

  13. [13]

    A Survey on Efficient Inference for Large Language Models

    Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y ., Wang, L., Yuan, Z., Li, X., et al. A survey on effi- cient inference for large language models.arXiv preprint arXiv:2404.14294,