pith. sign in

arxiv: 2507.21433 · v3 · pith:RLLBD3WSnew · submitted 2025-07-29 · 💻 cs.LG · cs.AI

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Pith reviewed 2026-05-19 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cachelarge reasoning modelsinference servingcache sharingcollaborative filteringthroughput optimizationzero-copy reuseQoS
0
0 comments X

The pith

ReasonCache reuses similar KV cache blocks in large reasoning models via collaborative filtering to raise serving throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce long sequences of tokens during inference, each adding KV cache entries that quickly exhaust memory and throttle concurrent requests. The paper observes that many intermediate reasoning steps are highly similar and therefore generate nearly identical KV states across layers. ReasonCache applies a collaborative filtering algorithm to locate reusable cache blocks and performs zero-copy sharing instead of recomputing or evicting them. Experiments report a peak throughput increase of 89.2 percent and average gains of 40-60 percent, together with accuracy that stays equal or higher than existing cache policies. The approach directly targets the QoS bottleneck that arises when many users query the same model at once.

Core claim

LRMs frequently generate highly similar intermediate reasoning steps that correspond to highly similar KV cache states across layers; a Collaborative Filtering Algorithm can efficiently identify reusable blocks and enable zero-copy cache reuse, yielding a peak throughput improvement of 89.2 percent and average gains of 40-60 percent while maintaining higher accuracy than prior KV cache management techniques.

What carries the argument

Collaborative Filtering Algorithm that locates reusable KV cache blocks from similar reasoning steps to support zero-copy reuse.

If this is right

  • Higher throughput allows the same hardware to serve more concurrent users without violating latency targets.
  • Lower per-request memory footprint reduces the cost of deploying large reasoning models in production clusters.
  • Accuracy remains stable or improves relative to eviction-based KV cache methods because reuse avoids recomputation errors.
  • Inference systems gain headroom to accommodate longer reasoning chains without increasing hardware allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reuse pattern could be tested on other long-generation workloads such as code synthesis or multi-step planning where repetition is common.
  • Integration with existing dynamic batching schedulers might compound the throughput gains by aligning cache hits across different user requests.
  • If KV-state similarity holds across model scales, the technique could reduce the memory capacity required for serving clusters without retraining.

Load-bearing premise

Large reasoning models often produce highly similar intermediate reasoning steps that map to sufficiently similar KV cache states for safe identification and reuse without accuracy loss.

What would settle it

A set of traces from an LRM where the collaborative filtering algorithm selects blocks that produce a measurable drop in final answer accuracy or where similar reasoning steps fail to produce cache states close enough for reuse.

Figures

Figures reproduced from arXiv: 2507.21433 by Hong Xu, Jingzong Li, Kaiwen Chen, Minchen Yu, Xin Tan.

Figure 1
Figure 1. Figure 1: An example of redundant thinking in reasoning models, as demonstrated by QwQ-32B [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Redundant thinking across different reasoning [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Main workflow of MemShare. The system consists of a Collaborative Filtering Algorithm and a Paged Attention [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Paged Attention adapted KV sharing mechanism [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of MemShare across models and benchmarks using different threshold settings. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of similarity measure [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes ReasonCache, a KV cache management system for Large Reasoning Models that observes frequent similarities in intermediate reasoning steps across generations and uses a Collaborative Filtering Algorithm to detect reusable KV cache blocks for zero-copy sharing. It reports a peak throughput improvement of 89.2% and average gains of 40-60% while claiming higher accuracy than prior KV cache techniques.

Significance. If the core assumption holds and the reported gains are reproducible, the work could meaningfully improve throughput and cost-efficiency for serving long-context reasoning models without accuracy degradation, addressing a practical bottleneck in production inference systems.

major comments (3)
  1. [Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.
  2. [Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.
  3. [Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the specific Collaborative Filtering variant employed and the similarity threshold used for block reuse.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity in the abstract and the need for stronger empirical support for the zero-copy reuse safety and accuracy claims. We address each point below and commit to revisions that will incorporate additional analysis and metrics without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.

    Authors: We agree that the abstract would benefit from additional context to improve verifiability. The full manuscript details the experimental setup in Section 5, including the specific LRMs evaluated, reasoning benchmarks, comparison baselines (standard KV cache policies), and that throughput and accuracy results are averaged over multiple independent runs. To address this directly, we will revise the abstract to include a concise reference to the evaluation methodology, such as noting the benchmarks and that gains are reported as averages with observed variance. revision: yes

  2. Referee: [Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.

    Authors: This concern about numerical identity versus correlation is well-taken, as accumulation of small errors could indeed affect long generations. The manuscript grounds the approach in empirical observations of similarity in reasoning steps and corresponding KV states, with the collaborative filtering algorithm selecting blocks for reuse based on those patterns. However, we do not currently report explicit per-layer numerical difference metrics (e.g., L2 norms or cosine similarity thresholds) to quantify how close the reused blocks are to the originals. We will add this analysis in a revised methods or evaluation section, including measurements confirming that differences remain below levels that impact attention outputs or token distributions. revision: yes

  3. Referee: [Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.

    Authors: We recognize that the accuracy claims, including the assertion of higher accuracy relative to prior techniques, require more direct supporting evidence to be fully convincing. The current evaluation reports overall accuracy metrics alongside throughput gains but does not include dedicated ablations on reuse safety or per-layer KV difference metrics. In the revised manuscript, we will add these elements: per-layer KV cache difference statistics for reused blocks, an ablation comparing accuracy with and without sharing, and checks confirming no degradation in token distributions. This will provide the concrete evidence needed to substantiate safe reuse. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper rests on an empirical observation that LRMs produce similar intermediate reasoning steps with corresponding KV cache states, then applies an external Collaborative Filtering Algorithm to detect reusable blocks for zero-copy sharing. Performance claims (throughput gains and accuracy) are validated via experimental evaluation rather than any mathematical derivation that reduces to fitted inputs or self-referential definitions. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way that collapses the central argument. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified observation that reasoning steps produce similar KV states and that collaborative filtering can locate them efficiently; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5711 in / 1054 out tokens · 40840 ms · 2026-05-19T03:10:20.634474+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Abdin, M.; Agarwal, S.; Awadallah, A.; Balachandran, V.; Behl, H.; Chen, L.; de Rosa, G.; Gunasekar, S.; Javaheripi, M.; Joshi, N.; Kauffmann, P.; Lara, Y.; Mendes, C. C. T.; Mitra, A.; Nushi, B.; Papailiopoulos, D.; Saarikivi, O.; Shah, S.; Shrivastava, V.; Vineet, V.; Wu, Y.; Yousefi, S.; and Zheng, G. 2025. Phi-4-reasoning Technical Report. arXiv:2504.21318

  4. [4]

    Bai, G.; Liu, J.; Bu, X.; He, Y.; Liu, J.; Zhou, Z.; Lin, Z.; Su, W.; Ge, T.; Zheng, B.; et al. 2024. Mt-bench-101: A Fine-grained Benchmark for Evaluating Large Language Models in Multi-turn Dialogues. arXiv preprint arXiv:2402.14762

  5. [5]

    Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; et al. 2024. Do Not Think That Much for 2+ 3=? on the Overthinking of o1-like LLMs. arXiv preprint arXiv:2412.21187

  6. [6]

    Child, R.; Gray, S.; Radford, A.; and Sutskever, I. 2019. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509

  7. [7]

    Collins, L.; Parulekar, A.; Mokhtari, A.; Sanghavi, S.; and Shakkottai, S. 2024. In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness. In Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; and Zhang, C., eds., Advances in Neural Information Processing Systems, volume 37, 92638--92696. Curran Ass...

  8. [8]

    Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; and Gao, J. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801

  9. [9]

    Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948

  10. [10]

    Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1

  11. [11]

    Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 System Card. arXiv preprint arXiv:2412.16720

  12. [12]

    H.; Gonzalez, J.; Zhang, H.; and Stoica, I

    Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C. H.; Gonzalez, J.; Zhang, H.; and Stoica, I. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, 611--626

  13. [13]

    Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. Snapkv: LLM Knows What You Are Looking for Before Generation. Advances in Neural Information Processing Systems, 37: 22947--22970

  14. [14]

    H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L

    Li, Y.; Jiang, H.; Wu, Q.; Luo, X.; Ahn, S.; Zhang, C.; Abdi, A. H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L. 2025. SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv:2412.10319

  15. [15]

    Liu, A.; Liu, J.; Pan, Z.; He, Y.; Haffari, R.; and Zhuang, B. 2024. Minicache: KV Cache Compression in Depth Dimension for Large Language Models. Advances in Neural Information Processing Systems, 37: 139997--140031

  16. [16]

    MAA. 2025. American Invitational Mathematics Examination - AIME

  17. [17]

    Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, \'I .; Maleki, S.; and Bianchini, R. 2024. Splitwise: Efficient Generative LLM Inference using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 118--132. IEEE

  18. [18]

    Qwen. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

  19. [19]

    Reimers, N.; and Gurevych, I. 2019. Sentence- BERT : Sentence Embeddings Using S iamese BERT -Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

  20. [20]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2023. GPQA: A Graduate Level Google-Proof QA Benchmark. arXiv:2311.12022

  21. [21]

    Ribar, L.; Chelombiev, I.; Hudlass-Galley, L.; Blake, C.; Luschi, C.; and Orr, D. 2023. Sparq Attention: Bandwidth-Efficient LLM Inference. arXiv preprint arXiv:2312.04985

  22. [22]

    Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531--3539

  23. [23]

    Tan, X.; Chen, Y.; Jiang, Y.; Chen, X.; Yan, K.; Duan, N.; Zhu, Y.; Jiang, D.; and Xu, H. 2025. DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training. arXiv:2502.07590

  24. [24]

    Tang, J.; Zhao, Y.; Zhu, K.; Xiao, G.; Kasikci, B.; and Han, S. 2024. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774

  25. [25]

    Virmaux, A.; and Scaman, K. 2018. Lipschitz Regularity of Deep Neural Networks: Analysis and Efficient Estimation. Advances in Neural Information Processing Systems, 31

  26. [26]

    Wan, Z.; Wu, X.; Zhang, Y.; Xin, Y.; Tao, C.; Zhu, Z.; Wang, X.; Luo, S.; Xiong, J.; Wang, L.; et al. 2025. D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models. In ICLR

  27. [27]

    Wang, Z.; Jin, B.; Yu, Z.; and Zhang, M. 2024. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks. arXiv preprint arXiv:2407.08454

  28. [28]

    Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453

  29. [29]

    Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F.; et al. 2025. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv preprint arXiv:2501.09686

  30. [30]

    Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

    Yang, J. Y.; Kim, B.; Bae, J.; Kwon, B.; Park, G.; Yang, E.; Kwon, S. J.; and Lee, D. 2024. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. arXiv preprint arXiv:2402.18096

  31. [31]

    A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al

    Zaheer, M.; Guruganesh, G.; Dubey, K. A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. 2020. Big Bird: Transformers for Longer Sequences. Advances in neural information processing systems, 33: 17283--17297

  32. [32]

    Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; R \'e , C.; Barrett, C.; et al. 2023. H2o: Heavy-hitter Oracle for Efficient Generative Inference of Large Language Models. Advances in Neural Information Processing Systems, 36: 34661--34710

  33. [33]

    L.; Huang, J.; Yu, C

    Zheng, L.; Yin, L.; Xie, Z.; Sun, C. L.; Huang, J.; Yu, C. H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J. E.; et al. 2024. Sglang: Efficient Execution of Structured Language Model Programs. Advances in Neural Information Processing Systems, 37: 62557--62583

  34. [34]

    Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, X.; and Zhang, H. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 193--210