ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
Pith reviewed 2026-05-19 03:10 UTC · model grok-4.3
The pith
ReasonCache reuses similar KV cache blocks in large reasoning models via collaborative filtering to raise serving throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LRMs frequently generate highly similar intermediate reasoning steps that correspond to highly similar KV cache states across layers; a Collaborative Filtering Algorithm can efficiently identify reusable blocks and enable zero-copy cache reuse, yielding a peak throughput improvement of 89.2 percent and average gains of 40-60 percent while maintaining higher accuracy than prior KV cache management techniques.
What carries the argument
Collaborative Filtering Algorithm that locates reusable KV cache blocks from similar reasoning steps to support zero-copy reuse.
If this is right
- Higher throughput allows the same hardware to serve more concurrent users without violating latency targets.
- Lower per-request memory footprint reduces the cost of deploying large reasoning models in production clusters.
- Accuracy remains stable or improves relative to eviction-based KV cache methods because reuse avoids recomputation errors.
- Inference systems gain headroom to accommodate longer reasoning chains without increasing hardware allocation.
Where Pith is reading between the lines
- The same reuse pattern could be tested on other long-generation workloads such as code synthesis or multi-step planning where repetition is common.
- Integration with existing dynamic batching schedulers might compound the throughput gains by aligning cache hits across different user requests.
- If KV-state similarity holds across model scales, the technique could reduce the memory capacity required for serving clusters without retraining.
Load-bearing premise
Large reasoning models often produce highly similar intermediate reasoning steps that map to sufficiently similar KV cache states for safe identification and reuse without accuracy loss.
What would settle it
A set of traces from an LRM where the collaborative filtering algorithm selects blocks that produce a measurable drop in final answer accuracy or where similar reasoning steps fail to produce cache states close enough for reuse.
Figures
read the original abstract
Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReasonCache, a KV cache management system for Large Reasoning Models that observes frequent similarities in intermediate reasoning steps across generations and uses a Collaborative Filtering Algorithm to detect reusable KV cache blocks for zero-copy sharing. It reports a peak throughput improvement of 89.2% and average gains of 40-60% while claiming higher accuracy than prior KV cache techniques.
Significance. If the core assumption holds and the reported gains are reproducible, the work could meaningfully improve throughput and cost-efficiency for serving long-context reasoning models without accuracy degradation, addressing a practical bottleneck in production inference systems.
major comments (3)
- [Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.
- [Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.
- [Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement of the specific Collaborative Filtering variant employed and the similarity threshold used for block reuse.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity in the abstract and the need for stronger empirical support for the zero-copy reuse safety and accuracy claims. We address each point below and commit to revisions that will incorporate additional analysis and metrics without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.
Authors: We agree that the abstract would benefit from additional context to improve verifiability. The full manuscript details the experimental setup in Section 5, including the specific LRMs evaluated, reasoning benchmarks, comparison baselines (standard KV cache policies), and that throughput and accuracy results are averaged over multiple independent runs. To address this directly, we will revise the abstract to include a concise reference to the evaluation methodology, such as noting the benchmarks and that gains are reported as averages with observed variance. revision: yes
-
Referee: [Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.
Authors: This concern about numerical identity versus correlation is well-taken, as accumulation of small errors could indeed affect long generations. The manuscript grounds the approach in empirical observations of similarity in reasoning steps and corresponding KV states, with the collaborative filtering algorithm selecting blocks for reuse based on those patterns. However, we do not currently report explicit per-layer numerical difference metrics (e.g., L2 norms or cosine similarity thresholds) to quantify how close the reused blocks are to the originals. We will add this analysis in a revised methods or evaluation section, including measurements confirming that differences remain below levels that impact attention outputs or token distributions. revision: yes
-
Referee: [Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.
Authors: We recognize that the accuracy claims, including the assertion of higher accuracy relative to prior techniques, require more direct supporting evidence to be fully convincing. The current evaluation reports overall accuracy metrics alongside throughput gains but does not include dedicated ablations on reuse safety or per-layer KV difference metrics. In the revised manuscript, we will add these elements: per-layer KV cache difference statistics for reused blocks, an ablation comparing accuracy with and without sharing, and checks confirming no degradation in token distributions. This will provide the concrete evidence needed to substantiate safe reuse. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper rests on an empirical observation that LRMs produce similar intermediate reasoning steps with corresponding KV cache states, then applies an external Collaborative Filtering Algorithm to detect reusable blocks for zero-copy sharing. Performance claims (throughput gains and accuracy) are validated via experimental evaluation rather than any mathematical derivation that reduces to fitted inputs or self-referential definitions. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way that collapses the central argument. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Abdin, M.; Agarwal, S.; Awadallah, A.; Balachandran, V.; Behl, H.; Chen, L.; de Rosa, G.; Gunasekar, S.; Javaheripi, M.; Joshi, N.; Kauffmann, P.; Lara, Y.; Mendes, C. C. T.; Mitra, A.; Nushi, B.; Papailiopoulos, D.; Saarikivi, O.; Shah, S.; Shrivastava, V.; Vineet, V.; Wu, Y.; Yousefi, S.; and Zheng, G. 2025. Phi-4-reasoning Technical Report. arXiv:2504.21318
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; et al. 2024. Do Not Think That Much for 2+ 3=? on the Overthinking of o1-like LLMs. arXiv preprint arXiv:2412.21187
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Child, R.; Gray, S.; Radford, A.; and Sutskever, I. 2019. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
Collins, L.; Parulekar, A.; Mokhtari, A.; Sanghavi, S.; and Shakkottai, S. 2024. In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness. In Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; and Zhang, C., eds., Advances in Neural Information Processing Systems, volume 37, 92638--92696. Curran Ass...
work page 2024
-
[8]
Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; and Gao, J. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1
work page 2021
-
[11]
Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 System Card. arXiv preprint arXiv:2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
H.; Gonzalez, J.; Zhang, H.; and Stoica, I
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C. H.; Gonzalez, J.; Zhang, H.; and Stoica, I. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, 611--626
work page 2023
-
[13]
Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. Snapkv: LLM Knows What You Are Looking for Before Generation. Advances in Neural Information Processing Systems, 37: 22947--22970
work page 2024
-
[14]
H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L
Li, Y.; Jiang, H.; Wu, Q.; Luo, X.; Ahn, S.; Zhang, C.; Abdi, A. H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L. 2025. SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv:2412.10319
-
[15]
Liu, A.; Liu, J.; Pan, Z.; He, Y.; Haffari, R.; and Zhuang, B. 2024. Minicache: KV Cache Compression in Depth Dimension for Large Language Models. Advances in Neural Information Processing Systems, 37: 139997--140031
work page 2024
-
[16]
MAA. 2025. American Invitational Mathematics Examination - AIME
work page 2025
-
[17]
Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, \'I .; Maleki, S.; and Bianchini, R. 2024. Splitwise: Efficient Generative LLM Inference using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 118--132. IEEE
work page 2024
-
[18]
Qwen. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning
work page 2025
-
[19]
Reimers, N.; and Gurevych, I. 2019. Sentence- BERT : Sentence Embeddings Using S iamese BERT -Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
work page 2019
-
[20]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2023. GPQA: A Graduate Level Google-Proof QA Benchmark. arXiv:2311.12022
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [21]
-
[22]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531--3539
work page 2021
- [23]
-
[24]
Tang, J.; Zhao, Y.; Zhu, K.; Xiao, G.; Kasikci, B.; and Han, S. 2024. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Virmaux, A.; and Scaman, K. 2018. Lipschitz Regularity of Deep Neural Networks: Analysis and Efficient Estimation. Advances in Neural Information Processing Systems, 31
work page 2018
-
[26]
Wan, Z.; Wu, X.; Zhang, Y.; Xin, Y.; Tao, C.; Zhu, Z.; Wang, X.; Luo, S.; Xiong, J.; Wang, L.; et al. 2025. D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models. In ICLR
work page 2025
- [27]
-
[28]
Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F.; et al. 2025. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv preprint arXiv:2501.09686
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S
Yang, J. Y.; Kim, B.; Bae, J.; Kwon, B.; Park, G.; Yang, E.; Kwon, S. J.; and Lee, D. 2024. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. arXiv preprint arXiv:2402.18096
-
[31]
A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al
Zaheer, M.; Guruganesh, G.; Dubey, K. A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. 2020. Big Bird: Transformers for Longer Sequences. Advances in neural information processing systems, 33: 17283--17297
work page 2020
-
[32]
Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; R \'e , C.; Barrett, C.; et al. 2023. H2o: Heavy-hitter Oracle for Efficient Generative Inference of Large Language Models. Advances in Neural Information Processing Systems, 36: 34661--34710
work page 2023
-
[33]
Zheng, L.; Yin, L.; Xie, Z.; Sun, C. L.; Huang, J.; Yu, C. H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J. E.; et al. 2024. Sglang: Efficient Execution of Structured Language Model Programs. Advances in Neural Information Processing Systems, 37: 62557--62583
work page 2024
-
[34]
Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, X.; and Zhang, H. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 193--210
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.