ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Hong Xu; Jingzong Li; Kaiwen Chen; Minchen Yu; Xin Tan

arxiv: 2507.21433 · v3 · pith:RLLBD3WSnew · submitted 2025-07-29 · 💻 cs.LG · cs.AI

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

Kaiwen Chen , Xin Tan , Minchen Yu , Jingzong Li , Hong Xu This is my paper

Pith reviewed 2026-05-19 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cachelarge reasoning modelsinference servingcache sharingcollaborative filteringthroughput optimizationzero-copy reuseQoS

0 comments

The pith

ReasonCache reuses similar KV cache blocks in large reasoning models via collaborative filtering to raise serving throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce long sequences of tokens during inference, each adding KV cache entries that quickly exhaust memory and throttle concurrent requests. The paper observes that many intermediate reasoning steps are highly similar and therefore generate nearly identical KV states across layers. ReasonCache applies a collaborative filtering algorithm to locate reusable cache blocks and performs zero-copy sharing instead of recomputing or evicting them. Experiments report a peak throughput increase of 89.2 percent and average gains of 40-60 percent, together with accuracy that stays equal or higher than existing cache policies. The approach directly targets the QoS bottleneck that arises when many users query the same model at once.

Core claim

LRMs frequently generate highly similar intermediate reasoning steps that correspond to highly similar KV cache states across layers; a Collaborative Filtering Algorithm can efficiently identify reusable blocks and enable zero-copy cache reuse, yielding a peak throughput improvement of 89.2 percent and average gains of 40-60 percent while maintaining higher accuracy than prior KV cache management techniques.

What carries the argument

Collaborative Filtering Algorithm that locates reusable KV cache blocks from similar reasoning steps to support zero-copy reuse.

If this is right

Higher throughput allows the same hardware to serve more concurrent users without violating latency targets.
Lower per-request memory footprint reduces the cost of deploying large reasoning models in production clusters.
Accuracy remains stable or improves relative to eviction-based KV cache methods because reuse avoids recomputation errors.
Inference systems gain headroom to accommodate longer reasoning chains without increasing hardware allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reuse pattern could be tested on other long-generation workloads such as code synthesis or multi-step planning where repetition is common.
Integration with existing dynamic batching schedulers might compound the throughput gains by aligning cache hits across different user requests.
If KV-state similarity holds across model scales, the technique could reduce the memory capacity required for serving clusters without retraining.

Load-bearing premise

Large reasoning models often produce highly similar intermediate reasoning steps that map to sufficiently similar KV cache states for safe identification and reuse without accuracy loss.

What would settle it

A set of traces from an LRM where the collaborative filtering algorithm selects blocks that produce a measurable drop in final answer accuracy or where similar reasoning steps fail to produce cache states close enough for reuse.

Figures

Figures reproduced from arXiv: 2507.21433 by Hong Xu, Jingzong Li, Kaiwen Chen, Minchen Yu, Xin Tan.

**Figure 2.** Figure 2: Redundant thinking across different reasoning [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Main workflow of MemShare. The system consists of a Collaborative Filtering Algorithm and a Paged Attention [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Paged Attention adapted KV sharing mechanism [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of MemShare across models and benchmarks using different threshold settings. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison of similarity measure [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) are becoming integral to many AI inference systems, enhancing their capabilities with advanced reasoning. However, deploying these models in production environments presents a significant QoS challenge: the substantial memory overhead from their long, auto-regressive inference processes severely limits throughput and increases latency, thereby affecting the quality of service for concurrent users. We observe that LRMs frequently generate highly similar intermediate reasoning steps, which, in turn, correspond to highly similar KV cache states across layers. Building on this insight, we propose ReasonCache, a novel KV cache management approach designed to improve the QoS of AI inference systems. ReasonCache utilizes a Collaborative Filtering Algorithm to efficiently identify reusable KV cache blocks and enables zero-copy cache reuse. Experimental evaluation demonstrates that ReasonCache achieves a peak throughput improvement of 89.2% and an average gain of 40-60%, leading to more responsive and cost-effective AI inference services. Notably, this performance is achieved while maintaining higher accuracy compared to existing KV cache management techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReasonCache applies collaborative filtering to spot reusable KV blocks from similar reasoning steps in LRMs and claims large throughput gains with no accuracy drop, but the abstract gives almost no experimental details to back it up.

read the letter

The main takeaway is that this paper takes the observation of repeated reasoning patterns in LRMs and turns it into a cache-sharing system using collaborative filtering to find blocks that can be reused with zero-copy. They report a peak throughput lift of 89.2 percent and average gains of 40-60 percent while saying accuracy stays higher than standard approaches. That is the core pitch. The work extends earlier KV cache management ideas from regular LLM serving into the longer, more repetitive outputs that reasoning models produce. The practical focus on concurrent serving and memory pressure is clear, and the choice of collaborative filtering as the matching tool is a straightforward way to avoid brute-force comparison of every block. If the full experiments show clean comparisons against reasonable baselines and confirm that the reused blocks really are identical rather than just close, the gains could matter for anyone running these models at scale. The soft spot is the lack of visible experimental setup. The abstract states the numbers and the accuracy claim but does not describe the models, the workloads, the exact baselines, or any error bars or statistical checks. That makes it hard to judge whether the improvements come from the sharing itself or from other changes in eviction or scheduling. The stress-test point about similarity versus numerical identity is worth pressing. If two KV blocks are only correlated and not exactly the same, small differences can compound across layers and long generations, which would undermine the zero-copy safety argument. The paper needs to show either that the algorithm only reuses truly identical blocks or that it measures the actual output divergence. This is aimed at systems people who build inference servers for reasoning models. A reader who cares about practical memory optimizations and throughput under concurrency would get something useful from the approach. It is not a foundational result, but the problem is real and the method is concrete enough that it deserves a serious referee to check the experiments and the reuse safety claims. I would send it to peer review.

Referee Report

3 major / 1 minor

Summary. The paper proposes ReasonCache, a KV cache management system for Large Reasoning Models that observes frequent similarities in intermediate reasoning steps across generations and uses a Collaborative Filtering Algorithm to detect reusable KV cache blocks for zero-copy sharing. It reports a peak throughput improvement of 89.2% and average gains of 40-60% while claiming higher accuracy than prior KV cache techniques.

Significance. If the core assumption holds and the reported gains are reproducible, the work could meaningfully improve throughput and cost-efficiency for serving long-context reasoning models without accuracy degradation, addressing a practical bottleneck in production inference systems.

major comments (3)

[Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.
[Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.
[Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the specific Collaborative Filtering variant employed and the similarity threshold used for block reuse.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity in the abstract and the need for stronger empirical support for the zero-copy reuse safety and accuracy claims. We address each point below and commit to revisions that will incorporate additional analysis and metrics without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central throughput claims (89.2% peak, 40-60% average) and the assertion of 'higher accuracy' are presented without any reference to experimental setup, baselines, number of runs, statistical significance, or error bars, rendering the quantitative results unverifiable from the given information.

Authors: We agree that the abstract would benefit from additional context to improve verifiability. The full manuscript details the experimental setup in Section 5, including the specific LRMs evaluated, reasoning benchmarks, comparison baselines (standard KV cache policies), and that throughput and accuracy results are averaged over multiple independent runs. To address this directly, we will revise the abstract to include a concise reference to the evaluation methodology, such as noting the benchmarks and that gains are reported as averages with observed variance. revision: yes
Referee: [Method description] The method relies on the claim that 'highly similar' intermediate reasoning steps produce KV cache states similar enough for exact zero-copy reuse. No section derives or measures whether the blocks flagged by the Collaborative Filtering Algorithm are numerically identical (as opposed to merely correlated) across layers; small numerical differences would accumulate in attention outputs and alter token distributions over long generations.

Authors: This concern about numerical identity versus correlation is well-taken, as accumulation of small errors could indeed affect long generations. The manuscript grounds the approach in empirical observations of similarity in reasoning steps and corresponding KV states, with the collaborative filtering algorithm selecting blocks for reuse based on those patterns. However, we do not currently report explicit per-layer numerical difference metrics (e.g., L2 norms or cosine similarity thresholds) to quantify how close the reused blocks are to the originals. We will add this analysis in a revised methods or evaluation section, including measurements confirming that differences remain below levels that impact attention outputs or token distributions. revision: yes
Referee: [Evaluation] The accuracy comparison to existing KV cache management techniques is load-bearing for the overall contribution, yet the manuscript provides no concrete evidence (e.g., per-layer KV difference metrics or ablation on reuse safety) that detected blocks can be shared without introducing error.

Authors: We recognize that the accuracy claims, including the assertion of higher accuracy relative to prior techniques, require more direct supporting evidence to be fully convincing. The current evaluation reports overall accuracy metrics alongside throughput gains but does not include dedicated ablations on reuse safety or per-layer KV difference metrics. In the revised manuscript, we will add these elements: per-layer KV cache difference statistics for reused blocks, an ablation comparing accuracy with and without sharing, and checks confirming no degradation in token distributions. This will provide the concrete evidence needed to substantiate safe reuse. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper rests on an empirical observation that LRMs produce similar intermediate reasoning steps with corresponding KV cache states, then applies an external Collaborative Filtering Algorithm to detect reusable blocks for zero-copy sharing. Performance claims (throughput gains and accuracy) are validated via experimental evaluation rather than any mathematical derivation that reduces to fitted inputs or self-referential definitions. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way that collapses the central argument. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified observation that reasoning steps produce similar KV states and that collaborative filtering can locate them efficiently; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5711 in / 1054 out tokens · 40840 ms · 2026-05-19T03:10:20.634474+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Abdin, M.; Agarwal, S.; Awadallah, A.; Balachandran, V.; Behl, H.; Chen, L.; de Rosa, G.; Gunasekar, S.; Javaheripi, M.; Joshi, N.; Kauffmann, P.; Lara, Y.; Mendes, C. C. T.; Mitra, A.; Nushi, B.; Papailiopoulos, D.; Saarikivi, O.; Shah, S.; Shrivastava, V.; Vineet, V.; Wu, Y.; Yousefi, S.; and Zheng, G. 2025. Phi-4-reasoning Technical Report. arXiv:2504.21318

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Bai, G.; Liu, J.; Bu, X.; He, Y.; Liu, J.; Zhou, Z.; Lin, Z.; Su, W.; Ge, T.; Zheng, B.; et al. 2024. Mt-bench-101: A Fine-grained Benchmark for Evaluating Large Language Models in Multi-turn Dialogues. arXiv preprint arXiv:2402.14762

work page arXiv 2024
[5]

Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; et al. 2024. Do Not Think That Much for 2+ 3=? on the Overthinking of o1-like LLMs. arXiv preprint arXiv:2412.21187

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Child, R.; Gray, S.; Radford, A.; and Sutskever, I. 2019. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Collins, L.; Parulekar, A.; Mokhtari, A.; Sanghavi, S.; and Shakkottai, S. 2024. In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness. In Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; and Zhang, C., eds., Advances in Neural Information Processing Systems, volume 37, 92638--92696. Curran Ass...

work page 2024
[8]

Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; and Gao, J. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1

work page 2021
[11]

Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 System Card. arXiv preprint arXiv:2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

H.; Gonzalez, J.; Zhang, H.; and Stoica, I

Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C. H.; Gonzalez, J.; Zhang, H.; and Stoica, I. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, 611--626

work page 2023
[13]

Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. Snapkv: LLM Knows What You Are Looking for Before Generation. Advances in Neural Information Processing Systems, 37: 22947--22970

work page 2024
[14]

H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L

Li, Y.; Jiang, H.; Wu, Q.; Luo, X.; Ahn, S.; Zhang, C.; Abdi, A. H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L. 2025. SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv:2412.10319

work page arXiv 2025
[15]

Liu, A.; Liu, J.; Pan, Z.; He, Y.; Haffari, R.; and Zhuang, B. 2024. Minicache: KV Cache Compression in Depth Dimension for Large Language Models. Advances in Neural Information Processing Systems, 37: 139997--140031

work page 2024
[16]

MAA. 2025. American Invitational Mathematics Examination - AIME

work page 2025
[17]

Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, \'I .; Maleki, S.; and Bianchini, R. 2024. Splitwise: Efficient Generative LLM Inference using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 118--132. IEEE

work page 2024
[18]

Qwen. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

work page 2025
[19]

Reimers, N.; and Gurevych, I. 2019. Sentence- BERT : Sentence Embeddings Using S iamese BERT -Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

work page 2019
[20]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2023. GPQA: A Graduate Level Google-Proof QA Benchmark. arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Ribar, L.; Chelombiev, I.; Hudlass-Galley, L.; Blake, C.; Luschi, C.; and Orr, D. 2023. Sparq Attention: Bandwidth-Efficient LLM Inference. arXiv preprint arXiv:2312.04985

work page arXiv 2023
[22]

Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531--3539

work page 2021
[23]

Tan, X.; Chen, Y.; Jiang, Y.; Chen, X.; Yan, K.; Duan, N.; Zhu, Y.; Jiang, D.; and Xu, H. 2025. DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training. arXiv:2502.07590

work page arXiv 2025
[24]

Tang, J.; Zhao, Y.; Zhu, K.; Xiao, G.; Kasikci, B.; and Han, S. 2024. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Virmaux, A.; and Scaman, K. 2018. Lipschitz Regularity of Deep Neural Networks: Analysis and Efficient Estimation. Advances in Neural Information Processing Systems, 31

work page 2018
[26]

Wan, Z.; Wu, X.; Zhang, Y.; Xin, Y.; Tao, C.; Zhu, Z.; Wang, X.; Luo, S.; Xiong, J.; Wang, L.; et al. 2025. D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models. In ICLR

work page 2025
[27]

Wang, Z.; Jin, B.; Yu, Z.; and Zhang, M. 2024. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks. arXiv preprint arXiv:2407.08454

work page arXiv 2024
[28]

Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F.; et al. 2025. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv preprint arXiv:2501.09686

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

Yang, J. Y.; Kim, B.; Bae, J.; Kwon, B.; Park, G.; Yang, E.; Kwon, S. J.; and Lee, D. 2024. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. arXiv preprint arXiv:2402.18096

work page arXiv 2024
[31]

A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al

Zaheer, M.; Guruganesh, G.; Dubey, K. A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. 2020. Big Bird: Transformers for Longer Sequences. Advances in neural information processing systems, 33: 17283--17297

work page 2020
[32]

Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; R \'e , C.; Barrett, C.; et al. 2023. H2o: Heavy-hitter Oracle for Efficient Generative Inference of Large Language Models. Advances in Neural Information Processing Systems, 36: 34661--34710

work page 2023
[33]

L.; Huang, J.; Yu, C

Zheng, L.; Yin, L.; Xie, Z.; Sun, C. L.; Huang, J.; Yu, C. H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J. E.; et al. 2024. Sglang: Efficient Execution of Structured Language Model Programs. Advances in Neural Information Processing Systems, 37: 62557--62583

work page 2024
[34]

Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, X.; and Zhang, H. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 193--210

work page 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Abdin, M.; Agarwal, S.; Awadallah, A.; Balachandran, V.; Behl, H.; Chen, L.; de Rosa, G.; Gunasekar, S.; Javaheripi, M.; Joshi, N.; Kauffmann, P.; Lara, Y.; Mendes, C. C. T.; Mitra, A.; Nushi, B.; Papailiopoulos, D.; Saarikivi, O.; Shah, S.; Shrivastava, V.; Vineet, V.; Wu, Y.; Yousefi, S.; and Zheng, G. 2025. Phi-4-reasoning Technical Report. arXiv:2504.21318

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Bai, G.; Liu, J.; Bu, X.; He, Y.; Liu, J.; Zhou, Z.; Lin, Z.; Su, W.; Ge, T.; Zheng, B.; et al. 2024. Mt-bench-101: A Fine-grained Benchmark for Evaluating Large Language Models in Multi-turn Dialogues. arXiv preprint arXiv:2402.14762

work page arXiv 2024

[5] [5]

Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; et al. 2024. Do Not Think That Much for 2+ 3=? on the Overthinking of o1-like LLMs. arXiv preprint arXiv:2412.21187

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Child, R.; Gray, S.; Radford, A.; and Sutskever, I. 2019. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

Collins, L.; Parulekar, A.; Mokhtari, A.; Sanghavi, S.; and Shakkottai, S. 2024. In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness. In Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; and Zhang, C., eds., Advances in Neural Information Processing Systems, volume 37, 92638--92696. Curran Ass...

work page 2024

[8] [8]

Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; and Gao, J. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving with the MATH Dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1

work page 2021

[11] [11]

Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 System Card. arXiv preprint arXiv:2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

H.; Gonzalez, J.; Zhang, H.; and Stoica, I

Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C. H.; Gonzalez, J.; Zhang, H.; and Stoica, I. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, 611--626

work page 2023

[13] [13]

Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. Snapkv: LLM Knows What You Are Looking for Before Generation. Advances in Neural Information Processing Systems, 37: 22947--22970

work page 2024

[14] [14]

H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L

Li, Y.; Jiang, H.; Wu, Q.; Luo, X.; Ahn, S.; Zhang, C.; Abdi, A. H.; Li, D.; Gao, J.; Yang, Y.; and Qiu, L. 2025. SCBench: A KV Cache-Centric Analysis of Long-Context Methods. arXiv:2412.10319

work page arXiv 2025

[15] [15]

Liu, A.; Liu, J.; Pan, Z.; He, Y.; Haffari, R.; and Zhuang, B. 2024. Minicache: KV Cache Compression in Depth Dimension for Large Language Models. Advances in Neural Information Processing Systems, 37: 139997--140031

work page 2024

[16] [16]

MAA. 2025. American Invitational Mathematics Examination - AIME

work page 2025

[17] [17]

Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, \'I .; Maleki, S.; and Bianchini, R. 2024. Splitwise: Efficient Generative LLM Inference using Phase Splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 118--132. IEEE

work page 2024

[18] [18]

Qwen. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

work page 2025

[19] [19]

Reimers, N.; and Gurevych, I. 2019. Sentence- BERT : Sentence Embeddings Using S iamese BERT -Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

work page 2019

[20] [20]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2023. GPQA: A Graduate Level Google-Proof QA Benchmark. arXiv:2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Ribar, L.; Chelombiev, I.; Hudlass-Galley, L.; Blake, C.; Luschi, C.; and Orr, D. 2023. Sparq Attention: Bandwidth-Efficient LLM Inference. arXiv preprint arXiv:2312.04985

work page arXiv 2023

[22] [22]

Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient Attention: Attention with Linear Complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3531--3539

work page 2021

[23] [23]

Tan, X.; Chen, Y.; Jiang, Y.; Chen, X.; Yan, K.; Duan, N.; Zhu, Y.; Jiang, D.; and Xu, H. 2025. DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training. arXiv:2502.07590

work page arXiv 2025

[24] [24]

Tang, J.; Zhao, Y.; Zhu, K.; Xiao, G.; Kasikci, B.; and Han, S. 2024. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Virmaux, A.; and Scaman, K. 2018. Lipschitz Regularity of Deep Neural Networks: Analysis and Efficient Estimation. Advances in Neural Information Processing Systems, 31

work page 2018

[26] [26]

Wan, Z.; Wu, X.; Zhang, Y.; Xin, Y.; Tao, C.; Zhu, Z.; Wang, X.; Luo, S.; Xiong, J.; Wang, L.; et al. 2025. D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models. In ICLR

work page 2025

[27] [27]

Wang, Z.; Jin, B.; Yu, Z.; and Zhang, M. 2024. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks. arXiv preprint arXiv:2407.08454

work page arXiv 2024

[28] [28]

Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F.; et al. 2025. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models. arXiv preprint arXiv:2501.09686

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

Yang, J. Y.; Kim, B.; Bae, J.; Kwon, B.; Park, G.; Yang, E.; Kwon, S. J.; and Lee, D. 2024. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization. arXiv preprint arXiv:2402.18096

work page arXiv 2024

[31] [31]

A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al

Zaheer, M.; Guruganesh, G.; Dubey, K. A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. 2020. Big Bird: Transformers for Longer Sequences. Advances in neural information processing systems, 33: 17283--17297

work page 2020

[32] [32]

Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; R \'e , C.; Barrett, C.; et al. 2023. H2o: Heavy-hitter Oracle for Efficient Generative Inference of Large Language Models. Advances in Neural Information Processing Systems, 36: 34661--34710

work page 2023

[33] [33]

L.; Huang, J.; Yu, C

Zheng, L.; Yin, L.; Xie, Z.; Sun, C. L.; Huang, J.; Yu, C. H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J. E.; et al. 2024. Sglang: Efficient Execution of Structured Language Model Programs. Advances in Neural Information Processing Systems, 37: 62557--62583

work page 2024

[34] [34]

Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, X.; and Zhang, H. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 193--210

work page 2024