ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Hanxu Hou; Hao Shi; Linqi Song; Mengzhe Ruan; Peisong Wang; Shuang Qiu; Weizhi Fei; Wenhao Liu; Xiangyuan Wang; Yunhe Li

arxiv: 2606.11164 · v1 · pith:QWD3YKJRnew · submitted 2026-06-09 · 💻 cs.AI

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Wenhao Liu , Hao Shi , Yunhe Li , Weizhi Fei , Xiangyuan Wang , Mengzhe Ruan , Hanxu Hou , Peisong Wang

show 2 more authors

Linqi Song Shuang Qiu

This is my paper

Pith reviewed 2026-06-27 13:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords KV cache compressionreasoning modelschain-of-thoughtbudget allocationdecoding-time methodsLLM inference optimizationmathematical reasoning

0 comments

The pith

ReasonAlloc improves reasoning accuracy by allocating limited KV cache in a hierarchical layer-then-head manner instead of uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that uniform budget allocation across layers and heads is suboptimal for the stepwise demands of long chain-of-thought reasoning in LLMs. It proposes ReasonAlloc, which uses an offline preallocation of layer budgets according to a captured Reasoning Wave pattern and an online reallocation among heads based on utility. This training-free approach is evaluated on mathematical reasoning tasks with several distilled reasoning models and demonstrates better results than uniform, SnapKV, and Pyramid-RKV baselines, particularly when the total KV budget is small. A reader would care because KV cache growth is a major bottleneck for deploying extended reasoning on resource-limited hardware.

Core claim

ReasonAlloc treats decoding-time KV cache compression as a hierarchical budget allocation problem consisting of offline layer-wise preallocation that follows an architecture-driven Reasoning Wave and online head-wise reallocation to real-time high-utility heads. When integrated with token eviction, this yields superior performance on MATH-500 and AIME 2024 benchmarks across DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B models compared to uniform-budget and other non-uniform methods, with the biggest improvements at budgets of 128-512 tokens.

What carries the argument

The Reasoning Wave, defined as the architecture-driven demand pattern for layer-wise KV budgets, paired with a real-time utility metric for head-wise reallocation during autoregressive decoding.

If this is right

Outperforms uniform allocation and existing methods like SnapKV on math reasoning at constrained cache sizes.
Introduces negligible overhead and works as a plug-in with current eviction policies.
The gains are consistent across different model sizes and architectures tested.
Most effective when the KV budget is severely limited, allowing longer reasoning chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Reasoning Wave may indicate fundamental differences in how early and late layers contribute to reasoning computations.
Similar hierarchical allocation could be explored for non-reasoning long-context tasks such as document summarization.
If the pattern holds across more models, it could lead to architecture-specific default cache profiles.

Load-bearing premise

A stable Reasoning Wave demand pattern exists that can be identified once offline and then used for all future instances without adjustment.

What would settle it

Running the method with layer budgets assigned randomly instead of according to the precomputed Reasoning Wave and finding no accuracy difference on the benchmarks would falsify the value of the specific allocation pattern.

read the original abstract

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReasonAlloc's offline Reasoning Wave preallocation needs more proof of stability to support the claimed gains at small budgets.

read the letter

The punchline here is that ReasonAlloc's gains at small budgets depend on an offline layer preallocation that assumes a stable Reasoning Wave, but there's no sign in the abstract that this pattern generalizes beyond the evaluated cases.

The new part is the hierarchical split for the decoding phase in reasoning models. It takes an architecture-driven layer budget captured once offline and pairs it with real-time head reallocation based on utility. This is distinct from the uniform or prefill-focused methods it compares against, and it is training-free, which is a plus for practical use. It does well by claiming to work as a plug-in with existing token eviction and showing the biggest improvements when the KV budget is tight, like 128-512 tokens, on those math benchmarks with the DeepSeek and AceReason models.

The soft spots are around the evidence and assumptions. The abstract mentions outperformance over R-KV, SnapKV, and Pyramid-RKV but skips any details on controls, how the wave is computed, or whether the same layer allocation works across different problems or held-out trajectories. If the wave shifts with harder problems, the static preallocation could be acting like a tuned hyperparameter rather than a robust feature. That matches the stress-test concern directly.

This is for people building or deploying reasoning LLMs where KV cache size limits long CoT runs on limited hardware. A reader focused on inference optimization would find the idea useful to consider, even if they have to dig into the full paper for the implementation.

It deserves peer review because it targets a genuine bottleneck with a simple, hierarchical idea that could be tested further. I would recommend sending it out for review rather than desk rejecting.

Referee Report

2 major / 1 minor

Summary. The paper proposes ReasonAlloc, a training-free hierarchical decoding-time KV cache budget allocation method for long CoT reasoning in LLMs. It combines an offline layer-wise preallocation that captures a claimed architecture-driven 'Reasoning Wave' pattern with an online head-wise reallocation based on real-time utility. Evaluations on MATH-500 and AIME 2024 using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B report outperformance over uniform-budget R-KV, SnapKV, and Pyramid-RKV, with largest gains at small budgets (128-512 tokens); the method is presented as plug-and-play with existing token-eviction policies and low overhead.

Significance. If the offline preallocation generalizes and the reported gains are robust, the approach could meaningfully improve memory efficiency for autoregressive reasoning without retraining, addressing a practical bottleneck in long-CoT inference. The training-free design and hierarchical split are positive features, but the central empirical claims rest on unverified stability of the Reasoning Wave across instances and models.

major comments (2)

[Abstract] Abstract: The central claim of outperformance at small budgets (128-512 tokens) rests on the offline layer-wise preallocation capturing a stable, architecture-driven 'Reasoning Wave' that can be computed once and reused. No evidence is shown that this pattern remains near-optimal across different problems (e.g., MATH-500 vs. AIME 2024), held-out trajectories, or model scales; if the wave varies, the static preallocation reduces to an arbitrary fixed mask plus the online allocator, undermining the reported gains over Pyramid-RKV.
[Abstract] Abstract: No details are provided on experimental controls, statistical significance testing, error bars, number of runs, or the precise measurement and validation procedure for the Reasoning Wave (e.g., how layer demand is quantified offline). This absence makes it impossible to assess whether the outperformance is load-bearing or sensitive to evaluation-set tuning.

minor comments (1)

[Abstract] The abstract refers to 'Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget)' without citing its origin or providing implementation details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of outperformance at small budgets (128-512 tokens) rests on the offline layer-wise preallocation capturing a stable, architecture-driven 'Reasoning Wave' that can be computed once and reused. No evidence is shown that this pattern remains near-optimal across different problems (e.g., MATH-500 vs. AIME 2024), held-out trajectories, or model scales; if the wave varies, the static preallocation reduces to an arbitrary fixed mask plus the online allocator, undermining the reported gains over Pyramid-RKV.

Authors: We agree that the manuscript would be strengthened by explicit evidence that the Reasoning Wave is stable across problem distributions. Our current results show the same qualitative pattern on three models spanning Llama and Qwen architectures, which supports the architecture-driven interpretation, but we do not directly compare preallocations derived from MATH-500 versus AIME 2024. In the revision we will add (i) a side-by-side visualization of the layer-demand vectors obtained from each benchmark and (ii) a cross-benchmark transfer experiment that applies the MATH-500-derived preallocation to AIME 2024 (and vice versa) while keeping the online head-wise allocator fixed. These additions will clarify whether the static component is load-bearing or largely subsumed by the online allocator. revision: yes
Referee: [Abstract] Abstract: No details are provided on experimental controls, statistical significance testing, error bars, number of runs, or the precise measurement and validation procedure for the Reasoning Wave (e.g., how layer demand is quantified offline). This absence makes it impossible to assess whether the outperformance is load-bearing or sensitive to evaluation-set tuning.

Authors: We acknowledge that the current manuscript omits these methodological details. The revised version will contain a new subsection that (a) specifies the exact offline metric used to compute layer demand (average per-layer attention mass over a small set of sampled CoT trajectories), (b) reports the number of independent runs performed for each configuration, (c) includes error bars on the primary accuracy-vs-budget plots, and (d) describes the statistical tests applied to the reported improvements. These additions will allow readers to evaluate robustness and sensitivity to evaluation-set choice. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with no self-referential derivations

full rationale

The provided abstract and description contain no equations, parameter-fitting steps, or derivations that reduce outputs to inputs by construction. The offline layer-wise preallocation is presented as capturing an observed architecture-driven pattern (Reasoning Wave) in a training-free manner, with online head-wise reallocation based on real-time utility; evaluations compare against baselines on standard benchmarks without any self-citation chains or uniqueness theorems invoked. No load-bearing claims reduce to fitted parameters renamed as predictions or ansatzes smuggled via prior self-work. This is a standard empirical contribution with independent content, warranting score 0 per the rules against manufacturing circularity or conflating assumption validity with definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified existence of a stable Reasoning Wave pattern and the assumption that real-time head utility can be measured without additional training or significant overhead.

axioms (1)

domain assumption Existence of a consistent architecture-driven 'Reasoning Wave' demand pattern across layers during reasoning decoding
Invoked to justify the offline layer-wise preallocation strategy

invented entities (1)

Reasoning Wave no independent evidence
purpose: To describe and exploit layer-wise context demand patterns for budget preallocation
New term introduced to motivate the offline component; no independent evidence provided in abstract

pith-pipeline@v0.9.1-grok · 5820 in / 1303 out tokens · 19949 ms · 2026-06-27T13:07:51.593409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 8 linked inside Pith

[1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[2]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2504.05185 , year=

Concise reasoning via reinforcement learning , author=. arXiv preprint arXiv:2504.05185 , year=

arXiv
[4]

arXiv e-prints , pages=

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration , author=. arXiv e-prints , pages=
[5]

arXiv preprint arXiv:2506.15969 , year=

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning , author=. arXiv preprint arXiv:2506.15969 , year=

arXiv
[6]

arXiv preprint arXiv:2604.04921 , year=

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression , author=. arXiv preprint arXiv:2604.04921 , year=

Pith/arXiv arXiv
[7]

Advances in Neural Information Processing Systems , volume=

Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=
[8]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=
[9]

Advances in Neural Information Processing Systems , volume=

Minicache: Kv cache compression in depth dimension for large language models , author=. Advances in Neural Information Processing Systems , volume=
[10]

Advances in Neural Information Processing Systems , volume=

Kvquant: Towards 10 million context length llm inference with kv cache quantization , author=. Advances in Neural Information Processing Systems , volume=
[11]

arXiv preprint arXiv:2407.21118 , year=

Palu: Compressing kv-cache with low-rank projection , author=. arXiv preprint arXiv:2407.21118 , year=

arXiv
[12]

arXiv preprint arXiv:2309.17453 , year=

Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2406.02069 , year=

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling , author=. arXiv preprint arXiv:2406.02069 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2412.14838 , year=

DynamicKV: Task-aware adaptive KV cache compression for long context LLMs , author=. arXiv preprint arXiv:2412.14838 , year=

arXiv
[15]

arXiv preprint arXiv:2504.01296 , year=

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning , author=. arXiv preprint arXiv:2504.01296 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2505.03469 , year=

Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

arXiv
[17]

arXiv preprint arXiv:2311.09277 , year=

Contrastive chain-of-thought prompting , author=. arXiv preprint arXiv:2311.09277 , year=

arXiv
[18]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[19]

2024 , publisher =

Math-AI , title =. 2024 , publisher =

2024
[20]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv
[21]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[22]

International journal of computer vision , volume=

The earth mover's distance as a metric for image retrieval , author=. International journal of computer vision , volume=. 2000 , publisher=

2000
[23]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[24]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Extending context window of large language models via semantic compression , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[25]

Advances in Neural Information Processing Systems , volume=

Efficient prompt compression with evaluator heads for long-context transformer inference , author=. Advances in Neural Information Processing Systems , volume=

[1] [1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[2] [2]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2504.05185 , year=

Concise reasoning via reinforcement learning , author=. arXiv preprint arXiv:2504.05185 , year=

arXiv

[4] [4]

arXiv e-prints , pages=

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration , author=. arXiv e-prints , pages=

[5] [5]

arXiv preprint arXiv:2506.15969 , year=

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning , author=. arXiv preprint arXiv:2506.15969 , year=

arXiv

[6] [6]

arXiv preprint arXiv:2604.04921 , year=

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression , author=. arXiv preprint arXiv:2604.04921 , year=

Pith/arXiv arXiv

[7] [7]

Advances in Neural Information Processing Systems , volume=

Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

[8] [8]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

[9] [9]

Advances in Neural Information Processing Systems , volume=

Minicache: Kv cache compression in depth dimension for large language models , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

Advances in Neural Information Processing Systems , volume=

Kvquant: Towards 10 million context length llm inference with kv cache quantization , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

arXiv preprint arXiv:2407.21118 , year=

Palu: Compressing kv-cache with low-rank projection , author=. arXiv preprint arXiv:2407.21118 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2309.17453 , year=

Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2406.02069 , year=

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling , author=. arXiv preprint arXiv:2406.02069 , year=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2412.14838 , year=

DynamicKV: Task-aware adaptive KV cache compression for long context LLMs , author=. arXiv preprint arXiv:2412.14838 , year=

arXiv

[15] [15]

arXiv preprint arXiv:2504.01296 , year=

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning , author=. arXiv preprint arXiv:2504.01296 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2505.03469 , year=

Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models , author=. arXiv preprint arXiv:2505.03469 , year=

arXiv

[17] [17]

arXiv preprint arXiv:2311.09277 , year=

Contrastive chain-of-thought prompting , author=. arXiv preprint arXiv:2311.09277 , year=

arXiv

[18] [18]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[19] [19]

2024 , publisher =

Math-AI , title =. 2024 , publisher =

2024

[20] [20]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv

[21] [21]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[22] [22]

International journal of computer vision , volume=

The earth mover's distance as a metric for image retrieval , author=. International journal of computer vision , volume=. 2000 , publisher=

2000

[23] [23]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[24] [24]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Extending context window of large language models via semantic compression , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[25] [25]

Advances in Neural Information Processing Systems , volume=

Efficient prompt compression with evaluator heads for long-context transformer inference , author=. Advances in Neural Information Processing Systems , volume=