ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran; Brucek Khailany; Charbel Sakr; Marina Neseem; Rangharajan Venkatesan; Tushar Krishna

arxiv: 2510.01290 · v2 · submitted 2025-10-01 · 💻 cs.LG

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran , Marina Neseem , Charbel Sakr , Rangharajan Venkatesan , Brucek Khailany , Tushar Krishna This is my paper

Pith reviewed 2026-05-18 10:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache compressionchain of thoughtreasoning modelsattention sparsityinference optimizationquantizationtoken evictionlarge language models

0 comments

The pith

ThinKV maintains near-lossless accuracy in reasoning models by compressing the KV cache to less than 5% of original size using thought-adaptive eviction and quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce extended chains of thought that cause the key-value cache to grow rapidly and exhaust GPU memory. The paper shows that attention sparsity patterns distinguish different types of thoughts according to their importance for reaching the final answer. It therefore applies lower precision to tokens from important thoughts and progressively removes tokens from less critical ones as the reasoning continues. A supporting kernel reuses the freed memory slots efficiently by extending PagedAttention. The result is substantially higher inference speed while answer quality stays nearly unchanged on mathematics and coding tasks.

Core claim

Attention sparsity within the chain of thought reveals distinct thought types that differ in importance to the final answer. ThinKV therefore uses a hybrid strategy that assigns quantization precision according to thought importance and progressively evicts tokens belonging to less critical thoughts as reasoning trajectories evolve. When paired with a kernel that reuses memory slots from evicted tokens without compaction overhead, the method reduces the KV cache to under 5 percent of its original size while preserving near-lossless accuracy on mathematics and coding benchmarks for models such as DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason.

What carries the argument

The thought-adaptive hybrid quantization-eviction strategy that classifies tokens by importance using attention sparsity patterns observed in the chain of thought.

If this is right

Preserves near-lossless accuracy on mathematics and coding benchmarks across several reasoning models.
Delivers up to 5.8 times higher inference throughput compared with prior state-of-the-art cache compression methods.
Reduces KV cache memory footprint to less than 5 percent of the uncompressed size.
Enables efficient memory reuse through an extended PagedAttention kernel that avoids compaction overhead after evictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-based importance classification could be tested for compressing context windows in long-document generation tasks outside reasoning.
Hardware schedulers might incorporate similar progressive eviction logic to support larger context sizes on existing accelerators.
Further experiments could check whether the observed thought-type distinctions remain stable when models are fine-tuned on different reasoning domains.

Load-bearing premise

Attention sparsity patterns reliably reveal distinct thought types of varying importance within the chain of thought, so that selective eviction and quantization can be performed without harming final answer quality.

What would settle it

Running ThinKV on a reasoning model and observing a clear accuracy drop on a standard mathematics benchmark after the same attention-based thought classification and eviction would show the central premise does not hold.

read the original abstract

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinKV gives a practical hybrid way to shrink KV caches in long CoT reasoning by tagging thoughts via attention sparsity and mixing quantization with eviction, plus a paged kernel for reuse.

read the letter

ThinKV is worth a look if you're dealing with memory limits on long reasoning traces. The core idea is to watch attention patterns to figure out which parts of the chain of thought matter less and then compress or drop those tokens selectively. What the paper actually brings is a hybrid method that combines quantization for some tokens and eviction for others, based on classifying thoughts by their importance inferred from sparsity. They also built a kernel on top of PagedAttention to reuse the freed memory slots efficiently. That avoids the usual overhead of reorganizing the cache. The experiments run on distilled reasoning models and a couple of others, showing they can keep accuracy nearly the same with only 5% of the cache and get throughput up to 5.8 times better than baselines. The experiments are on math and coding tasks, which is reasonable for testing reasoning. If the full results include proper ablations and multiple runs, this could be a practical win for running these models on smaller GPUs. One area to check is whether the importance labels stay reliable over long sequences. The stress test raises a fair point: a token that gets low attention early might still be crucial for a conclusion many steps later. The paper should demonstrate that their progressive eviction doesn't lose those critical pieces, perhaps through some recovery mechanism or by showing failure cases are rare. Without that, the near-lossless claim could be brittle on harder problems. Overall, this is aimed at engineers and researchers focused on inference optimization for large language models. Anyone trying to deploy chain-of-thought systems with tight memory budgets would find the approach and numbers relevant. The method is grounded in observed patterns rather than pure theory, which makes it straightforward to test. I think it should go to peer review. The practical results are strong enough to warrant detailed feedback on the experimental design and the robustness of the thought classification.

Referee Report

2 major / 2 minor

Summary. The paper proposes ThinKV, a thought-adaptive KV cache compression method for long chain-of-thought reasoning models. It observes that attention sparsity patterns distinguish thought types of varying importance, then applies a hybrid quantization-eviction policy that progressively reduces precision and evicts tokens from lower-importance thoughts while extending PagedAttention with a custom kernel to reuse freed memory slots without compaction. Experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding tasks report near-lossless accuracy at under 5% of the original KV cache size together with up to 5.8x higher inference throughput versus prior baselines.

Significance. If the empirical claims hold, ThinKV would materially lower the memory barrier for extended reasoning trajectories, allowing longer CoT without proportional hardware scaling and delivering substantial throughput gains. The practical kernel extension for paged memory reuse and the hybrid eviction-quantization design constitute concrete engineering contributions that could be adopted in production inference stacks.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central 'near-lossless accuracy' claim with <5% KV cache is presented without error bars, data-split details, or exclusion criteria. This omission directly undermines verification of the accuracy-throughput tradeoff that constitutes the paper's primary result.
[§3.2] §3.2 (Thought-Adaptive Eviction): the premise that attention sparsity at eviction time reliably labels entire thoughts as low-importance and stationary is load-bearing for the safety of eviction. No analysis is provided of whether early low-attention tokens can carry information required only 20–50 steps later; if stationarity fails on even a modest fraction of trajectories, the near-lossless guarantee at 5% cache size does not hold.

minor comments (2)

[§5] §5 (Kernel Implementation): clarify the exact reuse mechanism for evicted slots and whether any additional synchronization cost is incurred under concurrent decoding.
[Table 2] Table 2: report the precise KV-cache size percentages and throughput numbers for each baseline rather than only the best-case 5.8x figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central 'near-lossless accuracy' claim with <5% KV cache is presented without error bars, data-split details, or exclusion criteria. This omission directly undermines verification of the accuracy-throughput tradeoff that constitutes the paper's primary result.

Authors: We agree that additional statistical details would improve verifiability. In the revised version we will report error bars as standard deviations computed over five independent runs with different random seeds for all accuracy metrics. We will also specify the exact data splits (e.g., the MATH and GSM8K subsets) and state that no trajectories were excluded beyond standard filtering for syntactically valid CoT outputs. These changes will be reflected in both the abstract and §4. revision: yes
Referee: [§3.2] §3.2 (Thought-Adaptive Eviction): the premise that attention sparsity at eviction time reliably labels entire thoughts as low-importance and stationary is load-bearing for the safety of eviction. No analysis is provided of whether early low-attention tokens can carry information required only 20–50 steps later; if stationarity fails on even a modest fraction of trajectories, the near-lossless guarantee at 5% cache size does not hold.

Authors: The concern about potential non-stationarity is well-taken. Section 3.2 grounds the eviction policy in observed attention sparsity patterns that distinguish thought importance as trajectories evolve. Although we do not provide a dedicated forward-looking analysis of whether early low-attention tokens may become relevant 20–50 steps later, the hybrid policy continuously re-assesses importance and the empirical results across three models and multiple benchmarks show near-lossless accuracy at <5 % cache size. We will add a short discussion of this stationarity assumption and its empirical support in the revised manuscript; a full theoretical treatment or new long-horizon ablation would require substantial additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework based on observed attention patterns

full rationale

The paper frames ThinKV as an applied engineering method driven by empirical observations of attention sparsity revealing thought types in CoT trajectories. No derivation chain, first-principles result, or prediction is claimed that reduces by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The hybrid quantization-eviction strategy and kernel extension are presented as direct responses to measured patterns and performance needs rather than any self-referential loop. The approach remains self-contained against external benchmarks as a compression technique without requiring the target accuracy claims to be presupposed in its construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on one key domain assumption about attention patterns and introduces a small number of adaptation parameters whose exact values are not detailed in the abstract.

free parameters (1)

thought importance thresholds
Parameters used to classify thoughts and decide eviction or quantization levels based on attention sparsity.

axioms (1)

domain assumption Attention sparsity reveals distinct thought types with varying importance within chain-of-thought reasoning.
This observation is stated as the basis for assigning token precision and performing progressive eviction.

pith-pipeline@v0.9.0 · 5732 in / 1253 out tokens · 54656 ms · 2026-05-18T10:54:07.169847+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
cs.AI 2025-12 conditional novelty 6.0

SkipKV performs sentence-level KV eviction using similarity scoring and dynamically adjusts hidden states via a steering vector to produce shorter, accurate CoT outputs, delivering up to 26.7% higher accuracy and 1.7x...