pith. sign in

arxiv: 2510.01290 · v2 · submitted 2025-10-01 · 💻 cs.LG

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Pith reviewed 2026-05-18 10:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords KV cache compressionchain of thoughtreasoning modelsattention sparsityinference optimizationquantizationtoken evictionlarge language models
0
0 comments X

The pith

ThinKV maintains near-lossless accuracy in reasoning models by compressing the KV cache to less than 5% of original size using thought-adaptive eviction and quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models produce extended chains of thought that cause the key-value cache to grow rapidly and exhaust GPU memory. The paper shows that attention sparsity patterns distinguish different types of thoughts according to their importance for reaching the final answer. It therefore applies lower precision to tokens from important thoughts and progressively removes tokens from less critical ones as the reasoning continues. A supporting kernel reuses the freed memory slots efficiently by extending PagedAttention. The result is substantially higher inference speed while answer quality stays nearly unchanged on mathematics and coding tasks.

Core claim

Attention sparsity within the chain of thought reveals distinct thought types that differ in importance to the final answer. ThinKV therefore uses a hybrid strategy that assigns quantization precision according to thought importance and progressively evicts tokens belonging to less critical thoughts as reasoning trajectories evolve. When paired with a kernel that reuses memory slots from evicted tokens without compaction overhead, the method reduces the KV cache to under 5 percent of its original size while preserving near-lossless accuracy on mathematics and coding benchmarks for models such as DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason.

What carries the argument

The thought-adaptive hybrid quantization-eviction strategy that classifies tokens by importance using attention sparsity patterns observed in the chain of thought.

If this is right

  • Preserves near-lossless accuracy on mathematics and coding benchmarks across several reasoning models.
  • Delivers up to 5.8 times higher inference throughput compared with prior state-of-the-art cache compression methods.
  • Reduces KV cache memory footprint to less than 5 percent of the uncompressed size.
  • Enables efficient memory reuse through an extended PagedAttention kernel that avoids compaction overhead after evictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-based importance classification could be tested for compressing context windows in long-document generation tasks outside reasoning.
  • Hardware schedulers might incorporate similar progressive eviction logic to support larger context sizes on existing accelerators.
  • Further experiments could check whether the observed thought-type distinctions remain stable when models are fine-tuned on different reasoning domains.

Load-bearing premise

Attention sparsity patterns reliably reveal distinct thought types of varying importance within the chain of thought, so that selective eviction and quantization can be performed without harming final answer quality.

What would settle it

Running ThinKV on a reasoning model and observing a clear accuracy drop on a standard mathematics benchmark after the same attention-based thought classification and eviction would show the central premise does not hold.

read the original abstract

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ThinKV, a thought-adaptive KV cache compression method for long chain-of-thought reasoning models. It observes that attention sparsity patterns distinguish thought types of varying importance, then applies a hybrid quantization-eviction policy that progressively reduces precision and evicts tokens from lower-importance thoughts while extending PagedAttention with a custom kernel to reuse freed memory slots without compaction. Experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding tasks report near-lossless accuracy at under 5% of the original KV cache size together with up to 5.8x higher inference throughput versus prior baselines.

Significance. If the empirical claims hold, ThinKV would materially lower the memory barrier for extended reasoning trajectories, allowing longer CoT without proportional hardware scaling and delivering substantial throughput gains. The practical kernel extension for paged memory reuse and the hybrid eviction-quantization design constitute concrete engineering contributions that could be adopted in production inference stacks.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central 'near-lossless accuracy' claim with <5% KV cache is presented without error bars, data-split details, or exclusion criteria. This omission directly undermines verification of the accuracy-throughput tradeoff that constitutes the paper's primary result.
  2. [§3.2] §3.2 (Thought-Adaptive Eviction): the premise that attention sparsity at eviction time reliably labels entire thoughts as low-importance and stationary is load-bearing for the safety of eviction. No analysis is provided of whether early low-attention tokens can carry information required only 20–50 steps later; if stationarity fails on even a modest fraction of trajectories, the near-lossless guarantee at 5% cache size does not hold.
minor comments (2)
  1. [§5] §5 (Kernel Implementation): clarify the exact reuse mechanism for evicted slots and whether any additional synchronization cost is incurred under concurrent decoding.
  2. [Table 2] Table 2: report the precise KV-cache size percentages and throughput numbers for each baseline rather than only the best-case 5.8x figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central 'near-lossless accuracy' claim with <5% KV cache is presented without error bars, data-split details, or exclusion criteria. This omission directly undermines verification of the accuracy-throughput tradeoff that constitutes the paper's primary result.

    Authors: We agree that additional statistical details would improve verifiability. In the revised version we will report error bars as standard deviations computed over five independent runs with different random seeds for all accuracy metrics. We will also specify the exact data splits (e.g., the MATH and GSM8K subsets) and state that no trajectories were excluded beyond standard filtering for syntactically valid CoT outputs. These changes will be reflected in both the abstract and §4. revision: yes

  2. Referee: [§3.2] §3.2 (Thought-Adaptive Eviction): the premise that attention sparsity at eviction time reliably labels entire thoughts as low-importance and stationary is load-bearing for the safety of eviction. No analysis is provided of whether early low-attention tokens can carry information required only 20–50 steps later; if stationarity fails on even a modest fraction of trajectories, the near-lossless guarantee at 5% cache size does not hold.

    Authors: The concern about potential non-stationarity is well-taken. Section 3.2 grounds the eviction policy in observed attention sparsity patterns that distinguish thought importance as trajectories evolve. Although we do not provide a dedicated forward-looking analysis of whether early low-attention tokens may become relevant 20–50 steps later, the hybrid policy continuously re-assesses importance and the empirical results across three models and multiple benchmarks show near-lossless accuracy at <5 % cache size. We will add a short discussion of this stationarity assumption and its empirical support in the revised manuscript; a full theoretical treatment or new long-horizon ablation would require substantial additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework based on observed attention patterns

full rationale

The paper frames ThinKV as an applied engineering method driven by empirical observations of attention sparsity revealing thought types in CoT trajectories. No derivation chain, first-principles result, or prediction is claimed that reduces by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The hybrid quantization-eviction strategy and kernel extension are presented as direct responses to measured patterns and performance needs rather than any self-referential loop. The approach remains self-contained against external benchmarks as a compression technique without requiring the target accuracy claims to be presupposed in its construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on one key domain assumption about attention patterns and introduces a small number of adaptation parameters whose exact values are not detailed in the abstract.

free parameters (1)
  • thought importance thresholds
    Parameters used to classify thoughts and decide eviction or quantization levels based on attention sparsity.
axioms (1)
  • domain assumption Attention sparsity reveals distinct thought types with varying importance within chain-of-thought reasoning.
    This observation is stated as the basis for assigning token precision and performing progressive eviction.

pith-pipeline@v0.9.0 · 5732 in / 1253 out tokens · 54656 ms · 2026-05-18T10:54:07.169847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.

  2. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.

  3. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

  4. MEMENTO: Teaching LLMs to Manage Their Own Context

    cs.AI 2026-04 unverdicted novelty 6.0

    MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

  5. SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

    cs.AI 2025-12 conditional novelty 6.0

    SkipKV performs sentence-level KV eviction using similarity scoring and dynamically adjusts hidden states via a steering vector to produce shorter, accurate CoT outputs, delivering up to 26.7% higher accuracy and 1.7x...