pith. sign in

arxiv: 2510.08525 · v3 · pith:T25BRUPEnew · submitted 2025-10-09 · 💻 cs.CL

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

classification 💻 cs.CL
keywords reasoningcacheheadscompressiongenerationdirectlyessentialexisting
0
0 comments X
read the original abstract

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...

  2. EarlyTom: Early Token Compression Completes Fast Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    EarlyTom is a training-free early token compression method inside the vision encoder with decoupled spatial selection that reduces TTFT up to 2.65x and FLOPs 61% on LLaVA-OneVision-7B while keeping accuracy comparable...

  3. CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

    cs.LG 2026-02 unverdicted novelty 6.0

    CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.

  4. HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression

    cs.LG 2026-06 unverdicted novelty 5.0

    HARD-KV bridges dynamic head-adaptive KV cache compression with static inference engine constraints via Cascade Cache and Logits Calibration, reporting up to 2x throughput gains on long-context math benchmarks.