pith. sign in

arxiv: 2603.11504 · v2 · submitted 2026-03-12 · 💻 cs.LG · cs.CL

LongFlow: Efficient KV Cache Compression for Reasoning Models

Pith reviewed 2026-05-15 13:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords KV cache compressionreasoning modelstoken evictionattention optimizationFlashAttention kernellong output generationthroughput improvementmemory efficiency
0
0 comments X

The pith

LongFlow compresses KV cache 80 percent in long-output reasoning models by scoring token importance from the current query alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning models produce long outputs that swell the KV cache and raise memory and bandwidth costs during attention. LongFlow computes an importance score for each cached token from only the current query and an intermediate attention result, allowing eviction of low-value tokens without full-history recomputation. The metric adds negligible overhead and requires no auxiliary storage. A fused kernel combines FlashAttention, scoring, and eviction into one operator for system-level speed. Experiments report up to 11.8 times higher throughput at 80 percent compression with little accuracy loss.

Core claim

Token importance for eviction during extended generation can be estimated accurately enough using only the current query's interaction with an intermediate attention result, eliminating the need for expensive periodic full-history re-evaluation while adding almost zero compute or storage cost when implemented inside a custom fused attention kernel.

What carries the argument

Query-only importance score derived from an intermediate attention result, used to drive token eviction inside a single fused FlashAttention operator.

If this is right

  • Memory footprint for long reasoning traces shrinks dramatically without retraining.
  • Bandwidth pressure on attention drops, directly raising tokens per second.
  • Models such as o1-style reasoners become cheaper to serve at scale.
  • Longer reasoning chains fit on the same hardware before hitting context limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight scoring may transfer to other long-generation tasks where full re-scoring is costly.
  • Compression ratio could be made dynamic, tightening as output length grows.
  • If recent queries carry enough signal for eviction, similar shortcuts may exist in structured generation beyond reasoning.

Load-bearing premise

An importance score computed from the current query and one intermediate attention result remains reliable for the entire length of a long output sequence without re-evaluation.

What would settle it

Measure accuracy on a math or code generation benchmark after forcing 80 percent eviction with LongFlow; if accuracy falls more than a few points below the full-cache baseline, the claim does not hold.

read the original abstract

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LongFlow, a KV cache compression method for reasoning models that generate long outputs. It derives an importance metric from an intermediate attention result using only the current query vector, incurring negligible overhead and requiring no auxiliary storage. A custom fused kernel integrates FlashAttention, importance estimation, and token eviction. Experiments report up to 11.8× throughput gains at 80% KV cache compression with minimal accuracy impact on reasoning tasks.

Significance. If the throughput and accuracy claims are robustly validated, LongFlow could meaningfully lower memory and bandwidth costs for deploying long-reasoning models such as o1 and R1. The parameter-free, query-local importance scoring avoids the re-evaluation overhead of prior methods and is a practical strength for continuous generation.

major comments (2)
  1. [Abstract] Abstract: the central claim of up to 11.8× throughput improvement with 80% compression and “minimal impact on model accuracy” is stated without any information on benchmarks, baseline methods, number of runs, or statistical significance, leaving the performance result only moderately supported.
  2. [LongFlow Method] Method description of importance estimation: the score is computed solely from the current query and an intermediate attention result, with permanent eviction and no periodic full-history re-scoring. This design rests on the untested assumption that query-local importance remains stable across long reasoning trajectories; violation of the assumption would make early mis-evictions irrecoverable and could produce larger accuracy degradation than reported.
minor comments (2)
  1. [Experiments] Experiments section: add explicit statements of the exact datasets, tasks, and model sizes used, together with error bars or standard deviations across runs, to support reproducibility.
  2. [Method] Notation: clarify whether the importance metric is exactly the attention weight or a scaled variant; the current description leaves the precise formula ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to strengthen the abstract with experimental details and to include new analysis supporting the stability of our importance metric. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of up to 11.8× throughput improvement with 80% compression and “minimal impact on model accuracy” is stated without any information on benchmarks, baseline methods, number of runs, or statistical significance, leaving the performance result only moderately supported.

    Authors: We agree that the abstract should provide more context on the experimental setup. In the revised version, we have updated the abstract to specify the benchmarks (GSM8K, MATH, HumanEval), baselines (H2O, SnapKV, and uncompressed KV cache), hardware (NVIDIA A100), and that results are averaged over 5 runs with standard deviations reported in the main text and tables. These additions directly address the support for the 11.8× throughput and accuracy claims. revision: yes

  2. Referee: [LongFlow Method] Method description of importance estimation: the score is computed solely from the current query and an intermediate attention result, with permanent eviction and no periodic full-history re-scoring. This design rests on the untested assumption that query-local importance remains stable across long reasoning trajectories; violation of the assumption would make early mis-evictions irrecoverable and could produce larger accuracy degradation than reported.

    Authors: The referee correctly notes that the method relies on stability of query-local scores without re-scoring. While our original experiments demonstrated low overall accuracy loss on long outputs, we did not include explicit temporal stability analysis. We have added Section 3.4 with new experiments and visualizations tracking importance score consistency over generation steps (up to 12k tokens), showing that early evictions rarely affect final accuracy (degradation remains <1%). This provides evidence for the assumption in practice; we retain the no-re-scoring design due to its efficiency but acknowledge the added analysis strengthens the claims. revision: partial

Circularity Check

0 steps flagged

No circularity: importance metric follows directly from standard attention arithmetic without self-reference or fitted-input reduction

full rationale

The paper's core derivation presents the importance score as an immediate byproduct of the attention computation performed with the current query vector alone. No equation defines the eviction policy or compression ratio in terms of itself, no parameter is fitted on a subset and then relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the design. The reported throughput gains are measured separately via a fused kernel, and accuracy impact is assessed empirically rather than derived tautologically from the importance definition. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method introduces one lightweight importance heuristic whose exact functional form is not specified in the abstract; no explicit free parameters, axioms, or new physical entities are declared.

pith-pipeline@v0.9.0 · 5504 in / 1153 out tokens · 51660 ms · 2026-05-15T13:06:31.205349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.