pith. sign in

arxiv: 2602.16284 · v2 · pith:XID3AKEVnew · submitted 2026-02-18 · 💻 cs.LG

Fast KV Compaction via Attention Matching

classification 💻 cs.LG
keywords compactionattentionspacecompactcontextsfasthighlylatent
0
0 comments X
read the original abstract

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Nearly Optimal Attention Coresets

    cs.DS 2026-05 unverdicted novelty 8.0

    ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

  2. Parallel Context Compaction for Long-Horizon LLM Agent Serving

    cs.AI 2026-05 unverdicted novelty 6.0

    Parallel compaction for LLM agent context management provides predictable volume control and reduces wall time versus sequential baselines on HotpotQA and LoCoMo.

  3. Nectar: Neural Estimation of Cached-Token Attention via Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.

  4. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.

  5. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

  6. A Simple Plug-in for Improving Eviction-Based KV Cache Compression

    cs.LG 2026-05 unverdicted novelty 4.0

    VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.

  7. Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

    cs.CL 2026-04 unverdicted novelty 4.0

    A minimalist retrieval-and-generation framework using turn isolation and query-driven pruning outperforms complex memory systems by directly addressing signal sparsity and dual-level redundancy in dialogues.