pith. sign in

arxiv: 2509.21623 · v2 · submitted 2025-09-25 · 💻 cs.CL · cs.AI· cs.LG

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

Pith reviewed 2026-05-18 13:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords KV cachelow-rank compressiononline adaptationlong-context LLMsmemory efficiencyOja's algorithmattention optimization
0
0 comments X

The pith

OjaKV adapts low-rank projections online to compress KV caches while keeping or improving accuracy on long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OjaKV to reduce the memory demands of the key-value cache in long-context language model generation. It does this by keeping the first and latest tokens in full detail and compressing the middle ones with a low-rank projection whose basis is updated incrementally using Oja's algorithm. The updates happen fully during the initial prompt processing and lightly during token generation to follow context changes. A sympathetic reader would care because this could let models handle much longer inputs on existing hardware without retraining or losing performance on reasoning tasks.

Core claim

OjaKV combines a hybrid storage policy, preserving crucial tokens at full rank, with online subspace adaptation via Oja's algorithm for low-rank compression of intermediate tokens. Comprehensive updates occur during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace aligns with evolving context. This approach maintains compatibility with FlashAttention and delivers maintained or improved zero-shot accuracy at high compression ratios, especially on very long-context benchmarks requiring complex reasoning.

What carries the argument

The online adaptation of the projection basis using Oja's algorithm for principal component analysis, which dynamically tracks context shifts during both prefilling and decoding phases.

If this is right

  • Supports longer context lengths within the same memory budget.
  • Works as a plug-and-play addition without model retraining.
  • Shows particular strength on tasks needing complex reasoning over extended inputs.
  • Integrates directly with efficient attention implementations like FlashAttention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar online adaptation ideas could apply to other memory compression methods in transformers.
  • Reducing KV cache size this way might lower power consumption during inference by decreasing memory bandwidth needs.
  • Testing on even longer sequences could reveal how well the periodic updates scale with context length.

Load-bearing premise

That the incremental updates to the projection basis will successfully track context changes without accumulating errors that harm the quality of the attention computations.

What would settle it

Running the method on a long-context reasoning benchmark both with and without the periodic lightweight updates during decoding, and checking if accuracy falls when updates are disabled.

Figures

Figures reproduced from arXiv: 2509.21623 by David H. Yang, Keerthiram Murugesan, Mohammad Mohammadi Amiri, Pin-Yu Chen, Tejaswini Pedapati, Yuxuan Zhu.

Figure 1
Figure 1. Figure 1: Overview of the OjaKV workflow. The top-left panel shows standard attention using full-rank KV caching. Our method, shown in the bottom panel, introduces a low-rank path where keys and values are compressed using projection matrices (Uk, Uv) before caching. The top-right inset illustrates the core mechanism: these projection matrices are dynamically updated during both the prefill and decoding phases to ad… view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency comparison of Full KV and OjaKV (60% compression). [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OjaKV, a hybrid KV-cache compression framework for long-context LLMs. It preserves full-rank storage for the first and most recent tokens while applying low-rank projection (via Oja's online PCA) to intermediate tokens. The projection basis receives a comprehensive update during prompt prefilling and lightweight periodic updates during decoding; the method is stated to be FlashAttention-compatible and to require no model fine-tuning. Experiments are claimed to show that OjaKV maintains or improves zero-shot accuracy at high compression ratios, with the largest gains on long-context reasoning benchmarks.

Significance. If the empirical claims hold, the work would provide a practical, training-free route to reducing the dominant memory cost of long-context inference while addressing distribution shift better than static low-rank baselines. The hybrid anchor policy and online adaptation are conceptually sound responses to known weaknesses of offline PCA. The use of a standard online-PCA algorithm together with compatibility guarantees for modern attention kernels is a clear engineering strength.

major comments (2)
  1. [Method (online adaptation and hybrid policy)] Method description of decoding-phase updates: the central claim that lightweight periodic Oja updates suffice to keep the rank-r subspace aligned with evolving context (and thereby preserve attention quality for intermediate tokens) is load-bearing for the reported gains on long-context reasoning benchmarks. Oja's rule is a stochastic power iteration whose convergence is guaranteed only under stationary or slowly varying statistics; the manuscript should supply either a convergence-rate argument or targeted ablations on update interval and step-size under abrupt topic/reasoning shifts.
  2. [Experiments] Experimental section: the abstract asserts accuracy maintenance or gains, yet the quantitative evidence (exact compression ratios, baseline comparisons, error bars, and per-benchmark deltas) is not visible in the provided text. Without these numbers it is impossible to judge whether the strongest gains on long-context tasks are statistically reliable or merely within noise.
minor comments (2)
  1. Clarify the precise schedule and hyper-parameters of the 'lightweight periodic updates' (number of Oja steps per interval, learning-rate schedule, and any forgetting factor).
  2. Add a short discussion of the additional compute overhead introduced by the online updates relative to a static baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of OjaKV's hybrid storage policy and online adaptation approach, as well as for the constructive suggestions that will improve the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Method (online adaptation and hybrid policy)] Method description of decoding-phase updates: the central claim that lightweight periodic Oja updates suffice to keep the rank-r subspace aligned with evolving context (and thereby preserve attention quality for intermediate tokens) is load-bearing for the reported gains on long-context reasoning benchmarks. Oja's rule is a stochastic power iteration whose convergence is guaranteed only under stationary or slowly varying statistics; the manuscript should supply either a convergence-rate argument or targeted ablations on update interval and step-size under abrupt topic/reasoning shifts.

    Authors: We agree that additional analysis of the decoding-phase updates would strengthen the claims. Oja's algorithm is a standard online PCA method with established convergence guarantees under slowly varying distributions (as referenced in the original Oja paper and subsequent analyses), and our design uses a comprehensive update during prefilling followed by lightweight periodic updates to track gradual context evolution. To directly address abrupt shifts, we will add targeted ablations in the revised version that vary update interval and step-size on long-context benchmarks containing topic changes or reasoning shifts, reporting attention quality metrics and end-task accuracy. We will also include a short discussion citing relevant convergence results for online subspace tracking. revision: yes

  2. Referee: [Experiments] Experimental section: the abstract asserts accuracy maintenance or gains, yet the quantitative evidence (exact compression ratios, baseline comparisons, error bars, and per-benchmark deltas) is not visible in the provided text. Without these numbers it is impossible to judge whether the strongest gains on long-context tasks are statistically reliable or merely within noise.

    Authors: The full experimental section (Section 4) contains the requested details: tables reporting exact compression ratios (4x–16x), direct comparisons against static low-rank baselines and prior KV-cache methods, standard error bars from 3–5 runs, and per-benchmark deltas with the largest improvements on long-context reasoning tasks. We acknowledge that these results may not have been sufficiently highlighted in the version provided to the referee. In the revision we will add a concise summary table in the main body (near the abstract claims) and ensure all numbers, baselines, and statistical details are explicitly cross-referenced. revision: partial

Circularity Check

0 steps flagged

No circularity: standard online PCA applied with empirical validation

full rationale

The paper describes a hybrid KV-cache method that preserves first/recent tokens at full rank and compresses intermediates via incremental Oja updates during prefilling and periodic decoding steps. Oja's rule is a pre-existing stochastic power iteration (cited as standard online PCA) whose convergence properties are independent of this work; the paper does not derive any target quantity from its own fitted parameters or self-referential equations. Central claims rest on zero-shot benchmark measurements rather than any prediction that reduces by construction to the method's inputs. No self-citation load-bearing, uniqueness theorem, or ansatz smuggling appears in the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on standard properties of online PCA and low-rank matrix approximations without introducing new entities or heavily fitted parameters beyond typical compression hyperparameters.

free parameters (2)
  • compression rank
    Chosen to achieve target memory reduction; affects fidelity of intermediate token representations.
  • update interval
    Periodic lightweight updates during decoding; specific frequency is a design choice not derived from first principles.
axioms (1)
  • standard math Oja's algorithm produces a valid online estimate of principal components for the evolving token distribution
    Invoked to justify incremental subspace adaptation during prefilling and decoding phases.

pith-pipeline@v0.9.0 · 5861 in / 1204 out tokens · 53226 ms · 2026-05-18T13:22:04.229348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  2. eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al

    URLhttps://arxiv.org/ abs/2402.14261. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

  2. [2]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Al- ham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782,

  3. [3]

    Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

    URLhttps://arxiv.org/abs/2407.21118. Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness,

  4. [4]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    URLhttps://arxiv.org/abs/ 2205.14135. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi D...

  5. [5]

    URLhttps://arxiv.org/abs/2501.12948. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aure...

  6. [6]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. 12 Preprint Ming Gu and Stanley C Eisenstat. A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem.SIAM journal on Matrix Analysis and Applications, 15(4):1266– 1276,

  7. [7]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  8. [8]

    Angles between subspaces and their tangents

    Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces and their tangents.arXiv preprint arXiv:1209.0523,

  9. [9]

    MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection

    Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhi- jie Deng. Matryoshkakv: Adaptive kv compression via trainable orthogonal projection.arXiv preprint arXiv:2410.14731,

  10. [10]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750,

  11. [11]

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant

    doi: 10.1007/BF00275687. Erkki Oja. The nonlinear pca learning rule in independent component analysis.Neurocomputing, 17(1):25–45,

  12. [12]

    OpenAI o1 System Card

    URLhttps://arxiv.org/abs/2412.16720. Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646,

  13. [13]

    Keep the cost down: A review on methods to optimize llm’s kv-cache consumption

    Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, and Hai Zhao. Keep the cost down: A review on methods to optimize llm’s kv-cache consumption.arXiv preprint arXiv:2407.18003,

  14. [14]

    Shadowkv: Kv cache in shadows for high-throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference.arXiv preprint arXiv:2410.21465,

  15. [15]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

  16. [16]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

  17. [17]

    Recalkv: Low-rank kv cache compression via head reordering and offline calibra- tion.arXiv preprint arXiv:2505.24357, 2025

    Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Re- calkv: Low-rank kv cache compression via head reordering and offline calibration.arXiv preprint arXiv:2505.24357,

  18. [18]

    Sentencekv: Ef- ficient llm inference via sentence-level semantic kv caching.arXiv preprint arXiv:2504.00970,

    14 Preprint Yuxuan Zhu, Ali Falahati, David H Yang, and Mohammad Mohammadi Amiri. Sentencekv: Ef- ficient llm inference via sentence-level semantic kv caching.arXiv preprint arXiv:2504.00970,

  19. [19]

    Define compressed features ˜Q=QU k ∈R m×rk , ˜K=KU k ∈R n×rk , ˜V=V U v ∈R n×rv

    LetU k ∈R dh×rk andU v ∈R dh×rv be orthonormal bases withU T k Uk =I rk andU T v Uv =I rv. Define compressed features ˜Q=QU k ∈R m×rk , ˜K=KU k ∈R n×rk , ˜V=V U v ∈R n×rv . A.3.1 EQUIVALENCE OF TWO COMPUTATION REGIMES We compare (a) computing attention in the reduced space and expanding the output, versus (b) reconstructing full-rankK,Vand calling a stand...