MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng · 2024 · arXiv 2410.14731

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

Quantization Dominates Rank Reduction for KV-Cache Compression

cs.LG · 2026-04-13 · conditional · novelty 6.0

Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves ordering.

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

OjaKV introduces hybrid full-rank storage for key tokens combined with online low-rank KV cache compression via Oja's algorithm to support memory-efficient long-context LLM inference.

citing papers explorer

Showing 5 of 5 citing papers.

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference cs.CV · 2026-05-19 · unverdicted · none · ref 44
RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales cs.LG · 2026-04-22 · unverdicted · none · ref 18
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 45
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
Quantization Dominates Rank Reduction for KV-Cache Compression cs.LG · 2026-04-13 · conditional · none · ref 11
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves ordering.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression cs.CL · 2025-09-25 · unverdicted · none · ref 9
OjaKV introduces hybrid full-rank storage for key tokens combined with online low-rank KV cache compression via Oja's algorithm to support memory-efficient long-context LLM inference.

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

fields

years

verdicts

representative citing papers

citing papers explorer