RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.
Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4years
2026 4representative citing papers
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
citing papers explorer
-
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.
-
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
-
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
-
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.