Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost

Kitty: Accurate, efficient 2-bit KV cache quantization with dynamic channel-wise precision boost · 2025 · arXiv 2511.18643

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

cs.LG · 2026-04-22 · conditional · novelty 7.0

RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.

citing papers explorer

Showing 4 of 4 citing papers.

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory cs.LG · 2026-04-22 · conditional · none · ref 23
RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 6
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unverdicted · none · ref 25
Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving cs.LG · 2026-04-21 · unverdicted · none · ref 18
Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.

Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost

fields

years

verdicts

representative citing papers

citing papers explorer