pith. sign in

No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

citation-role summary

background 3

citation-polarity summary

years

2026 8 2025 3

verdicts

UNVERDICTED 11

roles

background 3

polarities

background 3

clear filters

representative citing papers

RoPE-Aware Bit Allocation for KV-Cache Quantization

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Block-GTQ performs RoPE-aware greedy bit allocation on KV caches using per-block energy scores, cutting logit MAE 32-80% versus uniform TQ-MSE and lifting long-context task scores substantially at 2-3 bits per dimension.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

cs.LG · 2025-04-28 · unverdicted · novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.

Rethinking LoRA Memory Through the Lens of KV Cache Compression

cs.CL · 2026-06-04 · unverdicted · novelty 5.0

Document LoRA acts as decoding-time parametric memory that recovers 13-21 ROUGE-L points under heavy KV cache compression in QA, performing best when the base model encodes the document and the adapter is used only at generation with QA supervision.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.