Mini- cache: Kv cache compression in depth dimension for large language models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang · 2024 · arXiv 2405.14366

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

cs.CL · 2025-02-04 · unverdicted · novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.

S2O: Early Stopping for Sparse Attention via Online Permutation

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

cs.LG · 2025-03-25 · unverdicted · novelty 6.0

LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.

citing papers explorer

Showing 3 of 3 citing papers.

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression cs.CL · 2025-02-04 · unverdicted · none · ref 60
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
S2O: Early Stopping for Sparse Attention via Online Permutation cs.LG · 2026-02-26 · unverdicted · none · ref 15
S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation cs.LG · 2025-03-25 · unverdicted · none · ref 13
LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.

Mini- cache: Kv cache compression in depth dimension for large language models

fields

years

verdicts

representative citing papers

citing papers explorer