Zipcache: Accurate and efficient kv cache quantization with salient token identification

· 2024 · arXiv 2405.14256

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention and 10% cache budgets.

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.

GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache

cs.LG · 2026-07-01 · unverdicted · novelty 5.0

GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.

Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

cs.CR · 2025-06-04 · unverdicted · novelty 5.0

Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

cs.AI · 2026-05-21 · unverdicted · novelty 4.0 · 2 refs

Meta-Soft dynamically synthesizes targeted soft tokens from a learnable meta-library using Gumbel-Softmax and applies attention-flow integration to compress KV cache while attempting to preserve evicted context information.

citing papers explorer

Showing 6 of 6 citing papers after filters.

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling cs.CV · 2026-06-06 · unverdicted · none · ref 39
HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention and 10% cache budgets.
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference cs.LG · 2026-05-13 · unverdicted · none · ref 10
Spherical KV combines angle-domain attention using spherical key codes with rate-distortion retention to cut KV cache residency and HBM traffic while keeping a paged, fusion-friendly decode path.
GSRQ: Gain-Shape Residual Quantization for Sub-1-bit KV Cache cs.LG · 2026-07-01 · unverdicted · none · ref 7
GSRQ applies a gain-shape variant of K-means inside residual quantization to improve directional fidelity, raising LongBench accuracy from 11.34 to 33.54 at 1-bit on LLaMA-3-8B.
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference cs.CL · 2026-05-25 · unverdicted · none · ref 8
IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across multiple LLMs.
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG cs.CR · 2025-06-04 · unverdicted · none · ref 37
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression cs.AI · 2026-05-21 · unverdicted · none · ref 12 · 2 links
Meta-Soft dynamically synthesizes targeted soft tokens from a learnable meta-library using Gumbel-Softmax and applies attention-flow integration to compress KV cache while attempting to preserve evicted context information.

Zipcache: Accurate and efficient kv cache quantization with salient token identification

fields

years

verdicts

representative citing papers

citing papers explorer