Pqcache: Product quantization-based kvcache for long context llm inference

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui · 2024 · arXiv 2407.12820

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

cs.AR · 2026-04-20 · unverdicted · novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

cs.LG · 2026-04-12 · unverdicted · novelty 6.0

IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

cs.LG · 2024-09-16 · conditional · novelty 6.0

RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

citing papers explorer

Showing 3 of 3 citing papers.

AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization cs.AR · 2026-04-20 · unverdicted · none · ref 77
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 12
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval cs.LG · 2024-09-16 · conditional · none · ref 118
RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.

Pqcache: Product quantization-based kvcache for long context llm inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer