DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.