RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.
IEEE Transactions on Computers , volume=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
baseline 1
citation-polarity summary
fields
cs.LG 1years
2024 1verdicts
CONDITIONAL 1roles
baseline 1polarities
baseline 1representative citing papers
citing papers explorer
-
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.