MoA framework derives a denotational normal form for attention that eliminates all intermediate arrays by algebraic construction, yielding O(n_dk + n_dv) memory traffic with numerical verification against PyTorch.
A fast optimization view: Reformu- lating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2verdicts
UNVERDICTED 2roles
background 1polarities
unclear 1representative citing papers
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
citing papers explorer
-
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
MoA framework derives a denotational normal form for attention that eliminates all intermediate arrays by algebraic construction, yielding O(n_dk + n_dv) memory traffic with numerical verification against PyTorch.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.