DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Fast transformer decoding: One write-head is all you need
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.
citing papers explorer
-
Prune, Update and Trim: Robust Structured Pruning for Large Language Models
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
-
The General Theory of Localization Methods
The localization method is presented as a unifying framework connecting kernel methods, MeanShift, Hopfield networks, LLE, fuzzy inference, denoising autoencoders, and Transformers via local models and the localization trick.