DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
Hugging Face
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 32K-128K contexts.
citing papers explorer
-
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
-
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 32K-128K contexts.