STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
Tidaldecode: Fast and accurate llm decoding with position persistent sparse attention, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
STS: Efficient Sparse Attention with Speculative Token Sparsity
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.