STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
Efficient memory man- agement for large language model serving with page- dattention
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
DynaTrain introduces a Virtual Parameter Space abstraction to enable sub-second online parallelism reconfiguration for elastic LLM training on models up to 235B parameters.
citing papers explorer
-
STS: Efficient Sparse Attention with Speculative Token Sparsity
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
-
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
DynaTrain introduces a Virtual Parameter Space abstraction to enable sub-second online parallelism reconfiguration for elastic LLM training on models up to 235B parameters.