Current LLMs remain robust to high levels of inference-time context sparsity across diverse tasks, enabling up to 10x acceleration via sparse kernels.
SOCKET: SOft Collision Kernel EsTimator for Sparse Attention
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Traditional LSH yields binary collision signals that limit ranking quality and require substantial memory to perform well. In contrast, soft LSH accumulates graded collision evidence across hash tables, preserving top-k ordering with significantly less memory. This reframes LSH from a candidate generator into a principled scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad hoc voting and matches or surpasses prior sparse attention methods across multiple long-context benchmarks. With a custom CUDA scoring kernel and a Flash Decode Triton backend, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention.
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Inference Time Context Sparsity: Illusion or Opportunity?
Current LLMs remain robust to high levels of inference-time context sparsity across diverse tasks, enabling up to 10x acceleration via sparse kernels.