Sam- pleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention

Zhu, Q · 2024 · arXiv 2406.15486

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

cs.DC · 2026-03-26 · unverdicted · novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 32K-128K contexts.

citing papers explorer

Showing 3 of 3 citing papers.

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 24
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache cs.LG · 2026-05-07 · unverdicted · none · ref 57
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference cs.LG · 2026-03-30 · unverdicted · none · ref 11
CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 32K-128K contexts.

Sam- pleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention

fields

years

verdicts

representative citing papers

citing papers explorer