AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
Flash-LLM: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.DC 4verdicts
UNVERDICTED 4representative citing papers
BLEST is a Tensor Core-accelerated BFS framework using BVSS representation and optimized MMA layouts that reports average speedups of 22x over GAP on real-world graphs.
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
citing papers explorer
-
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
-
Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs
BLEST is a Tensor Core-accelerated BFS framework using BVSS representation and optimized MMA layouts that reports average speedups of 22x over GAP on real-world graphs.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
-
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.