AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
Bridging the gap between unstructured spmm and structured sparse tensor cores
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SWOT enables intra-collective reconfiguration in optical networks to hide reconfiguration overhead during collective communications, achieving up to 89.7% reduction in completion time versus static topologies.
citing papers explorer
-
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
-
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
SWOT enables intra-collective reconfiguration in optical networks to hide reconfiguration overhead during collective communications, achieving up to 89.7% reduction in completion time versus static topologies.