Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, et al · 2023 · arXiv 6292.362630

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

cs.DC · 2026-04-20 · unverdicted · novelty 7.0

AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

cs.DC · 2026-04-18 · unverdicted · novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

cs.DC · 2025-10-13 · unverdicted · novelty 5.0

FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.

citing papers explorer

Showing 3 of 3 citing papers.

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures cs.DC · 2026-04-20 · unverdicted · none · ref 19
AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 37
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters cs.DC · 2025-10-13 · unverdicted · none · ref 54
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer