Neptune introduces dependency-breaking fusion with algebraic corrections for reduction sequences, generating FlashAttention-like kernels from plain attention code with 1.35x average speedup across ten benchmarks and four GPU architectures.
CUTLASS: CUDA Templates for Linear Algebra Subroutines
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.PL 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
Neptune introduces dependency-breaking fusion with algebraic corrections for reduction sequences, generating FlashAttention-like kernels from plain attention code with 1.35x average speedup across ten benchmarks and four GPU architectures.