FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Performance, design, and autotuning of batched gemm for gpus
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
A new GPU-oriented batch SVD solver based on the one-sided Jacobi method delivers significant speedups over vendor libraries and prior open-source implementations across precisions and matrix shapes.
citing papers explorer
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
-
An Efficient Batch Solver for the Singular Value Decomposition on GPUs
A new GPU-oriented batch SVD solver based on the one-sided Jacobi method delivers significant speedups over vendor libraries and prior open-source implementations across precisions and matrix shapes.