Performance, design, and autotuning of batched gemm for gpus

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra · 2016 · DOI 10.1007/978-3-319-41321-1_2

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

An Efficient Batch Solver for the Singular Value Decomposition on GPUs

cs.MS · 2026-01-25 · unverdicted · novelty 6.0

A new GPU-oriented batch SVD solver based on the one-sided Jacobi method delivers significant speedups over vendor libraries and prior open-source implementations across precisions and matrix shapes.

citing papers explorer

Showing 2 of 2 citing papers.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision cs.LG · 2024-07-11 · accept · none · ref 1
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
An Efficient Batch Solver for the Singular Value Decomposition on GPUs cs.MS · 2026-01-25 · unverdicted · none · ref 5
A new GPU-oriented batch SVD solver based on the one-sided Jacobi method delivers significant speedups over vendor libraries and prior open-source implementations across precisions and matrix shapes.

Performance, design, and autotuning of batched gemm for gpus

fields

years

verdicts

representative citing papers

citing papers explorer