Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

Alejandro Vaquero; Alexei Strelchenko; Evan Weinberg; M. A. Clark; Mathias Wagner

arxiv: 1710.09745 · v2 · pith:BG5IX6LEnew · submitted 2017-10-26 · ✦ hep-lat · physics.comp-ph

Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

M. A. Clark , Alexei Strelchenko , Alejandro Vaquero , Mathias Wagner , Evan Weinberg This is my paper

classification ✦ hep-lat physics.comp-ph

keywords operationssolversblock-krylovgpusimplementationskrylovmatrix-vectormemory-bandwidth

0 comments

read the original abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

This paper has not been read by Pith yet.

Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

discussion (0)