pith. sign in

arxiv: 1607.01249 · v1 · pith:AHYAECYYnew · submitted 2016-07-05 · 💻 cs.MS · cs.PF

TTC: A Tensor Transposition Compiler for Multiple Architectures

classification 💻 cs.MS cs.PF
keywords compilerarchitecturesintelmultiplecornerdomain-specifichaswellknights
0
0 comments X
read the original abstract

We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to $8\times$ and $32\times$, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.