TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

[Online] · 2018 · cs.LG · arXiv 1802.04799

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Prism: Symbolic Superoptimization of Tensor Programs

cs.PL · 2026-04-16 · unverdicted · novelty 8.0

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

cs.PL · 2026-04-16 · unverdicted · novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

cs.PL · 2025-10-09 · conditional · novelty 6.0

Neptune introduces dependency-breaking fusion with algebraic corrections for reduction sequences, generating FlashAttention-like kernels from plain attention code with 1.35x average speedup across ten benchmarks and four GPU architectures.

Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

cs.DC · 2026-04-13 · unverdicted · novelty 6.0

R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.

A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline

cs.SE · 2026-06-30 · unverdicted · novelty 5.0

An empirical study decomposes the LLVM -O3 pipeline into cumulative prefixes and quantifies per-pass effects on 30 kernels, finding non-monotonic behavior, back-loaded gains, and a 46.35% idealized upper bound on phase-interference losses.

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

IO-aware GPU kernels for SpMM convolutions, degree-aware reductions, and fused attention layers deliver median speedups of 1.6-2.6x (up to 10x) and memory reductions up to 76x over DGL/PyG baselines on realistic graphs.

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

HTAM builds a Hierarchical Transition Graph to organize coarse global directions and detailed local strategies for guiding LLM-based CUDA kernel optimization, improving results on KernelBench.

citing papers explorer

Showing 1 of 1 citing paper after filters.

On Efficient Scaling of GNNs via IO-Aware Layers Implementations cs.LG · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
IO-aware GPU kernels for SpMM convolutions, degree-aware reductions, and fused attention layers deliver median speedups of 1.6-2.6x (up to 10x) and memory reductions up to 76x over DGL/PyG baselines on realistic graphs.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer