Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia , Marco Maggioni , Benjamin Staiger , Daniele P. Scarpazza

Authors on Pith no claims yet

classification 💻 cs.DC cs.PF

keywords nvidiavoltaarchitecturedetailsinstructionkeplerlevelline

read the original abstract

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery. This has led to a prolific line of publications that shed light on instruction encoding, and memory hierarchy's geometry and features at each level. Namely, research that describes the performance and behavior of the Kepler, Maxwell and Pascal architectures. In this technical report, we continue this line of research by presenting the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly. Additionally, we compare quantitatively our Volta findings against its predecessors, Kepler, Maxwell and Pascal.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU
cs.DC 2026-04 unverdicted novelty 7.0

Ocean uses HyperLogLog estimators to skip the costly symbolic phase of GPU SpGEMM, pairs it with dynamic workflow choice and a shared-plus-global hash accumulator, and reports 1.4-2.8x speedups over prior GPU implementations.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
AME-PIM: Can Memory be Your Next Tensor Accelerator?
cs.AR 2026-04 unverdicted novelty 6.0

The paper maps RISC-V AME matrix instructions to HBM-PIM micro-kernels via a PEP-based model and reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s on Samsung Aquabolt-XL.
AdaSplash-2: Faster Differentiable Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

AdaSplash-2 introduces a histogram-based initialization for the α-entmax normalizer that cuts iterations to 1-2 and, with a sparsity-aware GPU kernel, matches or beats FlashAttention-2 training speed at moderate-to-hi...
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
cs.LG 2023-07 accept novelty 6.0

FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
cs.DC 2026-05 unverdicted novelty 5.0

Microbenchmark-driven analytical models for B200 and MI300A achieve 1.31% and 0.09% MAE on validation kernels, far outperforming roofline baselines exceeding 95% error.
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
cs.LG 2026-04 unverdicted novelty 5.0

M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments
cs.DC 2026-04 unverdicted novelty 4.0

Warp-tiled CUDA kernel for depthwise convolution delivers 3.26x runtime reduction versus naive baseline and 1.29x end-to-end training speedup using counter-free analysis in cloud settings.