Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Hongyuan Liu, Qiang Wang, Xiaowen Chu · 2025 · arXiv 2501.12084

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

cs.DC · 2026-04-20 · unverdicted · novelty 7.0

AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

cs.AR · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.

Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis

cs.AR · 2026-05-01 · unverdicted · novelty 6.0

Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures

cs.DC · 2026-05-05 · unverdicted · novelty 5.0

Microbenchmark-driven analytical models for B200 and MI300A achieve 1.31% and 0.09% MAE on validation kernels, far outperforming roofline baselines exceeding 95% error.

citing papers explorer

Showing 4 of 4 citing papers.

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures cs.DC · 2026-04-20 · unverdicted · none · ref 16
AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90% block sparsity.
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments cs.AR · 2026-05-11 · unverdicted · none · ref 16 · 2 links
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis cs.AR · 2026-05-01 · unverdicted · none · ref 9
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures cs.DC · 2026-05-05 · unverdicted · none · ref 21
Microbenchmark-driven analytical models for B200 and MI300A achieve 1.31% and 0.09% MAE on validation kernels, far outperforming roofline baselines exceeding 95% error.

Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer