Hierarchical roofline analysis for gpus: Accelerating performance optimization for the nersc-9 perlmutter system

Charlene Yang, Thorsten Kurth, Samuel Williams · 2020 · DOI 10.1002/cpe.5547

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

Apple Neural Engine: Architecture, Programming, and Performance

cs.AR · 2026-06-21 · unverdicted · novelty 8.0

The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

cs.LG · 2026-06-24 · conditional · novelty 7.0

KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.

COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

cs.PF · 2026-04-24 · conditional · novelty 6.0

COMPASS formalizes HPC configuration questions as ML tasks on traces, quantifies recommendation trustworthiness, and delivers 65.93% lower average job turnaround time plus 80.93% lower node usage versus prior methods in simulator tests.

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

cs.LG · 2026-06-11 · unverdicted · novelty 4.0

Profiling of Med-DDPM shows cuDNN kernels dominate training; TF32 Tensor Core activation and 3D channels-last layout reduce SM cycles up to 100x and raise Tensor Core utilization on A100 without quality loss.

citing papers explorer

Showing 4 of 4 citing papers.

Apple Neural Engine: Architecture, Programming, and Performance cs.AR · 2026-06-21 · unverdicted · none · ref 12
The paper delivers a reverse-engineered documentation of the Apple Neural Engine architecture, dispatch mechanisms, weight compression, and roofline performance based on measurements from M1 and M5 chips and analysis of private runtime components.
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization cs.LG · 2026-06-24 · conditional · none · ref 33
KernelPro combines LLM code generation, roofline-guided tool orchestration, and domain-adapted MCTS to produce GPU kernels that outperform prior automated and some hand-tuned baselines on KernelBench and VeOmni workloads.
COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC cs.PF · 2026-04-24 · conditional · none · ref 170
COMPASS formalizes HPC configuration questions as ML tasks on traces, quantifies recommendation trustworthiness, and delivers 65.93% lower average job turnaround time plus 80.93% lower node usage versus prior methods in simulator tests.
Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures cs.LG · 2026-06-11 · unverdicted · none · ref 58
Profiling of Med-DDPM shows cuDNN kernels dominate training; TF32 Tensor Core activation and 3D channels-last layout reduce SM cycles up to 100x and raise Tensor Core utilization on A100 without quality loss.

Hierarchical roofline analysis for gpus: Accelerating performance optimization for the nersc-9 perlmutter system

fields

years

verdicts

representative citing papers

citing papers explorer