Recognition: 2 theorem links
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
Pith reviewed 2026-05-08 18:18 UTC · model grok-4.3
The pith
Analytical models from microbenchmarks predict performance on Blackwell and MI300A GPUs with mean errors of 1.31% and 0.09%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analytical performance models grounded in systematic microbenchmark characterization accurately predict kernel execution times on NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A), with mean absolute errors of 1.31% and 0.09% on validation sets of 21 and 27 kernels respectively, while naive roofline baselines exceed 95% error.
What carries the argument
Microbenchmark-driven analytical performance models that capture architecture-specific features such as Tensor Memory (TMEM), asynchronous bulk copy (TMA), tensor cores for Blackwell, and Infinity Cache hierarchy, VGPR constraints, and occupancy for CDNA3.
Load-bearing premise
The microbenchmark suite isolates and measures all performance-critical components such that the fitted parameters accurately compose for real kernels without unaccounted interactions.
What would settle it
A validation kernel whose measured performance deviates substantially from the model's prediction even after accounting for all modeled components like memory hierarchies and occupancy.
Figures
read the original abstract
Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the model captures Tensor Memory (TMEM), asynchronous bulk copy (TMA), and 5th-generation tensor cores; for CDNA3, the model captures Infinity Cache hierarchy, VGPR constraints, and occupancy. Validation yields 1.31% MAE on B200 (21 kernels) and 0.09% on MI300A (27 kernels), while naive roofline baselines exceed 95% error on the same kernels. We further validate the models using Rodinia~3.1 and SPEChpc 2021 Tiny.The models are updated with HBM bandwidth, capacity, and cache parameters and applied to H200 (Hopper) and MI250X (CDNA2), indicating no major restructuring of the models are needed. All models and benchmarks will be released as open-source upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops microbenchmark-driven analytical performance models for NVIDIA Blackwell B200 and AMD CDNA3 MI300A GPUs. For B200 the model incorporates TMEM, TMA, and 5th-generation tensor cores; for MI300A it incorporates the Infinity Cache hierarchy, VGPR constraints, and occupancy. Parameters are fitted exclusively to dedicated microbenchmarks and then used to predict performance on 21 independent kernels (B200) and 27 kernels (MI300A), yielding 1.31% and 0.09% MAE respectively—far below naive roofline baselines (>95% error). The same model structures are applied to H200 and MI250X after updating only HBM bandwidth, capacity, and cache parameters. Additional validation uses Rodinia 3.1 and SPEChpc 2021 Tiny suites. All models and benchmarks are to be released open-source.
Significance. If the compositionality of the microbenchmark-derived parameters holds, the work supplies a practical, architecture-portable analytical framework that substantially outperforms roofline models on real kernels. The low reported errors on independent application suites, the explicit separation of microbenchmark fitting from validation kernels, and the planned open-source release are concrete strengths that would make the models immediately useful for performance engineers and architects working on rapidly evolving GPU platforms.
major comments (1)
- [Abstract and validation section] The central claim that microbenchmark parameters compose accurately for validation kernels rests on the assumption that all performance-critical interactions (TMEM+TMA, VGPR+occupancy, etc.) are either independently measured or provably additive in the model equations. The abstract and validation description report impressive MAE numbers but do not provide the explicit combination formulas or an analysis showing that the validation kernels do not exercise unbenchmarked regimes or cross-effects. This is load-bearing for the generality claim.
minor comments (2)
- [Abstract] The abstract states that the models are updated with HBM bandwidth, capacity, and cache parameters for H200/MI250X; a short table or paragraph listing the exact parameter values used for each architecture would improve clarity and reproducibility.
- [Abstract] The manuscript promises open-source release of models and benchmarks; including a placeholder repository URL or DOI at submission time would strengthen the reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive major comment. We address the point directly below and will incorporate revisions to strengthen the explicit presentation of model compositionality.
read point-by-point responses
-
Referee: [Abstract and validation section] The central claim that microbenchmark parameters compose accurately for validation kernels rests on the assumption that all performance-critical interactions (TMEM+TMA, VGPR+occupancy, etc.) are either independently measured or provably additive in the model equations. The abstract and validation description report impressive MAE numbers but do not provide the explicit combination formulas or an analysis showing that the validation kernels do not exercise unbenchmarked regimes or cross-effects. This is load-bearing for the generality claim.
Authors: We agree that the abstract is high-level by design and that the validation section emphasizes empirical results. The explicit combination formulas are derived in Sections 3 (B200) and 4 (MI300A), where parameters from microbenchmarks are composed via bottleneck analysis (e.g., effective memory bandwidth as the minimum across HBM, Infinity Cache, and TMA limits; tensor throughput as the minimum of tensor-core peak, TMEM bandwidth, and TMA copy rate; occupancy as a function of VGPR usage and wavefront scheduling). These sections detail the additive and min/max structure of the equations. Section 5 selects the 21/27 validation kernels to cover distinct regimes (compute-bound, memory-bound, mixed-precision, and irregular access patterns) that exercise the modeled interactions in combination. The reported MAE values (1.31% and 0.09%) on these independent kernels provide empirical support that unbenchmarked cross-effects are negligible within the tested space. To make this compositionality transparent in the validation context and directly respond to the concern, we will add a dedicated paragraph to the validation section that (i) restates the key combination formulas, (ii) maps each formula to the kernel categories used, and (iii) discusses the coverage of potential cross-effects with reference to the microbenchmark suite. This revision will be included in the next version. revision: yes
Circularity Check
No significant circularity: models fitted exclusively to microbenchmarks and validated on independent kernels
full rationale
The derivation chain begins with systematic microbenchmark characterization to isolate and measure architecture-specific components (TMEM/TMA/tensor cores for Blackwell; Infinity Cache/VGPR/occupancy for CDNA3), from which a small number of parameters are fitted. These fitted models are then applied to entirely separate validation sets (21 kernels for B200, 27 for MI300A, plus Rodinia 3.1 and SPEChpc 2021 Tiny). The reported MAE values are computed on these held-out kernels rather than on the microbenchmark data used for fitting. No equations reduce a prediction to its own inputs by construction, no self-citations are load-bearing for the central claims, and no uniqueness theorems or ansatzes are smuggled in. The extension to H200/MI250X uses only parameter updates without model restructuring. The chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- HBM bandwidth and capacity
- Cache hierarchy parameters
- Occupancy and VGPR constraints
axioms (2)
- domain assumption Microbenchmarks accurately isolate and measure individual performance components without interference
- domain assumption Performance of complex kernels can be predicted by composing the measured component models
Reference graph
Works this paper leans on
-
[1]
Evolution of the graphics processing unit (gpu),
W. J. Dally, S. W. Keckler, and D. B. Kirk, “Evolution of the graphics processing unit (gpu),”IEEE Micro, vol. 41, no. 6, pp. 42–51, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3113475
-
[2]
Parallel and gpu based strategies for selected cfd and climate modeling models,
K. Kurowski, M. Kulczewski, and M. Dobski, “Parallel and gpu based strategies for selected cfd and climate modeling models,” in Information Technologies in Environmental Engineering: New Trends and Challenges. Springer, 2011, pp. 735–747. [Online]. Available: https://doi.org/10.1007/978-3-642-19536-5
-
[3]
Hardware acceleration of llms: A comprehen- sive survey and comparison,
N. Koilia and C. Kachris, “Hardware acceleration of llms: A comprehen- sive survey and comparison,”arXiv preprint arXiv:2409.03384, 2024, url = https://doi.org/10.48550/arXiv.2409.03384
-
[4]
[Online]
NVIDIA Corporation,NVIDIA Blackwell Architecture Technical Brief, NVIDIA, 2024. [Online]. Available: https://resources.nvidia.com/en-u s-blackwell-architecture
2024
-
[5]
Introducing amd cdna™ 3 architecture,
“Introducing amd cdna™ 3 architecture,” Advanced Micro Devices, Inc., Tech. Rep., 2023. [Online]. Available: https://www.amd.com/co ntent/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdn a-3-white-paper.pdf
2023
-
[6]
Retrieval-augmented generation for knowledge-intensive nlp tasks
A. Jarmusch and S. Chandrasekaran, “Microbenchmarking NVIDIA’s blackwell architecture: An in-depth architectural analysis,” in2026 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2026. [Online]. Available: https://doi.org/10.48550/a rXiv.2512.02189
work page doi:10.48550/a 2026
-
[7]
A. Jarmusch, C. Vitz, and S. Chandrasekaran, “Execution-centric characterization of FP8 matrix cores, asynchronous execution, and structured sparsity on AMD MI300A,” inProceedings of the ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.10262
-
[8]
Benchmarking thread block cluster,
T. L ¨uhnen, S. Nabi, and A. Koch, “Benchmarking thread block cluster,” inProceedings of PDPTA. CSREA Press, url = https://doi.org/10.1109/HPEC62836.2024.10938416, 2024
-
[9]
Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels,
E. Gilman, H. Khaleghzadeh, K. Doshi, A. Dengel, B. Juurlink, N. Jain, and R. Ahmed, “Demystifying the placement policies of the NVIDIA GPU thread block scheduler for concurrent kernels,”ACM SIGMETRICS Performance Evaluation Review, vol. 48, no. 3, pp. 65–67, 2020. [Online]. Available: https://doi.org/10.1145/3453953.3453972
-
[10]
Roofline: an in- sightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: an in- sightful visual performance model for multicore architectures,” inCommunications of the ACM, vol. 52, no. 4, 2009, url = https://doi.org/10.1145/1498765.1498785, pp. 65–76
-
[11]
Cache-aware roofline model: Upgrading the loft,
A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2014. [Online]. Available: https://doi.org/10.1109/L-CA.2013.6
-
[12]
G. Ofenbeck, R. Steinmann, V . C. Cabezas, D. G. Spampinato, and M. P ¨uschel, “Applying the roofline model,” in2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 76–85. [Online]. Available: https://doi.org/10.1109/ISPASS.2014.6842069
-
[13]
Metrics and design of an instruction roofline model for amd gpus,
M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, and S. Chandrasekaran, “Metrics and design of an instruction roofline model for amd gpus,”ACM Transactions on Parallel Computing, 2022. [Online]. Available: https://doi.org/10.1145/3505285
-
[14]
An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,
S. W. Hong and H. Kim, “An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness,” inProceedings of the 36th Annual International Symposium on Computer Architecture, 2009, url = https://doi.org/10.1145/1555754.1555775, pp. 152–163
-
[15]
GCoM: A detailed GPU core model for accurate analytical modeling of modern GPUs,
J. Lee, H. Noh, S. Kim, J. Jeong, J. Kim, and J. Choi, “GCoM: A detailed GPU core model for accurate analytical modeling of modern GPUs,” inProceedings of ISCA. ACM, 2022, pp. 424–436. [Online]. Available: https://doi.org/10.1145/3470496.3527384
-
[16]
Gpumech: Gpu performance modeling technique based on interval analysis,
J.-C. Huang, J. H. Lee, H. Kim, and H.-H. S. Lee, “Gpumech: Gpu performance modeling technique based on interval analysis,” inMICRO. IEEE, 2014, pp. 68–79. [Online]. Available: https: //doi.org/10.1109/MICRO.2014.59
-
[17]
Mdm: The gpu memory divergence model,
L. Wang, M. Jahre, A. Adileh, and L. Eeckhout, “Mdm: The gpu memory divergence model,” inMICRO. IEEE, 2020, pp. 1009–1021. [Online]. Available: https://doi.org/10.1109/MICRO50266.2020.00089
-
[18]
Accel-Sim: An extensible simulation framework for validated GPU modeling,
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework for validated GPU modeling,” inISCA. IEEE, 2020, pp. 473–486. [Online]. Available: https: //doi.org/10.1109/ISCA45697.2020.00047
-
[19]
Demystifying gpu microarchitecture through microbenchmarking,
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying gpu microarchitecture through microbenchmarking,” in2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, url = https://doi.org/10.1109/ISPASS.2010.5452013, 2010, pp. 235–246
- [20]
-
[21]
Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,
W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, “Dissecting the NVIDIA hopper architecture through microbenchmarking and multiple level analysis,” 2025. [Online]. Available: https://arxiv.org/ab s/2501.12084
-
[22]
Dissecting CPU-GPU unified physical memory on AMD MI300A APUs,
J. Wahlgren, G. Schieffer, R. Shi, E. A. Le ´on, R. Pearce, M. Gokhale, and I. Peng, “Dissecting CPU-GPU unified physical memory on AMD MI300A APUs,” in2025 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2025, arXiv:2508.12743, url = https://doi.org/10.1109/IISWC66894.2025.00038
-
[23]
Character- izing the performance, power efficiency, and programmability of AMD matrix cores,
G. Schieffer, D. Medeiros, J. Faj, A. Marathe, and I. Peng, “Character- izing the performance, power efficiency, and programmability of AMD matrix cores,” in2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Indianapolis, IN, url = https://doi.org/10.1109/ISPASS61541.2024.00022: IEEE, 2024
-
[24]
A computational model for tensor core units,
R. Chowdhury, F. Silvestri, and F. Vella, “A computational model for tensor core units,”arXiv preprint arXiv:1908.06649, 2020, url = https://doi.org/10.48550/arXiv.1908.06649
-
[25]
Amali: An analytical model for accurately modeling llm inference on modern gpus,
S. Cao, J. Wu, J. Chen, H. An, and Z. Yu, “Amali: An analytical model for accurately modeling llm inference on modern gpus,” inProceedings of ISCA. ACM, 2025
2025
-
[26]
Numerical behavior of NVIDIA tensor cores,
M. Fasi, N. J. Higham, M. Mikaitis, and S. Pranesh, “Numerical behavior of NVIDIA tensor cores,”PeerJ Computer Science, vol. 7, p. e330, 2021. [Online]. Available: https://doi.org/10.7717/peerj-cs.330
-
[27]
P-chase: A portable tool for measuring memory access characteristics on multicore computers,
X. Mei and X. Chu, “P-chase: A portable tool for measuring memory access characteristics on multicore computers,” inEmbedded Software and Systems. Springer, 2009, pp. 76–83
2009
-
[28]
[Online]
NVIDIA Corporation,NVIDIA Nsight Compute User Guide, 2025. [Online]. Available: https://docs.nvidia.com/nsight-compute/
2025
-
[29]
[Online]
AMD ROCm Development Team,ROCProfiler: AMD ROCm GPU Profiling Tool, 2025. [Online]. Available: https://rocm.docs.amd.com/pr ojects/rocprofiler/en/latest/
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.