Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3
The pith
CuTile reaches 1007 TFLOP/s fused attention on Blackwell B200 using 60 lines of Python code, outperforming FlashAttention-2 by 2.5x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CuTile's Python abstraction enables efficient Tensor Core and TMA usage, delivering up to 1007 TFLOP/s for fused attention on B200 (2.5x FlashAttention-2) in 60 lines and 52-79% of cuBLAS for GEMM in 22 lines, yet the identical attention kernel achieves only 53% of FlashAttention-2 on RTX PRO 6000 Blackwell, while Triton sustains near cuBLAS performance portably across Hopper and Blackwell.
What carries the argument
The CuTile Python-based tile-centric abstraction for GPU kernel development that targets Tensor Cores and Tensor Memory Accelerator efficiency.
If this is right
- CuTile offers a practical short-code alternative to hand-written CUDA kernels for attention on datacenter Blackwell GPUs.
- GEMM performance with CuTile reaches over half of cuBLAS levels on tested platforms but falls short of vendor libraries.
- Triton demonstrates stronger portability than CuTile across Hopper and both Blackwell variants without per-architecture changes.
- Performance gaps on the RTX PRO 6000 indicate that CuTile kernels require architecture-specific tuning for consumer GPUs.
Where Pith is reading between the lines
- Short code length in CuTile may speed up prototyping of custom AI kernels when full vendor performance is not required.
- The observed portability difference suggests adding auto-tuning or compiler improvements could broaden CuTile's applicability.
- End-to-end LLM inference results imply CuTile could integrate into training or serving pipelines on B200-class hardware with minimal code changes.
Load-bearing premise
The CuTile, Triton, WMMA, and cuBLAS implementations were developed and tuned with comparable effort without undisclosed architecture-specific optimizations biasing the comparisons.
What would settle it
A re-benchmark of the fused attention kernel on RTX PRO 6000 Blackwell showing CuTile matching or exceeding FlashAttention-2 throughput would falsify the claim of significant cross-architecture optimization gaps.
Figures
read the original abstract
NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first independent cross-architecture evaluation of NVIDIA's CUDA Tile (CuTile) Python-based tile-centric abstraction for GPU kernel development. It benchmarks CuTile against cuBLAS, Triton, WMMA, and FlashAttention-2 on GEMM, fused multi-head attention, and end-to-end LLM inference workloads in BF16/FP16 across H100 NVL, B200, and RTX PRO 6000 Blackwell GPUs, reporting architecture-dependent results including up to 1007 TFLOP/s for fused attention on B200 (2.5x over FlashAttention-2 in 60 lines of Python code) and 52-79% of cuBLAS GEMM performance in 22 lines (vs. 123 for WMMA), while noting Triton's stronger portability.
Significance. If the central performance and portability claims hold under verified fair-comparison conditions, the work would provide useful empirical data on a new high-level abstraction's practical trade-offs for AI kernels on recent NVIDIA architectures. The direct hardware measurements, specific TFLOP/s figures, and line-of-code counts constitute a strength for an evaluation paper, though the lack of accompanying code or verification artifacts limits immediate reproducibility.
major comments (2)
- [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.
- [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.
minor comments (2)
- [Abstract] The abstract states specific TFLOP/s and percentage figures but does not reference the corresponding tables or figures that contain the raw data, making it harder to cross-check the reported values.
- The manuscript would benefit from an explicit statement of the CuTile version or commit hash used, as well as the exact cuBLAS and FlashAttention-2 versions against which it was compared.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation of CUDA Tile. The comments highlight important areas for improving methodological transparency, and we address each point below with corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.
Authors: We agree that explicit documentation of baseline configurations is required to substantiate the performance attributions. In the revised manuscript we have added the precise library versions employed (cuBLAS 12.4, FlashAttention-2 v2.5.0, Triton 2.2.0, and WMMA from CUDA 12.4) together with a statement that no Blackwell-specific re-tuning, custom TMA paths, or non-default compilation flags were applied to any baseline. All vendor and framework kernels were invoked through their standard public APIs using only the target compute-capability flag and -O3 optimization. These clarifications appear in a new paragraph of the Experimental Setup section and support the claim that observed differences reflect the abstractions rather than unequal optimization effort. revision: yes
-
Referee: [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.
Authors: We accept that the original description of the experimental protocol was insufficient. The revised version now states that every reported throughput is the mean of 100 timed runs after 20 warm-up iterations, with standard-deviation error bars added to all figures. Kernel launch parameters were equalized by adopting each framework’s autotuner output (or manually verified equivalent tile and block dimensions) while enforcing identical row-major memory layouts and BF16 precision for all compared kernels. These controls are documented in an expanded Experimental Methodology subsection, including a summary table of launch configurations. Although formal statistical tests were not originally performed, the magnitude of the reported differences renders the architecture-dependent conclusions stable under the added variability measures. revision: yes
Circularity Check
No circularity: purely empirical benchmark results with no derivations or fitted predictions
full rationale
The paper reports direct hardware measurements of kernel throughput (e.g., TFLOP/s for fused attention and GEMM) on H100, B200, and RTX PRO 6000 GPUs. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. All performance numbers are obtained by executing the kernels; comparisons to cuBLAS, Triton, WMMA, and FlashAttention-2 are likewise raw runtime results. Because the work contains no derivation chain that could reduce to its own inputs by construction, it is self-contained against external benchmarks and receives the default non-circularity finding.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.