Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Deepak Kumar; Divakar Kumar Yadav; Tian Zhao

arxiv: 2604.23466 · v2 · pith:XIYUKASVnew · submitted 2026-04-25 · 💻 cs.LG · cs.AI· cs.AR

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav , Tian Zhao , Deepak Kumar This is my paper

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR

keywords CUDA TileCuTileGPU kernel developmentBlackwell GPUHopper GPUfused attentionGEMMperformance evaluation

0 comments

The pith

CuTile reaches 1007 TFLOP/s fused attention on Blackwell B200 using 60 lines of Python code, outperforming FlashAttention-2 by 2.5x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates NVIDIA's CUDA Tile abstraction, a Python-based tile-centric model for GPU kernels, on H100, B200, and RTX PRO 6000 Blackwell GPUs. It benchmarks GEMM, fused multi-head attention, and LLM inference in BF16/FP16, comparing against cuBLAS, Triton, WMMA, and SIMT. On datacenter Blackwell, CuTile hits high throughput with short code for attention and GEMM, but the same kernels drop to 53% of FlashAttention-2 on the consumer Blackwell variant. Triton maintains 62-101% of cuBLAS across all platforms without tuning. The results highlight strong workload and architecture dependence for CuTile's effectiveness.

Core claim

CuTile's Python abstraction enables efficient Tensor Core and TMA usage, delivering up to 1007 TFLOP/s for fused attention on B200 (2.5x FlashAttention-2) in 60 lines and 52-79% of cuBLAS for GEMM in 22 lines, yet the identical attention kernel achieves only 53% of FlashAttention-2 on RTX PRO 6000 Blackwell, while Triton sustains near cuBLAS performance portably across Hopper and Blackwell.

What carries the argument

The CuTile Python-based tile-centric abstraction for GPU kernel development that targets Tensor Cores and Tensor Memory Accelerator efficiency.

If this is right

CuTile offers a practical short-code alternative to hand-written CUDA kernels for attention on datacenter Blackwell GPUs.
GEMM performance with CuTile reaches over half of cuBLAS levels on tested platforms but falls short of vendor libraries.
Triton demonstrates stronger portability than CuTile across Hopper and both Blackwell variants without per-architecture changes.
Performance gaps on the RTX PRO 6000 indicate that CuTile kernels require architecture-specific tuning for consumer GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Short code length in CuTile may speed up prototyping of custom AI kernels when full vendor performance is not required.
The observed portability difference suggests adding auto-tuning or compiler improvements could broaden CuTile's applicability.
End-to-end LLM inference results imply CuTile could integrate into training or serving pipelines on B200-class hardware with minimal code changes.

Load-bearing premise

The CuTile, Triton, WMMA, and cuBLAS implementations were developed and tuned with comparable effort without undisclosed architecture-specific optimizations biasing the comparisons.

What would settle it

A re-benchmark of the fused attention kernel on RTX PRO 6000 Blackwell showing CuTile matching or exceeding FlashAttention-2 throughput would falsify the claim of significant cross-architecture optimization gaps.

Figures

Figures reproduced from arXiv: 2604.23466 by Deepak Kumar, Divakar Kumar Yadav, Tian Zhao.

**Figure 1.** Figure 1: Performance–productivity frontier for GEMM (left, view at source ↗

**Figure 2.** Figure 2: GEMM performance (TFLOP/s, BF16) across four square matrix sizes on three GPUs. cuBLAS dominates on all platforms. CuTile (available only view at source ↗

**Figure 3.** Figure 3: Fused attention throughput (TFLOP/s) vs. sequence length (BF16, causal, batch=8). On the B200 (right), CuTile (red) dramatically outscales all other view at source ↗

**Figure 4.** Figure 4: The CuTile attention paradox: identical kernel code produces 2.51 view at source ↗

**Figure 5.** Figure 5: Normalized performance heatmap. Left: GEMM throughput as percentage of cuBLAS. Right: attention throughput as percentage of FlashAttention-2. view at source ↗

**Figure 6.** Figure 6: GEMM performance as a percentage of cuBLAS at view at source ↗

read the original abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first reported CuTile numbers on Blackwell but the 2.5x attention claim and portability conclusions rest on unverified baseline tuning.

read the letter

The core takeaway is straightforward empirical data: CuTile hits 1007 TFLOP/s on fused attention for B200 while using only 60 lines of Python, and GEMM kernels reach 52-79% of cuBLAS in 22 lines. Those specific figures and the cross-architecture runs on H100, B200, and the RTX PRO 6000 are new and worth having on record for anyone choosing between CuTile, Triton, and hand-written CUDA on recent NVIDIA hardware. The line-count comparisons also give a practical sense of development effort that most prior work skips. The paper does a clean job of showing that results are workload- and architecture-dependent, with the same attention kernel dropping to 53% of FlashAttention-2 on the consumer Blackwell part. That contrast is useful and not something I had seen before. The main soft spot is exactly the one the stress test flags. Nothing in the abstract or available text confirms that FlashAttention-2, cuBLAS, or WMMA were re-tuned or recompiled with Blackwell-specific paths, TMA usage, or equivalent effort. If the baselines are still Hopper-era builds, the reported speedups and the claim that CuTile is a practical replacement cannot be attributed cleanly to the abstraction itself. Triton’s stronger portability numbers look more believable on the surface, but the same tuning-equivalence question applies there too. No error bars, run counts, or version details appear, which leaves the central performance claims only moderately supported. This is the kind of work that belongs in a reading group when people are actively porting kernels to B200 or H100. Engineers selecting tools will get immediate value from the numbers even if the comparisons need caveats. For peer review I would send it forward, but only after the authors add a methods section that documents baseline versions, compilation flags, and any architecture-specific adjustments. Without that, the claims stay too loose for a journal but the raw data on new hardware still merits checking.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first independent cross-architecture evaluation of NVIDIA's CUDA Tile (CuTile) Python-based tile-centric abstraction for GPU kernel development. It benchmarks CuTile against cuBLAS, Triton, WMMA, and FlashAttention-2 on GEMM, fused multi-head attention, and end-to-end LLM inference workloads in BF16/FP16 across H100 NVL, B200, and RTX PRO 6000 Blackwell GPUs, reporting architecture-dependent results including up to 1007 TFLOP/s for fused attention on B200 (2.5x over FlashAttention-2 in 60 lines of Python code) and 52-79% of cuBLAS GEMM performance in 22 lines (vs. 123 for WMMA), while noting Triton's stronger portability.

Significance. If the central performance and portability claims hold under verified fair-comparison conditions, the work would provide useful empirical data on a new high-level abstraction's practical trade-offs for AI kernels on recent NVIDIA architectures. The direct hardware measurements, specific TFLOP/s figures, and line-of-code counts constitute a strength for an evaluation paper, though the lack of accompanying code or verification artifacts limits immediate reproducibility.

major comments (2)

[Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.
[Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.

minor comments (2)

[Abstract] The abstract states specific TFLOP/s and percentage figures but does not reference the corresponding tables or figures that contain the raw data, making it harder to cross-check the reported values.
The manuscript would benefit from an explicit statement of the CuTile version or commit hash used, as well as the exact cuBLAS and FlashAttention-2 versions against which it was compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of CUDA Tile. The comments highlight important areas for improving methodological transparency, and we address each point below with corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results: The headline claims (1007 TFLOP/s fused attention on B200 at 2.5x FlashAttention-2; 52-79% cuBLAS GEMM in 22 lines) rest on the unverified premise that FlashAttention-2, cuBLAS, WMMA, and Triton baselines received comparable development effort and Blackwell-specific tuning (including TMA paths and compilation flags). No version numbers, re-tuning steps, or architecture-specific optimization details for the baselines are supplied, which directly affects attribution of speedups and the cross-architecture portability conclusions.

Authors: We agree that explicit documentation of baseline configurations is required to substantiate the performance attributions. In the revised manuscript we have added the precise library versions employed (cuBLAS 12.4, FlashAttention-2 v2.5.0, Triton 2.2.0, and WMMA from CUDA 12.4) together with a statement that no Blackwell-specific re-tuning, custom TMA paths, or non-default compilation flags were applied to any baseline. All vendor and framework kernels were invoked through their standard public APIs using only the target compute-capability flag and -O3 optimization. These clarifications appear in a new paragraph of the Experimental Setup section and support the claim that observed differences reflect the abstractions rather than unequal optimization effort. revision: yes
Referee: [Results] Results and methodology: Performance numbers (e.g., 53% of FlashAttention-2 on RTX PRO 6000 sm_120, 62-101% cuBLAS for Triton) are presented without error bars, run counts, or statistical significance, and the experimental setup does not describe how kernel launch parameters, memory layouts, or precision handling were equalized across CuTile, Triton, and vendor libraries. This gap is load-bearing for the workload- and architecture-dependent effectiveness claims.

Authors: We accept that the original description of the experimental protocol was insufficient. The revised version now states that every reported throughput is the mean of 100 timed runs after 20 warm-up iterations, with standard-deviation error bars added to all figures. Kernel launch parameters were equalized by adopting each framework’s autotuner output (or manually verified equivalent tile and block dimensions) while enforcing identical row-major memory layouts and BF16 precision for all compared kernels. These controls are documented in an expanded Experimental Methodology subsection, including a summary table of launch configurations. Although formal statistical tests were not originally performed, the magnitude of the reported differences renders the architecture-dependent conclusions stable under the added variability measures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results with no derivations or fitted predictions

full rationale

The paper reports direct hardware measurements of kernel throughput (e.g., TFLOP/s for fused attention and GEMM) on H100, B200, and RTX PRO 6000 GPUs. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. All performance numbers are obtained by executing the kernels; comparisons to cuBLAS, Triton, WMMA, and FlashAttention-2 are likewise raw runtime results. Because the work contains no derivation chain that could reduce to its own inputs by construction, it is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, theoretical models, or new constructs. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1128 out tokens · 131634 ms · 2026-05-08T08:16:04.834470+00:00 · methodology

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)