pith. sign in

arxiv: 2603.01960 · v2 · submitted 2026-03-02 · 💻 cs.LG · cs.AI

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

Pith reviewed 2026-05-15 17:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords TiledAttentionscaled dot-product attentionSDPAonline softmaxPyTorch kernelCUDA tileattention performancekernel research
0
0 comments X

The pith

TiledAttention offers a Python-modifiable SDPA kernel that achieves large speedups over eager attention in PyTorch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TiledAttention as a forward operator for scaled dot-product attention on NVIDIA GPUs. It implements the operator through a tile-based approach exposed as a PyTorch function, using online softmax and streaming of key-value tiles to maintain realistic performance. The approach allows direct edits to tile shapes, staging, and memory layout from Python, avoiding the need for full low-level CUDA rewrites. Benchmarks across sequence lengths, head dimensions, and precisions show it outperforms standard eager attention paths while remaining usable directly in PyTorch workflows.

Core claim

TiledAttention follows the established online-softmax formulation for scaled dot-product attention but realizes it through a schedule-level implementation in a high-level tile language that permits rapid modifications to parameters such as tile shapes and shared-memory layouts while retaining tiled streaming of K and V.

What carries the argument

The tiled streaming of K and V combined with online softmax updates, implemented at the schedule level for direct Python-level edits to shapes and staging.

If this is right

  • Direct integration into PyTorch models for attention research without separate compilation steps.
  • Faster iteration on custom attention variants through schedule edits rather than template rewrites.
  • Reproducible speedups over torch_sdpa_math and standard eager paths across FP16 and BF16.
  • A middle option between rigid production fused kernels and slow pure Python attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to testing new attention variants by swapping only the tile schedule without touching surrounding code.
  • Researchers might combine it with emerging precision formats to measure tradeoffs more quickly than with fixed kernels.
  • It points toward a broader pattern where high-level scheduling languages lower the cost of exploring kernel-level optimizations.

Load-bearing premise

Schedule-level changes in the tile implementation will preserve performance levels close to those of hand-tuned low-level code.

What would settle it

Run the kernel on a new sequence length or head dimension where measured throughput falls below that of the unfused eager baseline while numerical outputs remain correct.

Figures

Figures reproduced from arXiv: 2603.01960 by Taimur Khan.

Figure 1
Figure 1. Figure 1: gives a high-level intuition: as sequence length S increases, SDPAincreasingly dominates end-to-end throughput. Long context: attention dominates token throughput Other ops Attention [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput versus sequence length for D=128 (FP16/BF16, non-causal). 4.2 Explicit baseline summary To make the value proposition explicit for HPC users, we report TiledAttention not only against fused PyTorch SDPA, but also against unfused baselines [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Explicit baseline comparison (FP16): TiledAttention vs fused PyTorch SDPA, math-SDPA, and eager attention [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relative performance regime map (TiledAttention as % of fused PyTorch SDPA) for FP16 non-causal runs. 4.4 Profiling-guided bottleneck analysis To interpret scaling trends, we profile representative shapes with Nsight Compute (counters) and Nsight Systems (timeline/API overhead). We track throughput, memory traffic, and stalls to identify when the kernel becomes memory-bound, what limits short-S, and which … view at source ↗
Figure 6
Figure 6. Figure 6: Normalized bandwidth proxy versus sequence length (FP16, D=128, non-causal). 4.5 Sensitivity to tiling parameters A central advantage of expressing the kernel as a tile program is the ability to expose and sweep tiling parameters [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch), explicit unfused baselines (torch_sdpa_math, standard eager attention), and forced backend probes (FlashAttention2, EffecientAttention, CuDNN Attention) across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents TiledAttention, a scaled dot-product attention (SDPA) forward operator implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function. It follows the established FlashAttention-style online-softmax formulation with tiled K/V streaming; the claimed novelty lies in the schedule-level modifiability (tile shapes, staging, shared-memory layout) from Python while retaining realistic performance. The authors provide a reproducible benchmarking harness on an NVIDIA DGX GB10 node and compare against PyTorch SDPA auto-dispatch, torch_sdpa_math, eager attention, and forced backends (FlashAttention2, EfficientAttention, CuDNN), reporting large speedups over eager paths and a practical balance between performance and customizability.

Significance. If the central performance claims hold, the work offers a useful engineering contribution by lowering the barrier to SDPA kernel experimentation through Python-level schedule edits without requiring full CUDA/CUTLASS rewrites. The reproducible benchmarking workflow and direct PyTorch integration are concrete strengths that could aid rapid research iteration in attention mechanisms.

major comments (2)
  1. [Abstract] Abstract: the central claim that TiledAttention provides a 'practical balance between performance and customizability' rests on the untested assumption that Python-level cuTile/TileIR schedule modifications (tile shapes, staging, shared-memory layout) incur negligible overhead relative to hand-tuned CUDA. No ablation data, before/after timing for modified vs. unmodified kernels, or quantitative overhead measurements are presented to substantiate that modifiability preserves the reported speedups over eager attention.
  2. [Abstract] Abstract and benchmarking description: while the manuscript states that 'TiledAttention delivers large speedups over standard eager attention paths,' the provided text contains no specific speedup factors, latency tables, or references to figures that would allow verification of the magnitude of improvement across sequence lengths, head dimensions, and precisions (FP16/BF16). This makes the performance claims difficult to assess independently.
minor comments (2)
  1. The abstract refers to an 'NVIDIA DGX GB10 node' without clarifying the exact GPU model or configuration details (e.g., SM count, memory bandwidth) that would aid reproducibility of the harness.
  2. [Abstract] Minor notation inconsistency: the abstract uses both 'SDPA' and 'scaled dot-product attention' without an initial definition on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that TiledAttention provides a 'practical balance between performance and customizability' rests on the untested assumption that Python-level cuTile/TileIR schedule modifications (tile shapes, staging, shared-memory layout) incur negligible overhead relative to hand-tuned CUDA. No ablation data, before/after timing for modified vs. unmodified kernels, or quantitative overhead measurements are presented to substantiate that modifiability preserves the reported speedups over eager attention.

    Authors: We acknowledge that the manuscript does not contain explicit ablation experiments quantifying the runtime overhead of performing schedule edits (tile shapes, staging, shared-memory layout) from Python relative to a static hand-tuned CUDA baseline. Our current evidence for the 'practical balance' consists of end-to-end speedups versus eager attention together with the observation that the same cuTile kernel can be edited without rewriting CUDA. To directly address the referee's concern, we will add a new ablation subsection in the revised version that reports before/after timings for representative schedule modifications on the same hardware. revision: yes

  2. Referee: [Abstract] Abstract and benchmarking description: while the manuscript states that 'TiledAttention delivers large speedups over standard eager attention paths,' the provided text contains no specific speedup factors, latency tables, or references to figures that would allow verification of the magnitude of improvement across sequence lengths, head dimensions, and precisions (FP16/BF16). This makes the performance claims difficult to assess independently.

    Authors: We agree that the abstract would be clearer with concrete numerical examples and explicit figure references. The full manuscript already contains the requested latency tables and plots across sequence lengths, head dimensions, and both FP16/BF16; however, the abstract summarizes them only qualitatively. In the revision we will insert representative speedup ranges (e.g., 4-12x versus torch_sdpa_math for sequence lengths 1k-8k) together with direct citations to the relevant figures and tables. revision: yes

Circularity Check

0 steps flagged

No circularity; direct implementation and benchmarking paper

full rationale

The manuscript describes a practical CUDA kernel implementation in cuTile/TileIR for scaled dot-product attention, explicitly following the established FlashAttention online-softmax and tiled K/V streaming approach. No derivation chain, first-principles predictions, fitted parameters presented as outputs, or load-bearing self-citations exist; the central claims rest on external benchmarking against PyTorch SDPA, FlashAttention2, and other baselines. The work is self-contained as an engineering artifact with reproducible harnesses and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the established FlashAttention online-softmax formulation and standard cuTile execution semantics; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption FlashAttention-style online softmax and tiled K/V streaming produce correct attention results
    The paper states it follows the established FlashAttention-style online-softmax formulation.
  • domain assumption cuTile Python provides realistic GPU behavior for tiled kernels
    The implementation strategy assumes cuTile/TileIR delivers performance and correctness comparable to low-level CUDA.

pith-pipeline@v0.9.0 · 5535 in / 1210 out tokens · 58069 ms · 2026-05-15T17:39:42.457803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    NVIDIA Documentation, https://docs.nvidia

    cutile python documentation. NVIDIA Documentation, https://docs.nvidia. com/cuda/cutile-python/, accessed 2026-01-31

  2. [2]

    https://developer

    Nsight Deep Learning Designer — developer.nvidia.com. https://developer. nvidia.com/nsight-dl-designer, [Accessed 26-02-2026]

  3. [3]

    NVIDIA Developer, https://developer.nvidia.com/cuda/tile, accessed 2026-01-31

    Nvidia cuda tile. NVIDIA Developer, https://developer.nvidia.com/cuda/tile, accessed 2026-01-31

  4. [4]

    NVIDIA Documentation, https://docs.nvidia.com/cuda/ tile-ir/latest/sections/introduction.html, accessed 2026-01-31

    Tile ir — introduction. NVIDIA Documentation, https://docs.nvidia.com/cuda/ tile-ir/latest/sections/introduction.html, accessed 2026-01-31

  5. [5]

    Procedia Computer Science (2025)

    Towards a european hpc/ai ecosystem. Procedia Computer Science (2025). https: //doi.org/10.1016/j.procs.2025.02.269, https://doi.org/10.1016/j.procs. 2025.02.269 12 Khan, T

  6. [6]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014), https://arxiv.org/ abs/1409.0473

  7. [7]

    Advances in Neural Information Processing Systems (2020)

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems (2020)

  8. [8]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022), https://arxiv.org/abs/ 2204.02311

  9. [9]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. arXiv preprint arXiv:2307.08691 (2023), https://arxiv.org/abs/2307. 08691

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Dao, T., Fu, D.Y., Ermon, S., Rudra, A., R´ e, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135 (2022), https://arxiv.org/abs/2205.14135

  11. [11]

    In: Proceedings of NAACL- HLT (2019)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL- HLT (2019)

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021), https: //arxiv.org/abs/2010.11929

  13. [13]

    arXiv preprint arXiv:2106.04554 (2021),https://arxiv.org/abs/2106.04554

    Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: A survey of transformers. arXiv preprint arXiv:2106.04554 (2021),https://arxiv.org/abs/2106.04554

  14. [14]

    https://doi.org/10.5281/ZENODO.18787737, https:// zenodo.org/doi/10.5281/zenodo.18787737

    Khan, T.: Tiledattention on nvidia dgx gb10: Supplementary benchmark and nsight compute results (2026). https://doi.org/10.5281/ZENODO.18787737, https:// zenodo.org/doi/10.5281/zenodo.18787737

  15. [15]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021),https://arxiv.org/abs/2103.00020

  16. [16]

    Efficient transformers: A survey

    Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: A survey. arXiv preprint arXiv:2009.06732 (2020),https://arxiv.org/abs/2009.06732

  17. [17]

    TOP500: Top500 june 2024 highlights (2024), https://www.top500.org/lists/ top500/2024/06/highs/, accessed 2026-02-16

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023), https://arxiv. org/abs/2307.09288

  19. [19]

    Advances in Neural Information Processing Systems (2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems (2017)