TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
Pith reviewed 2026-05-15 17:39 UTC · model grok-4.3
The pith
TiledAttention offers a Python-modifiable SDPA kernel that achieves large speedups over eager attention in PyTorch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TiledAttention follows the established online-softmax formulation for scaled dot-product attention but realizes it through a schedule-level implementation in a high-level tile language that permits rapid modifications to parameters such as tile shapes and shared-memory layouts while retaining tiled streaming of K and V.
What carries the argument
The tiled streaming of K and V combined with online softmax updates, implemented at the schedule level for direct Python-level edits to shapes and staging.
If this is right
- Direct integration into PyTorch models for attention research without separate compilation steps.
- Faster iteration on custom attention variants through schedule edits rather than template rewrites.
- Reproducible speedups over torch_sdpa_math and standard eager paths across FP16 and BF16.
- A middle option between rigid production fused kernels and slow pure Python attention.
Where Pith is reading between the lines
- The method could extend to testing new attention variants by swapping only the tile schedule without touching surrounding code.
- Researchers might combine it with emerging precision formats to measure tradeoffs more quickly than with fixed kernels.
- It points toward a broader pattern where high-level scheduling languages lower the cost of exploring kernel-level optimizations.
Load-bearing premise
Schedule-level changes in the tile implementation will preserve performance levels close to those of hand-tuned low-level code.
What would settle it
Run the kernel on a new sequence length or head dimension where measured throughput falls below that of the unfused eager baseline while numerical outputs remain correct.
Figures
read the original abstract
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch), explicit unfused baselines (torch_sdpa_math, standard eager attention), and forced backend probes (FlashAttention2, EffecientAttention, CuDNN Attention) across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TiledAttention, a scaled dot-product attention (SDPA) forward operator implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function. It follows the established FlashAttention-style online-softmax formulation with tiled K/V streaming; the claimed novelty lies in the schedule-level modifiability (tile shapes, staging, shared-memory layout) from Python while retaining realistic performance. The authors provide a reproducible benchmarking harness on an NVIDIA DGX GB10 node and compare against PyTorch SDPA auto-dispatch, torch_sdpa_math, eager attention, and forced backends (FlashAttention2, EfficientAttention, CuDNN), reporting large speedups over eager paths and a practical balance between performance and customizability.
Significance. If the central performance claims hold, the work offers a useful engineering contribution by lowering the barrier to SDPA kernel experimentation through Python-level schedule edits without requiring full CUDA/CUTLASS rewrites. The reproducible benchmarking workflow and direct PyTorch integration are concrete strengths that could aid rapid research iteration in attention mechanisms.
major comments (2)
- [Abstract] Abstract: the central claim that TiledAttention provides a 'practical balance between performance and customizability' rests on the untested assumption that Python-level cuTile/TileIR schedule modifications (tile shapes, staging, shared-memory layout) incur negligible overhead relative to hand-tuned CUDA. No ablation data, before/after timing for modified vs. unmodified kernels, or quantitative overhead measurements are presented to substantiate that modifiability preserves the reported speedups over eager attention.
- [Abstract] Abstract and benchmarking description: while the manuscript states that 'TiledAttention delivers large speedups over standard eager attention paths,' the provided text contains no specific speedup factors, latency tables, or references to figures that would allow verification of the magnitude of improvement across sequence lengths, head dimensions, and precisions (FP16/BF16). This makes the performance claims difficult to assess independently.
minor comments (2)
- The abstract refers to an 'NVIDIA DGX GB10 node' without clarifying the exact GPU model or configuration details (e.g., SM count, memory bandwidth) that would aid reproducibility of the harness.
- [Abstract] Minor notation inconsistency: the abstract uses both 'SDPA' and 'scaled dot-product attention' without an initial definition on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that TiledAttention provides a 'practical balance between performance and customizability' rests on the untested assumption that Python-level cuTile/TileIR schedule modifications (tile shapes, staging, shared-memory layout) incur negligible overhead relative to hand-tuned CUDA. No ablation data, before/after timing for modified vs. unmodified kernels, or quantitative overhead measurements are presented to substantiate that modifiability preserves the reported speedups over eager attention.
Authors: We acknowledge that the manuscript does not contain explicit ablation experiments quantifying the runtime overhead of performing schedule edits (tile shapes, staging, shared-memory layout) from Python relative to a static hand-tuned CUDA baseline. Our current evidence for the 'practical balance' consists of end-to-end speedups versus eager attention together with the observation that the same cuTile kernel can be edited without rewriting CUDA. To directly address the referee's concern, we will add a new ablation subsection in the revised version that reports before/after timings for representative schedule modifications on the same hardware. revision: yes
-
Referee: [Abstract] Abstract and benchmarking description: while the manuscript states that 'TiledAttention delivers large speedups over standard eager attention paths,' the provided text contains no specific speedup factors, latency tables, or references to figures that would allow verification of the magnitude of improvement across sequence lengths, head dimensions, and precisions (FP16/BF16). This makes the performance claims difficult to assess independently.
Authors: We agree that the abstract would be clearer with concrete numerical examples and explicit figure references. The full manuscript already contains the requested latency tables and plots across sequence lengths, head dimensions, and both FP16/BF16; however, the abstract summarizes them only qualitatively. In the revision we will insert representative speedup ranges (e.g., 4-12x versus torch_sdpa_math for sequence lengths 1k-8k) together with direct citations to the relevant figures and tables. revision: yes
Circularity Check
No circularity; direct implementation and benchmarking paper
full rationale
The manuscript describes a practical CUDA kernel implementation in cuTile/TileIR for scaled dot-product attention, explicitly following the established FlashAttention online-softmax and tiled K/V streaming approach. No derivation chain, first-principles predictions, fitted parameters presented as outputs, or load-bearing self-citations exist; the central claims rest on external benchmarking against PyTorch SDPA, FlashAttention2, and other baselines. The work is self-contained as an engineering artifact with reproducible harnesses and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption FlashAttention-style online softmax and tiled K/V streaming produce correct attention results
- domain assumption cuTile Python provides realistic GPU behavior for tiled kernels
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TiledAttention uses cuTile Python tile program with online softmax updates and tiled K,V streaming
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TiledAttention delivers large speedups over standard eager attention paths
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
NVIDIA Documentation, https://docs.nvidia
cutile python documentation. NVIDIA Documentation, https://docs.nvidia. com/cuda/cutile-python/, accessed 2026-01-31
work page 2026
-
[2]
Nsight Deep Learning Designer — developer.nvidia.com. https://developer. nvidia.com/nsight-dl-designer, [Accessed 26-02-2026]
work page 2026
-
[3]
NVIDIA Developer, https://developer.nvidia.com/cuda/tile, accessed 2026-01-31
Nvidia cuda tile. NVIDIA Developer, https://developer.nvidia.com/cuda/tile, accessed 2026-01-31
work page 2026
-
[4]
Tile ir — introduction. NVIDIA Documentation, https://docs.nvidia.com/cuda/ tile-ir/latest/sections/introduction.html, accessed 2026-01-31
work page 2026
-
[5]
Procedia Computer Science (2025)
Towards a european hpc/ai ecosystem. Procedia Computer Science (2025). https: //doi.org/10.1016/j.procs.2025.02.269, https://doi.org/10.1016/j.procs. 2025.02.269 12 Khan, T
-
[6]
Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014), https://arxiv.org/ abs/1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Advances in Neural Information Processing Systems (2020)
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems (2020)
work page 2020
-
[8]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022), https://arxiv.org/abs/ 2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. arXiv preprint arXiv:2307.08691 (2023), https://arxiv.org/abs/2307. 08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., R´ e, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135 (2022), https://arxiv.org/abs/2205.14135
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
In: Proceedings of NAACL- HLT (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL- HLT (2019)
work page 2019
-
[12]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021), https: //arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
arXiv preprint arXiv:2106.04554 (2021),https://arxiv.org/abs/2106.04554
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: A survey of transformers. arXiv preprint arXiv:2106.04554 (2021),https://arxiv.org/abs/2106.04554
-
[14]
https://doi.org/10.5281/ZENODO.18787737, https:// zenodo.org/doi/10.5281/zenodo.18787737
Khan, T.: Tiledattention on nvidia dgx gb10: Supplementary benchmark and nsight compute results (2026). https://doi.org/10.5281/ZENODO.18787737, https:// zenodo.org/doi/10.5281/zenodo.18787737
-
[15]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021),https://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Efficient transformers: A survey
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: A survey. arXiv preprint arXiv:2009.06732 (2020),https://arxiv.org/abs/2009.06732
-
[17]
TOP500: Top500 june 2024 highlights (2024), https://www.top500.org/lists/ top500/2024/06/highs/, accessed 2026-02-16
work page 2024
-
[18]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023), https://arxiv. org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Advances in Neural Information Processing Systems (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.