The anatomy of a triton attention kernel

Burkhard Ringlein, Jan van Lunteren, Radu Stoica, Thomas Parnell · 2025 · arXiv 2511.11581

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

cs.LG · 2026-02-03 · conditional · novelty 7.0

FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

cs.DC · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

cs.PF · 2026-04-11 · unverdicted · novelty 6.0

WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.

citing papers explorer

Showing 3 of 3 citing papers.

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU cs.LG · 2026-02-03 · conditional · none · ref 40
FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC · 2026-05-08 · unverdicted · none · ref 32 · 2 links
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning cs.PF · 2026-04-11 · unverdicted · none · ref 29
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.

The anatomy of a triton attention kernel

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer