FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.
The anatomy of a triton attention kernel
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 2polarities
background 2representative citing papers
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.
citing papers explorer
-
FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU
FlashSinkhorn delivers up to 32x forward and 161x end-to-end speedups for entropic OT on A100 GPUs via IO-aware Triton kernels that fuse log-domain updates and streaming transport application.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
-
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.