Thunderkittens: Simple, fast, and adorable ai kernels

Benjamin F Spector, Simran Arora, Aaryan Singhal, Daniel Y Fu, Christopher Ré · 2024 · arXiv 2410.20399

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

cs.PL · 2026-04-16 · unverdicted · novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

cs.AR · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

cs.DC · 2026-04-28 · unverdicted · novelty 6.0

DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

citing papers explorer

Showing 4 of 4 citing papers.

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 15
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels cs.PL · 2026-04-16 · unverdicted · none · ref 37
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments cs.AR · 2026-05-11 · unverdicted · none · ref 29 · 2 links
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference cs.DC · 2026-04-28 · unverdicted · none · ref 34
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.

Thunderkittens: Simple, fast, and adorable ai kernels

fields

years

verdicts

representative citing papers

citing papers explorer