Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Thunderkittens: Simple, fast, and adorable ai kernels
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.
citing papers explorer
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
-
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
-
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.