A systematic analysis of 301 real-world code generation bugs in tile programs, categorizing root causes, symptoms, input patterns, test oracles, and fix strategies from curated GitHub reports.
Tilelink: Gen- erating efficient compute-communication overlapping kernels using tile-centric primitives
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
Ada-MK fuses LLM operators into persistent MegaKernels via MLIR DAG search and 3D shared-memory modeling, delivering up to 23.6% higher single-batch throughput than TensorRT-LLM on NVIDIA L20.
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
citing papers explorer
-
Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection
A systematic analysis of 301 real-world code generation bugs in tile programs, categorizing root causes, symptoms, input patterns, test oracles, and fix strategies from curated GitHub reports.
-
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
-
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
-
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
-
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
Ada-MK fuses LLM operators into persistent MegaKernels via MLIR DAG search and 3D shared-memory modeling, delivering up to 23.6% higher single-batch throughput than TensorRT-LLM on NVIDIA L20.
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.