Tilelink: Gen- erating efficient compute-communication overlapping kernels using tile-centric primitives

Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu · 2025 · arXiv 2503.20313

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

A systematic analysis of 301 real-world code generation bugs in tile programs, categorizing root causes, symptoms, input patterns, test oracles, and fix strategies from curated GitHub reports.

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

cs.PL · 2026-05-02 · unverdicted · novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

cs.DC · 2026-01-28 · unverdicted · novelty 6.0

Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

Ada-MK fuses LLM operators into persistent MegaKernels via MLIR DAG search and 3D shared-memory modeling, delivering up to 23.6% higher single-batch throughput than TensorRT-LLM on NVIDIA L20.

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

cs.DC · 2026-04-21 · unverdicted · novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

citing papers explorer

Showing 6 of 6 citing papers.

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection cs.SE · 2026-05-19 · unverdicted · none · ref 95
A systematic analysis of 301 real-world code generation bugs in tile programs, categorizing root causes, symptoms, input patterns, test oracles, and fix strategies from curated GitHub reports.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs cs.PL · 2026-05-02 · unverdicted · none · ref 34
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training cs.LG · 2026-04-27 · unverdicted · none · ref 29
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap cs.DC · 2026-01-28 · unverdicted · none · ref 46
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference cs.CL · 2026-05-12 · unverdicted · none · ref 36
Ada-MK fuses LLM operators into persistent MegaKernels via MLIR DAG search and 3D shared-memory modeling, delivering up to 23.6% higher single-batch throughput than TensorRT-LLM on NVIDIA L20.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 52
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

Tilelink: Gen- erating efficient compute-communication overlapping kernels using tile-centric primitives

fields

years

verdicts

representative citing papers

citing papers explorer