hub

Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yu · 2025 · arXiv 2504.19442

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

cs.DC · 2026-04-14 · unverdicted · novelty 8.0

Event Tensor is a new compiler abstraction for dynamic megakernels that enables high-performance persistent GPU kernels with state-of-the-art LLM serving latency and reduced warmup overhead.

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

cs.DC · 2026-05-22 · unverdicted · novelty 7.0

HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

cs.DC · 2025-05-16 · conditional · novelty 7.0

TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

Eliminating Hidden Serialization in Multi-Node Megakernel Communication

cs.DC · 2026-05-01 · conditional · novelty 6.0

Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exceeding GPU-direct RDMA.

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

cs.AR · 2026-04-06 · conditional · novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

cs.DC · 2026-01-28 · unverdicted · novelty 6.0

Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

cs.DC · 2025-11-10 · unverdicted · novelty 6.0

DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

cs.DC · 2026-05-20 · unverdicted · novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

cs.DC · 2026-04-21 · unverdicted · novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

citing papers explorer

Showing 11 of 11 citing papers.

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel cs.DC · 2026-04-14 · unverdicted · none · ref 2
Event Tensor is a new compiler abstraction for dynamic megakernels that enables high-performance persistent GPU kernels with state-of-the-art LLM serving latency and reduced warmup overhead.
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs cs.DC · 2026-05-22 · unverdicted · none · ref 27
HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training cs.DC · 2026-04-21 · unverdicted · none · ref 19
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cs.DC · 2025-05-16 · conditional · none · ref 12
TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding cs.DC · 2026-05-20 · unverdicted · none · ref 81
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
Eliminating Hidden Serialization in Multi-Node Megakernel Communication cs.DC · 2026-05-01 · conditional · none · ref 58
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exceeding GPU-direct RDMA.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators cs.AR · 2026-04-06 · conditional · none · ref 126
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap cs.DC · 2026-01-28 · unverdicted · none · ref 45
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 36
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling cs.DC · 2026-05-20 · unverdicted · none · ref 16
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training cs.DC · 2026-04-21 · unverdicted · none · ref 51
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer