Event Tensor is a new compiler abstraction for dynamic megakernels that enables high-performance persistent GPU kernels with state-of-the-art LLM serving latency and reduced warmup overhead.
hub
Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exceeding GPU-direct RDMA.
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
citing papers explorer
-
Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
Event Tensor is a new compiler abstraction for dynamic megakernels that enables high-performance persistent GPU kernels with state-of-the-art LLM serving latency and reduced warmup overhead.
-
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs
HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.
-
NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
-
Eliminating Hidden Serialization in Multi-Node Megakernel Communication
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exceeding GPU-direct RDMA.
-
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
-
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap
Syncopate automatically overlaps compute and communication at fine chunk granularity inside a single fused Triton kernel, yielding 1.3x average and up to 4.7x end-to-end speedup on multi-GPU workloads.
-
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
-
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.