ISBN 9781450399159

Association for Computing Machinery · 2022 · arXiv 7955.356795

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

cs.DC · 2025-05-16 · conditional · novelty 7.0

TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.

TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

cs.AR · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

cs.PL · 2026-05-02 · unverdicted · novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

cs.DC · 2025-11-10 · unverdicted · novelty 6.0

DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

citing papers explorer

Showing 4 of 4 citing papers.

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cs.DC · 2025-05-16 · conditional · none · ref 10
TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments cs.AR · 2026-05-11 · unverdicted · none · ref 34 · 2 links
TLX introduces MIMW-based extensions to Triton that let developers orchestrate warp-group execution and asynchronous hardware features while preserving blocked programming productivity, with kernels deployed in large-scale training and inference.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs cs.PL · 2026-05-02 · unverdicted · none · ref 29
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% while saving hundreds of thousands of GPU hours monthly.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 33
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

ISBN 9781450399159

fields

years

verdicts

representative citing papers

citing papers explorer