DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

· 2025 · cs.DC · arXiv 2511.06605

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct$^{\mathrm{TM}}$ MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the end-to-end workload-level (LLM inference). For the former, our optimized DMA offloads close up to 4.5$\times$ performance gap and deliver additional power savings (3-10%) for ML collectives as compared to state-of-the-art GPU core-based communication library, RCCL. For the latter, we demonstrate acceleration for LLM inference: up to 1.5$\times$ lower latency and up to 1.9$\times$ higher throughput over the state-of-the-art vLLM inference framework. We conclude with a discussion of AMD Instinct GPU runtime innovations that stand to expose these features and additionally identify future hardware-software co-design potential to further improve DMA offload efficiency.

representative citing papers

CompPow: A Case for Component-level GPU Power Management

cs.AR · 2026-05-21 · unverdicted · novelty 3.0

CompPow makes the case that component-aware power management inside GPUs can yield 10% higher energy efficiency and 5% better performance for ML workloads.

citing papers explorer

Showing 1 of 1 citing paper.

CompPow: A Case for Component-level GPU Power Management cs.AR · 2026-05-21 · unverdicted · none · ref 23 · internal anchor
CompPow makes the case that component-aware power management inside GPUs can yield 10% higher energy efficiency and 5% better performance for ML workloads.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

fields

years

verdicts

representative citing papers

citing papers explorer