pith. sign in

arxiv: 2511.06605 · v2 · submitted 2025-11-10 · 💻 cs.DC · cs.AR

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

Pith reviewed 2026-05-18 00:24 UTC · model grok-4.3

classification 💻 cs.DC cs.AR
keywords DMA offloadsML communicationlatency-bound transfersGPU collectivesLLM inferenceAMD MI300XRCCL comparisoncommunication overlap
0
0 comments X

The pith

DMA offloads using untapped features in MI300X GPUs can compete with core-based libraries even for small latency-bound ML transfers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that DMA engines on commercial GPUs, long restricted to large bandwidth-bound transfers, can now handle smaller latency-bound communication in machine learning by using specific untapped hardware features. This expansion matters because communication overheads frequently limit ML performance, and effective offloading allows better overlap with computation to improve overall efficiency. Demonstrations at the level of collectives show meaningful gains over existing libraries, while end-to-end tests on LLM inference confirm lower latency and higher throughput.

Core claim

By exploiting hitherto untapped features available in the AMD Instinct MI300X GPUs, DMA communication offloads become competitive for latency-bound regions with transfer sizes from KB to low MB. Optimized offloads for ML collectives such as all-gather and all-to-all close up to 4.5 times the performance gap relative to the state-of-the-art GPU core-based library RCCL and deliver 3-10 percent additional power savings. The same approach accelerates full LLM inference workloads, achieving up to 1.5 times lower latency and up to 1.9 times higher throughput compared with the vLLM framework.

What carries the argument

Optimized DMA offloads that leverage specific untapped features in state-of-the-art AMD Instinct MI300X GPUs to support competitive performance in latency-bound communication.

If this is right

  • ML collectives achieve up to 4.5 times better performance and 3-10 percent power savings versus core-based libraries.
  • LLM inference sees up to 1.5 times lower latency and 1.9 times higher throughput than current frameworks.
  • Computation and communication overlap improves for small transfer sizes that previously could not use DMA offloads.
  • Runtime innovations can expose these GPU features for wider adoption in ML systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar untapped DMA features may exist on other GPU platforms and could be evaluated for comparable latency-bound gains.
  • Targeted hardware-software co-design could push the technique to even smaller transfer sizes or additional collective patterns.
  • Power reductions observed at scale could lower energy consumption in large ML training clusters.

Load-bearing premise

The untapped features in MI300X GPUs allow DMA offloads to handle latency-bound transfers competitively without introducing hidden overheads or workload-specific limitations.

What would settle it

Direct head-to-head latency and power measurements of all-gather and all-to-all collectives at KB to low-MB transfer sizes on MI300X hardware, checking whether DMA offloads close the stated performance gap to RCCL while avoiding any unaccounted slowdowns.

Figures

Figures reproduced from arXiv: 2511.06605 by Mahzabeen Islam, Mohamed Assem Ibrahim, Ryan Quach, Saleel Kudchadker, Shaizeen Aga, Suchita Pati.

Figure 1
Figure 1. Figure 1: Bridging the performance gap between DMA and GPU compute-unit (CU)-based collectives from RCCL with unique optimizations across the size spectrum. adoption across variety of domains. This in turn has led to ML training and inference to be often distributed over multiple GPUs thus necessitating focused optimization of inter-GPU communication costs. That is so as while slicing work across GPUs helps parallel… view at source ↗
Figure 2
Figure 2. Figure 2: ML collectives (left) All-gather (right) All-to-All. scheduling of DMA commands off the critical path in closing the performance gap of DMA collectives. We build DMA collectives prototypes on real hardware which harness the above DMA architecture capabilities. Us￾ing our prototypes for all-gather and all-to-all collectives, we demonstrate that we can considerably close the perfor￾mance gap between DMA and … view at source ↗
Figure 3
Figure 3. Figure 3: Concurrent Compute with Communication (left) using CUs (right) offloaded to sDMA engine. 2 ML Collectives and Need for Efficient Offload 2.1 All-gather and All-to-all ML Collectives As model sizes and inputs scale, efficiently slicing weight (and/or activation) tensors across multiple GPUs and gath￾ering them as needed is crucial for both performance (par￾allelism) and functionality (memory capacity). This… view at source ↗
Figure 6
Figure 6. Figure 6: Latency breakdown of an sDMA copy. DMA copy, which is the smallest unit of work applications can offload to DMAs with HIP/HSA API calls. 3.1 DMA Offload Benchmarking Methodology Since most of the DMA functionality is within AMD HIP run￾time (ROCr) [8] which handles several other tasks, we create a microbenchmark emulating only its DMA offload behavior (details in Section 3.2). As in ROCr, we use ROCt libra… view at source ↗
Figure 8
Figure 8. Figure 8: Broadcast-based allgather (bcst). in system memory, while schedule phase involve fetching commands from these queues to the sDMA engine. Thus, to minimize control and schedule time in ML collectives which require many-to-many GPU communication, it is essential to limit the number of sDMA commands issued. sDMA engines in a AMD Instinct™ MI300x support two types of commands that combine the functionality of … view at source ↗
Figure 10
Figure 10. Figure 10: Back-to-back copy-based collective (b2b). (Section 2.4), a simple implementation of the collective en￾tails each sDMA engine executing one independent copy as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Leveraging deterministic communication pat￾terns in ML to prelaunch collectives. used between all GPUs, each GPU only executes ∼half the number of swaps it requires, while the others are issued by the GPUs it requires the remaining swaps with. Note that since the CPU host issues these commands and they are pro￾cessed in parallel by sDMA engines across all GPUs, there is no performance advantage of balanci… view at source ↗
Figure 14
Figure 14. Figure 14: Speedup of sDMA Alltoall collective optimiza￾tions vs. RCCL. using the ROCt library which provides user-level APIs to cre￾ate DMA packets and interact with DMA engines through the GPU driver. For CU-based collectives, we use AMD’s state-of-the-art RCCL [9] library which has been tuned for each message size. We adjust the appropriate environment variables to scale the number of processes, enable perfor￾man… view at source ↗
Figure 15
Figure 15. Figure 15: Total power consumed by best DMA vs CU (RCCL) collectives size range. We list the best-performing implementations for different size ranges in Tables 1 and 2. 5.4 Collective Power Offloading communication to DMAs and freeing up GPU compute resources also stands to provide power savings. Fig￾ure 15 shows the total GPU power (including XCD, IOD and HBM as detailed in Section 2.4) consumed by AG collective a… view at source ↗
read the original abstract

Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct$^{\mathrm{TM}}$ MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the end-to-end workload-level (LLM inference). For the former, our optimized DMA offloads close up to 4.5$\times$ performance gap and deliver additional power savings (3-10%) for ML collectives as compared to state-of-the-art GPU core-based communication library, RCCL. For the latter, we demonstrate acceleration for LLM inference: up to 1.5$\times$ lower latency and up to 1.9$\times$ higher throughput over the state-of-the-art vLLM inference framework. We conclude with a discussion of AMD Instinct GPU runtime innovations that stand to expose these features and additionally identify future hardware-software co-design potential to further improve DMA offload efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that hitherto untapped features in AMD Instinct MI300X GPUs enable DMA offloads to become competitive for latency-bound ML communication (KB to low-MB transfers). It reports up to 4.5× gap closure versus RCCL for collectives such as all-gather and all-to-all, accompanied by 3-10% power savings, plus up to 1.5× lower latency and 1.9× higher throughput for end-to-end LLM inference versus vLLM.

Significance. If the empirical claims are substantiated with transparent methodology, the work would meaningfully broaden the scope of low-overhead DMA offloading in distributed ML systems, reducing GPU-core contention for small messages and improving overlap and energy efficiency in both training and inference workloads.

major comments (1)
  1. [§5.1] §5.1 (microbenchmarks for 1–100 KB messages): the central claim that the new DMA features eliminate fixed setup overheads for latency-bound transfers requires an explicit isolation of setup latency versus pure transfer time, plus any extra synchronization or cache-flush costs; without this breakdown the reported gap closure for small messages cannot be verified as general rather than workload-specific.
minor comments (2)
  1. [Abstract] Abstract: concrete speedup numbers are given without any mention of message-size ranges, number of runs, or hardware configuration details; these should be added for reproducibility.
  2. [Figure 4] Figure 4 (LLM inference results): axis labels and error-bar conventions are inconsistent with the collective plots in Figure 3; standardize formatting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the suggestion to strengthen the microbenchmark analysis and will revise the paper accordingly to improve clarity and verifiability of our results.

read point-by-point responses
  1. Referee: [§5.1] §5.1 (microbenchmarks for 1–100 KB messages): the central claim that the new DMA features eliminate fixed setup overheads for latency-bound transfers requires an explicit isolation of setup latency versus pure transfer time, plus any extra synchronization or cache-flush costs; without this breakdown the reported gap closure for small messages cannot be verified as general rather than workload-specific.

    Authors: We agree that an explicit breakdown isolating setup latency from pure transfer time, along with synchronization and cache-flush costs, would make the claims more transparent and easier to verify as general rather than workload-specific. In the revised manuscript we will expand §5.1 with additional microbenchmark data that separately reports DMA engine setup time, data transfer time, synchronization overhead, and any cache-flush costs for the 1–100 KB range. These measurements will be presented both for the baseline RCCL path and for our DMA-offload implementation, allowing direct comparison of the fixed overhead components. We believe this addition will substantiate that the untapped MI300X features are responsible for the observed gap closure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with benchmark-driven claims

full rationale

The manuscript reports hardware-specific optimizations and micro-benchmark results for DMA offloads on MI300X GPUs, comparing against RCCL and vLLM. All performance claims (4.5× gap closure, 1.5–1.9× LLM gains) rest on direct measurements of latency, throughput, and power rather than any derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear; the work is self-contained against external baselines and contains no load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper; the central claims rest on hardware-specific optimizations and benchmarking rather than mathematical axioms, free parameters, or newly invented entities. No free parameters, axioms, or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1162 out tokens · 35106 ms · 2026-05-18T00:24:04.678462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CompPow: A Case for Component-level GPU Power Management

    cs.AR 2026-05 unverdicted novelty 3.0

    CompPow makes the case that component-aware power management inside GPUs can yield 10% higher energy efficiency and 5% better performance for ML workloads.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    DMA Packets.https://people.freedesktop.org/~agd5f/dma_ packets.txt

    2014. DMA Packets.https://people.freedesktop.org/~agd5f/dma_ packets.txt

  2. [2]

    HIP Documentation.https://rocm.docs.amd.com/_/downloads/ HIP/en/docs-6.1.2/pdf/

    2024. HIP Documentation.https://rocm.docs.amd.com/_/downloads/ HIP/en/docs-6.1.2/pdf/

  3. [3]

    ROCR Documentation.https://rocm.docs.amd.com/_/ downloads/ROCR-Runtime/en/master/pdf/

    2024. ROCR Documentation.https://rocm.docs.amd.com/_/ downloads/ROCR-Runtime/en/master/pdf/

  4. [4]

    AMD ROCm documentation.https://rocm.docs.amd.com/en/ latest/

    2025. AMD ROCm documentation.https://rocm.docs.amd.com/en/ latest/

  5. [5]

    Tensile Documentation.https://rocm.docs.amd.com/_/ downloads/Tensile/en/latest/pdf/

    2025. Tensile Documentation.https://rocm.docs.amd.com/_/ downloads/Tensile/en/latest/pdf/

  6. [6]

    User Buffer Registration.https://docs.nvidia.com/deeplearning/ nccl/user-guide/docs/usage/bufferreg.html

    2025. User Buffer Registration.https://docs.nvidia.com/deeplearning/ nccl/user-guide/docs/usage/bufferreg.html

  7. [7]

    Anirudha Agrawal, Shaizeen Aga, Suchita Pati, and Mahzabeen Islam

  8. [8]

    arXiv:2412.14335 [cs.AR]https://arxiv.org/ abs/2412.14335

    Optimizing ML Concurrent Computation and Communication with GPU DMA Engines. arXiv:2412.14335 [cs.AR]https://arxiv.org/ abs/2412.14335

  9. [9]

    AMD. [n. d.].ROCm Runtime (ROCr).https://github.com/ROCm/ ROCR-Runtime

  10. [10]

    2025.ROCm Communication Collectives Library (RCCL).https: //github.com/ROCm/rccl

    AMD. 2025.ROCm Communication Collectives Library (RCCL).https: //github.com/ROCm/rccl

  11. [11]

    AMD. 2025. ROCm/rocBLAS: Next generation BLAS implementation for ROCm platform.https://github.com/ROCm/rocBLAS

  12. [12]

    Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. 2024. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858(2024)

  13. [13]

    Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2023. Mscclang: Microsoft collective communication language. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 502–514

  14. [14]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  15. [15]

    Ali Hassani, Michael Isaev, Nic McDonald, Jie Ren, Vijay Thakkar, Haicheng Wu, and Humphrey Shi. 2024. Distributed GEMM

  16. [16]

    https://discuss.pytorch.org/t/distributed-w-torchtitan- introducing-async-tensor-parallelism-in-pytorch/209487

    Horace He, Less Wright, Luca Wehrstedt, Tianyu Liu, Wan- chao Liang. 2024. Introducing Async Tensor Parallelism in PyTorch. "https://discuss.pytorch.org/t/distributed-w-torchtitan- introducing-async-tensor-parallelism-in-pytorch/209487"

  17. [17]

    Changho Hwang, KyoungSoo Park, Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong. 2023. {ARK}:{GPU-driven} Code Execution for Distributed Deep Learning. In20th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI 23). 87–101

  18. [18]

    Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sa- bet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkow- icz, and Olli Saarikivi. 2022. Breaking the Computation and Commu- nication Abstraction Barrier in Distributed Machine Learning Work- loads. InProceedings of the 27th ACM International Conference on Architectural Support for Pr...

  19. [19]

    Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An In-Network Architecture for Accelerating Shared-Memory Multi- processor Collectives. InACM/IEEE 47th Annual International Sympo- sium on Computer Architecture (ISCA). IEEE, IEEE Computer Society, Washington, DC, USA, 996–1009

  20. [20]

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catan- zaro. 2023. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems5 (2023), 341– 353

  21. [21]

    NVIDIA. [n. d.].NCCL.https://github.com/NVIDIA/nccl

  22. [22]

    Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D Sinclair. 2024. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. InProceedings 12 DMA Collectives for Efficient ML Communication Offloads of the 29th ACM International Conference on Architectural Support for Programming Languages and Operatin...

  23. [23]

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InInternational conference on machine learning. PMLR, 18332–18346

  24. [24]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  25. [25]

    InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    Zero: Memory optimizations toward training trillion param- eter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

  26. [26]

    Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srini- vasan, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. 2021. En- abling Compute-Communication Overlap in Distributed Deep Learn- ing Training Platforms. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, IEEE Press, Piscat- away, NJ, USA, 540–553...

  27. [27]

    Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang

  28. [28]

    MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications.arXiv preprint arXiv:2504.09014(2025)

  29. [29]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRRabs/1909.08053 (2019), 9 pages. arXiv:1909.08053 [cs.CL]http: //arxiv.org/abs/1909.08053

  30. [30]

    Varsha Singhania, Shaizeen Aga, and Mohamed Assem Ibrahim. 2025. FinGraV: Methodology for Fine-Grain GPU Power Visibility and In- sights. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 96–107

  31. [31]

    Alan Smith, Eric Chapman, Chintan Patel, Raja Swaminathan, John Wuu, Tyrone Huang, Wonjun Jung, Alexander Kaganov, Hugh McIn- tyre, and Ramon Mangaser. 2024. 11.1 AMD InstinctTM MI300 Series Modular Chiplet Package – HPC and AI Accelerator for Exa-Class Sys- tems. In2024 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 67. 490–492. doi:10....

  32. [32]

    An Intel 3 Advanced FinFET Platform Technology for High Performance Computing and SOC Product Applications,

    Alan Smith, Gabriel H. Loh, John Wuu, Samuel Naffziger, Tyrone Huang, Hugh McIntyre, Ramon Mangaser, Wonjun Jung, and Raja Swaminathan. 2024. AMD Instinct™MI300X Accelerator: Packaging and Architecture Co-Optimization. In2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 1–8. doi:10. 1109/VLSITECHNOLOGYANDCIR46783.2024.10631545

  33. [33]

    Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. 2022. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. In Proceedings of the 28th ACM International...

  34. [34]

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wen- lei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li- Wen Chang, Quan Chen, and Xin Liu. 2025. Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. arXiv:2502.19811 [cs.DC]https://arxiv.org/abs/2502.19811

  35. [35]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. Py- Torch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC]https://arxiv...

  36. [36]

    Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chen- hui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yi- fan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. 2025. Triton-distributed: Programming Overlap- ping Kernels on Distributed AI Syst...