Recognition: unknown
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
Pith reviewed 2026-05-07 14:55 UTC · model grok-4.3
The pith
Direct GPU access to remote memory via repurposed TMA outperforms prefetching and achieves near-optimal bandwidth aggregation for LLM inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. DAK repurposes TMA to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory, paired with a greedy algorithm for optimal per-operation offloading ratios, active congestion control, and TMA multicast to eliminate interconnect bottlenecks and read amplification.
What carries the argument
Repurposed Tensor Memory Accelerator (TMA) performing asynchronous direct fetches from remote memory tiers into SMEM, guided by greedy offloading-ratio selection and multicast congestion control.
If this is right
- HBM capacity is fully available for active computation rather than holding prefetched data.
- Pipeline bubbles from prefetch waits disappear because fetches occur asynchronously into SMEM.
- Aggregate system bandwidth approaches the sum of local and remote link capacities.
- Read amplification and interconnect hotspots are removed by multicast and congestion control.
Where Pith is reading between the lines
- The same direct-access pattern could apply to other bandwidth-bound GPU workloads such as graph analytics or scientific simulation.
- If TMA is unavailable on future GPUs, equivalent low-overhead remote loads would require new hardware primitives.
- Memory-tier designers might prioritize low-latency direct-access links over larger local HBM if the performance gap persists.
Load-bearing premise
The Tensor Memory Accelerator can be repurposed for low-overhead asynchronous direct fetches from remote memory without new contention or hardware changes beyond what is already available.
What would settle it
A measurement on an NVLink-C2C system showing that TMA direct fetches add more latency or interconnect contention than an optimized prefetch baseline would falsify the claimed bandwidth optimality and speedup.
Figures
read the original abstract
LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloading frameworks rely on prefetching data into local GPU HBM. This approach underutilizes system resources by introducing HBM contention, squandering memory capacity, and creating pipeline bubbles. We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. We propose DAK, an end-to-end direct-access memory offloading framework that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory (SMEM). To maximize remote access performance, DAK introduces a greedy algorithm to determine optimal per-operation offloading ratios, alongside active congestion control and TMA multicast to eliminate interconnect bottlenecks and read amplification. Evaluations across diverse architectures show that DAK achieves near-optimal bandwidth aggregation, with up to 3$\times$ performance gains on NVLink-C2C and 1.8$\times$ on PCIe systems compared to state-of-the-art memory offloading baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DAK, an end-to-end direct-access memory offloading framework for LLM inference that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory tiers into GPU shared memory (SMEM). It introduces a greedy algorithm for determining per-operation offloading ratios, active congestion control, and TMA multicast to address interconnect bottlenecks and read amplification. The central claim is that this direct-access approach significantly outperforms traditional prefetching into HBM by achieving near-optimal aggregate system bandwidth, with reported gains of up to 3× on NVLink-C2C and 1.8× on PCIe systems relative to state-of-the-art memory offloading baselines.
Significance. If the empirical results hold under broader conditions, the work could have substantial practical impact on scaling LLM inference beyond single-GPU memory limits by better exploiting tiered memory architectures without HBM contention or pipeline stalls. The focus on hardware-level measurements and systems optimizations (rather than parameter-fitted models) provides concrete guidance for architects and practitioners working with NVLink, PCIe, and similar interconnects.
major comments (2)
- Abstract: The reported performance gains (up to 3× on NVLink-C2C and 1.8× on PCIe) and the claim of 'near-optimal bandwidth aggregation' are presented without any description of experimental methodology, baseline implementations, workload characteristics (e.g., model sizes, batch sizes, sequence lengths), hardware configurations, or error bars. This absence directly undermines verification of the central optimality claim, as the outperformance could be an artifact of unstated test conditions rather than a general advantage of direct TMA-based access.
- Abstract (and implied evaluation): The assumption that TMA can be repurposed for low-overhead asynchronous direct fetches from remote memory without introducing new interconnect contention, read amplification, or requiring non-standard hardware features is load-bearing for the 'optimal efficiency' result. No evidence is provided that the implementation avoids these issues under varied workloads or that the multicast and congestion control fully mitigate them; if TMA usage relies on undocumented behaviors or specific NVLink/PCIe setups, the gains would not generalize.
minor comments (1)
- The abstract uses 'optimal aggregate system bandwidth' and 'near-optimal' interchangeably without a precise definition or metric (e.g., relative to theoretical peak or a specific formula); this should be clarified with an explicit optimality criterion.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to strengthen the presentation while maintaining the accuracy of our claims.
read point-by-point responses
-
Referee: Abstract: The reported performance gains (up to 3× on NVLink-C2C and 1.8× on PCIe) and the claim of 'near-optimal bandwidth aggregation' are presented without any description of experimental methodology, baseline implementations, workload characteristics (e.g., model sizes, batch sizes, sequence lengths), hardware configurations, or error bars. This absence directly undermines verification of the central optimality claim, as the outperformance could be an artifact of unstated test conditions rather than a general advantage of direct TMA-based access.
Authors: We agree that the abstract, constrained by length, omits key experimental details that would aid immediate verification of the results. The full manuscript provides these in the Evaluation section, including workload characteristics (models from 7B to 70B parameters, batch sizes 1-64, sequence lengths up to 8K), hardware setups (specific NVLink-C2C and PCIe configurations), baseline implementations (state-of-the-art prefetching frameworks), and error bars from repeated runs. To address the concern directly, we have revised the abstract to include a concise summary of the methodology, workloads, hardware, baselines, and statistical reporting. This change makes the central claims more self-contained without altering the reported gains. revision: yes
-
Referee: Abstract (and implied evaluation): The assumption that TMA can be repurposed for low-overhead asynchronous direct fetches from remote memory without introducing new interconnect contention, read amplification, or requiring non-standard hardware features is load-bearing for the 'optimal efficiency' result. No evidence is provided that the implementation avoids these issues under varied workloads or that the multicast and congestion control fully mitigate them; if TMA usage relies on undocumented behaviors or specific NVLink/PCIe setups, the gains would not generalize.
Authors: Our TMA repurposing relies exclusively on documented NVIDIA APIs for asynchronous memory operations, as described in the System Design and Implementation sections, with no use of undocumented behaviors. The Evaluation section presents hardware-level measurements across multiple architectures and workloads demonstrating that direct access, augmented by the greedy offloading algorithm, active congestion control, and TMA multicast, achieves near-optimal aggregate bandwidth without introducing additional contention or read amplification. To strengthen the generalization argument, we have added a new subsection with further results under varied workloads (different batch sizes and model scales) confirming consistent mitigation of bottlenecks. revision: partial
Circularity Check
No circularity: empirical systems design with independent hardware measurements
full rationale
The paper presents an engineering framework (DAK) that repurposes TMA for direct remote fetches, introduces a greedy offloading-ratio algorithm, congestion control, and multicast, then validates via end-to-end benchmarks on NVLink-C2C and PCIe hardware. No equations, predictions, or first-principles results are claimed; performance numbers (3×/1.8× gains, near-optimal bandwidth) are reported from direct measurement rather than derived from fitted parameters or self-referential definitions. The central claim therefore rests on external hardware behavior, not on any reduction of outputs to the paper's own inputs or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption TMA hardware supports asynchronous direct fetches from remote memory into SMEM with acceptable latency and without side effects on compute.
- domain assumption Interconnect supports multicast and congestion control sufficient to eliminate read amplification.
invented entities (1)
-
DAK framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024
2024
-
[2]
Llm in a flash: Efficient large language model inference with lim- ited memory
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Be- lenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with lim- ited memory. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12...
2024
-
[3]
In-depth analyses of unified virtual memory system for gpu accelerated computing
Tyler Allen and Rong Ge. In-depth analyses of unified virtual memory system for gpu accelerated computing. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Anal- ysis, pages 1–15, 2021
2021
-
[4]
Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...
2022
-
[5]
Moe-lightning: High-throughput moe inference on memory-constrained gpus
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...
2025
- [6]
-
[7]
Perfor- mance evaluation of advanced features in cuda unified memory
Steven Chien, Ivy Peng, and Stefano Markidis. Perfor- mance evaluation of advanced features in cuda unified memory. In2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), pages 50–57. IEEE, 2019
2019
-
[8]
Compute express link (cxl) specifica- tion, revision 3.1
CXL Consortium. Compute express link (cxl) specifica- tion, revision 3.1. Technical report, Compute Express Link Consortium, November 2023. The foundational spec enabling peer-to-peer accelerator memory pooling and fabric routing
2023
-
[9]
Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022
2022
-
[10]
Hetero- geneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems
Yinxiao Feng, Dong Xiang, and Kaisheng Ma. Hetero- geneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 930–943, 2023
2023
-
[11]
Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Girid- har Chukkapalli, Thomas Schulthess, and Torsten Hoe- fler. Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556, 2024
-
[12]
AI and memory wall.IEEE Micro, 44(1):14–24, 2024
Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. AI and memory wall.IEEE Micro, 44(1):14–24, 2024
2024
-
[13]
Memory pooling with cxl.IEEE Micro, 43(2):48–57, 2023
Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. Memory pooling with cxl.IEEE Micro, 43(2):48–57, 2023
2023
-
[14]
Swapad- visor: Pushing deep learning beyond the gpu memory limit via smart swapping
Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapad- visor: Pushing deep learning beyond the gpu memory limit via smart swapping. InProceedings of the Twenty- Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020
2020
-
[15]
Neo: Saving gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems, 7, 2025
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. Neo: Saving gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems, 7, 2025
2025
-
[16]
Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024
-
[17]
The breakthrough memory solutions for improved performance on llm inference.IEEE Micro, 44(3):40–48, 2024
Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang, Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory solutions for improved performance on llm inference.IEEE Micro, 44(3):40–48, 2024
2024
-
[18]
Lia: A single-gpu llm inference acceleration with coopera- tive amx-enabled cpu-gpu computation and cxl offload- ing
Hyungyo Kim, Nachuan Wang, Qirong Xia, Jinghan Huang, Amir Yazdanbakhsh, and Nam Sung Kim. Lia: A single-gpu llm inference acceleration with coopera- tive amx-enabled cpu-gpu computation and cxl offload- ing. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 544–558, 2025
2025
-
[19]
Pond: Cxl-based memory pooling systems for cloud platforms
Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. Pond: Cxl-based memory pooling systems for cloud platforms. InProceedings of the 28th ACM International Confer- ence on Architectural Support for Programming Lan- guages and Operating Systems, Volume 2, ...
2023
-
[20]
Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuo- tao Xu, Peng Cheng, and Lidong Zhou. Fenghuang: Next-generation memory orchestration for ai inferenc- ing.arXiv preprint arXiv:2511.10753, 2025
-
[21]
An evaluation of unified memory technology on nvidia gpus
Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. An evaluation of unified memory technology on nvidia gpus. In2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pages 1092–1098. IEEE, 2015
2015
-
[22]
Cutlass 3.0: Fast linear algebra in cuda c++
NVIDIA. Cutlass 3.0: Fast linear algebra in cuda c++. https://github.com/NVIDIA/cutlass, 2023
2023
-
[23]
NVIDIA Blackwell GeForce RTX 50 Series Opens New World of AI Computer Graphics
NVIDIA. NVIDIA Blackwell GeForce RTX 50 Series Opens New World of AI Computer Graphics. NVIDIA Newsroom, January 2025
2025
-
[24]
NVIDIA RTX PRO Blackwell GPU Archi- tecture
NVIDIA. NVIDIA RTX PRO Blackwell GPU Archi- tecture. Technical report, NVIDIA Corporation, 2025
2025
-
[25]
NVIDIA H100 Tensor Core GPU Architecture
NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Architecture. Whitepaper, NVIDIA, 2022
2022
-
[26]
NVIDIA GH200 Grace Hopper Superchip Architecture
NVIDIA Corporation. NVIDIA GH200 Grace Hopper Superchip Architecture. Technical whitepaper, NVIDIA, 2023
2023
-
[27]
Cuda c programming guide: Us- ing the tensor memory accelerator (tma), 2024
NVIDIA Corporation. Cuda c programming guide: Us- ing the tensor memory accelerator (tma), 2024. Ac- cessed: 2026-04-15
2024
-
[28]
Carl Pearson. Interconnect bandwidth heterogeneity on amd mi250x and infinity fabric.arXiv preprint arXiv:2302.14827, 2023
-
[29]
Pytorch documentation: Scaled dot product attention
PyTorch Contributors. Pytorch documentation: Scaled dot product attention. https://docs.pytorch.org/ docs/stable/generated/torch.nn.functional. scaled_dot_product_attention.html, 2024. Accessed: 2026-04-15
2024
-
[30]
{Zero-offload}: De- mocratizing {billion-scale} model training
Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. {Zero-offload}: De- mocratizing {billion-scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021
2021
-
[31]
Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper
Gabin Schieffer, Jacob Wahlgren, Jie Ren, Jennifer Faj, and Ivy Peng. Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper. InPro- ceedings of the 53rd International Conference on Paral- lel Processing, pages 199–209, 2024
2024
-
[32]
Flashattention- 3: Fast and accurate attention with asynchrony and low- precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low- precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
2024
-
[33]
Flexgen: High-throughput generative inference of large language models with a single gpu
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023
2023
-
[34]
Thunderkittens: Simple, fast, and adorable ai kernels.arXiv preprint arXiv:2410.20399, 2024
Benjamin F Spector, Simran Arora, Aaryan Singhal, Daniel Y Fu, and Christopher Ré. Thunderkittens: Simple, fast, and adorable ai kernels.arXiv preprint arXiv:2410.20399, 2024
-
[35]
Aqua: Network-accelerated memory offloading for llms in scale-up gpu domains
Abhishek Vijaya Kumar, Gianni Antichi, and Rachee Singh. Aqua: Network-accelerated memory offloading for llms in scale-up gpu domains. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 48–62, 2025
2025
-
[36]
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, and Toyotaro Suzumura. Memory is all you need: An overview of compute-in-memory architectures for accel- erating large language model inference.arXiv preprint arXiv:2406.08413, 2024
-
[37]
Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024
Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Sto- ica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024. A Formal Proof of Optimal Greedy Offload The optimization problem can be written as: min {xi} ∑ i Ci EB(x i) (1) s.t. ∑ i Cixi =R ∑ i Ci,(2) 0≤x i ≤1,∀i,(3) Next, we explain why the greedy algorithm proposed in §4.2 is...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.