pith. machine review for the scientific record. sign in

arxiv: 2604.26074 · v1 · submitted 2026-04-28 · 💻 cs.DC

Recognition: unknown

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:55 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU memory offloadingLLM inferencedirect memory accessTensor Memory Acceleratortiered memorybandwidth aggregationNVLinkPCIe
0
0 comments X

The pith

Direct GPU access to remote memory via repurposed TMA outperforms prefetching and achieves near-optimal bandwidth aggregation for LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing offloading methods prefetch data into GPU HBM, which creates contention, wastes capacity, and stalls pipelines. DAK instead lets the GPU fetch offloaded weights and KV caches directly from remote tiers using the Tensor Memory Accelerator for asynchronous loads into shared memory. A greedy algorithm sets per-operation offloading ratios while congestion control and multicast remove bottlenecks and read amplification. Across NVLink-C2C and PCIe systems this yields up to 3x and 1.8x speedups over prior baselines while using aggregate bandwidth close to the hardware maximum.

Core claim

We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. DAK repurposes TMA to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory, paired with a greedy algorithm for optimal per-operation offloading ratios, active congestion control, and TMA multicast to eliminate interconnect bottlenecks and read amplification.

What carries the argument

Repurposed Tensor Memory Accelerator (TMA) performing asynchronous direct fetches from remote memory tiers into SMEM, guided by greedy offloading-ratio selection and multicast congestion control.

If this is right

  • HBM capacity is fully available for active computation rather than holding prefetched data.
  • Pipeline bubbles from prefetch waits disappear because fetches occur asynchronously into SMEM.
  • Aggregate system bandwidth approaches the sum of local and remote link capacities.
  • Read amplification and interconnect hotspots are removed by multicast and congestion control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-access pattern could apply to other bandwidth-bound GPU workloads such as graph analytics or scientific simulation.
  • If TMA is unavailable on future GPUs, equivalent low-overhead remote loads would require new hardware primitives.
  • Memory-tier designers might prioritize low-latency direct-access links over larger local HBM if the performance gap persists.

Load-bearing premise

The Tensor Memory Accelerator can be repurposed for low-overhead asynchronous direct fetches from remote memory without new contention or hardware changes beyond what is already available.

What would settle it

A measurement on an NVLink-C2C system showing that TMA direct fetches add more latency or interconnect contention than an optimized prefetch baseline would falsify the claimed bandwidth optimality and speedup.

Figures

Figures reproduced from arXiv: 2604.26074 by Jiaxin Lin, Shouxu Lin, Zhiyuan Guo.

Figure 1
Figure 1. Figure 1: Comparison of a prefetch-based system and our direct-access￾based system, measured on GH200 with OPT-30b. latency penalties. To overcome the challenge of utilizing re￾mote capacity without suffering interconnect latency penal￾ties, prefetching-based frameworks have been proposed. In the context of LLM inference, these frameworks offload a subset of weights and KV caches in remote memory and hide the remote… view at source ↗
Figure 3
Figure 3. Figure 3: Direct memory offload vs. copy data paths. LLM Pipeline Orchestration: Another line of work builds customized GPU kernels or inference pipelines that explic￾itly control when data is moved from remote memory into local GPU HBM. These systems frame memory offloading as a scheduling optimization problem to maximize the over￾lap between remote memory communication and active GPU computation. For example, Flex… view at source ↗
Figure 4
Figure 4. Figure 4: DAK overview. concurrently direct access HBM and host DRAM. This con￾current access is achieved through fine-grained weight par￾titioning and optimized GPU kernel running on partitioned memory view at source ↗
Figure 5
Figure 5. Figure 5: DAK partitions and executes a matrix multiplication. cache-coherent Grace Hopper Architecture), requiring the ker￾nel to handle disparate data paths and memory characteristics. In 4.3, we present a congestion control and TMA multicast mechanism that mitigates intra-GPU congestion and handles non-cacheable host memory access. 4 Design 4.1 DAK Design LLM inference is dominated by matrix multiplication opera￾… view at source ↗
Figure 6
Figure 6. Figure 6: Effective bandwidth of memory/compute-bound operations. 4.2.1 The Suboptimality of Uniform Offloading Given a global offloading ratio OR (determined by total mem￾ory footprint and available GPU memory), the system must distribute this budget across individual operations. A naive ap￾proach is to apply a uniform offloading ratio, where every op￾eration offloads exactly OR fraction of its involving matrices t… view at source ↗
Figure 7
Figure 7. Figure 7: Local/remote memory access congestion on GH200. Increased NSM_Host or Nin f light reduces SM-to-HBM bandwidth. xi ∈ [0,1] to each operation Fi . The sum of offloaded data from all operations should match the global ratio constraint. Our objective is to minimize the end-to-end execution time, which is the sum of individual operation latencies. Using the effective bandwidth model, the latency of operation Fi… view at source ↗
Figure 8
Figure 8. Figure 8: Performance w/ batch size = 8 under vary offloading ratios. Testbed & Workloads: We evaluate our system across two distinct NVIDIA GPU architectures to capture a range of interconnect technologies and GPU generations: • Grace Hopper (GH200): An NVLink-C2C-based system featuring a 900 GB/s host-GPU interconnect and 480 GB of host memory. The GPU includes 96 GB of local HBM3 with 4.0 TB/s of bandwidth. • RTX… view at source ↗
Figure 9
Figure 9. Figure 9: Performance w/ batch size = 512 under vary offloading ratios As shown in Fig. 8c and Fig. 8d, on the PCIe-based RTX 6000 Blackwell system, DAK remains strictly better to all prefetching schemes. It achieves near-ideal aggregate sys￾tem memory bandwidth, demonstrating a 4% improvement through offloading that closely approaches the theoretical limits of the PCIe Gen5 interconnect. At low-to-medium of￾fload r… view at source ↗
Figure 11
Figure 11. Figure 11: Compare greedy with uniform offloading. Batch size is 512. sweeping offload ratios, we evaluate DAK’s capability to efficiently run large models on GPUs with limited memory capacity, with the global offloading ratio determined by the real available GPU memory. As shown in view at source ↗
Figure 12
Figure 12. Figure 12: Effectiveness of congestion control and kernel alignment on different matrix sizes. 0 50 100 Offloading Ratio (%) 0 500 1000 Effective Bandwidth (GB/s) N = 512 0 50 100 Offloading Ratio (%) 0 200 400 600 Effective Bandwidth (GB/s) N = 1024 DAK w/o Multicast DAK view at source ↗
Figure 13
Figure 13. Figure 13: GEMM of weights (7168,7168) and hidden states (7168, N). greedy strategy against a uniform offloading baseline using a batch size of 512, which is a workload that has a mix of compute- and memory-bound operations. As shown in view at source ↗
Figure 14
Figure 14. Figure 14: Figure shows the performance for GH200 under different configurations. The table shows the memory footprint for different con￾figurations and the corresponding global memory offload ratio. Putting everything together, our offloading algorithm satis￾fies constraint 4 in Phase 1 and constraint 5 in Phase 2 respec￾tively, making it optimal. In Phase 3, any feasible allocation is optimal. Therefore, the propo… view at source ↗
read the original abstract

LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloading frameworks rely on prefetching data into local GPU HBM. This approach underutilizes system resources by introducing HBM contention, squandering memory capacity, and creating pipeline bubbles. We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. We propose DAK, an end-to-end direct-access memory offloading framework that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory (SMEM). To maximize remote access performance, DAK introduces a greedy algorithm to determine optimal per-operation offloading ratios, alongside active congestion control and TMA multicast to eliminate interconnect bottlenecks and read amplification. Evaluations across diverse architectures show that DAK achieves near-optimal bandwidth aggregation, with up to 3$\times$ performance gains on NVLink-C2C and 1.8$\times$ on PCIe systems compared to state-of-the-art memory offloading baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DAK, an end-to-end direct-access memory offloading framework for LLM inference that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory tiers into GPU shared memory (SMEM). It introduces a greedy algorithm for determining per-operation offloading ratios, active congestion control, and TMA multicast to address interconnect bottlenecks and read amplification. The central claim is that this direct-access approach significantly outperforms traditional prefetching into HBM by achieving near-optimal aggregate system bandwidth, with reported gains of up to 3× on NVLink-C2C and 1.8× on PCIe systems relative to state-of-the-art memory offloading baselines.

Significance. If the empirical results hold under broader conditions, the work could have substantial practical impact on scaling LLM inference beyond single-GPU memory limits by better exploiting tiered memory architectures without HBM contention or pipeline stalls. The focus on hardware-level measurements and systems optimizations (rather than parameter-fitted models) provides concrete guidance for architects and practitioners working with NVLink, PCIe, and similar interconnects.

major comments (2)
  1. Abstract: The reported performance gains (up to 3× on NVLink-C2C and 1.8× on PCIe) and the claim of 'near-optimal bandwidth aggregation' are presented without any description of experimental methodology, baseline implementations, workload characteristics (e.g., model sizes, batch sizes, sequence lengths), hardware configurations, or error bars. This absence directly undermines verification of the central optimality claim, as the outperformance could be an artifact of unstated test conditions rather than a general advantage of direct TMA-based access.
  2. Abstract (and implied evaluation): The assumption that TMA can be repurposed for low-overhead asynchronous direct fetches from remote memory without introducing new interconnect contention, read amplification, or requiring non-standard hardware features is load-bearing for the 'optimal efficiency' result. No evidence is provided that the implementation avoids these issues under varied workloads or that the multicast and congestion control fully mitigate them; if TMA usage relies on undocumented behaviors or specific NVLink/PCIe setups, the gains would not generalize.
minor comments (1)
  1. The abstract uses 'optimal aggregate system bandwidth' and 'near-optimal' interchangeably without a precise definition or metric (e.g., relative to theoretical peak or a specific formula); this should be clarified with an explicit optimality criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to strengthen the presentation while maintaining the accuracy of our claims.

read point-by-point responses
  1. Referee: Abstract: The reported performance gains (up to 3× on NVLink-C2C and 1.8× on PCIe) and the claim of 'near-optimal bandwidth aggregation' are presented without any description of experimental methodology, baseline implementations, workload characteristics (e.g., model sizes, batch sizes, sequence lengths), hardware configurations, or error bars. This absence directly undermines verification of the central optimality claim, as the outperformance could be an artifact of unstated test conditions rather than a general advantage of direct TMA-based access.

    Authors: We agree that the abstract, constrained by length, omits key experimental details that would aid immediate verification of the results. The full manuscript provides these in the Evaluation section, including workload characteristics (models from 7B to 70B parameters, batch sizes 1-64, sequence lengths up to 8K), hardware setups (specific NVLink-C2C and PCIe configurations), baseline implementations (state-of-the-art prefetching frameworks), and error bars from repeated runs. To address the concern directly, we have revised the abstract to include a concise summary of the methodology, workloads, hardware, baselines, and statistical reporting. This change makes the central claims more self-contained without altering the reported gains. revision: yes

  2. Referee: Abstract (and implied evaluation): The assumption that TMA can be repurposed for low-overhead asynchronous direct fetches from remote memory without introducing new interconnect contention, read amplification, or requiring non-standard hardware features is load-bearing for the 'optimal efficiency' result. No evidence is provided that the implementation avoids these issues under varied workloads or that the multicast and congestion control fully mitigate them; if TMA usage relies on undocumented behaviors or specific NVLink/PCIe setups, the gains would not generalize.

    Authors: Our TMA repurposing relies exclusively on documented NVIDIA APIs for asynchronous memory operations, as described in the System Design and Implementation sections, with no use of undocumented behaviors. The Evaluation section presents hardware-level measurements across multiple architectures and workloads demonstrating that direct access, augmented by the greedy offloading algorithm, active congestion control, and TMA multicast, achieves near-optimal aggregate bandwidth without introducing additional contention or read amplification. To strengthen the generalization argument, we have added a new subsection with further results under varied workloads (different batch sizes and model scales) confirming consistent mitigation of bottlenecks. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems design with independent hardware measurements

full rationale

The paper presents an engineering framework (DAK) that repurposes TMA for direct remote fetches, introduces a greedy offloading-ratio algorithm, congestion control, and multicast, then validates via end-to-end benchmarks on NVLink-C2C and PCIe hardware. No equations, predictions, or first-principles results are claimed; performance numbers (3×/1.8× gains, near-optimal bandwidth) are reported from direct measurement rather than derived from fitted parameters or self-referential definitions. The central claim therefore rests on external hardware behavior, not on any reduction of outputs to the paper's own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on hardware assumptions about TMA behavior and interconnect properties rather than new mathematical axioms or fitted constants. No free parameters are explicitly introduced; the greedy algorithm is presented as determining ratios without hand-tuned values.

axioms (2)
  • domain assumption TMA hardware supports asynchronous direct fetches from remote memory into SMEM with acceptable latency and without side effects on compute.
    Invoked in the description of DAK's core mechanism.
  • domain assumption Interconnect supports multicast and congestion control sufficient to eliminate read amplification.
    Required for the claimed elimination of bottlenecks.
invented entities (1)
  • DAK framework no independent evidence
    purpose: End-to-end direct-access offloading system
    New software stack proposed in the paper.

pith-pipeline@v0.9.0 · 5508 in / 1376 out tokens · 75681 ms · 2026-05-07T14:55:49.881770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 8 canonical work pages

  1. [1]

    Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

  2. [2]

    Llm in a flash: Efficient large language model inference with lim- ited memory

    Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Be- lenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient large language model inference with lim- ited memory. InProceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12562–12...

  3. [3]

    In-depth analyses of unified virtual memory system for gpu accelerated computing

    Tyler Allen and Rong Ge. In-depth analyses of unified virtual memory system for gpu accelerated computing. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Anal- ysis, pages 1–15, 2021

  4. [4]

    Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...

  5. [5]

    Moe-lightning: High-throughput moe inference on memory-constrained gpus

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...

  6. [6]

    Cheng, Z

    Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, et al. Mirage persistent ker- nel: A compiler and runtime for mega-kernelizing tensor programs.arXiv preprint arXiv:2512.22219, 2025

  7. [7]

    Perfor- mance evaluation of advanced features in cuda unified memory

    Steven Chien, Ivy Peng, and Stefano Markidis. Perfor- mance evaluation of advanced features in cuda unified memory. In2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), pages 50–57. IEEE, 2019

  8. [8]

    Compute express link (cxl) specifica- tion, revision 3.1

    CXL Consortium. Compute express link (cxl) specifica- tion, revision 3.1. Technical report, Compute Express Link Consortium, November 2023. The foundational spec enabling peer-to-peer accelerator memory pooling and fabric routing

  9. [9]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  10. [10]

    Hetero- geneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems

    Yinxiao Feng, Dong Xiang, and Kaisheng Ma. Hetero- geneous die-to-die interfaces: Enabling more flexible chiplet interconnection systems. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 930–943, 2023

  11. [11]

    Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556, 2024

    Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Girid- har Chukkapalli, Thomas Schulthess, and Torsten Hoe- fler. Understanding data movement in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556, 2024

  12. [12]

    AI and memory wall.IEEE Micro, 44(1):14–24, 2024

    Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. AI and memory wall.IEEE Micro, 44(1):14–24, 2024

  13. [13]

    Memory pooling with cxl.IEEE Micro, 43(2):48–57, 2023

    Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. Memory pooling with cxl.IEEE Micro, 43(2):48–57, 2023

  14. [14]

    Swapad- visor: Pushing deep learning beyond the gpu memory limit via smart swapping

    Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapad- visor: Pushing deep learning beyond the gpu memory limit via smart swapping. InProceedings of the Twenty- Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020

  15. [15]

    Neo: Saving gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems, 7, 2025

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. Neo: Saving gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems, 7, 2025

  16. [16]

    Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

    Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

  17. [17]

    The breakthrough memory solutions for improved performance on llm inference.IEEE Micro, 44(3):40–48, 2024

    Byeongho Kim, Sanghoon Cha, Sangsoo Park, Jieun Lee, Sukhan Lee, Shin-haeng Kang, Jinin So, Kyungsoo Kim, Jin Jung, Jong-Geon Lee, et al. The breakthrough memory solutions for improved performance on llm inference.IEEE Micro, 44(3):40–48, 2024

  18. [18]

    Lia: A single-gpu llm inference acceleration with coopera- tive amx-enabled cpu-gpu computation and cxl offload- ing

    Hyungyo Kim, Nachuan Wang, Qirong Xia, Jinghan Huang, Amir Yazdanbakhsh, and Nam Sung Kim. Lia: A single-gpu llm inference acceleration with coopera- tive amx-enabled cpu-gpu computation and cxl offload- ing. InProceedings of the 52nd Annual International Symposium on Computer Architecture, pages 544–558, 2025

  19. [19]

    Pond: Cxl-based memory pooling systems for cloud platforms

    Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, et al. Pond: Cxl-based memory pooling systems for cloud platforms. InProceedings of the 28th ACM International Confer- ence on Architectural Support for Programming Lan- guages and Operating Systems, Volume 2, ...

  20. [20]

    Fenghuang: Next-generation memory orchestration for ai inferenc- ing.arXiv preprint arXiv:2511.10753, 2025

    Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuo- tao Xu, Peng Cheng, and Lidong Zhou. Fenghuang: Next-generation memory orchestration for ai inferenc- ing.arXiv preprint arXiv:2511.10753, 2025

  21. [21]

    An evaluation of unified memory technology on nvidia gpus

    Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. An evaluation of unified memory technology on nvidia gpus. In2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pages 1092–1098. IEEE, 2015

  22. [22]

    Cutlass 3.0: Fast linear algebra in cuda c++

    NVIDIA. Cutlass 3.0: Fast linear algebra in cuda c++. https://github.com/NVIDIA/cutlass, 2023

  23. [23]

    NVIDIA Blackwell GeForce RTX 50 Series Opens New World of AI Computer Graphics

    NVIDIA. NVIDIA Blackwell GeForce RTX 50 Series Opens New World of AI Computer Graphics. NVIDIA Newsroom, January 2025

  24. [24]

    NVIDIA RTX PRO Blackwell GPU Archi- tecture

    NVIDIA. NVIDIA RTX PRO Blackwell GPU Archi- tecture. Technical report, NVIDIA Corporation, 2025

  25. [25]

    NVIDIA H100 Tensor Core GPU Architecture

    NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Architecture. Whitepaper, NVIDIA, 2022

  26. [26]

    NVIDIA GH200 Grace Hopper Superchip Architecture

    NVIDIA Corporation. NVIDIA GH200 Grace Hopper Superchip Architecture. Technical whitepaper, NVIDIA, 2023

  27. [27]

    Cuda c programming guide: Us- ing the tensor memory accelerator (tma), 2024

    NVIDIA Corporation. Cuda c programming guide: Us- ing the tensor memory accelerator (tma), 2024. Ac- cessed: 2026-04-15

  28. [28]

    Interconnect bandwidth heterogeneity on amd mi250x and infinity fabric.arXiv preprint arXiv:2302.14827, 2023

    Carl Pearson. Interconnect bandwidth heterogeneity on amd mi250x and infinity fabric.arXiv preprint arXiv:2302.14827, 2023

  29. [29]

    Pytorch documentation: Scaled dot product attention

    PyTorch Contributors. Pytorch documentation: Scaled dot product attention. https://docs.pytorch.org/ docs/stable/generated/torch.nn.functional. scaled_dot_product_attention.html, 2024. Accessed: 2026-04-15

  30. [30]

    {Zero-offload}: De- mocratizing {billion-scale} model training

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. {Zero-offload}: De- mocratizing {billion-scale} model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021

  31. [31]

    Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper

    Gabin Schieffer, Jacob Wahlgren, Jie Ren, Jennifer Faj, and Ivy Peng. Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper. InPro- ceedings of the 53rd International Conference on Paral- lel Processing, pages 199–209, 2024

  32. [32]

    Flashattention- 3: Fast and accurate attention with asynchrony and low- precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low- precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  33. [33]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

  34. [34]

    Thunderkittens: Simple, fast, and adorable ai kernels.arXiv preprint arXiv:2410.20399, 2024

    Benjamin F Spector, Simran Arora, Aaryan Singhal, Daniel Y Fu, and Christopher Ré. Thunderkittens: Simple, fast, and adorable ai kernels.arXiv preprint arXiv:2410.20399, 2024

  35. [35]

    Aqua: Network-accelerated memory offloading for llms in scale-up gpu domains

    Abhishek Vijaya Kumar, Gianni Antichi, and Rachee Singh. Aqua: Network-accelerated memory offloading for llms in scale-up gpu domains. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 48–62, 2025

  36. [36]

    Memory is all you need: An overview of compute-in-memory architectures for accel- erating large language model inference.arXiv preprint arXiv:2406.08413, 2024

    Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, and Toyotaro Suzumura. Memory is all you need: An overview of compute-in-memory architectures for accel- erating large language model inference.arXiv preprint arXiv:2406.08413, 2024

  37. [37]

    Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024

    Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Sto- ica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024. A Formal Proof of Optimal Greedy Offload The optimization problem can be written as: min {xi} ∑ i Ci EB(x i) (1) s.t. ∑ i Cixi =R ∑ i Ci,(2) 0≤x i ≤1,∀i,(3) Next, we explain why the greedy algorithm proposed in §4.2 is...