DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.
Interconnect bandwidth heterogeneity on amd mi250x and infinity fabric
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.DC 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
citing papers explorer
-
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
DAK enables direct GPU access to remote memory for LLM inference via TMA repurposing and a greedy offloading algorithm, achieving up to 3x gains over prefetching baselines on NVLink-C2C and 1.8x on PCIe.
-
The Landscape of GPU-Centric Communication
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.