Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
citing papers explorer
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
-
The Landscape of GPU-Centric Communication
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.