NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Runtime compression of MPI messages to improve the performance and scalability of parallel applications
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.
RegDem translates SASS code to spill registers to shared memory, increasing occupancy and delivering 9% geometric mean speedup over nvcc with peaks of 18%.
Classical emulation of the HHL algorithm via extended UNIQuE scales exponentially only with qubit count and shows runtime advantage over state-vector simulation for small linear systems.
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
citing papers explorer
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
FusionRCG: Orchestrating Recursive Computation Graphs across GPU Memory Hierarchies
FusionRCG uses liveness-aware graph orchestration, Cartesian-to-spherical fusion, and multi-tier kernels to cut intermediate data by up to 7.7x and deliver 3.09x SCF speedup on A100 GPUs.
-
RegDem: Increasing GPU Performance via Shared Memory Register Spilling
RegDem translates SASS code to spill registers to shared memory, increasing occupancy and delivering 9% geometric mean speedup over nvcc with peaks of 18%.
-
Extending UNIQuE: Quantum Simulation Speedup for the HHL Algorithm
Classical emulation of the HHL algorithm via extended UNIQuE scales exponentially only with qubit count and shows runtime advantage over state-vector simulation for small linear systems.
-
The Landscape of GPU-Centric Communication
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.