Fine-grained automated failure management for extreme-scale gpu accelerated systems

Yan, Jiakun, Snir, Marc , title = · 2025 · arXiv 2285.375988

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms

cs.DC · 2026-05-12 · unverdicted · novelty 4.0

Charm++ techniques enable efficient overdecomposition on multi-vendor GPGPU distributed systems.

Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora

cs.DC · 2026-04-10 · unverdicted · novelty 4.0

Aurora reached 1.01 EF/s FP64 HPL and 11.64 EF/s HPL-MxP through locality-aware mapping, CPU-GPU pipelining, mixed-precision orchestration, and hybrid resilience on a large Intel GPU-based system.

citing papers explorer

Showing 2 of 2 citing papers.

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms cs.DC · 2026-05-12 · unverdicted · none · ref 1
Charm++ techniques enable efficient overdecomposition on multi-vendor GPGPU distributed systems.
Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora cs.DC · 2026-04-10 · unverdicted · none · ref 33
Aurora reached 1.01 EF/s FP64 HPL and 11.64 EF/s HPL-MxP through locality-aware mapping, CPU-GPU pipelining, mixed-precision orchestration, and hybrid resilience on a large Intel GPU-based system.

Fine-grained automated failure management for extreme-scale gpu accelerated systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer