Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
Characterizing GPU resilience and impact on AI/HPC systems
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Detachment-class GPU failures show minimal numeric precursors and are mainly visible through structural monitoring collapse, with joint modeling of thermal drift and pipeline degradation increasing early-warning lead time over GPU-only detection.
citing papers explorer
-
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
-
When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry
Detachment-class GPU failures show minimal numeric precursors and are mainly visible through structural monitoring collapse, with joint modeling of thermal drift and pipeline degradation increasing early-warning lead time over GPU-only detection.