When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Freja Nordsiek; Julian M. Kunkel; Michael Bidollahkhani

arxiv: 2603.28781 · v2 · submitted 2026-03-17 · 💻 cs.DC · cs.LG

When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Michael Bidollahkhani , Freja Nordsiek , Julian M. Kunkel This is my paper

Pith reviewed 2026-05-15 09:41 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords GPU failuresearly warningobservabilitytelemetrydetachment failuresstructural monitoringHPCmonitoring degradation

0 comments

The pith

GPU detachment failures show almost no numeric precursor and are detected earlier by jointly modeling GPU telemetry with structural monitoring degradation signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that many GPU failures in production HPC and AI systems occur as abrupt detachments at the driver or interconnect level rather than through gradual drift in temperature or utilization. These events produce little change in standard numeric telemetry but instead manifest as structural collapse in the monitoring pipeline, including lost samples, rising scrape latency, time-series gaps, and the sudden absence of device metrics. The proposed framework combines utilization-aware thermal signatures from GPUs with these pipeline-health indicators and demonstrates longer early-warning lead times on real production traces than GPU-only methods. A reader would care because current monitoring often misses these quiet failures until the GPU is already unavailable, wasting resources on long jobs. The approach is evaluated using correlated GPU, node, monitoring, and scheduler data from a live cluster.

Core claim

Detachment-class GPU failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse, while joint modeling of utilization-aware thermal drift signatures in GPU telemetry and monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance increases early-warning lead time compared to GPU-only detection.

What carries the argument

The observability-aware early-warning framework that jointly models GPU utilization-aware thermal drift signatures with monitoring-pipeline degradation indicators including scrape latency, sample loss, and device-metric disappearance.

If this is right

Detachment failures become detectable primarily through collapse of structural telemetry rather than numeric drift.
Joint modeling of GPU and monitoring signals extends the available lead time for early-warning actions.
GPU-only detection systematically misses the structural component of these failures on production systems.
Actionable intervention becomes possible before the GPU is fully unavailable to schedulers and jobs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing monitoring stacks could treat pipeline integrity metrics as first-class inputs rather than secondary health checks.
The same joint-modeling pattern may apply to quiet failures in CPUs, storage, or interconnects.
The released dataset allows direct comparison of alternative detection methods on identical traces.
Schedulers could use these signals for proactive job checkpointing or migration before detachment completes.

Load-bearing premise

Structural indicators such as scrape latency increase, sample loss, and device-metric disappearance reliably precede GPU unavailability instead of appearing only as concurrent symptoms.

What would settle it

Telemetry traces in which structural degradation signals appear only after or at the exact moment of GPU detachment, providing no measurable lead time.

Figures

Figures reproduced from arXiv: 2603.28781 by Freja Nordsiek, Julian M. Kunkel, Michael Bidollahkhani.

read the original abstract

GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse, while joint modeling increases early-warning lead time compared to GPU-only detection. The dataset used in this study is publicly available at https://doi.org/10.5281/zenodo.19052367.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags real detachment failures in GPU clusters that lack numeric precursors and shows structural monitoring signals can extend warning time, but the timing evidence needs more detail to confirm precedence over coincidence.

read the letter

The main takeaway is that some GPU failures in production clusters detach at the driver or interconnect level with almost no advance signal in standard numeric telemetry like temperature or utilization. Instead, the first clear signs appear in the monitoring pipeline itself through things like increased scrape latency, sample loss, and disappearing device metrics. The authors combine these structural indicators with utilization-aware thermal drift to build a joint early-warning model and report longer lead times than GPU-only detection on GWDG production data. They also release the dataset, which is useful for follow-up work. This is a practical extension of existing observability ideas rather than a new paradigm, but it targets a genuine pain point in large-scale HPC and AI setups. The production correlation of GPU, node, monitoring, and scheduler signals gives the results some grounding. The soft spot is the evaluation. The abstract claims minimal numeric precursors and better lead time from the joint model, yet supplies no specifics on how lead time was measured, what statistical controls were used, or how they ruled out the structural signals simply appearing at the same moment as the failure. The stress-test concern about coincidence rather than precedence looks worth checking in the full methods; if the timing analysis is thin, the actionable-early-warning claim weakens. This is for cluster operators and systems researchers who deal with silent GPU failures in production. A reader working on monitoring pipelines would pick up usable ideas here. I would send it to peer review. The core observation is relevant and the data release helps, so referees can push for clearer timing evidence and confirm how much the joint approach actually adds.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an observability-aware early-warning framework for GPU 'detachment-class' failures in HPC/AI systems. These failures exhibit minimal numeric precursors in telemetry and are instead signaled by structural degradation in the monitoring pipeline (scrape latency increase, sample loss, time-series gaps, device-metric disappearance). The framework jointly models utilization-aware thermal drift signatures together with these structural indicators. It is evaluated on production telemetry from GWDG GPU nodes (with correlated GPU, node, monitoring, and scheduler signals) and claims that joint modeling yields greater early-warning lead time than GPU-only detection. The dataset is released publicly.

Significance. If the reported lead-time gains are robustly demonstrated, the work addresses a practical gap in large-scale GPU observability by showing that structural monitoring signals can provide actionable warning where numeric drift is absent. The public dataset release supports reproducibility and further research on failure precursors in production environments.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the central claim that joint modeling 'increases early-warning lead time' compared to GPU-only detection is not supported by any reported methodology for measuring lead time, defining the prediction horizon, applying statistical controls, or handling potential confounds such as concurrent symptoms. Without these details the data-to-claim link cannot be assessed.
[Abstract] Abstract: the assertion that detachment failures 'exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse' requires explicit timing analysis showing that structural indicators (scrape latency, sample loss, metric disappearance) reliably precede driver/interconnect-level unavailability rather than coinciding with it; the current text provides no such analysis.

minor comments (2)

[Abstract] The abstract refers to 'utilization-aware thermal drift signatures' without defining the exact features or drift-detection method used; a concise definition or reference to the relevant subsection would improve clarity.
[Abstract] The public dataset DOI is given, but the manuscript does not state the exact time span, number of nodes, or failure instances included; adding these summary statistics would aid readers in assessing generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to improve clarity on methodology and analysis without altering the core claims or results.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the central claim that joint modeling 'increases early-warning lead time' compared to GPU-only detection is not supported by any reported methodology for measuring lead time, defining the prediction horizon, applying statistical controls, or handling potential confounds such as concurrent symptoms. Without these details the data-to-claim link cannot be assessed.

Authors: We agree that the abstract and evaluation description require more explicit detail on lead-time measurement to support the claim. In the revised manuscript we will expand both sections to define lead time as the interval from the first joint-model trigger (thermal drift plus structural indicator) to confirmed failure, specify the prediction horizon, describe statistical controls via matched non-failure baseline intervals, and note how concurrent symptoms are handled through multi-signal correlation. These additions will be drawn from the existing evaluation framework and will not change the reported results. revision: yes
Referee: [Abstract] Abstract: the assertion that detachment failures 'exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse' requires explicit timing analysis showing that structural indicators (scrape latency, sample loss, metric disappearance) reliably precede driver/interconnect-level unavailability rather than coinciding with it; the current text provides no such analysis.

Authors: The full manuscript contains timestamp-correlation analysis across the GWDG production traces demonstrating that structural indicators precede unavailability. To satisfy the request for explicit mention in the abstract, we will revise the abstract to include a concise statement of this precedence (average lead of structural signals over driver-level loss). We will also highlight the timing results more prominently in the evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an observability framework for GPU failures using joint modeling of numeric drift and structural indicators (scrape latency, sample loss, metric disappearance) evaluated on external production telemetry from GWDG. No equations, parameter fits, or self-citations appear in the derivation; the central claims rest on empirical correlation of signals rather than reducing to fitted inputs or self-referential definitions. The dataset is publicly released, allowing independent verification outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the framework description does not introduce new mathematical constructs or postulated objects.

pith-pipeline@v0.9.0 · 5507 in / 1085 out tokens · 28747 ms · 2026-05-15T09:41:50.127050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Xid errors,

NVIDIA Corporation, “Xid errors,” https://docs.nvidia.com/deploy/ xid-errors/, accessed: 2026-02-10

work page 2026
[3]

Understanding and mitigating issues in GPU DRAM reliability,

M. Sullivan, D. Dooley, S. Cadambi, B. Robatmili, K. Mahmood, H. Zhang, H.-Y . Tsai, M. B. Sullivan, P. M. Chen, J. Clemons, and S. W. Keckler, “Understanding and mitigating issues in GPU DRAM reliability,” inProceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 2021

work page 2021
[4]

Characterizing GPU Resilience in Modern AI and HPC Workloads,

H. Abdi, F. Baldini, M. Fatica, Q. Gong, and et al., “Characterizing GPU Resilience in Modern AI and HPC Workloads,” arXiv preprint, 2025, accessed: 2026-02-10. [Online]. Available: https: //arxiv.org/abs/2503.11901

work page arXiv 2025
[5]

[Online]

NVIDIA Corporation,NVIDIA Data Center GPU Manager (DCGM) User Guide, NVIDIA, 2025, accessed: 2026-02-10. [Online]. Available: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html

work page 2025
[6]

M100 dataset 2: from 21- 01 to 21-06,

A. Borghesi, C. Di Santi, M. Molan, M. Seyedkazemi Ardebili, A. Mauri, M. Guarrasi, D. Galetti, M. Cestari, F. Barchi, L. Benini, F. Beneventi, and A. Bartolini, “M100 dataset 2: from 21- 01 to 21-06,” 2023, accessed: 2026-02-10. [Online]. Available: https://zenodo.org/records/7589131

work page arXiv 2023
[7]

Jobs and Instances,

Prometheus Authors, “Jobs and Instances,” Prometheus documentation, n.d., accessed: 2026-02-10. [Online]. Available: https://prometheus.io/ docs/concepts/jobs instances/

work page 2026
[8]

Monitoring and diagnosing performance issues in large-scale distributed systems,

Q. Zhang, Z. Chen, Y . Zhou, X. Ouyang, and R. Wang, “Monitoring and diagnosing performance issues in large-scale distributed systems,” inProceedings of the ACM SIGOPS Conference. ACM, 2020

work page 2020
[9]

Anomaly detection in monitoring data: Challenges and opportunities,

J.-G. Lou, Q. Lin, and Q. Fu, “Anomaly detection in monitoring data: Challenges and opportunities,”IEEE Transactions on Network and Service Management, vol. 19, no. 3, pp. 2765–2780, 2022

work page 2022
[10]

Understanding system failures in large-scale hpc systems,

K. B. Ferreira, J. Stearley, and A. J. Oliner, “Understanding system failures in large-scale hpc systems,”Journal of Parallel and Distributed Computing, vol. 134, pp. 219–234, 2019

work page 2019
[11]

Anomaly detection for hpc systems: Challenges and opportunities,

A. Maricq, A. Gainaru, and F. Cappello, “Anomaly detection for hpc systems: Challenges and opportunities,” inProceedings of the IEEE International Conference on Cluster Computing. IEEE, 2021

work page 2021
[12]

Gwdg gpu node telemetry dataset for observability-aware early warning of gpu detachment failures (2025-2026),

M. Bidollahkhani, F. Nordsiek, and J. M. Kunkel, “Gwdg gpu node telemetry dataset for observability-aware early warning of gpu detachment failures (2025-2026),” 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19052367

work page doi:10.5281/zenodo.19052367 2025
[13]

DCGM-Exporter,

NVIDIA Corporation, “DCGM-Exporter,” https://github. com/NVIDIA/dcgm-exporter, accessed: 2026-02-20 (commit 52ffa18044bdb26d43d7a48944140cdcdf03d0f2)

work page 2026

[1] [1]

Xid errors,

NVIDIA Corporation, “Xid errors,” https://docs.nvidia.com/deploy/ xid-errors/, accessed: 2026-02-10

work page 2026

[2] [3]

Understanding and mitigating issues in GPU DRAM reliability,

M. Sullivan, D. Dooley, S. Cadambi, B. Robatmili, K. Mahmood, H. Zhang, H.-Y . Tsai, M. B. Sullivan, P. M. Chen, J. Clemons, and S. W. Keckler, “Understanding and mitigating issues in GPU DRAM reliability,” inProceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 2021

work page 2021

[3] [4]

Characterizing GPU Resilience in Modern AI and HPC Workloads,

H. Abdi, F. Baldini, M. Fatica, Q. Gong, and et al., “Characterizing GPU Resilience in Modern AI and HPC Workloads,” arXiv preprint, 2025, accessed: 2026-02-10. [Online]. Available: https: //arxiv.org/abs/2503.11901

work page arXiv 2025

[4] [5]

[Online]

NVIDIA Corporation,NVIDIA Data Center GPU Manager (DCGM) User Guide, NVIDIA, 2025, accessed: 2026-02-10. [Online]. Available: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html

work page 2025

[5] [6]

M100 dataset 2: from 21- 01 to 21-06,

A. Borghesi, C. Di Santi, M. Molan, M. Seyedkazemi Ardebili, A. Mauri, M. Guarrasi, D. Galetti, M. Cestari, F. Barchi, L. Benini, F. Beneventi, and A. Bartolini, “M100 dataset 2: from 21- 01 to 21-06,” 2023, accessed: 2026-02-10. [Online]. Available: https://zenodo.org/records/7589131

work page arXiv 2023

[6] [7]

Jobs and Instances,

Prometheus Authors, “Jobs and Instances,” Prometheus documentation, n.d., accessed: 2026-02-10. [Online]. Available: https://prometheus.io/ docs/concepts/jobs instances/

work page 2026

[7] [8]

Monitoring and diagnosing performance issues in large-scale distributed systems,

Q. Zhang, Z. Chen, Y . Zhou, X. Ouyang, and R. Wang, “Monitoring and diagnosing performance issues in large-scale distributed systems,” inProceedings of the ACM SIGOPS Conference. ACM, 2020

work page 2020

[8] [9]

Anomaly detection in monitoring data: Challenges and opportunities,

J.-G. Lou, Q. Lin, and Q. Fu, “Anomaly detection in monitoring data: Challenges and opportunities,”IEEE Transactions on Network and Service Management, vol. 19, no. 3, pp. 2765–2780, 2022

work page 2022

[9] [10]

Understanding system failures in large-scale hpc systems,

K. B. Ferreira, J. Stearley, and A. J. Oliner, “Understanding system failures in large-scale hpc systems,”Journal of Parallel and Distributed Computing, vol. 134, pp. 219–234, 2019

work page 2019

[10] [11]

Anomaly detection for hpc systems: Challenges and opportunities,

A. Maricq, A. Gainaru, and F. Cappello, “Anomaly detection for hpc systems: Challenges and opportunities,” inProceedings of the IEEE International Conference on Cluster Computing. IEEE, 2021

work page 2021

[11] [12]

Gwdg gpu node telemetry dataset for observability-aware early warning of gpu detachment failures (2025-2026),

M. Bidollahkhani, F. Nordsiek, and J. M. Kunkel, “Gwdg gpu node telemetry dataset for observability-aware early warning of gpu detachment failures (2025-2026),” 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19052367

work page doi:10.5281/zenodo.19052367 2025

[12] [13]

DCGM-Exporter,

NVIDIA Corporation, “DCGM-Exporter,” https://github. com/NVIDIA/dcgm-exporter, accessed: 2026-02-20 (commit 52ffa18044bdb26d43d7a48944140cdcdf03d0f2)

work page 2026