pith. sign in

arxiv: 1907.01019 · v1 · pith:SEVDCUFJnew · submitted 2019-07-01 · 💻 cs.DC

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.DC
keywords fault injectionfailure analysissupercomputingGemini networkCray XErecovery mechanismsBlue Waters
0
0 comments X

The pith

Fault injection experiments on Cielo characterize how Gemini network and Cray node faults propagate to unrecoverable failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports a campaign of fault injection experiments on the Cielo Cray XE supercomputer to clarify failure causes and propagation patterns first observed in Blue Waters field data. Faults are introduced by commands that disable network links, directional connections, nodes, and blades, with resulting sequences tracked through system logs and network performance counters. The work maps the progression from fault to error to failure, documents recovery behavior in the Gemini interconnect and compute nodes, and measures effects on both the overall system and running applications at varying scales. A central aim is to isolate and reproduce the specific fault combinations that produce unrecoverable states so that new tests for resilience can be built.

Core claim

The experiments characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, understand the impact of failures on the system and the user applications at different scale, and identify and recreate fault scenarios that induce unrecoverable failures.

What carries the argument

Special input commands that bring down network links, directional connections, nodes, and blades, combined with analysis of logs and network performance counter data.

If this is right

  • Recovery mechanisms operating inside the Gemini network and Cray compute nodes can be directly observed and documented.
  • The scale-dependent effects of network and node failures on system throughput and application correctness become measurable.
  • Concrete fault combinations that produce unrecoverable states can be recreated on demand for use in new test suites.
  • The injection and analysis approach will require targeted extensions before it can be applied to Cray XC Aries systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identified unrecoverable scenarios suggest concrete priorities for where checkpointing or redundancy should be strengthened in future large-scale runs.
  • Field-data-guided injection campaigns could be repeated on other interconnect technologies to compare resilience properties across platforms.
  • The work implies that application developers may benefit from test harnesses that deliberately trigger the exact fault patterns shown to defeat recovery.

Load-bearing premise

The injected faults obtained by shutting down links, connections, nodes, and blades accurately reproduce the failure causes and propagation paths recorded in Blue Waters field data.

What would settle it

A comparison showing that real Blue Waters field failures exhibit error propagation or recovery patterns absent from the sequences produced by the link and node shutdown commands would falsify the claim that the injections represent observed field behavior.

Figures

Figures reproduced from arXiv: 1907.01019 by Amanda Bonnie, Annette Greiner, Ann Gentile, Bill Krammer, Daniel Chen, Fei Deng, Jason Repik, Jeremy Enos, Jim Brandt, Larry Kaplan, Mike Mason, Mike Showerman, Ravishankar K. Iyer, Saurabh Jha, Valerio Formicola, Zbigniew Kalbarczyk.

Figure 1
Figure 1. Figure 1: Target components of fault injection experiments. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recovery procedures of the Cray Gemini high-speed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HPCArrow: A network fault injection tool for HPC systems. Data produced during an experiment or campaign can be further analyzed using tools like LogDiver. The steps taken by HPCArrow to launch fault injection experiments are shown as S1, S2, ..., S6. is allowed to run further fault injection campaigns. HPCArrow reports the results of the campaign on the user console and collects all the relevant data for … view at source ↗
Figure 4
Figure 4. Figure 4: Single link failure on connection X+ and restore. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Single connection fault injection as a sequence of link [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Double connection injection (8 links in total) with [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents fault injection experiments on the Cielo Cray XE supercomputer (Gemini interconnect) in which administrative commands are used to bring down network links, directional connections, nodes, and blades. Using logs and network performance counters, the work characterizes fault-error-failure sequences and recovery mechanisms, quantifies impacts on system and applications at varying scales, and identifies scenarios that produce unrecoverable failures, with the explicit aim of explaining field observations from Blue Waters and generating new tests for Cray XC (Aries) systems.

Significance. If the injected faults are shown to be representative, the study supplies concrete empirical sequences and scale-dependent impact data that are otherwise difficult to obtain from production logs alone; the direct use of production-class hardware and the dual collection of logs plus counters constitute a methodological strength for HPC reliability research.

major comments (2)
  1. [Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.
  2. Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.
minor comments (1)
  1. [Abstract] The final paragraph on extensions to Aries systems is stated at a high level; a brief outline of the required changes to the injection and analysis methodology would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need to strengthen the connection between our fault injection experiments and the Blue Waters field data. We address the major comments point by point below. While we maintain that the experiments provide valuable characterizations on production hardware, we agree that the lack of explicit quantitative validation is a limitation and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.

    Authors: We acknowledge that the manuscript does not include a quantitative mapping (e.g., error code overlap or latency distributions) between the injected faults and Blue Waters logs. The injections were chosen to reproduce the categories of faults observed in the field (link and node failures) on identical Gemini hardware, allowing direct observation of sequences and impacts that are difficult to isolate in production logs. We will revise the abstract and add a limitations discussion to clarify that the connection is based on fault-type similarity rather than statistical equivalence, and note this as an area for future work. revision: yes

  2. Referee: [—] Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.

    Authors: The paper's intent is to use controlled injections to recreate and analyze the fault scenarios inferred from Blue Waters, rather than to claim identical reproduction of all signatures. Qualitative alignment in error patterns and recovery behaviors was observed, but we agree that without explicit validation the explanatory link remains inferential. We will add text in the discussion section to address this distinction and temper the claims regarding direct explanation of the field data. revision: yes

Circularity Check

0 steps flagged

No circularity: direct experimental observations without derivations or fitted predictions

full rationale

The paper reports fault injection experiments on Cielo using admin commands to bring down links/nodes/blades, followed by log and counter analysis to characterize fault-error-failure sequences. No equations, parameters, predictions, or derivations appear. The central claims rest on direct observation of injected faults and their observed impacts, cross-referenced to prior Blue Waters field data without reducing to self-fitted quantities or self-citation chains. This matches the default non-circular case for experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen injection method produces fault sequences representative of real field failures; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Faults injected via special commands to disable links, connections, nodes, and blades produce sequences representative of observed field failures.
    The experiments are motivated by matching Blue Waters observations, making this assumption load-bearing for the claimed characterization and scenario identification.

pith-pipeline@v0.9.0 · 5747 in / 1105 out tokens · 49330 ms · 2026-05-25T11:20:47.088966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Addressing failures in exascale computing,

    M. Snir et al. , “Addressing failures in exascale computing,” The International Journal of High Performance Computing Applications , vol. 28, no. 2, pp. 129–173, 2014

  2. [2]

    The reliability wall for exascale supercomputing,

    X. Yang, Z. Wang, J. Xue, and Y . Zhou, “The reliability wall for exascale supercomputing,” IEEE Transactions on Computers , vol. 61, no. 6, pp. 767–779, 2012

  3. [3]

    Analysis of gemini interconnect recovery mechanisms: Methods and observations,

    S. Jha, V . Formicola, Z. Kalbarczyk, C. Di Martino, W. T. Kramer, and R. K. Iyer, “Analysis of gemini interconnect recovery mechanisms: Methods and observations,” Cray User Group , pp. 8–12, 2016

  4. [4]

    Fault injection experiments using FIAT,

    J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Transactions on Computers , vol. 39, no. 4, pp. 575–582, 1990

  5. [5]

    Lessons learned from the analysis of system failures at petascale: The case of blue waters,

    C. Di Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbarczyk, and R. Iyer, “Lessons learned from the analysis of system failures at petascale: The case of blue waters,” in Proc. of 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN) , 2014

  6. [6]

    Logdiver: a tool for measuring resilience of extreme-scale systems and applications,

    C. D. Martino, S. Jha, W. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Logdiver: a tool for measuring resilience of extreme-scale systems and applications,” in Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale . ACM, 2015, pp. 11–18

  7. [7]

    A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,

    I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” The Journal of Supercomputing , vol. 65, no. 3, pp. 1302–1326, 2013

  8. [8]

    Resilience for extreme scale supercomputing systems,

    U.S. Department of Energy Office of Science, “Resilience for extreme scale supercomputing systems,” DOE National Laboratory Announce- ment Number LAB 14-1059, 2014

  9. [9]

    A large-scale study of failures in high- performance computing systems,

    B. Schroeder and G. Gibson, “A large-scale study of failures in high- performance computing systems,” IEEE Transactions on Dependable and Secure Computing , vol. 7, no. 4, pp. 337–350, 2010

  10. [10]

    Reducing waste in extreme scale systems through introspective analysis,

    L. Bautista-Gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta, C. En- gelmann, F. Cappello, and M. Snir, “Reducing waste in extreme scale systems through introspective analysis,” in Proc. of IEEE conference in Parallel and Distributed Processing Symposium . IEEE, 2016, pp. 212–221

  11. [11]

    Baler: deterministic, lossless log message clustering tool,

    N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun, “Baler: deterministic, lossless log message clustering tool,” Computer Science- Research and Development , vol. 26, no. 3-4, p. 285, 2011

  12. [12]

    Bringing up cielo: experiences with a cray xe6 system,

    C. Lueninghoener, D. Grunau, T. Harrington, K. Kelly, and Q. Snead, “Bringing up cielo: experiences with a cray xe6 system,” in Proceedings of the 25th international conference on Large Installation System Administration (LISA) , 2011

  13. [13]

    Gemini network resiliency guide,

    Cray, “Gemini network resiliency guide,” http://docs.cray.com/books/ S-0032-E/

  14. [14]

    Getting started with intel mpi benchmarks 2017, intel software,

    “Getting started with intel mpi benchmarks 2017, intel software,” Intel Corporation, September 2016. [Online]. Available: https: //software.intel.com/en-us/articles/intel-mpi-benchmarks

  15. [15]

    Using and Configuring System Environment Data Collections (SEDC),

    Cray Inc., “Using and Configuring System Environment Data Collections (SEDC),” Cray Doc S-2491-7001, 2012

  16. [16]

    Cray Linux Environment (CLE) 4.0 Software Release,

    ——, “Cray Linux Environment (CLE) 4.0 Software Release,” Cray Doc S-2425-40, 2010

  17. [17]

    Using the Cray Gemini Hardware Counters,

    ——, “Using the Cray Gemini Hardware Counters,” Cray Doc S-0025- 10, 2010

  18. [18]

    Large Scale System Monitoring and Analysis on Blue Waters using OVIS,

    M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon, N. Taerat, T. Tucker, J. Brandt, A. Gentile, and B. Allan, “Large Scale System Monitoring and Analysis on Blue Waters using OVIS,” in Proc. Cray User’s Group , 2014

  19. [19]

    Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,

    A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gen- tile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker, “Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,” in Proc. Int’l Conf. for High...

  20. [20]

    Aries network resiliency guide,

    Cray, “Aries network resiliency guide,” http://docs.cray.com/PDF/XC Series AriesNetwork Resiliency Guide CLE60UP02 S-0014.pdf

  21. [21]

    Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,

    A. DeConinck et al., “Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,” Proc. Cray User’s Group , 2017

  22. [22]

    Network performance counter monitoring and analysis on the cray XC platform

    J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh, “Network performance counter monitoring and analysis on the cray XC platform ” in Proc. Cray User’s Group , 2016

  23. [23]

    Fault injection techniques and tools,

    M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” Computer, vol. 30, no. 4, pp. 75–82, 1997

  24. [24]

    NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,

    D. T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R. K. Iyer, “NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,” in IEEE Proc. in Computer Performance and Dependability Symposium . IEEE, 2000, pp. 91–100

  25. [25]

    Fault injection framework for system resilience evaluation: Fake faults for finding future failures,

    T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L. Scott, “Fault injection framework for system resilience evaluation: Fake faults for finding future failures,” in Proceedings of the 2009 Workshop on Resiliency in High Performance . ACM, 2009, pp. 23–28

  26. [26]

    Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,

    B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on . IEEE, 2014, pp. 221–230

  27. [27]

    Fimd-mpi: a tool for injecting faults into mpi application,

    D. M. Blough and P. Liu, “Fimd-mpi: a tool for injecting faults into mpi application,” in Parallel and Distributed Processing Symposium,

  28. [28]

    Proceedings

    IPDPS 2000. Proceedings. 14th International . IEEE, 2000, pp. 241–247

  29. [29]

    Symplfied: Symbolic program-level fault injection and error detection framework,

    K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer, “Symplfied: Symbolic program-level fault injection and error detection framework,” in Dependable Systems and Networks With FTCS and DCC, 2008. DSN

  30. [30]

    IEEE, 2008, pp

    IEEE International Conference on . IEEE, 2008, pp. 472–481

  31. [31]

    Supporting the development of resilient message passing applications using simulation,

    T. Naughton, C. Engelmann, G. Vall ´ee, and S. B ¨ohm, “Supporting the development of resilient message passing applications using simulation,” in Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on . IEEE, 2014, pp. 271–278

  32. [32]

    Fast fault injection and sensitivity analysis for collective communications,

    K. Feng, M. G. Venkata, D. Li, and X.-H. Sun, “Fast fault injection and sensitivity analysis for collective communications,” in IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2015, pp. 148– 157

  33. [33]

    Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms,

    C. Constantinescu, “Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms,” IEEE Transactions on Computers , vol. 49, no. 9, pp. 886–894, 2000