Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo
Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3
The pith
Fault injection experiments on Cielo characterize how Gemini network and Cray node faults propagate to unrecoverable failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The experiments characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, understand the impact of failures on the system and the user applications at different scale, and identify and recreate fault scenarios that induce unrecoverable failures.
What carries the argument
Special input commands that bring down network links, directional connections, nodes, and blades, combined with analysis of logs and network performance counter data.
If this is right
- Recovery mechanisms operating inside the Gemini network and Cray compute nodes can be directly observed and documented.
- The scale-dependent effects of network and node failures on system throughput and application correctness become measurable.
- Concrete fault combinations that produce unrecoverable states can be recreated on demand for use in new test suites.
- The injection and analysis approach will require targeted extensions before it can be applied to Cray XC Aries systems.
Where Pith is reading between the lines
- The identified unrecoverable scenarios suggest concrete priorities for where checkpointing or redundancy should be strengthened in future large-scale runs.
- Field-data-guided injection campaigns could be repeated on other interconnect technologies to compare resilience properties across platforms.
- The work implies that application developers may benefit from test harnesses that deliberately trigger the exact fault patterns shown to defeat recovery.
Load-bearing premise
The injected faults obtained by shutting down links, connections, nodes, and blades accurately reproduce the failure causes and propagation paths recorded in Blue Waters field data.
What would settle it
A comparison showing that real Blue Waters field failures exhibit error propagation or recovery patterns absent from the sequences produced by the link and node shutdown commands would falsify the claim that the injections represent observed field behavior.
Figures
read the original abstract
We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents fault injection experiments on the Cielo Cray XE supercomputer (Gemini interconnect) in which administrative commands are used to bring down network links, directional connections, nodes, and blades. Using logs and network performance counters, the work characterizes fault-error-failure sequences and recovery mechanisms, quantifies impacts on system and applications at varying scales, and identifies scenarios that produce unrecoverable failures, with the explicit aim of explaining field observations from Blue Waters and generating new tests for Cray XC (Aries) systems.
Significance. If the injected faults are shown to be representative, the study supplies concrete empirical sequences and scale-dependent impact data that are otherwise difficult to obtain from production logs alone; the direct use of production-class hardware and the dual collection of logs plus counters constitute a methodological strength for HPC reliability research.
major comments (2)
- [Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.
- Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.
minor comments (1)
- [Abstract] The final paragraph on extensions to Aries systems is stated at a high level; a brief outline of the required changes to the injection and analysis methodology would improve clarity.
Simulated Author's Rebuttal
We thank the referee for highlighting the need to strengthen the connection between our fault injection experiments and the Blue Waters field data. We address the major comments point by point below. While we maintain that the experiments provide valuable characterizations on production hardware, we agree that the lack of explicit quantitative validation is a limitation and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.
Authors: We acknowledge that the manuscript does not include a quantitative mapping (e.g., error code overlap or latency distributions) between the injected faults and Blue Waters logs. The injections were chosen to reproduce the categories of faults observed in the field (link and node failures) on identical Gemini hardware, allowing direct observation of sequences and impacts that are difficult to isolate in production logs. We will revise the abstract and add a limitations discussion to clarify that the connection is based on fault-type similarity rather than statistical equivalence, and note this as an area for future work. revision: yes
-
Referee: [—] Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.
Authors: The paper's intent is to use controlled injections to recreate and analyze the fault scenarios inferred from Blue Waters, rather than to claim identical reproduction of all signatures. Qualitative alignment in error patterns and recovery behaviors was observed, but we agree that without explicit validation the explanatory link remains inferential. We will add text in the discussion section to address this distinction and temper the claims regarding direct explanation of the field data. revision: yes
Circularity Check
No circularity: direct experimental observations without derivations or fitted predictions
full rationale
The paper reports fault injection experiments on Cielo using admin commands to bring down links/nodes/blades, followed by log and counter analysis to characterize fault-error-failure sequences. No equations, parameters, predictions, or derivations appear. The central claims rest on direct observation of injected faults and their observed impacts, cross-referenced to prior Blue Waters field data without reducing to self-fitted quantities or self-citation chains. This matches the default non-circular case for experimental work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Faults injected via special commands to disable links, connections, nodes, and blades produce sequences representative of observed field failures.
Reference graph
Works this paper leans on
-
[1]
Addressing failures in exascale computing,
M. Snir et al. , “Addressing failures in exascale computing,” The International Journal of High Performance Computing Applications , vol. 28, no. 2, pp. 129–173, 2014
work page 2014
-
[2]
The reliability wall for exascale supercomputing,
X. Yang, Z. Wang, J. Xue, and Y . Zhou, “The reliability wall for exascale supercomputing,” IEEE Transactions on Computers , vol. 61, no. 6, pp. 767–779, 2012
work page 2012
-
[3]
Analysis of gemini interconnect recovery mechanisms: Methods and observations,
S. Jha, V . Formicola, Z. Kalbarczyk, C. Di Martino, W. T. Kramer, and R. K. Iyer, “Analysis of gemini interconnect recovery mechanisms: Methods and observations,” Cray User Group , pp. 8–12, 2016
work page 2016
-
[4]
Fault injection experiments using FIAT,
J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Transactions on Computers , vol. 39, no. 4, pp. 575–582, 1990
work page 1990
-
[5]
Lessons learned from the analysis of system failures at petascale: The case of blue waters,
C. Di Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbarczyk, and R. Iyer, “Lessons learned from the analysis of system failures at petascale: The case of blue waters,” in Proc. of 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN) , 2014
work page 2014
-
[6]
Logdiver: a tool for measuring resilience of extreme-scale systems and applications,
C. D. Martino, S. Jha, W. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Logdiver: a tool for measuring resilience of extreme-scale systems and applications,” in Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale . ACM, 2015, pp. 11–18
work page 2015
-
[7]
I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” The Journal of Supercomputing , vol. 65, no. 3, pp. 1302–1326, 2013
work page 2013
-
[8]
Resilience for extreme scale supercomputing systems,
U.S. Department of Energy Office of Science, “Resilience for extreme scale supercomputing systems,” DOE National Laboratory Announce- ment Number LAB 14-1059, 2014
work page 2014
-
[9]
A large-scale study of failures in high- performance computing systems,
B. Schroeder and G. Gibson, “A large-scale study of failures in high- performance computing systems,” IEEE Transactions on Dependable and Secure Computing , vol. 7, no. 4, pp. 337–350, 2010
work page 2010
-
[10]
Reducing waste in extreme scale systems through introspective analysis,
L. Bautista-Gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta, C. En- gelmann, F. Cappello, and M. Snir, “Reducing waste in extreme scale systems through introspective analysis,” in Proc. of IEEE conference in Parallel and Distributed Processing Symposium . IEEE, 2016, pp. 212–221
work page 2016
-
[11]
Baler: deterministic, lossless log message clustering tool,
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun, “Baler: deterministic, lossless log message clustering tool,” Computer Science- Research and Development , vol. 26, no. 3-4, p. 285, 2011
work page 2011
-
[12]
Bringing up cielo: experiences with a cray xe6 system,
C. Lueninghoener, D. Grunau, T. Harrington, K. Kelly, and Q. Snead, “Bringing up cielo: experiences with a cray xe6 system,” in Proceedings of the 25th international conference on Large Installation System Administration (LISA) , 2011
work page 2011
-
[13]
Gemini network resiliency guide,
Cray, “Gemini network resiliency guide,” http://docs.cray.com/books/ S-0032-E/
-
[14]
Getting started with intel mpi benchmarks 2017, intel software,
“Getting started with intel mpi benchmarks 2017, intel software,” Intel Corporation, September 2016. [Online]. Available: https: //software.intel.com/en-us/articles/intel-mpi-benchmarks
work page 2017
-
[15]
Using and Configuring System Environment Data Collections (SEDC),
Cray Inc., “Using and Configuring System Environment Data Collections (SEDC),” Cray Doc S-2491-7001, 2012
work page 2012
-
[16]
Cray Linux Environment (CLE) 4.0 Software Release,
——, “Cray Linux Environment (CLE) 4.0 Software Release,” Cray Doc S-2425-40, 2010
work page 2010
-
[17]
Using the Cray Gemini Hardware Counters,
——, “Using the Cray Gemini Hardware Counters,” Cray Doc S-0025- 10, 2010
work page 2010
-
[18]
Large Scale System Monitoring and Analysis on Blue Waters using OVIS,
M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon, N. Taerat, T. Tucker, J. Brandt, A. Gentile, and B. Allan, “Large Scale System Monitoring and Analysis on Blue Waters using OVIS,” in Proc. Cray User’s Group , 2014
work page 2014
-
[19]
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gen- tile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker, “Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,” in Proc. Int’l Conf. for High...
work page 2014
-
[20]
Aries network resiliency guide,
Cray, “Aries network resiliency guide,” http://docs.cray.com/PDF/XC Series AriesNetwork Resiliency Guide CLE60UP02 S-0014.pdf
-
[21]
Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,
A. DeConinck et al., “Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,” Proc. Cray User’s Group , 2017
work page 2017
-
[22]
Network performance counter monitoring and analysis on the cray XC platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh, “Network performance counter monitoring and analysis on the cray XC platform ” in Proc. Cray User’s Group , 2016
work page 2016
-
[23]
Fault injection techniques and tools,
M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” Computer, vol. 30, no. 4, pp. 75–82, 1997
work page 1997
-
[24]
D. T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R. K. Iyer, “NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,” in IEEE Proc. in Computer Performance and Dependability Symposium . IEEE, 2000, pp. 91–100
work page 2000
-
[25]
Fault injection framework for system resilience evaluation: Fake faults for finding future failures,
T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L. Scott, “Fault injection framework for system resilience evaluation: Fake faults for finding future failures,” in Proceedings of the 2009 Workshop on Resiliency in High Performance . ACM, 2009, pp. 23–28
work page 2009
-
[26]
Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on . IEEE, 2014, pp. 221–230
work page 2014
-
[27]
Fimd-mpi: a tool for injecting faults into mpi application,
D. M. Blough and P. Liu, “Fimd-mpi: a tool for injecting faults into mpi application,” in Parallel and Distributed Processing Symposium,
- [28]
-
[29]
Symplfied: Symbolic program-level fault injection and error detection framework,
K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer, “Symplfied: Symbolic program-level fault injection and error detection framework,” in Dependable Systems and Networks With FTCS and DCC, 2008. DSN
work page 2008
- [30]
-
[31]
Supporting the development of resilient message passing applications using simulation,
T. Naughton, C. Engelmann, G. Vall ´ee, and S. B ¨ohm, “Supporting the development of resilient message passing applications using simulation,” in Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on . IEEE, 2014, pp. 271–278
work page 2014
-
[32]
Fast fault injection and sensitivity analysis for collective communications,
K. Feng, M. G. Venkata, D. Li, and X.-H. Sun, “Fast fault injection and sensitivity analysis for collective communications,” in IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2015, pp. 148– 157
work page 2015
-
[33]
Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms,
C. Constantinescu, “Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms,” IEEE Transactions on Computers , vol. 49, no. 9, pp. 886–894, 2000
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.