Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Amanda Bonnie; Annette Greiner; Ann Gentile; Bill Krammer; Daniel Chen; Fei Deng; Jason Repik; Jeremy Enos; Jim Brandt; Larry Kaplan

arxiv: 1907.01019 · v1 · pith:SEVDCUFJnew · submitted 2019-07-01 · 💻 cs.DC

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Valerio Formicola , Saurabh Jha , Daniel Chen , Fei Deng , Amanda Bonnie , Mike Mason , Jim Brandt , Ann Gentile

show 8 more authors

Larry Kaplan Jason Repik Jeremy Enos Mike Showerman Annette Greiner Zbigniew Kalbarczyk Ravishankar K. Iyer Bill Krammer

This is my paper

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.DC

keywords fault injectionfailure analysissupercomputingGemini networkCray XErecovery mechanismsBlue Waters

0 comments

The pith

Fault injection experiments on Cielo characterize how Gemini network and Cray node faults propagate to unrecoverable failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports a campaign of fault injection experiments on the Cielo Cray XE supercomputer to clarify failure causes and propagation patterns first observed in Blue Waters field data. Faults are introduced by commands that disable network links, directional connections, nodes, and blades, with resulting sequences tracked through system logs and network performance counters. The work maps the progression from fault to error to failure, documents recovery behavior in the Gemini interconnect and compute nodes, and measures effects on both the overall system and running applications at varying scales. A central aim is to isolate and reproduce the specific fault combinations that produce unrecoverable states so that new tests for resilience can be built.

Core claim

The experiments characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, understand the impact of failures on the system and the user applications at different scale, and identify and recreate fault scenarios that induce unrecoverable failures.

What carries the argument

Special input commands that bring down network links, directional connections, nodes, and blades, combined with analysis of logs and network performance counter data.

If this is right

Recovery mechanisms operating inside the Gemini network and Cray compute nodes can be directly observed and documented.
The scale-dependent effects of network and node failures on system throughput and application correctness become measurable.
Concrete fault combinations that produce unrecoverable states can be recreated on demand for use in new test suites.
The injection and analysis approach will require targeted extensions before it can be applied to Cray XC Aries systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identified unrecoverable scenarios suggest concrete priorities for where checkpointing or redundancy should be strengthened in future large-scale runs.
Field-data-guided injection campaigns could be repeated on other interconnect technologies to compare resilience properties across platforms.
The work implies that application developers may benefit from test harnesses that deliberately trigger the exact fault patterns shown to defeat recovery.

Load-bearing premise

The injected faults obtained by shutting down links, connections, nodes, and blades accurately reproduce the failure causes and propagation paths recorded in Blue Waters field data.

What would settle it

A comparison showing that real Blue Waters field failures exhibit error propagation or recovery patterns absent from the sequences produced by the link and node shutdown commands would falsify the claim that the injections represent observed field behavior.

Figures

Figures reproduced from arXiv: 1907.01019 by Amanda Bonnie, Annette Greiner, Ann Gentile, Bill Krammer, Daniel Chen, Fei Deng, Jason Repik, Jeremy Enos, Jim Brandt, Larry Kaplan, Mike Mason, Mike Showerman, Ravishankar K. Iyer, Saurabh Jha, Valerio Formicola, Zbigniew Kalbarczyk.

**Figure 2.** Figure 2: Recovery procedures of the Cray Gemini high-speed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: HPCArrow: A network fault injection tool for HPC systems. Data produced during an experiment or campaign can be further analyzed using tools like LogDiver. The steps taken by HPCArrow to launch fault injection experiments are shown as S1, S2, ..., S6. is allowed to run further fault injection campaigns. HPCArrow reports the results of the campaign on the user console and collects all the relevant data for … view at source ↗

**Figure 4.** Figure 4: Single link failure on connection X+ and restore. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Single connection fault injection as a sequence of link [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Double connection injection (8 links in total) with [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cielo injections add targeted failure data but the match to Blue Waters real faults rests on an assumption without clear quantitative backing.

read the letter

The paper reports new fault-injection runs on the Cielo Cray XE machine that try to reproduce and explain the failure sequences seen earlier in Blue Waters field logs. They shut down links, directional connections, nodes, and blades via admin commands, then track the resulting error and recovery behavior through logs and network counters. The goals are to map fault-error-failure chains in the Gemini network and Cray nodes, measure scale effects on applications, and flag unrecoverable cases for future test design. They also flag what would need to change for Aries-based systems. That is the concrete addition: fresh experimental traces from a production-scale machine tied to an existing field study. The work is straightforward observational systems research and stays within its scope. The soft spot is the untested claim that command-driven shutdowns produce sequences representative of the Blue Waters failures. Real faults often involve transient hardware errors or firmware timing that differ from clean administrative removal, yet the description gives no side-by-side comparison of error codes, latency distributions, or recovery rates between the injected and field data. Without that mapping or the full methods section, it is hard to judge how much the new observations actually explain the earlier field results. This is useful reading for people who run or analyze large Cray installations and need concrete examples of failure propagation. It is not a general method or framework, so the audience is narrow. The experimental intent is honest and the topic matters for production resilience work, so it should go to referees who can check the missing validation details and the raw traces.

Referee Report

2 major / 1 minor

Summary. The paper presents fault injection experiments on the Cielo Cray XE supercomputer (Gemini interconnect) in which administrative commands are used to bring down network links, directional connections, nodes, and blades. Using logs and network performance counters, the work characterizes fault-error-failure sequences and recovery mechanisms, quantifies impacts on system and applications at varying scales, and identifies scenarios that produce unrecoverable failures, with the explicit aim of explaining field observations from Blue Waters and generating new tests for Cray XC (Aries) systems.

Significance. If the injected faults are shown to be representative, the study supplies concrete empirical sequences and scale-dependent impact data that are otherwise difficult to obtain from production logs alone; the direct use of production-class hardware and the dual collection of logs plus counters constitute a methodological strength for HPC reliability research.

major comments (2)

[Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.
Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.

minor comments (1)

[Abstract] The final paragraph on extensions to Aries systems is stated at a high level; a brief outline of the required changes to the injection and analysis methodology would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need to strengthen the connection between our fault injection experiments and the Blue Waters field data. We address the major comments point by point below. While we maintain that the experiments provide valuable characterizations on production hardware, we agree that the lack of explicit quantitative validation is a limitation and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The central claim (Abstract) that the injection campaign improves understanding of the Blue Waters field failure data rests on the unverified premise that command-driven shutdowns produce fault-error-failure sequences and recovery behavior representative of real hardware, firmware, or environmental faults. No quantitative mapping—such as overlap in error codes, latency distributions, or recovery success rates—is reported between the two datasets.

Authors: We acknowledge that the manuscript does not include a quantitative mapping (e.g., error code overlap or latency distributions) between the injected faults and Blue Waters logs. The injections were chosen to reproduce the categories of faults observed in the field (link and node failures) on identical Gemini hardware, allowing direct observation of sequences and impacts that are difficult to isolate in production logs. We will revise the abstract and add a limitations discussion to clarify that the connection is based on fault-type similarity rather than statistical equivalence, and note this as an area for future work. revision: yes
Referee: [—] Because the paper provides no validation that the injected faults reproduce the signatures or propagation paths observed in the Blue Waters logs, the characterizations of sequences, unrecoverable scenarios, and scale-dependent impacts cannot yet be treated as explanatory of the field data.

Authors: The paper's intent is to use controlled injections to recreate and analyze the fault scenarios inferred from Blue Waters, rather than to claim identical reproduction of all signatures. Qualitative alignment in error patterns and recovery behaviors was observed, but we agree that without explicit validation the explanatory link remains inferential. We will add text in the discussion section to address this distinction and temper the claims regarding direct explanation of the field data. revision: yes

Circularity Check

0 steps flagged

No circularity: direct experimental observations without derivations or fitted predictions

full rationale

The paper reports fault injection experiments on Cielo using admin commands to bring down links/nodes/blades, followed by log and counter analysis to characterize fault-error-failure sequences. No equations, parameters, predictions, or derivations appear. The central claims rest on direct observation of injected faults and their observed impacts, cross-referenced to prior Blue Waters field data without reducing to self-fitted quantities or self-citation chains. This matches the default non-circular case for experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen injection method produces fault sequences representative of real field failures; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Faults injected via special commands to disable links, connections, nodes, and blades produce sequences representative of observed field failures.
The experiments are motivated by matching Blue Waters observations, making this assumption load-bearing for the claimed characterization and scenario identification.

pith-pipeline@v0.9.0 · 5747 in / 1105 out tokens · 49330 ms · 2026-05-25T11:20:47.088966+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Addressing failures in exascale computing,

M. Snir et al. , “Addressing failures in exascale computing,” The International Journal of High Performance Computing Applications , vol. 28, no. 2, pp. 129–173, 2014

work page 2014
[2]

The reliability wall for exascale supercomputing,

X. Yang, Z. Wang, J. Xue, and Y . Zhou, “The reliability wall for exascale supercomputing,” IEEE Transactions on Computers , vol. 61, no. 6, pp. 767–779, 2012

work page 2012
[3]

Analysis of gemini interconnect recovery mechanisms: Methods and observations,

S. Jha, V . Formicola, Z. Kalbarczyk, C. Di Martino, W. T. Kramer, and R. K. Iyer, “Analysis of gemini interconnect recovery mechanisms: Methods and observations,” Cray User Group , pp. 8–12, 2016

work page 2016
[4]

Fault injection experiments using FIAT,

J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Transactions on Computers , vol. 39, no. 4, pp. 575–582, 1990

work page 1990
[5]

Lessons learned from the analysis of system failures at petascale: The case of blue waters,

C. Di Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbarczyk, and R. Iyer, “Lessons learned from the analysis of system failures at petascale: The case of blue waters,” in Proc. of 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN) , 2014

work page 2014
[6]

Logdiver: a tool for measuring resilience of extreme-scale systems and applications,

C. D. Martino, S. Jha, W. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Logdiver: a tool for measuring resilience of extreme-scale systems and applications,” in Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale . ACM, 2015, pp. 11–18

work page 2015
[7]

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” The Journal of Supercomputing , vol. 65, no. 3, pp. 1302–1326, 2013

work page 2013
[8]

Resilience for extreme scale supercomputing systems,

U.S. Department of Energy Ofﬁce of Science, “Resilience for extreme scale supercomputing systems,” DOE National Laboratory Announce- ment Number LAB 14-1059, 2014

work page 2014
[9]

A large-scale study of failures in high- performance computing systems,

B. Schroeder and G. Gibson, “A large-scale study of failures in high- performance computing systems,” IEEE Transactions on Dependable and Secure Computing , vol. 7, no. 4, pp. 337–350, 2010

work page 2010
[10]

Reducing waste in extreme scale systems through introspective analysis,

L. Bautista-Gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta, C. En- gelmann, F. Cappello, and M. Snir, “Reducing waste in extreme scale systems through introspective analysis,” in Proc. of IEEE conference in Parallel and Distributed Processing Symposium . IEEE, 2016, pp. 212–221

work page 2016
[11]

Baler: deterministic, lossless log message clustering tool,

N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun, “Baler: deterministic, lossless log message clustering tool,” Computer Science- Research and Development , vol. 26, no. 3-4, p. 285, 2011

work page 2011
[12]

Bringing up cielo: experiences with a cray xe6 system,

C. Lueninghoener, D. Grunau, T. Harrington, K. Kelly, and Q. Snead, “Bringing up cielo: experiences with a cray xe6 system,” in Proceedings of the 25th international conference on Large Installation System Administration (LISA) , 2011

work page 2011
[13]

Gemini network resiliency guide,

Cray, “Gemini network resiliency guide,” http://docs.cray.com/books/ S-0032-E/

work page
[14]

Getting started with intel mpi benchmarks 2017, intel software,

“Getting started with intel mpi benchmarks 2017, intel software,” Intel Corporation, September 2016. [Online]. Available: https: //software.intel.com/en-us/articles/intel-mpi-benchmarks

work page 2017
[15]

Using and Conﬁguring System Environment Data Collections (SEDC),

Cray Inc., “Using and Conﬁguring System Environment Data Collections (SEDC),” Cray Doc S-2491-7001, 2012

work page 2012
[16]

Cray Linux Environment (CLE) 4.0 Software Release,

——, “Cray Linux Environment (CLE) 4.0 Software Release,” Cray Doc S-2425-40, 2010

work page 2010
[17]

Using the Cray Gemini Hardware Counters,

——, “Using the Cray Gemini Hardware Counters,” Cray Doc S-0025- 10, 2010

work page 2010
[18]

Large Scale System Monitoring and Analysis on Blue Waters using OVIS,

M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon, N. Taerat, T. Tucker, J. Brandt, A. Gentile, and B. Allan, “Large Scale System Monitoring and Analysis on Blue Waters using OVIS,” in Proc. Cray User’s Group , 2014

work page 2014
[19]

Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,

A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gen- tile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker, “Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,” in Proc. Int’l Conf. for High...

work page 2014
[20]

Aries network resiliency guide,

Cray, “Aries network resiliency guide,” http://docs.cray.com/PDF/XC Series AriesNetwork Resiliency Guide CLE60UP02 S-0014.pdf

work page
[21]

Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,

A. DeConinck et al., “Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,” Proc. Cray User’s Group , 2017

work page 2017
[22]

Network performance counter monitoring and analysis on the cray XC platform

J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh, “Network performance counter monitoring and analysis on the cray XC platform ” in Proc. Cray User’s Group , 2016

work page 2016
[23]

Fault injection techniques and tools,

M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” Computer, vol. 30, no. 4, pp. 75–82, 1997

work page 1997
[24]

NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,

D. T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R. K. Iyer, “NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,” in IEEE Proc. in Computer Performance and Dependability Symposium . IEEE, 2000, pp. 91–100

work page 2000
[25]

Fault injection framework for system resilience evaluation: Fake faults for ﬁnding future failures,

T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L. Scott, “Fault injection framework for system resilience evaluation: Fake faults for ﬁnding future failures,” in Proceedings of the 2009 Workshop on Resiliency in High Performance . ACM, 2009, pp. 23–28

work page 2009
[26]

Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,

B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on . IEEE, 2014, pp. 221–230

work page 2014
[27]

Fimd-mpi: a tool for injecting faults into mpi application,

D. M. Blough and P. Liu, “Fimd-mpi: a tool for injecting faults into mpi application,” in Parallel and Distributed Processing Symposium,

work page
[28]

Proceedings

IPDPS 2000. Proceedings. 14th International . IEEE, 2000, pp. 241–247

work page 2000
[29]

Symplﬁed: Symbolic program-level fault injection and error detection framework,

K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer, “Symplﬁed: Symbolic program-level fault injection and error detection framework,” in Dependable Systems and Networks With FTCS and DCC, 2008. DSN

work page 2008
[30]

IEEE, 2008, pp

IEEE International Conference on . IEEE, 2008, pp. 472–481

work page 2008
[31]

Supporting the development of resilient message passing applications using simulation,

T. Naughton, C. Engelmann, G. Vall ´ee, and S. B ¨ohm, “Supporting the development of resilient message passing applications using simulation,” in Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on . IEEE, 2014, pp. 271–278

work page 2014
[32]

Fast fault injection and sensitivity analysis for collective communications,

K. Feng, M. G. Venkata, D. Li, and X.-H. Sun, “Fast fault injection and sensitivity analysis for collective communications,” in IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2015, pp. 148– 157

work page 2015
[33]

Teraﬂops supercomputer: Architecture and validation of the fault tolerance mechanisms,

C. Constantinescu, “Teraﬂops supercomputer: Architecture and validation of the fault tolerance mechanisms,” IEEE Transactions on Computers , vol. 49, no. 9, pp. 886–894, 2000

work page 2000

[1] [1]

Addressing failures in exascale computing,

M. Snir et al. , “Addressing failures in exascale computing,” The International Journal of High Performance Computing Applications , vol. 28, no. 2, pp. 129–173, 2014

work page 2014

[2] [2]

The reliability wall for exascale supercomputing,

X. Yang, Z. Wang, J. Xue, and Y . Zhou, “The reliability wall for exascale supercomputing,” IEEE Transactions on Computers , vol. 61, no. 6, pp. 767–779, 2012

work page 2012

[3] [3]

Analysis of gemini interconnect recovery mechanisms: Methods and observations,

S. Jha, V . Formicola, Z. Kalbarczyk, C. Di Martino, W. T. Kramer, and R. K. Iyer, “Analysis of gemini interconnect recovery mechanisms: Methods and observations,” Cray User Group , pp. 8–12, 2016

work page 2016

[4] [4]

Fault injection experiments using FIAT,

J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Transactions on Computers , vol. 39, no. 4, pp. 575–582, 1990

work page 1990

[5] [5]

Lessons learned from the analysis of system failures at petascale: The case of blue waters,

C. Di Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbarczyk, and R. Iyer, “Lessons learned from the analysis of system failures at petascale: The case of blue waters,” in Proc. of 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN) , 2014

work page 2014

[6] [6]

Logdiver: a tool for measuring resilience of extreme-scale systems and applications,

C. D. Martino, S. Jha, W. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Logdiver: a tool for measuring resilience of extreme-scale systems and applications,” in Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale . ACM, 2015, pp. 11–18

work page 2015

[7] [7]

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems,” The Journal of Supercomputing , vol. 65, no. 3, pp. 1302–1326, 2013

work page 2013

[8] [8]

Resilience for extreme scale supercomputing systems,

U.S. Department of Energy Ofﬁce of Science, “Resilience for extreme scale supercomputing systems,” DOE National Laboratory Announce- ment Number LAB 14-1059, 2014

work page 2014

[9] [9]

A large-scale study of failures in high- performance computing systems,

B. Schroeder and G. Gibson, “A large-scale study of failures in high- performance computing systems,” IEEE Transactions on Dependable and Secure Computing , vol. 7, no. 4, pp. 337–350, 2010

work page 2010

[10] [10]

Reducing waste in extreme scale systems through introspective analysis,

L. Bautista-Gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta, C. En- gelmann, F. Cappello, and M. Snir, “Reducing waste in extreme scale systems through introspective analysis,” in Proc. of IEEE conference in Parallel and Distributed Processing Symposium . IEEE, 2016, pp. 212–221

work page 2016

[11] [11]

Baler: deterministic, lossless log message clustering tool,

N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun, “Baler: deterministic, lossless log message clustering tool,” Computer Science- Research and Development , vol. 26, no. 3-4, p. 285, 2011

work page 2011

[12] [12]

Bringing up cielo: experiences with a cray xe6 system,

C. Lueninghoener, D. Grunau, T. Harrington, K. Kelly, and Q. Snead, “Bringing up cielo: experiences with a cray xe6 system,” in Proceedings of the 25th international conference on Large Installation System Administration (LISA) , 2011

work page 2011

[13] [13]

Gemini network resiliency guide,

Cray, “Gemini network resiliency guide,” http://docs.cray.com/books/ S-0032-E/

work page

[14] [14]

Getting started with intel mpi benchmarks 2017, intel software,

“Getting started with intel mpi benchmarks 2017, intel software,” Intel Corporation, September 2016. [Online]. Available: https: //software.intel.com/en-us/articles/intel-mpi-benchmarks

work page 2017

[15] [15]

Using and Conﬁguring System Environment Data Collections (SEDC),

Cray Inc., “Using and Conﬁguring System Environment Data Collections (SEDC),” Cray Doc S-2491-7001, 2012

work page 2012

[16] [16]

Cray Linux Environment (CLE) 4.0 Software Release,

——, “Cray Linux Environment (CLE) 4.0 Software Release,” Cray Doc S-2425-40, 2010

work page 2010

[17] [17]

Using the Cray Gemini Hardware Counters,

——, “Using the Cray Gemini Hardware Counters,” Cray Doc S-0025- 10, 2010

work page 2010

[18] [18]

Large Scale System Monitoring and Analysis on Blue Waters using OVIS,

M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon, N. Taerat, T. Tucker, J. Brandt, A. Gentile, and B. Allan, “Large Scale System Monitoring and Analysis on Blue Waters using OVIS,” in Proc. Cray User’s Group , 2014

work page 2014

[19] [19]

Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,

A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gen- tile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker, “Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications,” in Proc. Int’l Conf. for High...

work page 2014

[20] [20]

Aries network resiliency guide,

Cray, “Aries network resiliency guide,” http://docs.cray.com/PDF/XC Series AriesNetwork Resiliency Guide CLE60UP02 S-0014.pdf

work page

[21] [21]

Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,

A. DeConinck et al., “Runtime collection and analysis of system metrics for production monitoring of trinity phase ii,” Proc. Cray User’s Group , 2017

work page 2017

[22] [22]

Network performance counter monitoring and analysis on the cray XC platform

J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh, “Network performance counter monitoring and analysis on the cray XC platform ” in Proc. Cray User’s Group , 2016

work page 2016

[23] [23]

Fault injection techniques and tools,

M.-C. Hsueh, T. K. Tsai, and R. K. Iyer, “Fault injection techniques and tools,” Computer, vol. 30, no. 4, pp. 75–82, 1997

work page 1997

[24] [24]

NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,

D. T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R. K. Iyer, “NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault injectors,” in IEEE Proc. in Computer Performance and Dependability Symposium . IEEE, 2000, pp. 91–100

work page 2000

[25] [25]

Fault injection framework for system resilience evaluation: Fake faults for ﬁnding future failures,

T. Naughton, W. Bland, G. Vallee, C. Engelmann, and S. L. Scott, “Fault injection framework for system resilience evaluation: Fake faults for ﬁnding future failures,” in Proceedings of the 2009 Workshop on Resiliency in High Performance . ACM, 2009, pp. 23–28

work page 2009

[26] [26]

Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,

B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “Gpu-qin: A methodology for evaluating the error resilience of gpgpu applications,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on . IEEE, 2014, pp. 221–230

work page 2014

[27] [27]

Fimd-mpi: a tool for injecting faults into mpi application,

D. M. Blough and P. Liu, “Fimd-mpi: a tool for injecting faults into mpi application,” in Parallel and Distributed Processing Symposium,

work page

[28] [28]

Proceedings

IPDPS 2000. Proceedings. 14th International . IEEE, 2000, pp. 241–247

work page 2000

[29] [29]

Symplﬁed: Symbolic program-level fault injection and error detection framework,

K. Pattabiraman, N. Nakka, Z. Kalbarczyk, and R. Iyer, “Symplﬁed: Symbolic program-level fault injection and error detection framework,” in Dependable Systems and Networks With FTCS and DCC, 2008. DSN

work page 2008

[30] [30]

IEEE, 2008, pp

IEEE International Conference on . IEEE, 2008, pp. 472–481

work page 2008

[31] [31]

Supporting the development of resilient message passing applications using simulation,

T. Naughton, C. Engelmann, G. Vall ´ee, and S. B ¨ohm, “Supporting the development of resilient message passing applications using simulation,” in Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on . IEEE, 2014, pp. 271–278

work page 2014

[32] [32]

Fast fault injection and sensitivity analysis for collective communications,

K. Feng, M. G. Venkata, D. Li, and X.-H. Sun, “Fast fault injection and sensitivity analysis for collective communications,” in IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2015, pp. 148– 157

work page 2015

[33] [33]

Teraﬂops supercomputer: Architecture and validation of the fault tolerance mechanisms,

C. Constantinescu, “Teraﬂops supercomputer: Architecture and validation of the fault tolerance mechanisms,” IEEE Transactions on Computers , vol. 49, no. 9, pp. 886–894, 2000

work page 2000