Live Forensics for Distributed Storage Systems

Jeremy Enos; Mark Dalton; Mike Showerman; Ravishankar K. Iyer; Saurabh Jha; Shengkun Cui; Tianyin Xu; William T. Kramer; Zbigniew T. Kalbarczyk

arxiv: 1907.10203 · v1 · pith:BKNLM43Nnew · submitted 2019-07-24 · 💻 cs.DC · cs.LG

Live Forensics for Distributed Storage Systems

Saurabh Jha , Shengkun Cui , Tianyin Xu , Jeremy Enos , Mike Showerman , Mark Dalton , Zbigniew T. Kalbarczyk , William T. Kramer

show 1 more author

Ravishankar K. Iyer

This is my paper

Pith reviewed 2026-05-24 17:07 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords live forensicsdistributed storageperformance diagnosisroot cause analysisstochastic modelingI/O failuresdifferential observability

0 comments

The pith

Kaleidoscope identifies root causes of performance problems in large-scale distributed storage by running live forensics every five minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kaleidoscope as a system designed to diagnose application performance issues in distributed storage systems, whether from individual component failures or resource contention. The design draws from observed I/O failures in a peta-scale storage system. Kaleidoscope relies on three features to achieve this: temporal and spatial differential observability for end-to-end I/O monitoring, stochastic modeling of component health via domain-guided functions that incorporate path redundancy and measurement uncertainty, and comparison of reliability and performance metrics between similar healthy and unhealthy components to determine the most likely root causes. Deployment results show the system operates at five-minute intervals while correctly attributing 95.8 percent of real-world issues and adding negligible overhead.

Core claim

Kaleidoscope supports live forensics for application performance problems in large-scale distributed storage systems by using temporal and spatial differential observability for I/O request monitoring, modeling the health of storage components as a stochastic process with domain-guided functions that account for path redundancy and measurement uncertainty, and attributing the most likely root causes through observed differences in reliability and performance metrics between similar types of healthy and unhealthy components.

What carries the argument

Stochastic modeling of storage component health via domain-guided functions that capture path redundancy and measurement uncertainty, combined with differential observability and metric comparison to isolate root causes.

If this is right

Storage operators can diagnose and respond to performance problems at five-minute granularity without accumulating significant monitoring cost.
Root causes can be attributed for the large majority of observed I/O performance issues in production peta-scale systems.
The same differential and stochastic approach can distinguish between failure-induced and contention-induced performance degradation.
Live forensics becomes feasible on systems where traditional post-mortem analysis is too slow for operational needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modeling approach could be adapted to other distributed systems that exhibit path redundancy, such as compute clusters or network fabrics.
Frequent live attribution might enable automated remediation loops that act before user-visible slowdowns compound.
If the stochastic functions generalize across hardware generations, the same system could track health trends over multi-year deployments.

Load-bearing premise

The domain-guided stochastic functions accurately capture path redundancy and measurement uncertainty so that metric differences between healthy and unhealthy components reliably point to the root cause.

What would settle it

A deployment run in which root-cause attribution accuracy falls below 95.8 percent or monitoring overhead becomes measurable across repeated five-minute intervals.

Figures

Figures reproduced from arXiv: 1907.10203 by Jeremy Enos, Mark Dalton, Mike Showerman, Ravishankar K. Iyer, Saurabh Jha, Shengkun Cui, Tianyin Xu, William T. Kramer, Zbigniew T. Kalbarczyk.

**Figure 1.** Figure 1: Common patterns of I/O failures. Notation: "hb" is heartbeat process, "srv" is service process; and each box represents the storage components (e.g., data servers). of many other object-based POSIX storage systems, such as IBM GPFS [61], BeeGFS [31], Ceph [73], and GlusterFS [16]. Monitoring overhead. Store-Ping-based monitors have been deployed on PetaStore for two years. The monitors measure the complet… view at source ↗

**Figure 2.** Figure 2: CDF of I/O request completion time under component failures (“Faliure”) and no failures (“Normal”). 100 101 102 103 104 105 106 20 21 22 23 24 25 26 27 28 29 210 Count I/O Await [sec(s)], 10484 Outlier Points Not Shown await [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 5.** Figure 5: Correlation between load (measured by loadavg) and latency. (“Comp. T.” is the completion time of I/O requests.) that time, the loadavg increased from 60 to as high as 350. The 50th and 99th percentile durations of extreme I/O were found to be 12 and 227 minutes, respectively. High load. I/O request completion time increases with the load on the storage servers. High load conditions are caused by a flood o… view at source ↗

**Figure 6.** Figure 6: An overview of Kaleidoscope. Kaleidoscope consists three component for monitoring, failure localization, and failure diagnosis (marked in gray). Many high-impact zero-day failures can be prevented if the faulty or unhealthy components can be detected and the corresponding potential causes can be diagnosed earlier, before they lead to user-visible impact. 4 Design and Implementation [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 7.** Figure 7: An illustration of the FG model. Only the paths C1 to OSD1, and C2 to OSD2 are shown, for clarity. Redundancies and other network components have also been removed for clarity. Thus, the path availability AP must explicitly model such redundancies (e.g., LNETs and HA-pairs) while estimating the availability of a path. The model described above can be represented using a factor graph that models the interac… view at source ↗

**Figure 8.** Figure 8: Histogram of duration of components in unhealthy state. 100 101 102 103 104 105 0 5 10 15 20 25 30 Timeout/ failure Count I/O Completion Time [sec(s)] scratch-fs project-fs home-fs [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Completion time of I/O requests measured by Kaleidoscope’s WrEx Store-Pings. with every 20 points on the graph) for the WrEx Store-Pings. (We omit RmEx and CrWr because of the page limit.) We can see that 99% of WrEx completed within one second (SLO), and only 0.14% failed with a timeout. With Kaleidoscope, it is efficient to nail down to the anomalies and perform live forensics (e.g., the load-related re… view at source ↗

**Figure 10.** Figure 10: Outages visible from Kaleidoscope We show that the false positive ratio is very low, and our interactions with PetaStore operators confirm its usefulness. One potential caveat is that Kaleidoscope assumes that Store-Pings experiences the same I/O behavior as real applications, which may not hold in all cases. On the other hand, using Kaleidoscope ML-components retrospectively on trace-data generated by S… view at source ↗

read the original abstract

We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kaleidoscope puts together differential monitoring and stochastic component modeling into a live forensics pipeline for large storage, with a real peta-scale deployment as its main asset, but the modeling validation is missing from the abstract.

read the letter

Kaleidoscope is a system for live root-cause analysis of performance problems in distributed storage. It watches I/O requests with temporal and spatial differentials, models component health as a stochastic process using domain-guided functions that factor in path redundancy and measurement uncertainty, and then attributes causes by comparing reliability and performance metrics between similar healthy and unhealthy components. They developed it from a study of I/O failures in the PetaStore system and deployed it there.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Kaleidoscope, a system for live forensics of performance problems in large-scale distributed storage systems. It is motivated by a study of I/O failures in the anonymized PetaStore system. The design relies on three features: temporal and spatial differential observability for monitoring, modeling storage component health as a stochastic process using domain-guided functions that incorporate path redundancy and measurement uncertainty, and attributing root causes by observing differences in metrics between healthy and unhealthy components. The evaluation claims that Kaleidoscope can perform live forensics at 5-minute intervals, identifying root causes for 95.8% of real-world performance issues with negligible overhead.

Significance. If the modeling and evaluation hold up under scrutiny, the work would offer a valuable contribution to distributed systems by enabling frequent, low-overhead diagnosis of performance issues in petascale storage environments, which could lead to improved reliability and faster troubleshooting.

major comments (2)

[Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.
[Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.

minor comments (1)

[Abstract] The abstract refers to 'our study of I/O failures' but does not indicate whether this study is presented in the paper or is prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below by clarifying where the requested details appear in the manuscript and offering targeted revisions to the abstract for improved clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.

Authors: The abstract is intentionally concise. The evaluation methodology, definition of real-world performance issues (drawn from the PetaStore I/O failure study), baselines, and statistical analysis including error bars are fully detailed in Sections 5 (Evaluation Setup) and 6 (Results). We will revise the abstract to include a brief clause referencing the real PetaStore deployment and 5-minute interval evaluation to aid readers. revision: partial
Referee: [Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.

Authors: The derivation of the stochastic health functions, their incorporation of path redundancy and measurement uncertainty, the fitting procedure, and validation experiments (both synthetic and on PetaStore traces) are presented in Section 4. These steps directly support the metric comparison in feature 3. We will add a parenthetical reference to Section 4 in the abstract. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on external evaluation and domain-guided modeling without self-referential reduction

full rationale

The abstract and provided text describe three features for Kaleidoscope: differential observability, stochastic health modeling via domain-guided functions, and metric differences for root-cause attribution. No equations, parameter-fitting procedures, predictions derived from fitted inputs, or self-citations are shown that would make any step equivalent to its inputs by construction. The 95.8% accuracy is reported from deployment on external real-world PetaStore data, keeping the derivation self-contained against external benchmarks. The domain-guided aspect is an assumption but not a circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on the unverified assumption that domain-guided stochastic functions can be constructed to represent component health under redundancy and noise; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Domain-guided functions exist that accurately model storage-component health while accounting for path redundancy and measurement uncertainty.
Invoked as the second key feature that drives the entire attribution step.

pith-pipeline@v0.9.0 · 5728 in / 1175 out tokens · 19102 ms · 2026-05-24T17:07:22.417961+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements... observing differences in reliability and performance metrics between similar types of healthy and unhealthy components
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We built a system model by using the factor graph (FG) formalization, which infers component health by ingesting the monitoring data collected by Store-Pings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages

[1]

https://linux.die.net/man/3/getloadavg

getloadavg - Linux man page. https://linux.die.net/man/3/getloadavg

work page
[2]

https://jenkins.io/

Jenkins CI/CD. https://jenkins.io/. Accessed: 2019-02-06

work page 2019
[3]

http://lustre.org/

Lustre filesystem. http://lustre.org/. Accessed: 2019-02-06

work page 2019
[4]

https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

LUSTRE Community BOF: Lustre in HPC and Emerging Data Markets: Roadmap, Features and Challenges. https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

work page
[5]

https://github.com/LLNL/ior

Parallel file system I/O Benchmark. https://github.com/LLNL/ior

work page
[6]

K., Lann, G

Aguilera, M. K., Lann, G. L., and Toueg, S. On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC’02) (Toulouse, France, Oct. 2002)

work page 2002
[7]

S., Arpaci-Dusseau, A

Alagappan, R., Ganesan, A., Patel, Y., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (Savannah, GA, November 2016)

work page 2016
[8]

Amazong fsx for lustre

Amazon. Amazong fsx for lustre. https://aws.amazon.com/fsx/lustre/. Accessed: 2017-12-06

work page 2017
[9]

T., and Outhred, G

Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018)

work page 2018
[10]

Basic concepts and taxonomy of dependable and secure computing

Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1 , 1 (2004), 11–33

work page 2004
[11]

Parallel Virtual File Systems on Microsoft Azure

Azure Customer Advisory Team . Parallel Virtual File Systems on Microsoft Azure. https://azure.microsoft.com/mediahandler/ files/resourcefiles/parallel-virtual-file-systems-on-microsoft-azure/ Parallel_Virtual_File_Systems_on_Microsoft_Azure.pdf. Accessed: 2019-04-01

work page 2019
[12]

N., Arpaci-Dusseau, A

Bairavasundaram, L. N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Goodson, G. R., and Schroeder, B. An analysis of data cor- ruption in the storage stack. ACM Transactions on Storage (TOS) 4 , 3 (2008), 8

work page 2008
[13]

N., Goodson, G

Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07)(San Diego, California, USA, June 2007)

work page 2007
[14]

N., Goodson, G

Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci- Dusseau, A. C., and Arpaci-Dusseau, R. H. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

work page 2008
[15]

N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A

Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M.Analyzing the Effects of Disk-Pointer Corruption. In Proceedings of the 2008 IEEE Inter- national Conference on Dependable Systems and Networks (DSN’08) (Anchorage, Alaska, June 2008)

work page 2008
[16]

B., Broomfield, M

Boyer, E. B., Broomfield, M. C., and Perrotti, T. A. Glusterfs one storage server to rule them all

work page
[17]

M., Kriegel, H.-P., Ng, R

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Iden- tifying density-based local outliers. SIGMOD Rec. 29 , 2 (May 2000), 93–104

work page 2000
[18]

Brown, A., and Patterson, D. A. Embracing failure: A case for recovery-oriented computing (roc). In High Performance Transaction Processing Symposium (2001), vol. 10, pp. 3–8

work page 2001
[19]

P., Hildebrand, D., and Zadok, E

Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the performance variation in modern storage stacks. In 15th {USENIX} Conference on File and Storage Technologies ( {FAST} 17) (2017), pp. 329–344

work page 2017
[20]

Network tomography: Recent developments

Castro, Rui and Coates, Mark and Liang, Gang and Nowak, Robert and Yu, Bin. Network tomography: Recent developments. Statistical science (2004), 499–517

work page 2004
[21]

D., and Toueg, S

Chandra, T. D., and Toueg, S. Unreliable Failure Detectors for Re- liable Distributed Systems. Journal of the ACM 43 , 2 (Mar. 1996), 225–267

work page 1996
[22]

Chen, C., Chen, Y., and Roth, P. C. Dosas: Mitigating the resource contention in active storage systems. In 2012 IEEE International Con- ference on Cluster Computing (Sep. 2012), pp. 164–172

work page 2012
[23]

Failure detectors as first class objects

Felber, P., Defago, X., Guerraoui, R., and Oser, P. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (Sep. 1999), pp. 132–141

work page 1999
[24]

I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S

Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI’10) (Vancouver, BC, Canada, Oct. 2010)

work page 2010
[25]

C., and Arpaci- Dusseau, R

Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., and Arpaci- Dusseau, R. H. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults. InProceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17) (Santa Clara, CA, Feb. 2017)

work page 2017
[26]

In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb

Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M., and V ahdat, A.SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb. 2019)

work page 2019
[27]

S., Rubio-González, C., Arpaci-Dusseau, A

Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci- Dusseau, R. H., and Liblit, B. EIO: Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08)(San Jose, CA, USA, Feb. 2008)

work page 2008
[28]

S., Suminto, R

Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundarara- man, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., 13 et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 , 3 (2018), 23

work page 2018
[29]

In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug

Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., W ang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug. 2015)

work page 2015
[30]

In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct

Hayashibara, N., Défago, X., Yared, R., and Katayama, T.The ϕ Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct. 2004)

work page 2004
[31]

An introduction to BeeGFS, 2014

Heichler, J. An introduction to BeeGFS, 2014

work page 2014
[32]

Parity declustering for continuous operation in redundant disk arrays

Holland, M., and Gibson, G. Parity declustering for continuous operation in redundant disk arrays. Tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 1992

work page 1992
[33]

R., Zhou, L., and Dang, Y

Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (Carlsbad, CA, USA, Oct. 2018)

work page 2018
[34]

R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17)(Whistler, BC, Canada, May 2017)

work page 2017
[35]

Automatic, application-aware i/o forwarding resource allocation

Ji, X., Yang, B., Zhang, T., Ma, X., Zhu, X., Wang, X., El-Sayed, N., Zhai, J., Liu, W., and Xue, W. Automatic, application-aware i/o forwarding resource allocation. In 17th {USENIX} Conference on File and Storage Technologies ( {FAST} 19) (2019), pp. 265–279

work page 2019
[36]

In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb

Jiang, W., Hu, C., Zhou, Y., and Kanevsky, A.Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

work page 2008
[37]

{SciPy}: Open source scientific tools for {Python}

Jones, E., Oliphant, T., and Peterson, P. {SciPy}: Open source scientific tools for {Python}

work page
[38]

K., and W ang, L.Application fault tolerance with armor middleware

Kalbarczyk, Z., Iyer, R. K., and W ang, L.Application fault tolerance with armor middleware. IEEE Internet Computing 9 , 2 (March 2005), 28–37

work page 2005
[39]

S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads

Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST-2012: 10th Usenix Conference on File and Storage Technologies (San Jose, February 2012)

work page 2012
[40]

C., Plank, J

Khan, O., Burns, R. C., Plank, J. S., and Huang, C. In search of i/o-optimal recovery from disk failures. In HotStorage (2011)

work page 2011
[41]

Enlightening the i/o path: a holistic approach for application performance

Kim, S., Kim, H., Lee, J., and Jeong, J. Enlightening the i/o path: a holistic approach for application performance. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17) (2017), pp. 345– 358

work page 2017
[42]

Probabilistic graphical models: principles and techniques

Koller, D., Friedman, N., and Bach, F. Probabilistic graphical models: principles and techniques. MIT press, 2009

work page 2009
[43]

Performance models of storage contention in cloud envi- ronments

Kraft, S., Casale, G., Krishnamurthy, D., Greer, D., and Kil- patrick, P. Performance models of storage contention in cloud envi- ronments. Software & Systems Modeling 12 , 4 (Oct 2013), 681–704

work page 2013
[44]

B., Gupta, T., Aguilera, M

Leners, J. B., Gupta, T., Aguilera, M. K., and W alfish, M.Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 9:1–9:16

work page 2015
[45]

B., Gupta, T., Aguilera, M

Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Tam- ing Uncertainty in Distributed Systems with Help from the Network. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15)(Bordeaux, France, Apr. 2015)

work page 2015
[46]

B., Wu, H., Hung, W.-L., Aguilera, M

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the falcon spy net- work. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 279–294

work page 2011
[47]

B., Wu, H., Hung, W.-L., Aguilera, M

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11) (Cascais, Portugal, Oct. 2011)

work page 2011
[48]

C., Arpaci-Dusseau, R

Lu, L., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Lu, S. A Study of Linux File System Evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13) (San Jose, CA, Feb. 2013)

work page 2013
[49]

Raidshield: characterizing, monitoring, and proactively protecting against disk failures

Ma, A., Traylor, R., Douglis, F., Chamness, M., Lu, G., Sawyer, D., Chandra, S., and Hsu, W. Raidshield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 17

work page 2015
[50]

K., and Lowe, J

Ma, L., He, T., Swami, A., Towsley, D., Leung, K. K., and Lowe, J. Node Failure Localization via Network Tomography. In Proceedings of the 2014 Conference on Internet Measurement Conference (New York, NY, USA, 2014), IMC ’14, ACM, pp. 195–208

work page 2014
[51]

A Large-Scale Study of Flash Memory Failures in the Field

Meza, J., Wu, Q., Kumar, S., and Mutlu, O. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15) (Portland, Oregon, USA, June 2015)

work page 2015
[52]

C., Isaacs, R., and Welch, B

Mogul, J. C., Isaacs, R., and Welch, B. Thinking about Availability in Large Service Infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17) (Whistler, BC, Canada, May 2017)

work page 2017
[53]

R., and Lui, J

Muntz, R. R., and Lui, J. C. Performance analysis of disk arrays under failure. Computer Science Department, University of California, 1990

work page 1990
[54]

Narayanan, I., W ang, D., Jeon, M., Sharma, B., Caulfield, L., Siva- subramaniam, A., Cutler, B., Liu, J., Khessib, B., and V aid, K.SSD Failures in Datacenters: What? When? And Why? InProceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16) (Haifa, Israel, June 2016)

work page 2016
[55]

Neal, R. M. Probabilistic inference using markov chain monte carlo methods

work page
[56]

S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A

Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Sys- tems Are Not Created Equal: On the Complexity of Crafting Crash- Consistent Applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14) (Broom- field, CO, Oct. 2014)

work page 2014
[57]

Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

work page 2007
[58]

C., and Arpaci-Dusseau, R

Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the 2005 IEEE International Conference on Dependable Systems and Networks (DSN’05) (Yokohama, Japan, June 2005)

work page 2005
[59]

S., Liblit, B., Arpaci-Dusseau, R

Rubio-González, C., Gunawi, H. S., Liblit, B., Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Error Propagation Analysis for File Systems. In Proceedings of the 30th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09) (Dublin, Ireland, June 2009)

work page 2009
[60]

V., and Fonnesbeck, C

Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic pro- gramming in python using pymc3. PeerJ Computer Science 2 (Apr. 2016), e55

work page 2016
[61]

GPFS: A Shared-Disk File System for Large Computing Clusters

Schmuck, Frank B and Haskin, Roger L. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST (2002), vol. 2

work page 2002
[62]

Understanding latent sector errors and how to protect against them

Schroeder, B., Damouras, S., and Gill, P. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10) (San Jose, CA, USA, Feb. 2010). 14

work page 2010
[63]

Schroeder, B., and Gibson, G. A. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

work page 2007
[64]

Flash Reliability in Production: The Expected and the Unexpected

Schroeder, B., Lagisetty, R., and Merchant, A. Flash Reliability in Production: The Expected and the Unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16) (Santa Clara, CA, USA, Feb. 2016)

work page 2016
[65]

DRAM Errors in the Wild: A Large-scale Field Study

Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 2009 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09) (Seattle, WA, USA, June 2009)

work page 2009
[66]

D., Sisneros, R., Fullop, J., and Bauer, G

Semeraro, B. D., Sisneros, R., Fullop, J., and Bauer, G. H. It takes a village: Monitoring the blue waters supercomputer. In 2014 IEEE International Conference on Cluster Computing (CLUSTER) (Sep. 2014), pp. 392–399

work page 2014
[67]

B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15) (Istanbul, Turkey, Mar. 2015)

work page 2015
[68]

Baler: deterministic, lossless log message clustering tool

Taerat, N., Brandt, J., Gentile, A., Wong, M., and Leangsuksun, C. Baler: deterministic, lossless log message clustering tool. Computer Science-Research and Development 26 , 3-4 (2011), 285

work page 2011
[69]

Netbouncer: active device and link failure localization in data center networks

Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019), USENIX Associ- ation, pp. 599–613

work page 2019
[70]

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (Boston, MA, USA, Feb. 2019)

work page 2019
[71]

A gossip-style failure detection service

van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware ’98, Springer-Verlag, pp. 55–70

work page 1998
[72]

S2-raid: A new raid archi- tecture for fast data recovery

Wan, J., Wang, J., Yang, Q., and Xie, C. S2-raid: A new raid archi- tecture for fast data recovery. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010), IEEE, pp. 1–9

work page 2010
[73]

A., Brandt, S

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), USENIX Association, pp. 307–320

work page 2006
[74]

Performance under failures of high- end computing

Wu, M., Sun, X.-H., and Jin, H. Performance under failures of high- end computing. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 2007), SC ’07, ACM, pp. 48:1– 48:11

work page 2007
[75]

Workout: I/o workload outsourcing for boosting raid reconstruction performance

Wu, S., Jiang, H., Feng, D., Tian, L., and Mao, B. Workout: I/o workload outsourcing for boosting raid reconstruction performance. In FAST (2009), vol. 9, pp. 239–252

work page 2009
[76]

L., Schwarz, S

Xin, Q., Miller, E. L., Schwarz, S. J. T. J. E., and Long, D. D. E.Impact of failure on interconnection networks for large storage systems. In 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05)(April 2005), pp. 189–196

work page 2005
[77]

In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr

Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Y ang, X., Y ao, R., , Chintalapati, M., Krishnamurthy, A., and Anderson, T.Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018). 15

work page 2018

[1] [1]

https://linux.die.net/man/3/getloadavg

getloadavg - Linux man page. https://linux.die.net/man/3/getloadavg

work page

[2] [2]

https://jenkins.io/

Jenkins CI/CD. https://jenkins.io/. Accessed: 2019-02-06

work page 2019

[3] [3]

http://lustre.org/

Lustre filesystem. http://lustre.org/. Accessed: 2019-02-06

work page 2019

[4] [4]

https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

LUSTRE Community BOF: Lustre in HPC and Emerging Data Markets: Roadmap, Features and Challenges. https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

work page

[5] [5]

https://github.com/LLNL/ior

Parallel file system I/O Benchmark. https://github.com/LLNL/ior

work page

[6] [6]

K., Lann, G

Aguilera, M. K., Lann, G. L., and Toueg, S. On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC’02) (Toulouse, France, Oct. 2002)

work page 2002

[7] [7]

S., Arpaci-Dusseau, A

Alagappan, R., Ganesan, A., Patel, Y., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (Savannah, GA, November 2016)

work page 2016

[8] [8]

Amazong fsx for lustre

Amazon. Amazong fsx for lustre. https://aws.amazon.com/fsx/lustre/. Accessed: 2017-12-06

work page 2017

[9] [9]

T., and Outhred, G

Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018)

work page 2018

[10] [10]

Basic concepts and taxonomy of dependable and secure computing

Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1 , 1 (2004), 11–33

work page 2004

[11] [11]

Parallel Virtual File Systems on Microsoft Azure

Azure Customer Advisory Team . Parallel Virtual File Systems on Microsoft Azure. https://azure.microsoft.com/mediahandler/ files/resourcefiles/parallel-virtual-file-systems-on-microsoft-azure/ Parallel_Virtual_File_Systems_on_Microsoft_Azure.pdf. Accessed: 2019-04-01

work page 2019

[12] [12]

N., Arpaci-Dusseau, A

Bairavasundaram, L. N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Goodson, G. R., and Schroeder, B. An analysis of data cor- ruption in the storage stack. ACM Transactions on Storage (TOS) 4 , 3 (2008), 8

work page 2008

[13] [13]

N., Goodson, G

Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07)(San Diego, California, USA, June 2007)

work page 2007

[14] [14]

N., Goodson, G

Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci- Dusseau, A. C., and Arpaci-Dusseau, R. H. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

work page 2008

[15] [15]

N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A

Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M.Analyzing the Effects of Disk-Pointer Corruption. In Proceedings of the 2008 IEEE Inter- national Conference on Dependable Systems and Networks (DSN’08) (Anchorage, Alaska, June 2008)

work page 2008

[16] [16]

B., Broomfield, M

Boyer, E. B., Broomfield, M. C., and Perrotti, T. A. Glusterfs one storage server to rule them all

work page

[17] [17]

M., Kriegel, H.-P., Ng, R

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Iden- tifying density-based local outliers. SIGMOD Rec. 29 , 2 (May 2000), 93–104

work page 2000

[18] [18]

Brown, A., and Patterson, D. A. Embracing failure: A case for recovery-oriented computing (roc). In High Performance Transaction Processing Symposium (2001), vol. 10, pp. 3–8

work page 2001

[19] [19]

P., Hildebrand, D., and Zadok, E

Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the performance variation in modern storage stacks. In 15th {USENIX} Conference on File and Storage Technologies ( {FAST} 17) (2017), pp. 329–344

work page 2017

[20] [20]

Network tomography: Recent developments

Castro, Rui and Coates, Mark and Liang, Gang and Nowak, Robert and Yu, Bin. Network tomography: Recent developments. Statistical science (2004), 499–517

work page 2004

[21] [21]

D., and Toueg, S

Chandra, T. D., and Toueg, S. Unreliable Failure Detectors for Re- liable Distributed Systems. Journal of the ACM 43 , 2 (Mar. 1996), 225–267

work page 1996

[22] [22]

Chen, C., Chen, Y., and Roth, P. C. Dosas: Mitigating the resource contention in active storage systems. In 2012 IEEE International Con- ference on Cluster Computing (Sep. 2012), pp. 164–172

work page 2012

[23] [23]

Failure detectors as first class objects

Felber, P., Defago, X., Guerraoui, R., and Oser, P. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (Sep. 1999), pp. 132–141

work page 1999

[24] [24]

I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S

Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI’10) (Vancouver, BC, Canada, Oct. 2010)

work page 2010

[25] [25]

C., and Arpaci- Dusseau, R

Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., and Arpaci- Dusseau, R. H. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults. InProceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17) (Santa Clara, CA, Feb. 2017)

work page 2017

[26] [26]

In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb

Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M., and V ahdat, A.SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb. 2019)

work page 2019

[27] [27]

S., Rubio-González, C., Arpaci-Dusseau, A

Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci- Dusseau, R. H., and Liblit, B. EIO: Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08)(San Jose, CA, USA, Feb. 2008)

work page 2008

[28] [28]

S., Suminto, R

Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundarara- man, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., 13 et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 , 3 (2018), 23

work page 2018

[29] [29]

In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug

Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., W ang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug. 2015)

work page 2015

[30] [30]

In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct

Hayashibara, N., Défago, X., Yared, R., and Katayama, T.The ϕ Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct. 2004)

work page 2004

[31] [31]

An introduction to BeeGFS, 2014

Heichler, J. An introduction to BeeGFS, 2014

work page 2014

[32] [32]

Parity declustering for continuous operation in redundant disk arrays

Holland, M., and Gibson, G. Parity declustering for continuous operation in redundant disk arrays. Tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 1992

work page 1992

[33] [33]

R., Zhou, L., and Dang, Y

Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (Carlsbad, CA, USA, Oct. 2018)

work page 2018

[34] [34]

R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17)(Whistler, BC, Canada, May 2017)

work page 2017

[35] [35]

Automatic, application-aware i/o forwarding resource allocation

Ji, X., Yang, B., Zhang, T., Ma, X., Zhu, X., Wang, X., El-Sayed, N., Zhai, J., Liu, W., and Xue, W. Automatic, application-aware i/o forwarding resource allocation. In 17th {USENIX} Conference on File and Storage Technologies ( {FAST} 19) (2019), pp. 265–279

work page 2019

[36] [36]

In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb

Jiang, W., Hu, C., Zhou, Y., and Kanevsky, A.Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

work page 2008

[37] [37]

{SciPy}: Open source scientific tools for {Python}

Jones, E., Oliphant, T., and Peterson, P. {SciPy}: Open source scientific tools for {Python}

work page

[38] [38]

K., and W ang, L.Application fault tolerance with armor middleware

Kalbarczyk, Z., Iyer, R. K., and W ang, L.Application fault tolerance with armor middleware. IEEE Internet Computing 9 , 2 (March 2005), 28–37

work page 2005

[39] [39]

S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads

Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST-2012: 10th Usenix Conference on File and Storage Technologies (San Jose, February 2012)

work page 2012

[40] [40]

C., Plank, J

Khan, O., Burns, R. C., Plank, J. S., and Huang, C. In search of i/o-optimal recovery from disk failures. In HotStorage (2011)

work page 2011

[41] [41]

Enlightening the i/o path: a holistic approach for application performance

Kim, S., Kim, H., Lee, J., and Jeong, J. Enlightening the i/o path: a holistic approach for application performance. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17) (2017), pp. 345– 358

work page 2017

[42] [42]

Probabilistic graphical models: principles and techniques

Koller, D., Friedman, N., and Bach, F. Probabilistic graphical models: principles and techniques. MIT press, 2009

work page 2009

[43] [43]

Performance models of storage contention in cloud envi- ronments

Kraft, S., Casale, G., Krishnamurthy, D., Greer, D., and Kil- patrick, P. Performance models of storage contention in cloud envi- ronments. Software & Systems Modeling 12 , 4 (Oct 2013), 681–704

work page 2013

[44] [44]

B., Gupta, T., Aguilera, M

Leners, J. B., Gupta, T., Aguilera, M. K., and W alfish, M.Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 9:1–9:16

work page 2015

[45] [45]

B., Gupta, T., Aguilera, M

Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Tam- ing Uncertainty in Distributed Systems with Help from the Network. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15)(Bordeaux, France, Apr. 2015)

work page 2015

[46] [46]

B., Wu, H., Hung, W.-L., Aguilera, M

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the falcon spy net- work. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 279–294

work page 2011

[47] [47]

B., Wu, H., Hung, W.-L., Aguilera, M

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11) (Cascais, Portugal, Oct. 2011)

work page 2011

[48] [48]

C., Arpaci-Dusseau, R

Lu, L., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Lu, S. A Study of Linux File System Evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13) (San Jose, CA, Feb. 2013)

work page 2013

[49] [49]

Raidshield: characterizing, monitoring, and proactively protecting against disk failures

Ma, A., Traylor, R., Douglis, F., Chamness, M., Lu, G., Sawyer, D., Chandra, S., and Hsu, W. Raidshield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 17

work page 2015

[50] [50]

K., and Lowe, J

Ma, L., He, T., Swami, A., Towsley, D., Leung, K. K., and Lowe, J. Node Failure Localization via Network Tomography. In Proceedings of the 2014 Conference on Internet Measurement Conference (New York, NY, USA, 2014), IMC ’14, ACM, pp. 195–208

work page 2014

[51] [51]

A Large-Scale Study of Flash Memory Failures in the Field

Meza, J., Wu, Q., Kumar, S., and Mutlu, O. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15) (Portland, Oregon, USA, June 2015)

work page 2015

[52] [52]

C., Isaacs, R., and Welch, B

Mogul, J. C., Isaacs, R., and Welch, B. Thinking about Availability in Large Service Infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17) (Whistler, BC, Canada, May 2017)

work page 2017

[53] [53]

R., and Lui, J

Muntz, R. R., and Lui, J. C. Performance analysis of disk arrays under failure. Computer Science Department, University of California, 1990

work page 1990

[54] [54]

Narayanan, I., W ang, D., Jeon, M., Sharma, B., Caulfield, L., Siva- subramaniam, A., Cutler, B., Liu, J., Khessib, B., and V aid, K.SSD Failures in Datacenters: What? When? And Why? InProceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16) (Haifa, Israel, June 2016)

work page 2016

[55] [55]

Neal, R. M. Probabilistic inference using markov chain monte carlo methods

work page

[56] [56]

S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A

Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Sys- tems Are Not Created Equal: On the Complexity of Crafting Crash- Consistent Applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14) (Broom- field, CO, Oct. 2014)

work page 2014

[57] [57]

Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

work page 2007

[58] [58]

C., and Arpaci-Dusseau, R

Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the 2005 IEEE International Conference on Dependable Systems and Networks (DSN’05) (Yokohama, Japan, June 2005)

work page 2005

[59] [59]

S., Liblit, B., Arpaci-Dusseau, R

Rubio-González, C., Gunawi, H. S., Liblit, B., Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Error Propagation Analysis for File Systems. In Proceedings of the 30th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09) (Dublin, Ireland, June 2009)

work page 2009

[60] [60]

V., and Fonnesbeck, C

Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic pro- gramming in python using pymc3. PeerJ Computer Science 2 (Apr. 2016), e55

work page 2016

[61] [61]

GPFS: A Shared-Disk File System for Large Computing Clusters

Schmuck, Frank B and Haskin, Roger L. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST (2002), vol. 2

work page 2002

[62] [62]

Understanding latent sector errors and how to protect against them

Schroeder, B., Damouras, S., and Gill, P. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10) (San Jose, CA, USA, Feb. 2010). 14

work page 2010

[63] [63]

Schroeder, B., and Gibson, G. A. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

work page 2007

[64] [64]

Flash Reliability in Production: The Expected and the Unexpected

Schroeder, B., Lagisetty, R., and Merchant, A. Flash Reliability in Production: The Expected and the Unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16) (Santa Clara, CA, USA, Feb. 2016)

work page 2016

[65] [65]

DRAM Errors in the Wild: A Large-scale Field Study

Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 2009 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09) (Seattle, WA, USA, June 2009)

work page 2009

[66] [66]

D., Sisneros, R., Fullop, J., and Bauer, G

Semeraro, B. D., Sisneros, R., Fullop, J., and Bauer, G. H. It takes a village: Monitoring the blue waters supercomputer. In 2014 IEEE International Conference on Cluster Computing (CLUSTER) (Sep. 2014), pp. 392–399

work page 2014

[67] [67]

B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15) (Istanbul, Turkey, Mar. 2015)

work page 2015

[68] [68]

Baler: deterministic, lossless log message clustering tool

Taerat, N., Brandt, J., Gentile, A., Wong, M., and Leangsuksun, C. Baler: deterministic, lossless log message clustering tool. Computer Science-Research and Development 26 , 3-4 (2011), 285

work page 2011

[69] [69]

Netbouncer: active device and link failure localization in data center networks

Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019), USENIX Associ- ation, pp. 599–613

work page 2019

[70] [70]

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (Boston, MA, USA, Feb. 2019)

work page 2019

[71] [71]

A gossip-style failure detection service

van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware ’98, Springer-Verlag, pp. 55–70

work page 1998

[72] [72]

S2-raid: A new raid archi- tecture for fast data recovery

Wan, J., Wang, J., Yang, Q., and Xie, C. S2-raid: A new raid archi- tecture for fast data recovery. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010), IEEE, pp. 1–9

work page 2010

[73] [73]

A., Brandt, S

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), USENIX Association, pp. 307–320

work page 2006

[74] [74]

Performance under failures of high- end computing

Wu, M., Sun, X.-H., and Jin, H. Performance under failures of high- end computing. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 2007), SC ’07, ACM, pp. 48:1– 48:11

work page 2007

[75] [75]

Workout: I/o workload outsourcing for boosting raid reconstruction performance

Wu, S., Jiang, H., Feng, D., Tian, L., and Mao, B. Workout: I/o workload outsourcing for boosting raid reconstruction performance. In FAST (2009), vol. 9, pp. 239–252

work page 2009

[76] [76]

L., Schwarz, S

Xin, Q., Miller, E. L., Schwarz, S. J. T. J. E., and Long, D. D. E.Impact of failure on interconnection networks for large storage systems. In 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05)(April 2005), pp. 189–196

work page 2005

[77] [77]

In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr

Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Y ang, X., Y ao, R., , Chintalapati, M., Krishnamurthy, A., and Anderson, T.Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018). 15

work page 2018