pith. sign in

arxiv: 1907.10203 · v1 · pith:BKNLM43Nnew · submitted 2019-07-24 · 💻 cs.DC · cs.LG

Live Forensics for Distributed Storage Systems

Pith reviewed 2026-05-24 17:07 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords live forensicsdistributed storageperformance diagnosisroot cause analysisstochastic modelingI/O failuresdifferential observability
0
0 comments X

The pith

Kaleidoscope identifies root causes of performance problems in large-scale distributed storage by running live forensics every five minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kaleidoscope as a system designed to diagnose application performance issues in distributed storage systems, whether from individual component failures or resource contention. The design draws from observed I/O failures in a peta-scale storage system. Kaleidoscope relies on three features to achieve this: temporal and spatial differential observability for end-to-end I/O monitoring, stochastic modeling of component health via domain-guided functions that incorporate path redundancy and measurement uncertainty, and comparison of reliability and performance metrics between similar healthy and unhealthy components to determine the most likely root causes. Deployment results show the system operates at five-minute intervals while correctly attributing 95.8 percent of real-world issues and adding negligible overhead.

Core claim

Kaleidoscope supports live forensics for application performance problems in large-scale distributed storage systems by using temporal and spatial differential observability for I/O request monitoring, modeling the health of storage components as a stochastic process with domain-guided functions that account for path redundancy and measurement uncertainty, and attributing the most likely root causes through observed differences in reliability and performance metrics between similar types of healthy and unhealthy components.

What carries the argument

Stochastic modeling of storage component health via domain-guided functions that capture path redundancy and measurement uncertainty, combined with differential observability and metric comparison to isolate root causes.

If this is right

  • Storage operators can diagnose and respond to performance problems at five-minute granularity without accumulating significant monitoring cost.
  • Root causes can be attributed for the large majority of observed I/O performance issues in production peta-scale systems.
  • The same differential and stochastic approach can distinguish between failure-induced and contention-induced performance degradation.
  • Live forensics becomes feasible on systems where traditional post-mortem analysis is too slow for operational needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modeling approach could be adapted to other distributed systems that exhibit path redundancy, such as compute clusters or network fabrics.
  • Frequent live attribution might enable automated remediation loops that act before user-visible slowdowns compound.
  • If the stochastic functions generalize across hardware generations, the same system could track health trends over multi-year deployments.

Load-bearing premise

The domain-guided stochastic functions accurately capture path redundancy and measurement uncertainty so that metric differences between healthy and unhealthy components reliably point to the root cause.

What would settle it

A deployment run in which root-cause attribution accuracy falls below 95.8 percent or monitoring overhead becomes measurable across repeated five-minute intervals.

Figures

Figures reproduced from arXiv: 1907.10203 by Jeremy Enos, Mark Dalton, Mike Showerman, Ravishankar K. Iyer, Saurabh Jha, Shengkun Cui, Tianyin Xu, William T. Kramer, Zbigniew T. Kalbarczyk.

Figure 1
Figure 1. Figure 1: Common patterns of I/O failures. Notation: "hb" is heart￾beat process, "srv" is service process; and each box represents the storage components (e.g., data servers). of many other object-based POSIX storage systems, such as IBM GPFS [61], BeeGFS [31], Ceph [73], and GlusterFS [16]. Monitoring overhead. Store-Ping-based monitors have been deployed on PetaStore for two years. The monitors measure the complet… view at source ↗
Figure 2
Figure 2. Figure 2: CDF of I/O request completion time under component failures (“Faliure”) and no failures (“Normal”). 100 101 102 103 104 105 106 20 21 22 23 24 25 26 27 28 29 210 Count I/O Await [sec(s)], 10484 Outlier Points Not Shown await [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between load (measured by loadavg) and latency. (“Comp. T.” is the completion time of I/O requests.) that time, the loadavg increased from 60 to as high as 350. The 50th and 99th percentile durations of extreme I/O were found to be 12 and 227 minutes, respectively. High load. I/O request completion time increases with the load on the storage servers. High load conditions are caused by a flood o… view at source ↗
Figure 6
Figure 6. Figure 6: An overview of Kaleidoscope. Kaleidoscope consists three component for monitoring, failure localization, and failure diagno￾sis (marked in gray). Many high-impact zero-day failures can be prevented if the faulty or unhealthy components can be detected and the corresponding potential causes can be diagnosed ear￾lier, before they lead to user-visible impact. 4 Design and Implementation [PITH_FULL_IMAGE:figu… view at source ↗
Figure 7
Figure 7. Figure 7: An illustration of the FG model. Only the paths C1 to OSD1, and C2 to OSD2 are shown, for clarity. Redundancies and other network components have also been removed for clarity. Thus, the path availability AP must explicitly model such redundancies (e.g., LNETs and HA-pairs) while estimating the availability of a path. The model described above can be represented using a factor graph that models the interac… view at source ↗
Figure 8
Figure 8. Figure 8: Histogram of duration of components in unhealthy state. 100 101 102 103 104 105 0 5 10 15 20 25 30 Timeout/ failure Count I/O Completion Time [sec(s)] scratch-fs project-fs home-fs [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Completion time of I/O requests measured by Kaleido￾scope’s WrEx Store-Pings. with every 20 points on the graph) for the WrEx Store-Pings. (We omit RmEx and CrWr because of the page limit.) We can see that 99% of WrEx completed within one second (SLO), and only 0.14% failed with a timeout. With Kaleidoscope, it is efficient to nail down to the anomalies and perform live forensics (e.g., the load-related re… view at source ↗
Figure 10
Figure 10. Figure 10: Outages visible from Kaleidoscope We show that the false positive ratio is very low, and our interactions with PetaStore operators confirm its usefulness. One potential caveat is that Kaleidoscope assumes that Store-Pings experiences the same I/O behavior as real ap￾plications, which may not hold in all cases. On the other hand, using Kaleidoscope ML-components retrospectively on trace-data generated by S… view at source ↗
read the original abstract

We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Kaleidoscope, a system for live forensics of performance problems in large-scale distributed storage systems. It is motivated by a study of I/O failures in the anonymized PetaStore system. The design relies on three features: temporal and spatial differential observability for monitoring, modeling storage component health as a stochastic process using domain-guided functions that incorporate path redundancy and measurement uncertainty, and attributing root causes by observing differences in metrics between healthy and unhealthy components. The evaluation claims that Kaleidoscope can perform live forensics at 5-minute intervals, identifying root causes for 95.8% of real-world performance issues with negligible overhead.

Significance. If the modeling and evaluation hold up under scrutiny, the work would offer a valuable contribution to distributed systems by enabling frequent, low-overhead diagnosis of performance issues in petascale storage environments, which could lead to improved reliability and faster troubleshooting.

major comments (2)
  1. [Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.
  2. [Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.
minor comments (1)
  1. [Abstract] The abstract refers to 'our study of I/O failures' but does not indicate whether this study is presented in the paper or is prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below by clarifying where the requested details appear in the manuscript and offering targeted revisions to the abstract for improved clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.

    Authors: The abstract is intentionally concise. The evaluation methodology, definition of real-world performance issues (drawn from the PetaStore I/O failure study), baselines, and statistical analysis including error bars are fully detailed in Sections 5 (Evaluation Setup) and 6 (Results). We will revise the abstract to include a brief clause referencing the real PetaStore deployment and 5-minute interval evaluation to aid readers. revision: partial

  2. Referee: [Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.

    Authors: The derivation of the stochastic health functions, their incorporation of path redundancy and measurement uncertainty, the fitting procedure, and validation experiments (both synthetic and on PetaStore traces) are presented in Section 4. These steps directly support the metric comparison in feature 3. We will add a parenthetical reference to Section 4 in the abstract. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on external evaluation and domain-guided modeling without self-referential reduction

full rationale

The abstract and provided text describe three features for Kaleidoscope: differential observability, stochastic health modeling via domain-guided functions, and metric differences for root-cause attribution. No equations, parameter-fitting procedures, predictions derived from fitted inputs, or self-citations are shown that would make any step equivalent to its inputs by construction. The 95.8% accuracy is reported from deployment on external real-world PetaStore data, keeping the derivation self-contained against external benchmarks. The domain-guided aspect is an assumption but not a circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on the unverified assumption that domain-guided stochastic functions can be constructed to represent component health under redundancy and noise; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Domain-guided functions exist that accurately model storage-component health while accounting for path redundancy and measurement uncertainty.
    Invoked as the second key feature that drives the entire attribution step.

pith-pipeline@v0.9.0 · 5728 in / 1175 out tokens · 19102 ms · 2026-05-24T17:07:22.417961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages

  1. [1]

    https://linux.die.net/man/3/getloadavg

    getloadavg - Linux man page. https://linux.die.net/man/3/getloadavg

  2. [2]

    https://jenkins.io/

    Jenkins CI/CD. https://jenkins.io/. Accessed: 2019-02-06

  3. [3]

    http://lustre.org/

    Lustre filesystem. http://lustre.org/. Accessed: 2019-02-06

  4. [4]

    https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

    LUSTRE Community BOF: Lustre in HPC and Emerging Data Markets: Roadmap, Features and Challenges. https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html

  5. [5]

    https://github.com/LLNL/ior

    Parallel file system I/O Benchmark. https://github.com/LLNL/ior

  6. [6]

    K., Lann, G

    Aguilera, M. K., Lann, G. L., and Toueg, S. On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC’02) (Toulouse, France, Oct. 2002)

  7. [7]

    S., Arpaci-Dusseau, A

    Alagappan, R., Ganesan, A., Patel, Y., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (Savannah, GA, November 2016)

  8. [8]

    Amazong fsx for lustre

    Amazon. Amazong fsx for lustre. https://aws.amazon.com/fsx/lustre/. Accessed: 2017-12-06

  9. [9]

    T., and Outhred, G

    Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018)

  10. [10]

    Basic concepts and taxonomy of dependable and secure computing

    Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1 , 1 (2004), 11–33

  11. [11]

    Parallel Virtual File Systems on Microsoft Azure

    Azure Customer Advisory Team . Parallel Virtual File Systems on Microsoft Azure. https://azure.microsoft.com/mediahandler/ files/resourcefiles/parallel-virtual-file-systems-on-microsoft-azure/ Parallel_Virtual_File_Systems_on_Microsoft_Azure.pdf. Accessed: 2019-04-01

  12. [12]

    N., Arpaci-Dusseau, A

    Bairavasundaram, L. N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Goodson, G. R., and Schroeder, B. An analysis of data cor- ruption in the storage stack. ACM Transactions on Storage (TOS) 4 , 3 (2008), 8

  13. [13]

    N., Goodson, G

    Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07)(San Diego, California, USA, June 2007)

  14. [14]

    N., Goodson, G

    Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci- Dusseau, A. C., and Arpaci-Dusseau, R. H. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

  15. [15]

    N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A

    Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M.Analyzing the Effects of Disk-Pointer Corruption. In Proceedings of the 2008 IEEE Inter- national Conference on Dependable Systems and Networks (DSN’08) (Anchorage, Alaska, June 2008)

  16. [16]

    B., Broomfield, M

    Boyer, E. B., Broomfield, M. C., and Perrotti, T. A. Glusterfs one storage server to rule them all

  17. [17]

    M., Kriegel, H.-P., Ng, R

    Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Iden- tifying density-based local outliers. SIGMOD Rec. 29 , 2 (May 2000), 93–104

  18. [18]

    Brown, A., and Patterson, D. A. Embracing failure: A case for recovery-oriented computing (roc). In High Performance Transaction Processing Symposium (2001), vol. 10, pp. 3–8

  19. [19]

    P., Hildebrand, D., and Zadok, E

    Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the performance variation in modern storage stacks. In 15th {USENIX} Conference on File and Storage Technologies ( {FAST} 17) (2017), pp. 329–344

  20. [20]

    Network tomography: Recent developments

    Castro, Rui and Coates, Mark and Liang, Gang and Nowak, Robert and Yu, Bin. Network tomography: Recent developments. Statistical science (2004), 499–517

  21. [21]

    D., and Toueg, S

    Chandra, T. D., and Toueg, S. Unreliable Failure Detectors for Re- liable Distributed Systems. Journal of the ACM 43 , 2 (Mar. 1996), 225–267

  22. [22]

    Chen, C., Chen, Y., and Roth, P. C. Dosas: Mitigating the resource contention in active storage systems. In 2012 IEEE International Con- ference on Cluster Computing (Sep. 2012), pp. 164–172

  23. [23]

    Failure detectors as first class objects

    Felber, P., Defago, X., Guerraoui, R., and Oser, P. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (Sep. 1999), pp. 132–141

  24. [24]

    I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S

    Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI’10) (Vancouver, BC, Canada, Oct. 2010)

  25. [25]

    C., and Arpaci- Dusseau, R

    Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., and Arpaci- Dusseau, R. H. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults. InProceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17) (Santa Clara, CA, Feb. 2017)

  26. [26]

    In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb

    Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M., and V ahdat, A.SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb. 2019)

  27. [27]

    S., Rubio-González, C., Arpaci-Dusseau, A

    Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci- Dusseau, R. H., and Liblit, B. EIO: Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08)(San Jose, CA, USA, Feb. 2008)

  28. [28]

    S., Suminto, R

    Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundarara- man, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., 13 et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 , 3 (2018), 23

  29. [29]

    In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug

    Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., W ang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug. 2015)

  30. [30]

    In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct

    Hayashibara, N., Défago, X., Yared, R., and Katayama, T.The ϕ Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct. 2004)

  31. [31]

    An introduction to BeeGFS, 2014

    Heichler, J. An introduction to BeeGFS, 2014

  32. [32]

    Parity declustering for continuous operation in redundant disk arrays

    Holland, M., and Gibson, G. Parity declustering for continuous operation in redundant disk arrays. Tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 1992

  33. [33]

    R., Zhou, L., and Dang, Y

    Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (Carlsbad, CA, USA, Oct. 2018)

  34. [34]

    R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

    Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17)(Whistler, BC, Canada, May 2017)

  35. [35]

    Automatic, application-aware i/o forwarding resource allocation

    Ji, X., Yang, B., Zhang, T., Ma, X., Zhu, X., Wang, X., El-Sayed, N., Zhai, J., Liu, W., and Xue, W. Automatic, application-aware i/o forwarding resource allocation. In 17th {USENIX} Conference on File and Storage Technologies ( {FAST} 19) (2019), pp. 265–279

  36. [36]

    In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb

    Jiang, W., Hu, C., Zhou, Y., and Kanevsky, A.Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)

  37. [37]

    {SciPy}: Open source scientific tools for {Python}

    Jones, E., Oliphant, T., and Peterson, P. {SciPy}: Open source scientific tools for {Python}

  38. [38]

    K., and W ang, L.Application fault tolerance with armor middleware

    Kalbarczyk, Z., Iyer, R. K., and W ang, L.Application fault tolerance with armor middleware. IEEE Internet Computing 9 , 2 (March 2005), 28–37

  39. [39]

    S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads

    Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST-2012: 10th Usenix Conference on File and Storage Technologies (San Jose, February 2012)

  40. [40]

    C., Plank, J

    Khan, O., Burns, R. C., Plank, J. S., and Huang, C. In search of i/o-optimal recovery from disk failures. In HotStorage (2011)

  41. [41]

    Enlightening the i/o path: a holistic approach for application performance

    Kim, S., Kim, H., Lee, J., and Jeong, J. Enlightening the i/o path: a holistic approach for application performance. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17) (2017), pp. 345– 358

  42. [42]

    Probabilistic graphical models: principles and techniques

    Koller, D., Friedman, N., and Bach, F. Probabilistic graphical models: principles and techniques. MIT press, 2009

  43. [43]

    Performance models of storage contention in cloud envi- ronments

    Kraft, S., Casale, G., Krishnamurthy, D., Greer, D., and Kil- patrick, P. Performance models of storage contention in cloud envi- ronments. Software & Systems Modeling 12 , 4 (Oct 2013), 681–704

  44. [44]

    B., Gupta, T., Aguilera, M

    Leners, J. B., Gupta, T., Aguilera, M. K., and W alfish, M.Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 9:1–9:16

  45. [45]

    B., Gupta, T., Aguilera, M

    Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Tam- ing Uncertainty in Distributed Systems with Help from the Network. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15)(Bordeaux, France, Apr. 2015)

  46. [46]

    B., Wu, H., Hung, W.-L., Aguilera, M

    Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the falcon spy net- work. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 279–294

  47. [47]

    B., Wu, H., Hung, W.-L., Aguilera, M

    Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11) (Cascais, Portugal, Oct. 2011)

  48. [48]

    C., Arpaci-Dusseau, R

    Lu, L., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Lu, S. A Study of Linux File System Evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13) (San Jose, CA, Feb. 2013)

  49. [49]

    Raidshield: characterizing, monitoring, and proactively protecting against disk failures

    Ma, A., Traylor, R., Douglis, F., Chamness, M., Lu, G., Sawyer, D., Chandra, S., and Hsu, W. Raidshield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 17

  50. [50]

    K., and Lowe, J

    Ma, L., He, T., Swami, A., Towsley, D., Leung, K. K., and Lowe, J. Node Failure Localization via Network Tomography. In Proceedings of the 2014 Conference on Internet Measurement Conference (New York, NY, USA, 2014), IMC ’14, ACM, pp. 195–208

  51. [51]

    A Large-Scale Study of Flash Memory Failures in the Field

    Meza, J., Wu, Q., Kumar, S., and Mutlu, O. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15) (Portland, Oregon, USA, June 2015)

  52. [52]

    C., Isaacs, R., and Welch, B

    Mogul, J. C., Isaacs, R., and Welch, B. Thinking about Availability in Large Service Infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17) (Whistler, BC, Canada, May 2017)

  53. [53]

    R., and Lui, J

    Muntz, R. R., and Lui, J. C. Performance analysis of disk arrays under failure. Computer Science Department, University of California, 1990

  54. [54]

    Narayanan, I., W ang, D., Jeon, M., Sharma, B., Caulfield, L., Siva- subramaniam, A., Cutler, B., Liu, J., Khessib, B., and V aid, K.SSD Failures in Datacenters: What? When? And Why? InProceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16) (Haifa, Israel, June 2016)

  55. [55]

    Neal, R. M. Probabilistic inference using markov chain monte carlo methods

  56. [56]

    S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A

    Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Sys- tems Are Not Created Equal: On the Complexity of Crafting Crash- Consistent Applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14) (Broom- field, CO, Oct. 2014)

  57. [57]

    Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

  58. [58]

    C., and Arpaci-Dusseau, R

    Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the 2005 IEEE International Conference on Dependable Systems and Networks (DSN’05) (Yokohama, Japan, June 2005)

  59. [59]

    S., Liblit, B., Arpaci-Dusseau, R

    Rubio-González, C., Gunawi, H. S., Liblit, B., Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Error Propagation Analysis for File Systems. In Proceedings of the 30th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09) (Dublin, Ireland, June 2009)

  60. [60]

    V., and Fonnesbeck, C

    Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic pro- gramming in python using pymc3. PeerJ Computer Science 2 (Apr. 2016), e55

  61. [61]

    GPFS: A Shared-Disk File System for Large Computing Clusters

    Schmuck, Frank B and Haskin, Roger L. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST (2002), vol. 2

  62. [62]

    Understanding latent sector errors and how to protect against them

    Schroeder, B., Damouras, S., and Gill, P. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10) (San Jose, CA, USA, Feb. 2010). 14

  63. [63]

    Schroeder, B., and Gibson, G. A. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)

  64. [64]

    Flash Reliability in Production: The Expected and the Unexpected

    Schroeder, B., Lagisetty, R., and Merchant, A. Flash Reliability in Production: The Expected and the Unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16) (Santa Clara, CA, USA, Feb. 2016)

  65. [65]

    DRAM Errors in the Wild: A Large-scale Field Study

    Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 2009 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09) (Seattle, WA, USA, June 2009)

  66. [66]

    D., Sisneros, R., Fullop, J., and Bauer, G

    Semeraro, B. D., Sisneros, R., Fullop, J., and Bauer, G. H. It takes a village: Monitoring the blue waters supercomputer. In 2014 IEEE International Conference on Cluster Computing (CLUSTER) (Sep. 2014), pp. 392–399

  67. [67]

    B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

    Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15) (Istanbul, Turkey, Mar. 2015)

  68. [68]

    Baler: deterministic, lossless log message clustering tool

    Taerat, N., Brandt, J., Gentile, A., Wong, M., and Leangsuksun, C. Baler: deterministic, lossless log message clustering tool. Computer Science-Research and Development 26 , 3-4 (2011), 285

  69. [69]

    Netbouncer: active device and link failure localization in data center networks

    Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019), USENIX Associ- ation, pp. 599–613

  70. [70]

    NetBouncer: Active Device and Link Failure Localization in Data Center Networks

    Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (Boston, MA, USA, Feb. 2019)

  71. [71]

    A gossip-style failure detection service

    van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware ’98, Springer-Verlag, pp. 55–70

  72. [72]

    S2-raid: A new raid archi- tecture for fast data recovery

    Wan, J., Wang, J., Yang, Q., and Xie, C. S2-raid: A new raid archi- tecture for fast data recovery. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010), IEEE, pp. 1–9

  73. [73]

    A., Brandt, S

    Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), USENIX Association, pp. 307–320

  74. [74]

    Performance under failures of high- end computing

    Wu, M., Sun, X.-H., and Jin, H. Performance under failures of high- end computing. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 2007), SC ’07, ACM, pp. 48:1– 48:11

  75. [75]

    Workout: I/o workload outsourcing for boosting raid reconstruction performance

    Wu, S., Jiang, H., Feng, D., Tian, L., and Mao, B. Workout: I/o workload outsourcing for boosting raid reconstruction performance. In FAST (2009), vol. 9, pp. 239–252

  76. [76]

    L., Schwarz, S

    Xin, Q., Miller, E. L., Schwarz, S. J. T. J. E., and Long, D. D. E.Impact of failure on interconnection networks for large storage systems. In 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05)(April 2005), pp. 189–196

  77. [77]

    In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr

    Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Y ang, X., Y ao, R., , Chintalapati, M., Krishnamurthy, A., and Anderson, T.Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018). 15