Live Forensics for Distributed Storage Systems
Pith reviewed 2026-05-24 17:07 UTC · model grok-4.3
The pith
Kaleidoscope identifies root causes of performance problems in large-scale distributed storage by running live forensics every five minutes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kaleidoscope supports live forensics for application performance problems in large-scale distributed storage systems by using temporal and spatial differential observability for I/O request monitoring, modeling the health of storage components as a stochastic process with domain-guided functions that account for path redundancy and measurement uncertainty, and attributing the most likely root causes through observed differences in reliability and performance metrics between similar types of healthy and unhealthy components.
What carries the argument
Stochastic modeling of storage component health via domain-guided functions that capture path redundancy and measurement uncertainty, combined with differential observability and metric comparison to isolate root causes.
If this is right
- Storage operators can diagnose and respond to performance problems at five-minute granularity without accumulating significant monitoring cost.
- Root causes can be attributed for the large majority of observed I/O performance issues in production peta-scale systems.
- The same differential and stochastic approach can distinguish between failure-induced and contention-induced performance degradation.
- Live forensics becomes feasible on systems where traditional post-mortem analysis is too slow for operational needs.
Where Pith is reading between the lines
- The modeling approach could be adapted to other distributed systems that exhibit path redundancy, such as compute clusters or network fabrics.
- Frequent live attribution might enable automated remediation loops that act before user-visible slowdowns compound.
- If the stochastic functions generalize across hardware generations, the same system could track health trends over multi-year deployments.
Load-bearing premise
The domain-guided stochastic functions accurately capture path redundancy and measurement uncertainty so that metric differences between healthy and unhealthy components reliably point to the root cause.
What would settle it
A deployment run in which root-cause attribution accuracy falls below 95.8 percent or monitoring overhead becomes measurable across repeated five-minute intervals.
Figures
read the original abstract
We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Kaleidoscope, a system for live forensics of performance problems in large-scale distributed storage systems. It is motivated by a study of I/O failures in the anonymized PetaStore system. The design relies on three features: temporal and spatial differential observability for monitoring, modeling storage component health as a stochastic process using domain-guided functions that incorporate path redundancy and measurement uncertainty, and attributing root causes by observing differences in metrics between healthy and unhealthy components. The evaluation claims that Kaleidoscope can perform live forensics at 5-minute intervals, identifying root causes for 95.8% of real-world performance issues with negligible overhead.
Significance. If the modeling and evaluation hold up under scrutiny, the work would offer a valuable contribution to distributed systems by enabling frequent, low-overhead diagnosis of performance issues in petascale storage environments, which could lead to improved reliability and faster troubleshooting.
major comments (2)
- [Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.
- [Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.
minor comments (1)
- [Abstract] The abstract refers to 'our study of I/O failures' but does not indicate whether this study is presented in the paper or is prior work.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below by clarifying where the requested details appear in the manuscript and offering targeted revisions to the abstract for improved clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 95.8% root-cause attribution success is stated without any accompanying description of the evaluation methodology, choice of baselines, statistical error bars, or precise definition of what constitutes a 'real-world performance issue'. This omission makes the primary empirical result impossible to verify or reproduce based on the provided text.
Authors: The abstract is intentionally concise. The evaluation methodology, definition of real-world performance issues (drawn from the PetaStore I/O failure study), baselines, and statistical analysis including error bars are fully detailed in Sections 5 (Evaluation Setup) and 6 (Results). We will revise the abstract to include a brief clause referencing the real PetaStore deployment and 5-minute interval evaluation to aid readers. revision: partial
-
Referee: [Abstract (second key feature)] Abstract (second key feature): The domain-guided stochastic functions used to model component health are described as accounting for path redundancy and measurement uncertainty, yet no derivation, fitting procedure, or validation experiment is referenced. Since the root-cause attribution in feature 3 depends directly on reliable metric differences between healthy and unhealthy states, the lack of evidence for this modeling step is load-bearing for the reported accuracy.
Authors: The derivation of the stochastic health functions, their incorporation of path redundancy and measurement uncertainty, the fitting procedure, and validation experiments (both synthetic and on PetaStore traces) are presented in Section 4. These steps directly support the metric comparison in feature 3. We will add a parenthetical reference to Section 4 in the abstract. revision: partial
Circularity Check
No circularity; claims rest on external evaluation and domain-guided modeling without self-referential reduction
full rationale
The abstract and provided text describe three features for Kaleidoscope: differential observability, stochastic health modeling via domain-guided functions, and metric differences for root-cause attribution. No equations, parameter-fitting procedures, predictions derived from fitted inputs, or self-citations are shown that would make any step equivalent to its inputs by construction. The 95.8% accuracy is reported from deployment on external real-world PetaStore data, keeping the derivation self-contained against external benchmarks. The domain-guided aspect is an assumption but not a circular construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Domain-guided functions exist that accurately model storage-component health while accounting for path redundancy and measurement uncertainty.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements... observing differences in reliability and performance metrics between similar types of healthy and unhealthy components
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We built a system model by using the factor graph (FG) formalization, which infers component health by ingesting the monitoring data collected by Store-Pings.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://linux.die.net/man/3/getloadavg
getloadavg - Linux man page. https://linux.die.net/man/3/getloadavg
- [2]
- [3]
-
[4]
https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html
LUSTRE Community BOF: Lustre in HPC and Emerging Data Markets: Roadmap, Features and Challenges. https://sc18.supercomputing.org/ proceedings/bof/bof_pages/bof176.html
-
[5]
Parallel file system I/O Benchmark. https://github.com/LLNL/ior
-
[6]
Aguilera, M. K., Lann, G. L., and Toueg, S. On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC’02) (Toulouse, France, Oct. 2002)
work page 2002
-
[7]
Alagappan, R., Ganesan, A., Patel, Y., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (Savannah, GA, November 2016)
work page 2016
-
[8]
Amazon. Amazong fsx for lustre. https://aws.amazon.com/fsx/lustre/. Accessed: 2017-12-06
work page 2017
-
[9]
Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007: Democratically Finding the Cause of Packet Drops. In Proceedings of the 15th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018)
work page 2018
-
[10]
Basic concepts and taxonomy of dependable and secure computing
Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1 , 1 (2004), 11–33
work page 2004
-
[11]
Parallel Virtual File Systems on Microsoft Azure
Azure Customer Advisory Team . Parallel Virtual File Systems on Microsoft Azure. https://azure.microsoft.com/mediahandler/ files/resourcefiles/parallel-virtual-file-systems-on-microsoft-azure/ Parallel_Virtual_File_Systems_on_Microsoft_Azure.pdf. Accessed: 2019-04-01
work page 2019
-
[12]
Bairavasundaram, L. N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Goodson, G. R., and Schroeder, B. An analysis of data cor- ruption in the storage stack. ACM Transactions on Storage (TOS) 4 , 3 (2008), 8
work page 2008
-
[13]
Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07)(San Diego, California, USA, June 2007)
work page 2007
-
[14]
Bairavasundaram, L. N., Goodson, G. R., Schroeder, B., Arpaci- Dusseau, A. C., and Arpaci-Dusseau, R. H. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)
work page 2008
-
[15]
N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A
Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M.Analyzing the Effects of Disk-Pointer Corruption. In Proceedings of the 2008 IEEE Inter- national Conference on Dependable Systems and Networks (DSN’08) (Anchorage, Alaska, June 2008)
work page 2008
-
[16]
Boyer, E. B., Broomfield, M. C., and Perrotti, T. A. Glusterfs one storage server to rule them all
-
[17]
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Iden- tifying density-based local outliers. SIGMOD Rec. 29 , 2 (May 2000), 93–104
work page 2000
-
[18]
Brown, A., and Patterson, D. A. Embracing failure: A case for recovery-oriented computing (roc). In High Performance Transaction Processing Symposium (2001), vol. 10, pp. 3–8
work page 2001
-
[19]
P., Hildebrand, D., and Zadok, E
Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the performance variation in modern storage stacks. In 15th {USENIX} Conference on File and Storage Technologies ( {FAST} 17) (2017), pp. 329–344
work page 2017
-
[20]
Network tomography: Recent developments
Castro, Rui and Coates, Mark and Liang, Gang and Nowak, Robert and Yu, Bin. Network tomography: Recent developments. Statistical science (2004), 499–517
work page 2004
-
[21]
Chandra, T. D., and Toueg, S. Unreliable Failure Detectors for Re- liable Distributed Systems. Journal of the ACM 43 , 2 (Mar. 1996), 225–267
work page 1996
-
[22]
Chen, C., Chen, Y., and Roth, P. C. Dosas: Mitigating the resource contention in active storage systems. In 2012 IEEE International Con- ference on Cluster Computing (Sep. 2012), pp. 164–172
work page 2012
-
[23]
Failure detectors as first class objects
Felber, P., Defago, X., Guerraoui, R., and Oser, P. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications (Sep. 1999), pp. 132–141
work page 1999
-
[24]
I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S
Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI’10) (Vancouver, BC, Canada, Oct. 2010)
work page 2010
-
[25]
Ganesan, A., Alagappan, R., Arpaci-Dusseau, A. C., and Arpaci- Dusseau, R. H. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults. InProceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17) (Santa Clara, CA, Feb. 2017)
work page 2017
-
[26]
Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M., and V ahdat, A.SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. In Proceed- ings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19)(Boston, MA, USA, Feb. 2019)
work page 2019
-
[27]
S., Rubio-González, C., Arpaci-Dusseau, A
Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci- Dusseau, R. H., and Liblit, B. EIO: Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08)(San Jose, CA, USA, Feb. 2008)
work page 2008
-
[28]
Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundarara- man, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., 13 et al. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 , 3 (2018), 23
work page 2018
-
[29]
In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug
Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., W ang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V.Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM’15)(London, United Kingdom, Aug. 2015)
work page 2015
-
[30]
Hayashibara, N., Défago, X., Yared, R., and Katayama, T.The ϕ Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS’04) (Florianópolis, Brazil, Oct. 2004)
work page 2004
- [31]
-
[32]
Parity declustering for continuous operation in redundant disk arrays
Holland, M., and Gibson, G. Parity declustering for continuous operation in redundant disk arrays. Tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 1992
work page 1992
-
[33]
Huang, P., Guo, C., Lorch, J. R., Zhou, L., and Dang, Y. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (Carlsbad, CA, USA, Oct. 2018)
work page 2018
-
[34]
R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Y ao, R.Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17)(Whistler, BC, Canada, May 2017)
work page 2017
-
[35]
Automatic, application-aware i/o forwarding resource allocation
Ji, X., Yang, B., Zhang, T., Ma, X., Zhu, X., Wang, X., El-Sayed, N., Zhai, J., Liu, W., and Xue, W. Automatic, application-aware i/o forwarding resource allocation. In 17th {USENIX} Conference on File and Storage Technologies ( {FAST} 19) (2019), pp. 265–279
work page 2019
-
[36]
Jiang, W., Hu, C., Zhou, Y., and Kanevsky, A.Are Disks the Dominant Contributor for Storage Failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08) (San Jose, CA, Feb. 2008)
work page 2008
-
[37]
{SciPy}: Open source scientific tools for {Python}
Jones, E., Oliphant, T., and Peterson, P. {SciPy}: Open source scientific tools for {Python}
-
[38]
K., and W ang, L.Application fault tolerance with armor middleware
Kalbarczyk, Z., Iyer, R. K., and W ang, L.Application fault tolerance with armor middleware. IEEE Internet Computing 9 , 2 (March 2005), 28–37
work page 2005
-
[39]
Khan, O., Burns, R., Plank, J. S., Pierce, W., and Huang, C.Rethink- ing erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In FAST-2012: 10th Usenix Conference on File and Storage Technologies (San Jose, February 2012)
work page 2012
-
[40]
Khan, O., Burns, R. C., Plank, J. S., and Huang, C. In search of i/o-optimal recovery from disk failures. In HotStorage (2011)
work page 2011
-
[41]
Enlightening the i/o path: a holistic approach for application performance
Kim, S., Kim, H., Lee, J., and Jeong, J. Enlightening the i/o path: a holistic approach for application performance. In 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17) (2017), pp. 345– 358
work page 2017
-
[42]
Probabilistic graphical models: principles and techniques
Koller, D., Friedman, N., and Bach, F. Probabilistic graphical models: principles and techniques. MIT press, 2009
work page 2009
-
[43]
Performance models of storage contention in cloud envi- ronments
Kraft, S., Casale, G., Krishnamurthy, D., Greer, D., and Kil- patrick, P. Performance models of storage contention in cloud envi- ronments. Software & Systems Modeling 12 , 4 (Oct 2013), 681–704
work page 2013
-
[44]
Leners, J. B., Gupta, T., Aguilera, M. K., and W alfish, M.Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 9:1–9:16
work page 2015
-
[45]
Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Tam- ing Uncertainty in Distributed Systems with Help from the Network. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15)(Bordeaux, France, Apr. 2015)
work page 2015
-
[46]
B., Wu, H., Hung, W.-L., Aguilera, M
Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the falcon spy net- work. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (2011), ACM, pp. 279–294
work page 2011
-
[47]
B., Wu, H., Hung, W.-L., Aguilera, M
Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11) (Cascais, Portugal, Oct. 2011)
work page 2011
-
[48]
Lu, L., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Lu, S. A Study of Linux File System Evolution. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13) (San Jose, CA, Feb. 2013)
work page 2013
-
[49]
Raidshield: characterizing, monitoring, and proactively protecting against disk failures
Ma, A., Traylor, R., Douglis, F., Chamness, M., Lu, G., Sawyer, D., Chandra, S., and Hsu, W. Raidshield: characterizing, monitoring, and proactively protecting against disk failures. ACM Transactions on Storage (TOS) 11, 4 (2015), 17
work page 2015
-
[50]
Ma, L., He, T., Swami, A., Towsley, D., Leung, K. K., and Lowe, J. Node Failure Localization via Network Tomography. In Proceedings of the 2014 Conference on Internet Measurement Conference (New York, NY, USA, 2014), IMC ’14, ACM, pp. 195–208
work page 2014
-
[51]
A Large-Scale Study of Flash Memory Failures in the Field
Meza, J., Wu, Q., Kumar, S., and Mutlu, O. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15) (Portland, Oregon, USA, June 2015)
work page 2015
-
[52]
Mogul, J. C., Isaacs, R., and Welch, B. Thinking about Availability in Large Service Infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HOTOS’17) (Whistler, BC, Canada, May 2017)
work page 2017
-
[53]
Muntz, R. R., and Lui, J. C. Performance analysis of disk arrays under failure. Computer Science Department, University of California, 1990
work page 1990
-
[54]
Narayanan, I., W ang, D., Jeon, M., Sharma, B., Caulfield, L., Siva- subramaniam, A., Cutler, B., Liu, J., Khessib, B., and V aid, K.SSD Failures in Datacenters: What? When? And Why? InProceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16) (Haifa, Israel, June 2016)
work page 2016
-
[55]
Neal, R. M. Probabilistic inference using markov chain monte carlo methods
-
[56]
S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A
Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Sys- tems Are Not Created Equal: On the Complexity of Crafting Crash- Consistent Applications. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14) (Broom- field, CO, Oct. 2014)
work page 2014
-
[57]
Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)
work page 2007
-
[58]
Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the 2005 IEEE International Conference on Dependable Systems and Networks (DSN’05) (Yokohama, Japan, June 2005)
work page 2005
-
[59]
S., Liblit, B., Arpaci-Dusseau, R
Rubio-González, C., Gunawi, H. S., Liblit, B., Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Error Propagation Analysis for File Systems. In Proceedings of the 30th Annual ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09) (Dublin, Ireland, June 2009)
work page 2009
-
[60]
Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilistic pro- gramming in python using pymc3. PeerJ Computer Science 2 (Apr. 2016), e55
work page 2016
-
[61]
GPFS: A Shared-Disk File System for Large Computing Clusters
Schmuck, Frank B and Haskin, Roger L. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST (2002), vol. 2
work page 2002
-
[62]
Understanding latent sector errors and how to protect against them
Schroeder, B., Damouras, S., and Gill, P. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10) (San Jose, CA, USA, Feb. 2010). 14
work page 2010
-
[63]
Schroeder, B., and Gibson, G. A. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07) (San Jose, CA, Feb. 2007)
work page 2007
-
[64]
Flash Reliability in Production: The Expected and the Unexpected
Schroeder, B., Lagisetty, R., and Merchant, A. Flash Reliability in Production: The Expected and the Unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16) (Santa Clara, CA, USA, Feb. 2016)
work page 2016
-
[65]
DRAM Errors in the Wild: A Large-scale Field Study
Schroeder, B., Pinheiro, E., and Weber, W.-D. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of the 2009 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09) (Seattle, WA, USA, June 2009)
work page 2009
-
[66]
D., Sisneros, R., Fullop, J., and Bauer, G
Semeraro, B. D., Sisneros, R., Fullop, J., and Bauer, G. H. It takes a village: Monitoring the blue waters supercomputer. In 2014 IEEE International Conference on Cluster Computing (CLUSTER) (Sep. 2014), pp. 392–399
work page 2014
-
[67]
Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., and Gurumurthi, S.Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15) (Istanbul, Turkey, Mar. 2015)
work page 2015
-
[68]
Baler: deterministic, lossless log message clustering tool
Taerat, N., Brandt, J., Gentile, A., Wong, M., and Leangsuksun, C. Baler: deterministic, lossless log message clustering tool. Computer Science-Research and Development 26 , 3-4 (2011), 285
work page 2011
-
[69]
Netbouncer: active device and link failure localization in data center networks
Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019), USENIX Associ- ation, pp. 599–613
work page 2019
-
[70]
NetBouncer: Active Device and Link Failure Localization in Data Center Networks
Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (Boston, MA, USA, Feb. 2019)
work page 2019
-
[71]
A gossip-style failure detection service
van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware ’98, Springer-Verlag, pp. 55–70
work page 1998
-
[72]
S2-raid: A new raid archi- tecture for fast data recovery
Wan, J., Wang, J., Yang, Q., and Xie, C. S2-raid: A new raid archi- tecture for fast data recovery. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010), IEEE, pp. 1–9
work page 2010
-
[73]
Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (2006), USENIX Association, pp. 307–320
work page 2006
-
[74]
Performance under failures of high- end computing
Wu, M., Sun, X.-H., and Jin, H. Performance under failures of high- end computing. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 2007), SC ’07, ACM, pp. 48:1– 48:11
work page 2007
-
[75]
Workout: I/o workload outsourcing for boosting raid reconstruction performance
Wu, S., Jiang, H., Feng, D., Tian, L., and Mao, B. Workout: I/o workload outsourcing for boosting raid reconstruction performance. In FAST (2009), vol. 9, pp. 239–252
work page 2009
-
[76]
Xin, Q., Miller, E. L., Schwarz, S. J. T. J. E., and Long, D. D. E.Impact of failure on interconnection networks for large storage systems. In 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05)(April 2005), pp. 189–196
work page 2005
-
[77]
Zhang, Q., Yu, G., Guo, C., Dang, Y., Swanson, N., Y ang, X., Y ao, R., , Chintalapati, M., Krishnamurthy, A., and Anderson, T.Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18) (Renton, WA, USA, Apr. 2018). 15
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.