EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures
Pith reviewed 2026-05-25 16:48 UTC · model grok-4.3
The pith
EasyCrash selectively persists HPC data objects in NVM to convert 54% of failing crashes into correct recomputations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EasyCrash decides how to selectively persist application data objects during execution so that, after a crash, the application can restart using the NVM-resident objects and recompute to a correct outcome; the approach rests on the observation that many HPC applications already possess enough intrinsic fault tolerance to make this possible.
What carries the argument
EasyCrash framework for deciding selective persistence of application data objects to NVM
If this is right
- 54% of crashes that cannot correctly recompute are transformed into correct computations
- 82% of crashes can successfully recompute when EasyCrash is combined with application intrinsic fault tolerance
- Up to 24% (15% average) improvement in system efficiency when EasyCrash is used together with a traditional checkpoint scheme
- 1.5% average performance overhead from the selective persistence decisions
Where Pith is reading between the lines
- Reducing the volume of data written to traditional storage checkpoints becomes feasible once NVM already holds enough objects for recovery
- The same selective-persistence logic could be applied to other transient hardware faults beyond full crashes
- Applications whose fault tolerance varies across phases would benefit from making the persistence decisions dynamic rather than static
Load-bearing premise
Many HPC applications possess sufficient intrinsic fault tolerance so that selective persistence of a subset of data objects will allow correct recomputation after a crash.
What would settle it
An experiment that runs the same applications under EasyCrash and records no measurable rise in the fraction of crashes that produce correct results compared with the non-selective baseline.
Figures
read the original abstract
Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54% of crashes that cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5% on average). Using EasyCrash and application intrinsic fault tolerance, 82% of crashes can successfully recompute. When EasyCrash is used with a traditional checkpoint scheme, it enables up to 24% improvement (15% on average) in system efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EasyCrash, a framework for selectively persisting subsets of application data objects in non-volatile memory (NVM) so that HPC applications can restart and recompute correctly after crashes. It rests on the observation that many HPC codes possess sufficient intrinsic fault tolerance. The evaluation claims that EasyCrash converts 54% of otherwise non-recomputable crashes into correct executions (1.5% average overhead), raises the overall success rate to 82%, and yields 15% average (up to 24%) system-efficiency gains when combined with traditional checkpointing.
Significance. If the empirical results are reproducible and the intrinsic-tolerance premise holds beyond the evaluated codes, the work would offer a practical way to reduce checkpoint frequency and storage overhead in future NVM-based HPC platforms by exploiting application resilience rather than full-state persistence.
major comments (2)
- [Abstract] Abstract: the concrete percentages (54% transformation rate, 82% success rate, 1.5% overhead, 15% efficiency gain) are stated without any accompanying information on the number of benchmarks, crash-injection methodology, output-validation procedure, number of trials per crash point, or statistical measures (error bars, confidence intervals). Because these figures constitute the central empirical support for the framework, the absence of this information makes the claims impossible to assess for robustness or generalizability.
- [Abstract / Introduction] The load-bearing assumption that selective persistence decisions can reliably convert failing runs into correct ones is invoked as an 'observation' but is not accompanied by a characterization of the tolerance (e.g., which data objects are critical, how crash timing affects tolerance, or an independent test that the chosen objects suffice). If this tolerance is fragile or code-specific, the selective-persistence policy cannot deliver the reported conversion rates.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the clarity of our empirical claims and the presentation of our core assumptions. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the concrete percentages (54% transformation rate, 82% success rate, 1.5% overhead, 15% efficiency gain) are stated without any accompanying information on the number of benchmarks, crash-injection methodology, output-validation procedure, number of trials per crash point, or statistical measures (error bars, confidence intervals). Because these figures constitute the central empirical support for the framework, the absence of this information makes the claims impossible to assess for robustness or generalizability.
Authors: We agree that the abstract, constrained by length, omits key methodological context that would help readers assess the reported figures. The full manuscript (Sections 4.1–4.2 and 5) specifies five HPC benchmarks, crash injection at multiple random points during execution, output validation by comparing against golden runs, ten trials per injection point with standard deviations, and error bars on all figures. We will revise the abstract to include a concise statement on the number of benchmarks and crash-injection approach, plus a pointer to the evaluation section for full details and statistical measures. revision: yes
-
Referee: [Abstract / Introduction] The load-bearing assumption that selective persistence decisions can reliably convert failing runs into correct ones is invoked as an 'observation' but is not accompanied by a characterization of the tolerance (e.g., which data objects are critical, how crash timing affects tolerance, or an independent test that the chosen objects suffice). If this tolerance is fragile or code-specific, the selective-persistence policy cannot deliver the reported conversion rates.
Authors: The manuscript (Section 3) derives the selective-persistence policy from profiling that identifies critical versus non-critical objects and evaluates recomputation success across crash points injected at early, middle, and late execution stages. We will add an explicit characterization paragraph in the introduction that gives concrete examples of critical objects for one benchmark, discusses timing sensitivity, and references the independent validation through the reported 54% conversion rate across the evaluated codes. revision: yes
Circularity Check
No circularity; empirical framework evaluation independent of self-citations or definitional loops
full rationale
The paper reports experimental outcomes from implementing and testing the EasyCrash framework on HPC applications. The central premise (intrinsic fault tolerance allowing selective persistence) is presented as an observation that motivates the work, not derived from prior self-citations or internal definitions. No equations, fitted parameters, or load-bearing self-citations appear in the provided text; success rates and efficiency gains are measured directly from runs rather than forced by construction. This is a standard empirical systems paper with no reduction of claims to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Many HPC applications have good enough intrinsic fault tolerance.
Reference graph
Works this paper leans on
- [1]
-
[2]
Intel and Micron produce breakthrough memory technology
2015. Intel and Micron produce breakthrough memory technology. (2015)
work page 2015
-
[3]
Mohammad Alshboul, James Tuck, and Yan Solihin. 2018. Lazy Persistency: A High-performing and Write-efficient Software Persistency Technique. InProceed- ings of the 45th Annual International Symposium on Computer Architecture
work page 2018
-
[4]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1993. NAS parallel benchmark results. IEEE Parallel Distrib. Technol. 1, 1 (Feb. 1993), 43–51
work page 1993
-
[5]
B.Fang, Q.Guan, N.Debardeleben, K.Pattabiraman, and M.Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing
work page 2017
-
[6]
Lockwood, Vakho Tsulaia, Suren Byna, Steve Farrell, Doga Gursoy, Christopher S
Wahid Bhimji, Debbie Bard, Melissa Romanus, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K. Lockwood, Vakho Tsulaia, Suren Byna, Steve Farrell, Doga Gursoy, Christopher S. Daley, Vincent E. Beckner, Brian van Straalen, Nicholas J. Wright, and Katie Antypas. 2016. Accelerating Science with the NERSC Burst Buffer Early User Program
work page 2016
-
[7]
George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. Unified Model for Assessing Checkpointing Protocols at Extreme-scale. Concurr. Comput. : Pract. Exper. (2014)
work page 2014
-
[8]
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. 2002. MPICH- V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. InProceedings of the 2002 ACM/IEEE Conference on Supercomputing
work page 2002
-
[9]
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying Quantita- tive Reliability for Programs that Execute on Unreliable Hardware. InInternational Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA)
work page 2013
-
[10]
de Supinski, Greg Bronevetsky, and Martin Schulz
Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault Resilience of the Algebraic Multi-grid Solver. In ACM International Confer- ence on Supercomputing
work page 2012
-
[11]
Milind Chabbi, Xu Liu, and John Mellor-Crummey. 2014. Call Paths for Pin Tools. In Proceedings of International Symposium on Code Generation and Optimization
work page 2014
-
[12]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron
-
[13]
In 2009 IEEE International Symposium on Workload Characterization
Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization
work page 2009
-
[14]
Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan
-
[15]
InProceedings of Annual Design Automation Conference
Analysis and Characterization of Inherent Application Resilience for Approximate Computing. InProceedings of Annual Design Automation Conference
-
[16]
J. Coburn, A.M. Caulfield, A. Akel, L.M. Grupp, R.K. Gupta, R. Jhala, and S. Swanson. 2011. NV-Heaps: Making Persistent Objects Fast and Safe with Next- generation, Non-volatile Memories. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems
work page 2011
-
[17]
N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod. 2019. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. (01 2019)
work page 2019
-
[18]
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System Software for Per- sistent Memory. In Proceedings of the European Conference on Computer Systems
work page 2014
-
[19]
Egwutuoha, David Levy, Bran Selic, and Shiping Chen
Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing (2013)
work page 2013
-
[20]
N. El-Sayed and B. Schroeder. 2014. Checkpoint/restart in practice: When ‘simple is better’. InIEEE International Conference on Cluster Computing
work page 2014
-
[21]
J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. 2012. Combining Partial Redundancy and Checkpointing for HPC. In 2012 IEEE 32nd International Conference on Distributed Computing Systems
work page 2012
-
[22]
H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin. 2017. Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory. In International Conference on Parallel Architectures and Compilation Techniques
work page 2017
-
[23]
B. Fang, Q. Guan, N. Debardeleben, K. Pattabiraman, and M. Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing
work page 2017
-
[24]
P. Fernando, A. Gavrilovska, S. Kannan, and G. Eisenhauer. 2018. NVStream: Ac- celerating HPC Workflows with NVRAM-based Transport for Streaming Objects. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing
work page 2018
-
[25]
Tiejun Gao, Karin Strauss, Stephen M. Blackburn, Kathryn S. McKinley, Doug Burger, and James Larus. 2013. Using Managed Runtime Systems to Tolerate Holes in Wearable Memories. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
work page 2013
-
[26]
Qiang Guan, Nathan Debardeleben, Sean Blanchard, and Song Fu. 2014. F- sefi: A Fine-grained Soft Error Fault Injection Tool for Profiling Application Vulnerability. In IEEE Parallel and Distributed Processing Symposium
work page 2014
-
[27]
L. Guo, D. Li, I. Laguna, and M. Schulz. 2018. FlipTracker: Understanding Nat- ural Error Resilience in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
work page 2018
-
[28]
Y. Guo, Y. Hua, and P. Zuo. 2018. A Latency-optimized and Energy-efficient Write Scheme in NVM-based Main Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018)
work page 2018
-
[29]
Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Im- plications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17)
work page 2017
- [30]
-
[31]
Intel. 2014. Persistent Memory Development Kit. https://pmem.io/. (2014)
work page 2014
-
[32]
Intel. 2014. Intel NVM Library. http://pmem.io/nvml/libpmem/. (2014)
work page 2014
-
[33]
Intel Corporation. 2009. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-018
work page 2009
-
[34]
Vazhkudai, Wei Xue, and Daniel Sanchez
Xu Ji, Chao Wang, Nosayba El-Sayed, Xiaosong Ma, Youngjae Kim, Sudharshan S. Vazhkudai, Wei Xue, and Daniel Sanchez. 2017. Understanding Object-level Memory Access Patterns Across the Spectrum. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
work page 2017
- [35]
- [36]
-
[37]
Argonne National Lab. U.S. Department of Energy and Intel to deliver first exascale supercomputer. https://www.anl.gov/article/us-department-of-energy- and-intel-to-deliver-first-exascale-supercomputer. (????)
-
[38]
Lee, Engin Ipek, Onur Mutlu, and Doug Burger
Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting Phase Change Memory As a Scalable Dram Alternative. InProceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09)
work page 2009
-
[39]
B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger
-
[40]
Phase-Change Technology and the Future of Main Memory. IEEE Micro (2010)
work page 2010
-
[41]
D. Li, J. S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and W. Yu. 2012. Identify- ing Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications. In IPDPS
work page 2012
-
[42]
D. Li, J. S. Vetter, and W. Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In Conference for High Performance Computing, Networking, Storage and Analysis
work page 2012
-
[43]
Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott. 2007. A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In 2007 IEEE International Conference on Cluster Computing
work page 2007
-
[44]
LLNL. 2013. LULESH 2.0. https://github.com/LLNL/LULESH. (2013)
work page 2013
-
[45]
C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proc. 2005 ACM SIGPLAN Conf. Programming Language Design and Implementation
work page 2005
-
[46]
C. D. Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer
-
[47]
In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
work page 2014
-
[48]
Esteban Meneses, Xiang Ni, Terry Jones, and Don Maxwell. 2015. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer
work page 2015
-
[49]
J. Meng, A. Raghunathan, S. Chakradhar, and S. Byna. 2010. Exploiting the for- giving nature of applications for scalable parallel execution. In IEEE International Symposium on Parallel and Distributed Processing
work page 2010
-
[50]
S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. Rinard. 2014. Chisel: Reliability- and Accuracy-Aware Optimization of Approximate Computational Kernels. In International Conference on Object Oriented Programming Systems Languages and Applications
work page 2014
- [51]
-
[52]
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System. IEEE Trans. Parallel Distrib. Syst. 25, 9 (2014), 2255–2263
work page 2014
- [53]
-
[54]
D Nicholaeff, N Davis, D Trujillo, and RW Robey. 2012. Cell-based adaptive mesh refinement implemented with general purpose graphics processing units. Tech. Rep. LA-UR-11-07127 (2012)
work page 2012
-
[55]
Steven Pelly, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory Persistency. In International Symposium on Computer Architecture
work page 2014
-
[56]
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. 2008. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed- Memory Computers. (2008)
work page 2008
-
[57]
Ian R. Philp. 2005. Software Failures and the Road to a Petaflop Machine. In 1st Workshop on High Performance Computing Reliability Issues (HPCRI) 2005
work page 2005
-
[58]
M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali
-
[59]
In IEEE/ACM International Symposium on Microarchitecture
Enhancing lifetime and security of PCM-based Main Memory with Start- Gap Wear Leveling. In IEEE/ACM International Symposium on Microarchitecture
-
[60]
Andrea Redaelli. 2018. Phase Change Memory: Device Physics, Reliability and Applications
work page 2018
-
[61]
Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computa- tions that discard task. In International Conference on Supercomputing (ICS)
work page 2006
-
[62]
Martin Rinard, Henry Hoffmann, Sasa Misailovic, and Stelios Sidiroglou. 2010. Patterns and Statistical Analysis for Understanding Reduced Resource Computing. In Onward! 2010
work page 2010
-
[63]
P J Roache. 1998. Verification and validation in computational science and engi- neering. Hermosa
work page 1998
-
[64]
Andy Rudoff. 2013. Programming Models for Emerging Non-Volatile Memory Technologies. The USENIX Magazine 38, 3 (2013), 40–45
work page 2013
-
[65]
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke
-
[66]
In Architectural Support for Programming Languages and Operating Systems
Paraprox: Pattern-Based Approximation for Data Parallel Applications. In Architectural Support for Programming Languages and Operating Systems
- [67]
-
[68]
In Programming Language Design and Implementation
EnerJ: Approximate Data Types for Safe and General Low-power Compu- tation. In Programming Language Design and Implementation
-
[69]
Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard
-
[70]
Accuracy Trade-offs with Loop Perforation
Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering (FSE)
-
[71]
M. Silvano and P. Toth. 1990. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons
work page 1990
-
[72]
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. Hensbergen. 201...
work page 2014
-
[73]
John R. Tramm, Andrew R. Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench – The Development and Verification of A Performance Abstraction for Monte Carlo Reactor Analysis. In International Conference on Physics of Reactors
work page 2014
-
[74]
S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell. 2011. Consis- tent and Durable Data Structures for Non-volatile Byte-addressable Memory. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies
work page 2011
-
[75]
Haris Volos, Guilherme Magalhaes, Ludmila Cherkasova, and Jun Li. 2015. Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proc. 16th Annu. Middleware Conference (Middleware ’15). Vancouver, Canada, 37–49
work page 2015
-
[76]
Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Light- weight Persistent Memory. InProceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems
work page 2011
- [77]
-
[78]
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. 2010. Hybrid Checkpointing for MPI Jobs in HPC Environments. In 2010 IEEE 16th International Conference on Parallel and Distributed Systems
work page 2010
-
[79]
K. Wu, W. Dong, Q. Guan, N. DeBardeleben, and D. Li. 2018. Modeling Appli- cation Resilience in Large-scale Parallel Execution. In Proceedings of the 47th International Conference on Parallel Processing
work page 2018
-
[80]
K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non- Volatile Memory-based Heterogeneous Main Memory. InInternational Conference for High Performance Computing, Networking, Storage and Analysis
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.