pith. sign in

arxiv: 1906.10081 · v1 · pith:PPBYLQVOnew · submitted 2019-06-24 · 💻 cs.PF

EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Pith reviewed 2026-05-25 16:48 UTC · model grok-4.3

classification 💻 cs.PF
keywords non-volatile memoryHPC crash recoveryselective persistenceintrinsic fault tolerancesystem efficiencyperformance overhead
0
0 comments X

The pith

EasyCrash selectively persists HPC data objects in NVM to convert 54% of failing crashes into correct recomputations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EasyCrash as a framework that decides which application data objects to keep in non-volatile memory during execution. This choice lets the application restart from the remaining NVM data after a crash and still produce correct results, relying on the fact that many HPC codes tolerate some data loss. The method adds 1.5% average runtime cost while raising the fraction of successful recomputations from an implicit baseline to 54% of previously failing cases and to 82% overall. When paired with ordinary checkpointing, the same selective persistence lifts system efficiency by 15% on average.

Core claim

EasyCrash decides how to selectively persist application data objects during execution so that, after a crash, the application can restart using the NVM-resident objects and recompute to a correct outcome; the approach rests on the observation that many HPC applications already possess enough intrinsic fault tolerance to make this possible.

What carries the argument

EasyCrash framework for deciding selective persistence of application data objects to NVM

If this is right

  • 54% of crashes that cannot correctly recompute are transformed into correct computations
  • 82% of crashes can successfully recompute when EasyCrash is combined with application intrinsic fault tolerance
  • Up to 24% (15% average) improvement in system efficiency when EasyCrash is used together with a traditional checkpoint scheme
  • 1.5% average performance overhead from the selective persistence decisions

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reducing the volume of data written to traditional storage checkpoints becomes feasible once NVM already holds enough objects for recovery
  • The same selective-persistence logic could be applied to other transient hardware faults beyond full crashes
  • Applications whose fault tolerance varies across phases would benefit from making the persistence decisions dynamic rather than static

Load-bearing premise

Many HPC applications possess sufficient intrinsic fault tolerance so that selective persistence of a subset of data objects will allow correct recomputation after a crash.

What would settle it

An experiment that runs the same applications under EasyCrash and records no measurable rise in the fraction of crashes that produce correct results compared with the non-selective baseline.

Figures

Figures reproduced from arXiv: 1906.10081 by Dong Li, Jie Ren, Kai Wu.

Figure 1
Figure 1. Figure 1: An illustration of how an HPC application behaves with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Study recomputability of MG with NVCT. of crash tests and application restart, and a tool to examine data inconsistency for post-crash analysis. Different from the traditional PIN-based cache simulator, NVCT not only captures microarchitec￾ture level, cache-related hardware events such as cache misses and invalidation, but also records the most recent values of data objects in the simulated caches and main… view at source ↗
Figure 3
Figure 3. Figure 3: Application responses after crash and restart. Figure anno [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The recomputability of MG after (a) persisting three differ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Application recomputability under three strategies to per [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Application recomputability with different methods. Fig [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The performance (normalized execution time) with and without EasyCrash. Figure annotation: “EC” stands for EasyCrash; “Lat” [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The performance(normalized execution time) with and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: System efficiency without and with EasyCrash when the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System efficiency for CG without and with EasyCrash [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
read the original abstract

Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54% of crashes that cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5% on average). Using EasyCrash and application intrinsic fault tolerance, 82% of crashes can successfully recompute. When EasyCrash is used with a traditional checkpoint scheme, it enables up to 24% improvement (15% on average) in system efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents EasyCrash, a framework for selectively persisting subsets of application data objects in non-volatile memory (NVM) so that HPC applications can restart and recompute correctly after crashes. It rests on the observation that many HPC codes possess sufficient intrinsic fault tolerance. The evaluation claims that EasyCrash converts 54% of otherwise non-recomputable crashes into correct executions (1.5% average overhead), raises the overall success rate to 82%, and yields 15% average (up to 24%) system-efficiency gains when combined with traditional checkpointing.

Significance. If the empirical results are reproducible and the intrinsic-tolerance premise holds beyond the evaluated codes, the work would offer a practical way to reduce checkpoint frequency and storage overhead in future NVM-based HPC platforms by exploiting application resilience rather than full-state persistence.

major comments (2)
  1. [Abstract] Abstract: the concrete percentages (54% transformation rate, 82% success rate, 1.5% overhead, 15% efficiency gain) are stated without any accompanying information on the number of benchmarks, crash-injection methodology, output-validation procedure, number of trials per crash point, or statistical measures (error bars, confidence intervals). Because these figures constitute the central empirical support for the framework, the absence of this information makes the claims impossible to assess for robustness or generalizability.
  2. [Abstract / Introduction] The load-bearing assumption that selective persistence decisions can reliably convert failing runs into correct ones is invoked as an 'observation' but is not accompanied by a characterization of the tolerance (e.g., which data objects are critical, how crash timing affects tolerance, or an independent test that the chosen objects suffice). If this tolerance is fragile or code-specific, the selective-persistence policy cannot deliver the reported conversion rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our empirical claims and the presentation of our core assumptions. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the concrete percentages (54% transformation rate, 82% success rate, 1.5% overhead, 15% efficiency gain) are stated without any accompanying information on the number of benchmarks, crash-injection methodology, output-validation procedure, number of trials per crash point, or statistical measures (error bars, confidence intervals). Because these figures constitute the central empirical support for the framework, the absence of this information makes the claims impossible to assess for robustness or generalizability.

    Authors: We agree that the abstract, constrained by length, omits key methodological context that would help readers assess the reported figures. The full manuscript (Sections 4.1–4.2 and 5) specifies five HPC benchmarks, crash injection at multiple random points during execution, output validation by comparing against golden runs, ten trials per injection point with standard deviations, and error bars on all figures. We will revise the abstract to include a concise statement on the number of benchmarks and crash-injection approach, plus a pointer to the evaluation section for full details and statistical measures. revision: yes

  2. Referee: [Abstract / Introduction] The load-bearing assumption that selective persistence decisions can reliably convert failing runs into correct ones is invoked as an 'observation' but is not accompanied by a characterization of the tolerance (e.g., which data objects are critical, how crash timing affects tolerance, or an independent test that the chosen objects suffice). If this tolerance is fragile or code-specific, the selective-persistence policy cannot deliver the reported conversion rates.

    Authors: The manuscript (Section 3) derives the selective-persistence policy from profiling that identifies critical versus non-critical objects and evaluates recomputation success across crash points injected at early, middle, and late execution stages. We will add an explicit characterization paragraph in the introduction that gives concrete examples of critical objects for one benchmark, discusses timing sensitivity, and references the independent validation through the reported 54% conversion rate across the evaluated codes. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluation independent of self-citations or definitional loops

full rationale

The paper reports experimental outcomes from implementing and testing the EasyCrash framework on HPC applications. The central premise (intrinsic fault tolerance allowing selective persistence) is presented as an observation that motivates the work, not derived from prior self-citations or internal definitions. No equations, fitted parameters, or load-bearing self-citations appear in the provided text; success rates and efficiency gains are measured directly from runs rather than forced by construction. This is a standard empirical systems paper with no reduction of claims to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on one explicit domain assumption and no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Many HPC applications have good enough intrinsic fault tolerance.
    Stated directly in the abstract as the observation enabling the approach.

pith-pipeline@v0.9.0 · 5699 in / 1260 out tokens · 25345 ms · 2026-05-25T16:48:31.323416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages

  1. [1]

    SPEC OMP2012

    2012. SPEC OMP2012. www.spec.org/omp2012. (2012)

  2. [2]

    Intel and Micron produce breakthrough memory technology

    2015. Intel and Micron produce breakthrough memory technology. (2015)

  3. [3]

    Mohammad Alshboul, James Tuck, and Yan Solihin. 2018. Lazy Persistency: A High-performing and Write-efficient Software Persistency Technique. InProceed- ings of the 45th Annual International Symposium on Computer Architecture

  4. [4]

    D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1993. NAS parallel benchmark results. IEEE Parallel Distrib. Technol. 1, 1 (Feb. 1993), 43–51

  5. [5]

    B.Fang, Q.Guan, N.Debardeleben, K.Pattabiraman, and M.Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing

  6. [6]

    Lockwood, Vakho Tsulaia, Suren Byna, Steve Farrell, Doga Gursoy, Christopher S

    Wahid Bhimji, Debbie Bard, Melissa Romanus, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K. Lockwood, Vakho Tsulaia, Suren Byna, Steve Farrell, Doga Gursoy, Christopher S. Daley, Vincent E. Beckner, Brian van Straalen, Nicholas J. Wright, and Katie Antypas. 2016. Accelerating Science with the NERSC Burst Buffer Early User Program

  7. [7]

    George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. Unified Model for Assessing Checkpointing Protocols at Extreme-scale. Concurr. Comput. : Pract. Exper. (2014)

  8. [8]

    Bosilca, A

    G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. 2002. MPICH- V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. InProceedings of the 2002 ACM/IEEE Conference on Supercomputing

  9. [9]

    Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying Quantita- tive Reliability for Programs that Execute on Unreliable Hardware. InInternational Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA)

  10. [10]

    de Supinski, Greg Bronevetsky, and Martin Schulz

    Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault Resilience of the Algebraic Multi-grid Solver. In ACM International Confer- ence on Supercomputing

  11. [11]

    Milind Chabbi, Xu Liu, and John Mellor-Crummey. 2014. Call Paths for Pin Tools. In Proceedings of International Symposium on Code Generation and Optimization

  12. [12]

    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron

  13. [13]

    In 2009 IEEE International Symposium on Workload Characterization

    Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization

  14. [14]

    Chippa, Srimat T

    Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan

  15. [15]

    InProceedings of Annual Design Automation Conference

    Analysis and Characterization of Inherent Application Resilience for Approximate Computing. InProceedings of Annual Design Automation Conference

  16. [16]

    Coburn, A.M

    J. Coburn, A.M. Caulfield, A. Akel, L.M. Grupp, R.K. Gupta, R. Jhala, and S. Swanson. 2011. NV-Heaps: Making Persistent Objects Fast and Safe with Next- generation, Non-volatile Memories. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems

  17. [17]

    DeBardeleben, J

    N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod. 2019. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. (01 2019)

  18. [18]

    Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson

    Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System Software for Per- sistent Memory. In Proceedings of the European Conference on Computer Systems

  19. [19]

    Egwutuoha, David Levy, Bran Selic, and Shiping Chen

    Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing (2013)

  20. [20]

    El-Sayed and B

    N. El-Sayed and B. Schroeder. 2014. Checkpoint/restart in practice: When ‘simple is better’. InIEEE International Conference on Cluster Computing

  21. [21]

    Elliott, K

    J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. 2012. Combining Partial Redundancy and Checkpointing for HPC. In 2012 IEEE 32nd International Conference on Distributed Computing Systems

  22. [22]

    Elnawawy, M

    H. Elnawawy, M. Alshboul, J. Tuck, and Y. Solihin. 2017. Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory. In International Conference on Parallel Architectures and Compilation Techniques

  23. [23]

    B. Fang, Q. Guan, N. Debardeleben, K. Pattabiraman, and M. Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing

  24. [24]

    Fernando, A

    P. Fernando, A. Gavrilovska, S. Kannan, and G. Eisenhauer. 2018. NVStream: Ac- celerating HPC Workflows with NVRAM-based Transport for Streaming Objects. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing

  25. [25]

    Blackburn, Kathryn S

    Tiejun Gao, Karin Strauss, Stephen M. Blackburn, Kathryn S. McKinley, Doug Burger, and James Larus. 2013. Using Managed Runtime Systems to Tolerate Holes in Wearable Memories. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

  26. [26]

    Qiang Guan, Nathan Debardeleben, Sean Blanchard, and Song Fu. 2014. F- sefi: A Fine-grained Soft Error Fault Injection Tool for Profiling Application Vulnerability. In IEEE Parallel and Distributed Processing Symposium

  27. [27]

    L. Guo, D. Li, I. Laguna, and M. Schulz. 2018. FlipTracker: Understanding Nat- ural Error Resilience in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

  28. [28]

    Y. Guo, Y. Hua, and P. Zuo. 2018. A Latency-optimized and Energy-efficient Write Scheme in NVM-based Main Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018)

  29. [29]

    Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Im- plications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17)

  30. [30]

    Hsu and W

    C. Hsu and W. Feng. 2005. A Power-Aware Run-Time System for High- Performance Computing. In ACM/IEEE Conference on Supercomputing

  31. [31]

    Intel. 2014. Persistent Memory Development Kit. https://pmem.io/. (2014)

  32. [32]

    Intel. 2014. Intel NVM Library. http://pmem.io/nvml/libpmem/. (2014)

  33. [33]

    Intel Corporation. 2009. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-018

  34. [34]

    Vazhkudai, Wei Xue, and Daniel Sanchez

    Xu Ji, Chao Wang, Nosayba El-Sayed, Xiaosong Ma, Youngjae Kim, Sudharshan S. Vazhkudai, Wei Xue, and Daniel Sanchez. 2017. Understanding Object-level Memory Access Patterns Across the Spectrum. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

  35. [35]

    Kannan, A

    S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic. 2013. Optimizing Check- points Using NVM as Virtual Memory. In2013 IEEE 27th International Symposium on Parallel and Distributed Processing

  36. [36]

    Kolli, S

    A. Kolli, S. Pelley, A. Saidi, P. M. Chen, and T. F. Wenisch. 2016. High-Performance Transactions for Persistent Memories. In Proceedings of the International Confer- ence on Architectural Support for Programming Languages and Operating Systems

  37. [37]

    Argonne National Lab. U.S. Department of Energy and Intel to deliver first exascale supercomputer. https://www.anl.gov/article/us-department-of-energy- and-intel-to-deliver-first-exascale-supercomputer. (????)

  38. [38]

    Lee, Engin Ipek, Onur Mutlu, and Doug Burger

    Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting Phase Change Memory As a Scalable Dram Alternative. InProceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09)

  39. [39]

    B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger

  40. [40]

    IEEE Micro (2010)

    Phase-Change Technology and the Future of Main Memory. IEEE Micro (2010)

  41. [41]

    D. Li, J. S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and W. Yu. 2012. Identify- ing Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications. In IPDPS

  42. [42]

    D. Li, J. S. Vetter, and W. Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In Conference for High Performance Computing, Networking, Storage and Analysis

  43. [43]

    Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott. 2007. A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In 2007 IEEE International Conference on Cluster Computing

  44. [44]

    LLNL. 2013. LULESH 2.0. https://github.com/LLNL/LULESH. (2013)

  45. [45]

    C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proc. 2005 ACM SIGPLAN Conf. Programming Language Design and Implementation

  46. [46]

    C. D. Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer

  47. [47]

    In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

    Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

  48. [48]

    Esteban Meneses, Xiang Ni, Terry Jones, and Don Maxwell. 2015. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer

  49. [49]

    J. Meng, A. Raghunathan, S. Chakradhar, and S. Byna. 2010. Exploiting the for- giving nature of applications for scalable parallel execution. In IEEE International Symposium on Parallel and Distributed Processing

  50. [50]

    Misailovic, M

    S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. Rinard. 2014. Chisel: Reliability- and Accuracy-Aware Optimization of Approximate Computational Kernels. In International Conference on Object Oriented Programming Systems Languages and Applications

  51. [51]

    de Supinski

    Kathryn Mohror, Adam Moody, Greg Bronevetsky, and Bronis R. de Supinski

  52. [52]

    IEEE Trans

    Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System. IEEE Trans. Parallel Distrib. Syst. 25, 9 (2014), 2255–2263

  53. [53]

    Moody, G

    A. Moody, G. Bronevetsky, K. Mohror, and B. de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In International Conference for High Performance Computing, Networking, Storage and Analysis

  54. [54]

    D Nicholaeff, N Davis, D Trujillo, and RW Robey. 2012. Cell-based adaptive mesh refinement implemented with general purpose graphics processing units. Tech. Rep. LA-UR-11-07127 (2012)

  55. [55]

    Chen, and Thomas F

    Steven Pelly, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory Persistency. In International Symposium on Computer Architecture

  56. [56]

    Petitet, R

    A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. 2008. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed- Memory Computers. (2008)

  57. [57]

    Ian R. Philp. 2005. Software Failures and the Road to a Petaflop Machine. In 1st Workshop on High Performance Computing Reliability Issues (HPCRI) 2005

  58. [58]

    M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali

  59. [59]

    In IEEE/ACM International Symposium on Microarchitecture

    Enhancing lifetime and security of PCM-based Main Memory with Start- Gap Wear Leveling. In IEEE/ACM International Symposium on Microarchitecture

  60. [60]

    Andrea Redaelli. 2018. Phase Change Memory: Device Physics, Reliability and Applications

  61. [61]

    Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computa- tions that discard task. In International Conference on Supercomputing (ICS)

  62. [62]

    Martin Rinard, Henry Hoffmann, Sasa Misailovic, and Stelios Sidiroglou. 2010. Patterns and Statistical Analysis for Understanding Reduced Resource Computing. In Onward! 2010

  63. [63]

    P J Roache. 1998. Verification and validation in computational science and engi- neering. Hermosa

  64. [64]

    Andy Rudoff. 2013. Programming Models for Emerging Non-Volatile Memory Technologies. The USENIX Magazine 38, 3 (2013), 40–45

  65. [65]

    Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke

  66. [66]

    In Architectural Support for Programming Languages and Operating Systems

    Paraprox: Pattern-Based Approximation for Data Parallel Applications. In Architectural Support for Programming Languages and Operating Systems

  67. [67]

    Sampson, W

    A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossma

  68. [68]

    In Programming Language Design and Implementation

    EnerJ: Approximate Data Types for Safe and General Low-power Compu- tation. In Programming Language Design and Implementation

  69. [69]

    Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard

  70. [70]

    Accuracy Trade-offs with Loop Perforation

    Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering (FSE)

  71. [71]

    Silvano and P

    M. Silvano and P. Toth. 1990. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons

  72. [72]

    M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. Hensbergen. 201...

  73. [73]

    Tramm, Andrew R

    John R. Tramm, Andrew R. Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench – The Development and Verification of A Performance Abstraction for Monte Carlo Reactor Analysis. In International Conference on Physics of Reactors

  74. [74]

    Venkataraman, N

    S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell. 2011. Consis- tent and Durable Data Structures for Non-volatile Byte-addressable Memory. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies

  75. [75]

    Haris Volos, Guilherme Magalhaes, Ludmila Cherkasova, and Jun Li. 2015. Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proc. 16th Annu. Middleware Conference (Middleware ’15). Vancouver, Canada, 37–49

  76. [76]

    Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Light- weight Persistent Memory. InProceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems

  77. [77]

    Volos, A

    H. Volos, A. J. Tack, and M. M. Swift. 2011. Mnemosyne: Lightweight Persistent Memory. In Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  78. [78]

    C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. 2010. Hybrid Checkpointing for MPI Jobs in HPC Environments. In 2010 IEEE 16th International Conference on Parallel and Distributed Systems

  79. [79]

    K. Wu, W. Dong, Q. Guan, N. DeBardeleben, and D. Li. 2018. Modeling Appli- cation Resilience in Large-scale Parallel Execution. In Proceedings of the 47th International Conference on Parallel Processing

  80. [80]

    K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non- Volatile Memory-based Heterogeneous Main Memory. InInternational Conference for High Performance Computing, Networking, Storage and Analysis

Showing first 80 references.